Skip to content
Snippets Groups Projects
Unverified Commit 9c66e3c1 authored by Gärber, Florian's avatar Gärber, Florian Committed by GitHub
Browse files

fix: var.relations $surr.res returning matrix interspersed with unexpected NA (#4)

This PR fixes an issue where the result of `var.relations` would contain
long runs of NA starting in the latter half of variables. This issue
caused mean adjusted agreements to be shifted horizontally and across
following rows.

The issue was caused by setting the mean adjusted agreement of a
variable with itself to NA incorrectly. Instead of using the current
index, the variable index was used. This resulted in cells being
incorrectly overwritten as NA if the variable index was smaller than the
number of all variables, and in the implicit addition of NA beyond the
end of the vector until the variable index when the variable index was
larger than the number of all variables.

This caused `mean.index` to return vectors with a length not exactly
equal to the number of variables, which in the following steps caused an
overrun when the concatenated vectors were being cast to a matrix with
dimensions being exactly the length of all variables by all candidate
variables. A warning message would appear when this occurred: `data
length [...] is not a sub-multiple or multiple of the number of rows
[...]`. It did not however notify that extra data was dropped
implicitly.

This PR also contains some basic test cases for the internal
`mean.index` to prevent regression of this issue.
- Assert that the length of the return value is always exactly as long
as the number of variables
- Assert that for unknown variables (that is, variables not represented
in `list.res` or `index.variables`) the function will return an NA
vector of the appropriate length.
parents 39dbf1f0 17fd97b9
No related branches found
No related tags found
No related merge requests found
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
<!-- News Style-guide: https://style.tidyverse.org/news.html --> <!-- News Style-guide: https://style.tidyverse.org/news.html -->
* Moved to new repository: [AGSeifert/RFSurrogates](https://github.com/AGSeifert/RFSurrogates) * Moved to new repository: [AGSeifert/RFSurrogates](https://github.com/AGSeifert/RFSurrogates)
* Fixed `mean.index()` bug which caused the return value to be of incorrect length in some cases (#4).
--- ---
......
...@@ -67,7 +67,7 @@ mean.index <- function(i, list.res, index.variables) { ...@@ -67,7 +67,7 @@ mean.index <- function(i, list.res, index.variables) {
list <- list.res[which(names(list.res) == index.variables[i])] list <- list.res[which(names(list.res) == index.variables[i])]
mean.list <- round(Reduce("+", list) / length(list), 2) mean.list <- round(Reduce("+", list) / length(list), 2)
if (length(mean.list) > 0) { if (length(mean.list) > 0) {
mean.list[index.variables[i]] <- NA mean.list[i] <- NA
return(mean.list) return(mean.list)
} else { } else {
return(rep(NA, length(index.variables))) return(rep(NA, length(index.variables)))
......
# Troubleshooting common issues
## `Error in var.relations: allvariables do not contain the candidate variables`
This guard clause is intended to catch cases where you are attempting to look at candidate variables not included in the data set.
Firstly, make sure that your `candidates` argument only contains variables present in your indepedent variables `x`:
```r
all(candidates %in% colnames(x)) # Must evaluate to TRUE
```
However, if you are using a data set with number-like feature identifiers, these will be automatically coerced into `X<number>` when the data frame is created, causing this mismatch.
To fix this, change your argument by adding `paste0("X", candidates)`.
Make sure to do the same for the `variables` argument too.
## `var.select.smd` is stuck at `Growing trees.. Progress`
The output `Growing trees..` is made by `ranger` while a large forest is being built. The main part of the function however does not emit any progress messages, causing it to appear stalled.
You can verify the function is still running by checking your CPU usage.
This issue is especially relevant for large data sets.
This should be improved since the rework (Version 0.3.0).
It is still recommended to use a high performance computer with large amounts of RAM for larger data sets.
test_that("mean.index always returns a numeric vector of length(index.variables)", {
list.res <- rlist::list.flatten(
list(
list(`1` = c(0.0, 0.5, 0.5, 0.5)),
list(`1` = c(0.0, 0.6, 0.7, 0.8)),
list(`5` = c(0.4, 0.0, 0.4, 0.4)),
list(`99` = c(0.1, 0.2, 0.1, 0.2)),
list(`100` = c(0.9, 0.3, 0.0, 0.9)),
list(`10000` = c(0.7, 0.8, 0.9, 0.0)),
list(`5` = c(0.3, 0.0, 0.5, 0.6), `100` = c(0.7, 0.6, 0.0, 0.9))
)
)
index.variables = c(1, 5, 100, 10000)
# 1 would not go out of bounds, thus should always pass
testthat::expect_length(mean.index(1, list.res, index.variables), length(index.variables))
testthat::expect_length(mean.index(2, list.res, index.variables), length(index.variables))
testthat::expect_length(mean.index(3, list.res, index.variables), length(index.variables))
testthat::expect_length(mean.index(4, list.res, index.variables), length(index.variables))
# 5 is out of bounds of index.variables, returning only NA of correct length
testthat::expect_length(mean.index(5, list.res, index.variables), length(index.variables))
})
test_that("mean.index returns NA vector when i is out of bounds of index.variables", {
list.res <- list()
index.variables = 1:4
testthat::expect_equal(mean.index(5, list.res, index.variables), rep(NA, length(index.variables)))
})
test_that("testthat works", {
expect_equal(2 * 2, 4)
})
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment