Skip to content
Snippets Groups Projects
Unverified Commit cadfb60f authored by Gärber, Florian's avatar Gärber, Florian Committed by GitHub
Browse files

v0.3.4 (#10)

parents 0354f3b7 0229933b
Branches v0.3
Tags v0.3.4
No related merge requests found
Showing with 338 additions and 206 deletions
Type: Package
Package: RFSurrogates
Title: Surrogate Minimal Depth Variable Importance
Version: 0.3.3
Version: 0.3.4
Authors@R: c(
person("Stephan", "Seifert", , "stephan.seifert@uni-hamburg.de", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-2567-5728")),
......
# RFSurrogates (development version)
<!-- News Style-guide: https://style.tidyverse.org/news.html -->
# RFSurrogates 0.3.4
* `var.select.smd()`, `var.select.md()`, `var.relations()`, `var.relations.mfi()`: Made several improvements to developer experience:
* `create.forest` now defaults to `is.null(forest)`, so it will automatically be `TRUE` if no forest is provided, and `FALSE` otherwise.
* `x` is no longer required if `create.forest` is `FALSE`.
* (Internal) Inverted some nested guard clauses for readability.
* `addLayer()`: Refactor for-loop to lapply.
* Add `num.threads` param to enable parallelization using `parallel::mclapply()`. It defaults to 1 for backward compatability.
* `getTreeranger()`: Refactor `lapply()` to `parallel::mclapply()`.
* Add `num.threads` param (passed to `mc.cores` in `parallel::mclapply()`). It defaults to 1 for backward compatability.
* Add `add_layer` param to include the effect of `addLayer` within the same loop. Defaults to `FALSE` for backward compatability.
* (Internal) `getsingletree()`: Add `add_layer` param to enable adding layers within the same loop.
* `addSurrogates()`:
* Clarified default value for `num.threads` to be `parallel::detectCores()` by adding it as a default to the parameter
* Added assertion that `RF` is a `ranger` object.
* Added assertion that `RF$num.trees` and `length(trees)` are equal. This is not considered a breaking change since these values should always be equal when the function is used correctly.
* Added S3 classes to the `trees` list objects.
* `getTreeranger()` now returns a `RangerTrees` list.
* `addLayer()` and `getTreeranger(add_layer = TRUE)` add the `LayerTrees` class to the list (indicating presence of the `layer` list item). It now requires that its `trees` param inherits `RangerTrees`.
* `addSurrogates()` now adds the `SurrogateTrees` class. It now requires that its `trees` param inherits `RangerTrees`.
# RFSurrogates 0.3.3
......
#' Add layer information to a forest that was created by getTreeranger
#'
#' This functions adds the layer information to each node in a list with trees that was obtained by getTreeranger.
#' You should use [`getTreeranger()`] with `add_layer = TRUE` instead.
#'
#' @param trees The output of [`getTreeranger()`].
#' @param num.threads (Default: 1) Number of threads to spawn for parallelization.
#'
#' @returns A list of tree data frames of length `RF$num.trees`.
#' Each row of the tree data frames corresponds to a node of the respective tree and the columns correspond to:
#' * `nodeID`: ID of the respective node (important for left and right daughters in the next columns)
#' * `leftdaughter`: ID of the left daughter of this node
#' * `rightdaughter`: ID of the right daughter of this node
#' * `splitvariable`: ID of the split variable
#' * `splitpoint`: Split point of the split variable.
#' For categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right.
#' * `status`: `0` for terminal (`splitpoint` is `NA`) and `1` for non-terminal.
#' * `layer`: Tree layer depth information, starting at 0 (root node) and incremented for each layer.
#'
#' @param trees list of trees created by getTreeranger
#' @return a list with trees. Each row of the list elements corresponds to a node of the respective tree and the columns correspond to:
#' \itemize{
#' \item nodeID: ID of the respective node (important for left and right daughters in the next columns)
#' \item leftdaughter: ID of the left daughter of this node
#' \item rightdaughter: ID of the right daughter of this node
#' \item splitvariable: ID of the split variable
#' \item splitpoint: splitpoint of the split variable
#' \item status: "0" for terminal and "1" for non-terminal
#' \item layer: layer information (0 means root node, 1 means 1 layer below root, etc)
#' }
#' @export
addLayer <- function(trees) {
# This function adds the respective layer to the different nodes in a tree. The tree has to be prepared by getTree function
tree.layer <- list()
num.trees <- length(trees)
for (i in 1:num.trees) {
tree <- trees[[i]]
layer <- rep(NA, nrow(tree))
layer[1] <- 0
t <- 1
while (anyNA(layer)) {
r <- unlist(tree[which(layer == (t - 1)), 2:3])
layer[r] <- t
t <- t + 1
}
tree <- cbind(tree, layer)
tree <- tree[order(as.numeric(tree[, "layer"])), ]
tree.layer[[i]] <- tree
#' @md
addLayer <- function(trees, num.threads = 1) {
if (!inherits(trees, "RangerTrees")) {
stop("`trees` must be a `getTreeranger` `RangerTrees` object.")
}
layer.trees <- parallel::mclapply(trees, add_layer_to_tree, mc.cores = num.threads)
class(layer.trees) <- c(class(trees), "LayerTrees")
return(layer.trees)
}
#' Internal function
#'
#' This function adds the respective layer to the different nodes in a tree.
#' The tree has to be prepared by getTree function.
#'
#' @param tree A tree data frame from [getTreeranger()].
#'
#' @returns A tree data frame with `layer` added.
#'
#' @seealso [addLayer()]
#'
#' @keywords internal
#' @md
add_layer_to_tree <- function(tree) {
layer <- rep(NA, nrow(tree))
layer[1] <- 0
t <- 1
while (anyNA(layer)) {
r <- unlist(tree[which(layer == (t - 1)), 2:3])
layer[r] <- t
t <- t + 1
}
return(tree.layer)
tree <- cbind(tree, layer)
tree <- tree[order(as.numeric(tree[, "layer"])), ]
return(tree)
}
#' Add surrogate information that was created by getTreeranger
#' Add surrogate information to a tree list.
#'
#' This function adds surrogate variables and adjusted agreement values to a forest that was created by getTreeranger.
#' This function adds surrogate variables and adjusted agreement values to a forest that was created by [getTreeranger].
#'
#' @param RF random forest object created by ranger (with keep.inbag=TRUE).
#' @param trees list of trees created by getTreeranger.
#' @param s Predefined number of surrogate splits (it may happen that the actual number of surrogate splits differes in individual nodes). Default is 1 \% of no. of variables.
#' @param RF A [ranger::ranger] object which was created with `keep.inbag = TRUE`.
#' @param trees List of trees created by [getTreeranger].
#' @param s Predefined number of surrogate splits (it may happen that the actual number of surrogate splits differs in individual nodes).
#' @param Xdata data without the dependent variable.
#' @param num.threads number of threads used for parallel execution. Default is number of CPUs available.
#' @return a list with trees containing of lists of nodes with the elements:
#' \itemize{
#' \item nodeID: ID of the respective node (important for left and right daughters in the next columns)
#' \item leftdaughter: ID of the left daughter of this node
#' \item rightdaughter: ID of the right daughter of this node
#' \item splitvariable: ID of the split variable
#' \item splitpoint: splitpoint of the split variable
#' \item status: "0" for terminal and "1" for non-terminal
#' \item layer: layer information (0 means root node, 1 means 1 layer below root, etc)
#' \item surrogate_i: numbered surrogate variables (number depending on s)
#' \item adj_i: adjusted agreement of variable i
#' }
#' @param num.threads (Default: [parallel::detectCores()]) Number of threads to spawn for parallelization.
#'
#' @returns A list of trees.
#' A list of trees containing of lists of nodes with the elements:
#' * `nodeID`: ID of the respective node (important for left and right daughters in the next columns)
#' * `leftdaughter`: ID of the left daughter of this node
#' * `rightdaughter`: ID of the right daughter of this node
#' * `splitvariable`: ID of the split variable
#' * `splitpoint`: splitpoint of the split variable
#' * `status`: `0` for terminal and `1` for non-terminal
#' * `layer`: layer information (`0` means root node, `1` means 1 layer below root, etc)
#' * `surrogate_i`: numbered surrogate variables (number depending on s)
#' * `adj_i`: adjusted agreement of variable i
#'
#' @export
addSurrogates <- function(RF, trees, s, Xdata, num.threads) {
num.trees <- length(trees)
ncat <- sapply(sapply(Xdata, levels), length) # determine number of categories (o for continuous variables)
names(ncat) <- colnames(Xdata)
#' @md
addSurrogates <- function(RF, trees, s, Xdata, num.threads = parallel::detectCores()) {
if (!inherits(RF, "ranger")) {
stop("`RF` must be a ranger object.")
}
if (is.null(num.threads)) {
num.threads <- parallel::detectCores()
if (!inherits(trees, "RangerTrees")) {
stop("`trees` must be a `getTreeranger` `RangerTrees` object.")
}
num.trees <- RF$num.trees
if (num.trees != length(trees)) {
stop("Number of trees in ranger model `RF` does not match number of extracted trees in `trees`.")
}
ncat <- sapply(sapply(Xdata, levels), length) # determine number of categories (o for continuous variables)
names(ncat) <- colnames(Xdata)
if (any(ncat) > 0) {
Xdata[, which(ncat > 0)] <- sapply(Xdata[, which(ncat > 0)], unclass)
}
......@@ -48,6 +59,9 @@ addSurrogates <- function(RF, trees, s, Xdata, num.threads) {
ncat = ncat
)
)
class(trees.surr) <- c(class(trees), "SurrogateTrees")
return(trees.surr)
}
......@@ -56,6 +70,7 @@ addSurrogates <- function(RF, trees, s, Xdata, num.threads) {
#' This is an internal function
#'
#' @keywords internal
#' @md
getSurrogate <- function(surr.par, k = 1, maxsurr) {
# weights and trees are extracted
tree <- surr.par$trees[[k]]
......@@ -79,6 +94,7 @@ getSurrogate <- function(surr.par, k = 1, maxsurr) {
#' @useDynLib RFSurrogates, .registration = TRUE
#'
#' @keywords internal
#' @md
SurrTree <- function(j, wt, Xdata, controls, column.names, tree, maxsurr, ncat) {
node <- tree[j, ]
# for non-terminal nodes get surrogates
......@@ -132,6 +148,7 @@ SurrTree <- function(j, wt, Xdata, controls, column.names, tree, maxsurr, ncat)
#' This is an internal function
#'
#' @keywords internal
#' @md
name.surr <- function(i, surrogate.names) {
surrogate.names <- c(surrogate.names, paste0("surrogate_", i))
return(surrogate.names)
......@@ -142,6 +159,7 @@ name.surr <- function(i, surrogate.names) {
#' This is an internal function
#'
#' @keywords internal
#' @md
name.adj <- function(i, adj.names) {
adj.names <- c(adj.names, paste0("adj_", i))
return(adj.names)
......
......@@ -34,8 +34,6 @@
#'
#' # investigate variable relations
#' rel <- var.relations(
#' x = data.frame(),
#' create.forest = FALSE,
#' forest = list(trees = trees.surr, allvariables = allvariables),
#' variables = allvariables,
#' candidates = allvariables,
......
#' Get a list of structured trees for ranger
#' Get a list of structured trees from a ranger object.
#'
#' This functions creates a list of trees for ranger objects similar as getTree function does for random Forest objects.
#'
#' @param RF random forest object created by ranger (with keep.inbag=TRUE)
#' @param num.trees number of trees
#' @return a list with trees. Each row of the list elements corresponds to a node of the respective tree and the columns correspond to:
#' \itemize{
#' \item nodeID: ID of the respective node (important for left and right daughters in the next columns)
#' \item leftdaughter: ID of the left daughter of this node
#' \item rightdaughter: ID of the right daughter of this node
#' \item splitvariable: ID of the split variable
#' \item splitpoint: splitpoint of the split variable (for categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right)
#' \item status: "0" for terminal and "1" for non-terminal
#' }
#' @param RF A [`ranger::ranger`] object which was created with `keep.inbag = TRUE`.
#' @param num.trees (Deprecated) Number of trees to convert (Default: `RF$num.trees`).
#' @param add_layer (Default: `FALSE`) Whether to [addLayer()] in the same loop.
#' @param num.threads (Default: 1) Number of threads to spawn for parallelization.
#'
#' @returns A list of tree data frames of length `RF$num.trees`.
#' Each row of the tree data frames corresponds to a node of the respective tree and the columns correspond to:
#' * `nodeID`: ID of the respective node (important for left and right daughters in the next columns)
#' * `leftdaughter`: ID of the left daughter of this node
#' * `rightdaughter`: ID of the right daughter of this node
#' * `splitvariable`: ID of the split variable
#' * `splitpoint`: Split point of the split variable.
#' For categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right.
#' * `status`: `0` for terminal (`splitpoint` is `NA`) and `1` for non-terminal.
#' * `layer`: If `add_layer` is `TRUE`, see [addLayer()]
#'
#' @export
getTreeranger <- function(RF, num.trees) {
trees <- lapply(1:num.trees, getsingletree, RF = RF)
#' @md
getTreeranger <- function(RF, num.trees = RF$num.trees, add_layer = FALSE, num.threads = 1) {
trees <- parallel::mclapply(1:num.trees, getsingletree,
mc.cores = num.threads,
RF = RF,
add_layer = add_layer
)
class(trees) <- "RangerTrees"
if (add_layer) {
class(trees) <- c(class(trees), "LayerTrees")
}
return(trees)
}
......@@ -24,26 +39,42 @@ getTreeranger <- function(RF, num.trees) {
#'
#' This is an internal function
#'
#' @param RF A [`ranger::ranger`] object.
#' @param k Tree index to convert.
#' @param add_layer
#'
#' @returns A tree data frame for the `k`th tree in `RF`.
#' Each row of the tree data frames corresponds to a node of the respective tree and the columns correspond to:
#' * `nodeID`: ID of the respective node (important for left and right daughters in the next columns)
#' * `leftdaughter`: ID of the left daughter of this node
#' * `rightdaughter`: ID of the right daughter of this node
#' * `splitvariable`: ID of the split variable
#' * `splitpoint`: Split point of the split variable.
#' For categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right.
#' * `status`: `0` for terminal (`splitpoint` is `NA`) and `1` for non-terminal.
#'
#' @keywords internal
getsingletree <- function(RF, k = 1) {
#' @md
getsingletree <- function(RF, k = 1, add_layer = FALSE) {
# here we use the treeInfo function of the ranger package to create extract the trees, in an earlier version this was done with a self implemented function
tree.ranger <- ranger::treeInfo(RF, tree = k)
ktree <- data.frame(
as.numeric(tree.ranger$nodeID + 1),
as.numeric(tree.ranger$leftChild + 1),
as.numeric(tree.ranger$rightChild + 1),
as.numeric(tree.ranger$splitvarID + 1),
tree.ranger$splitval,
tree.ranger$terminal
nodeID = as.numeric(tree.ranger$nodeID + 1),
leftdaughter = as.numeric(tree.ranger$leftChild + 1),
rightdaughter = as.numeric(tree.ranger$rightChild + 1),
splitvariable = as.numeric(tree.ranger$splitvarID + 1),
splitpoint = tree.ranger$splitval,
status = as.numeric(!tree.ranger$terminal)
)
if (is.factor(ktree[, 5])) {
ktree[, 5] <- as.character(levels(ktree[, 5]))[ktree[, 5]]
if (is.factor(ktree[, "splitpoint"])) {
ktree[, "splitpoint"] <- as.character(levels(ktree[, "splitpoint"]))[ktree[, "splitpoint"]]
}
ktree[, 6] <- as.numeric(ktree[, 6] == FALSE)
for (i in 2:4) {
ktree[, i][is.na(ktree[, i])] <- 0
ktree[, 2:4][is.na(ktree[, 2:4])] <- 0
if (add_layer) {
ktree <- add_layer_to_tree(ktree)
}
colnames(ktree) <- c("nodeID", "leftdaughter", "rightdaughter", "splitvariable", "splitpoint", "status")
return(ktree)
}
......@@ -27,8 +27,6 @@
#'
#' # execute SMD on tree with reduced number of surrogates
#' res.new <- var.select.smd(
#' x = data.frame(),
#' create.forest = FALSE,
#' forest = forest.new,
#' num.threads = 1
#' )
......@@ -36,8 +34,6 @@
#'
#' #' # investigate variable relations
#' rel <- var.relations(
#' x = data.frame(),
#' create.forest = FALSE,
#' forest = forest.new,
#' variables = c("X1", "X7"),
#' candidates = res$forest[["allvariables"]][1:100],
......
......@@ -43,14 +43,15 @@
#'
#' @export
var.relations <- function(x = NULL, y = NULL, num.trees = 500, type = "regression", s = NULL, mtry = NULL, min.node.size = 1,
num.threads = NULL, status = NULL, save.ranger = FALSE, create.forest = TRUE, forest = NULL,
num.threads = NULL, status = NULL, save.ranger = FALSE, create.forest = is.null(forest), forest = NULL,
save.memory = FALSE, case.weights = NULL,
variables, candidates, t = 5, select.rel = TRUE) {
if (!is.data.frame(x)) {
stop("x has to be a data frame")
}
if (create.forest) {
## check data
if (!is.data.frame(x)) {
stop("x has to be a data frame")
}
if (length(y) != nrow(x)) {
stop("length of y and number of rows in x are different")
}
......@@ -123,25 +124,24 @@ var.relations <- function(x = NULL, y = NULL, num.trees = 500, type = "regressio
forest <- list(trees = trees.surr, allvariables = colnames(data[, -1]))
}
if (!create.forest) {
if (is.null(forest)) {
stop("set create.forest to TRUE or analyze an existing random forest specified by parameter forest")
}
if (!create.forest && is.null(forest)) {
stop("set create.forest to TRUE or analyze an existing random forest specified by parameter forest")
}
trees <- forest[["trees"]]
allvariables <- forest[["allvariables"]]
if (all(candidates %in% allvariables)) {
if (all(variables %in% allvariables)) {
# count surrogates
s <- count.surrogates(trees)
results.meanAdjAgree <- meanAdjAgree(trees, variables, allvariables, candidates, t = t, s$s.a, select.var = select.rel, num.threads = num.threads)
} else {
stop("allvariables do not contain the chosen variables")
}
} else {
if (!all(candidates %in% allvariables)) {
stop("allvariables do not contain the candidate variables")
}
if (!all(variables %in% allvariables)) {
stop("allvariables do not contain the chosen variables")
}
# count surrogates
s <- count.surrogates(trees)
results.meanAdjAgree <- meanAdjAgree(trees, variables, allvariables, candidates, t = t, s$s.a, select.var = select.rel, num.threads = num.threads)
if (select.rel) {
surr.var <- results.meanAdjAgree$surr.var
varlist <- list()
......
......@@ -43,14 +43,15 @@
#'
#' @export
var.relations.mfi <- function(x = NULL, y = NULL, num.trees = 500, type = "regression", s = NULL, mtry = NULL, min.node.size = 1,
num.threads = NULL, status = NULL, save.ranger = FALSE, create.forest = TRUE, forest = NULL,
num.threads = NULL, status = NULL, save.ranger = FALSE, create.forest = is.null(forest), forest = NULL,
save.memory = FALSE, case.weights = NULL,
variables, candidates, p.t = 0.01, select.rel = TRUE, method = "janitza") {
if (!is.data.frame(x)) {
stop("x has to be a data frame")
}
if (create.forest) {
## check data
if (!is.data.frame(x)) {
stop("x has to be a data frame")
}
if (length(y) != nrow(x)) {
stop("length of y and number of rows in x are different")
}
......@@ -147,34 +148,32 @@ var.relations.mfi <- function(x = NULL, y = NULL, num.trees = 500, type = "regre
forest_perm <- list(trees = trees.surr_perm, allvariables = colnames(data_perm[, -1]))
}
if (!create.forest) {
if (is.null(forest)) {
stop("set create.forest to TRUE or analyze an existing random forest specified by parameter forest")
}
if (!create.forest && is.null(forest)) {
stop("set create.forest to TRUE or analyze an existing random forest specified by parameter forest")
}
if (all(candidates %in% allvariables)) {
if (all(variables %in% allvariables)) {
# count surrogates
s <- count.surrogates(forest$trees)
rel <- meanAdjAgree(forest$trees,
variables = allvariables, allvariables = allvariables, candidates = allvariables,
t = t, s$s.a, select.var = FALSE, num.threads = num.threads
)
allvariables_perm <- colnames(x_perm)
rel_perm <- meanAdjAgree(forest_perm$trees,
variables = allvariables_perm, allvariables = allvariables_perm, candidates = allvariables_perm,
t = t, s$s.a, select.var = FALSE, num.threads = num.threads
)
} else {
stop("allvariables do not contain the chosen variables")
}
} else {
if (!all(candidates %in% allvariables)) {
stop("allvariables do not contain the candidate variables")
}
if (!all(variables %in% allvariables)) {
stop("allvariables do not contain the chosen variables")
}
# count surrogates
s <- count.surrogates(forest$trees)
rel <- meanAdjAgree(forest$trees,
variables = allvariables, allvariables = allvariables, candidates = allvariables,
t = t, s$s.a, select.var = FALSE, num.threads = num.threads
)
allvariables_perm <- colnames(x_perm)
rel_perm <- meanAdjAgree(forest_perm$trees,
variables = allvariables_perm, allvariables = allvariables_perm, candidates = allvariables_perm,
t = t, s$s.a, select.var = FALSE, num.threads = num.threads
)
adj.agree <- rel$surr.res
adj.agree.perm <- rel_perm$surr.res
diag(adj.agree) <- diag(adj.agree.perm) <- 1
......
......@@ -13,8 +13,8 @@
#' @param num.threads number of threads used for parallel execution. Default is number of CPUs available.
#' @param status status variable, only applicable to survival data. Use 1 for event and 0 for censoring.
#' @param save.ranger Set TRUE if ranger object should be saved. Default is that ranger object is not saved (FALSE).
#' @param create.forest set FALSE if you want to analyze an existing forest. Default is TRUE.
#' @param forest the random forest that should be analyzed if create.forest is set to FALSE. (x and y still have to be given to obtain variable names)
#' @param create.forest Default: TRUE if `forest` is NULL, FALSE otherwise. Whether to create or use an existing forest.
#' @param forest the random forest that should be analyzed.
#' @param save.memory Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems. (This parameter is transfered to ranger)
#' @param case.weights Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.
#'
......@@ -59,7 +59,7 @@
#'
#' @export
var.select.md <- function(x = NULL, y = NULL, num.trees = 500, type = "regression", mtry = NULL, min.node.size = 1, num.threads = NULL,
status = NULL, save.ranger = FALSE, create.forest = TRUE, forest = NULL, save.memory = FALSE, case.weights = NULL) {
status = NULL, save.ranger = FALSE, create.forest = is.null(forest), forest = NULL, save.memory = FALSE, case.weights = NULL) {
results.smd <- var.select.smd(
x = x, y = y, num.trees = num.trees, type = type, mtry = mtry, min.node.size = min.node.size, num.threads = num.threads,
status = status, save.ranger = save.ranger, s = 0, create.forest = create.forest, forest = forest,
......
......@@ -167,25 +167,25 @@ var.select.mir <- function(x = NULL, y = NULL, num.trees = 500, type = "regressi
if (select.var) {
if (method.sel == "janitza") {
if (corr.rel) {
## Mirrored VIMP (# This part is taken from ranger function)
m1 <- mir[mir < 0]
m2 <- mir[mir == 0]
null.rel <- c(m1, -m1, m2)
if (!corr.rel) {
stop("Janitza approach should only be conducted with corrected relations")
}
pval <- 1 - ranger:::numSmaller(mir, null.rel) / length(null.rel)
names(pval) <- allvariables
selected <- as.numeric(pval <= p.t.sel)
names(selected) <- names(pval)
## Mirrored VIMP (# This part is taken from ranger function)
m1 <- mir[mir < 0]
m2 <- mir[mir == 0]
null.rel <- c(m1, -m1, m2)
if (length(m1) == 0) {
stop("No negative importance values found for selection of important variables. Consider the 'permutation' approach.")
}
if (length(m1) < 100) {
warning("Only few negative importance values found for selection of important variables, inaccurate p-values. Consider the 'permutation' approach.")
}
} else {
stop("Janitza approach should only be conducted with corrected relations")
pval <- 1 - ranger:::numSmaller(mir, null.rel) / length(null.rel)
names(pval) <- allvariables
selected <- as.numeric(pval <= p.t.sel)
names(selected) <- names(pval)
if (length(m1) == 0) {
stop("No negative importance values found for selection of important variables. Consider the 'permutation' approach.")
}
if (length(m1) < 100) {
warning("Only few negative importance values found for selection of important variables, inaccurate p-values. Consider the 'permutation' approach.")
}
}
......
......@@ -14,8 +14,8 @@
#' @param s predefined number of surrogate splits (it may happen that the actual number of surrogate splits differs in individual nodes). Default is 1 \% of no. of variables.
#' @param status status variable, only applicable to survival data. Use 1 for event and 0 for censoring.
#' @param save.ranger set TRUE if ranger object should be saved. Default is that ranger object is not saved (FALSE).
#' @param create.forest set FALSE if you want to analyze an existing forest. Default is TRUE.
#' @param forest the random forest that should be analyzed if create.forest is set to FALSE. (x and y still have to be given to obtain variable names)
#' @param create.forest Default: TRUE if `forest` is NULL, FALSE otherwise. Whether to create or use an existing forest.
#' @param forest the random forest that should be analyzed
#' @param save.memory Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems. (This parameter is transfered to ranger)
#' @param case.weights Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.
#'
......@@ -63,13 +63,14 @@
##' }
#' @export
var.select.smd <- function(x = NULL, y = NULL, num.trees = 500, type = "regression", s = NULL, mtry = NULL, min.node.size = 1,
num.threads = NULL, status = NULL, save.ranger = FALSE, create.forest = TRUE, forest = NULL,
num.threads = NULL, status = NULL, save.ranger = FALSE, create.forest = is.null(forest), forest = NULL,
save.memory = FALSE, case.weights = NULL) {
if (!is.data.frame(x)) {
stop("x has to be a data frame")
}
if (create.forest) {
## check data
if (!is.data.frame(x)) {
stop("x has to be a data frame")
}
if (length(y) != nrow(x)) {
stop("length of y and number of rows in x are different")
}
......
......@@ -118,7 +118,6 @@ Subsequently, variable relations are analyzed with `var.relations()`. The parame
```r
candidates <- colnames(SMD_example_data)[2:101]
rel <- var.relations(
x = data.frame(), create.forest = FALSE,
forest = res.smd$forest,
variables = c("X1", "X7"), candidates = candidates,
t = 5)
......
......@@ -4,23 +4,28 @@
\alias{addLayer}
\title{Add layer information to a forest that was created by getTreeranger}
\usage{
addLayer(trees)
addLayer(trees, num.threads = 1)
}
\arguments{
\item{trees}{list of trees created by getTreeranger}
\item{trees}{The output of \code{\link[=getTreeranger]{getTreeranger()}}.}
\item{num.threads}{(Default: 1) Number of threads to spawn for parallelization.}
}
\value{
a list with trees. Each row of the list elements corresponds to a node of the respective tree and the columns correspond to:
A list of tree data frames of length \code{RF$num.trees}.
Each row of the tree data frames corresponds to a node of the respective tree and the columns correspond to:
\itemize{
\item nodeID: ID of the respective node (important for left and right daughters in the next columns)
\item leftdaughter: ID of the left daughter of this node
\item rightdaughter: ID of the right daughter of this node
\item splitvariable: ID of the split variable
\item splitpoint: splitpoint of the split variable
\item status: "0" for terminal and "1" for non-terminal
\item layer: layer information (0 means root node, 1 means 1 layer below root, etc)
\item \code{nodeID}: ID of the respective node (important for left and right daughters in the next columns)
\item \code{leftdaughter}: ID of the left daughter of this node
\item \code{rightdaughter}: ID of the right daughter of this node
\item \code{splitvariable}: ID of the split variable
\item \code{splitpoint}: Split point of the split variable.
For categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right.
\item \code{status}: \code{0} for terminal (\code{splitpoint} is \code{NA}) and \code{1} for non-terminal.
\item \code{layer}: Tree layer depth information, starting at 0 (root node) and incremented for each layer.
}
}
\description{
This functions adds the layer information to each node in a list with trees that was obtained by getTreeranger.
You should use \code{\link[=getTreeranger]{getTreeranger()}} with \code{add_layer = TRUE} instead.
}
......@@ -2,35 +2,36 @@
% Please edit documentation in R/addSurrogates.R
\name{addSurrogates}
\alias{addSurrogates}
\title{Add surrogate information that was created by getTreeranger}
\title{Add surrogate information to a tree list.}
\usage{
addSurrogates(RF, trees, s, Xdata, num.threads)
addSurrogates(RF, trees, s, Xdata, num.threads = parallel::detectCores())
}
\arguments{
\item{RF}{random forest object created by ranger (with keep.inbag=TRUE).}
\item{RF}{A \link[ranger:ranger]{ranger::ranger} object which was created with \code{keep.inbag = TRUE}.}
\item{trees}{list of trees created by getTreeranger.}
\item{trees}{List of trees created by \link{getTreeranger}.}
\item{s}{Predefined number of surrogate splits (it may happen that the actual number of surrogate splits differes in individual nodes). Default is 1 \% of no. of variables.}
\item{s}{Predefined number of surrogate splits (it may happen that the actual number of surrogate splits differs in individual nodes).}
\item{Xdata}{data without the dependent variable.}
\item{num.threads}{number of threads used for parallel execution. Default is number of CPUs available.}
\item{num.threads}{(Default: \code{\link[parallel:detectCores]{parallel::detectCores()}}) Number of threads to spawn for parallelization.}
}
\value{
a list with trees containing of lists of nodes with the elements:
A list of trees.
A list of trees containing of lists of nodes with the elements:
\itemize{
\item nodeID: ID of the respective node (important for left and right daughters in the next columns)
\item leftdaughter: ID of the left daughter of this node
\item rightdaughter: ID of the right daughter of this node
\item splitvariable: ID of the split variable
\item splitpoint: splitpoint of the split variable
\item status: "0" for terminal and "1" for non-terminal
\item layer: layer information (0 means root node, 1 means 1 layer below root, etc)
\item surrogate_i: numbered surrogate variables (number depending on s)
\item adj_i: adjusted agreement of variable i
\item \code{nodeID}: ID of the respective node (important for left and right daughters in the next columns)
\item \code{leftdaughter}: ID of the left daughter of this node
\item \code{rightdaughter}: ID of the right daughter of this node
\item \code{splitvariable}: ID of the split variable
\item \code{splitpoint}: splitpoint of the split variable
\item \code{status}: \code{0} for terminal and \code{1} for non-terminal
\item \code{layer}: layer information (\code{0} means root node, \code{1} means 1 layer below root, etc)
\item \code{surrogate_i}: numbered surrogate variables (number depending on s)
\item \code{adj_i}: adjusted agreement of variable i
}
}
\description{
This function adds surrogate variables and adjusted agreement values to a forest that was created by getTreeranger.
This function adds surrogate variables and adjusted agreement values to a forest that was created by \link{getTreeranger}.
}
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/addLayer.R
\name{add_layer_to_tree}
\alias{add_layer_to_tree}
\title{Internal function}
\usage{
add_layer_to_tree(tree)
}
\arguments{
\item{tree}{A tree data frame from \code{\link[=getTreeranger]{getTreeranger()}}.}
}
\value{
A tree data frame with \code{layer} added.
}
\description{
This function adds the respective layer to the different nodes in a tree.
The tree has to be prepared by getTree function.
}
\seealso{
\code{\link[=addLayer]{addLayer()}}
}
\keyword{internal}
......@@ -44,8 +44,6 @@ trees.surr <- addSurrogates(RF = RF, trees = trees.lay, s = 10, Xdata = x, num.t
# investigate variable relations
rel <- var.relations(
x = data.frame(),
create.forest = FALSE,
forest = list(trees = trees.surr, allvariables = allvariables),
variables = allvariables,
candidates = allvariables,
......
......@@ -2,24 +2,31 @@
% Please edit documentation in R/getTreeranger.R
\name{getTreeranger}
\alias{getTreeranger}
\title{Get a list of structured trees for ranger}
\title{Get a list of structured trees from a ranger object.}
\usage{
getTreeranger(RF, num.trees)
getTreeranger(RF, num.trees = RF$num.trees, add_layer = FALSE, num.threads = 1)
}
\arguments{
\item{RF}{random forest object created by ranger (with keep.inbag=TRUE)}
\item{RF}{A \code{\link[ranger:ranger]{ranger::ranger}} object which was created with \code{keep.inbag = TRUE}.}
\item{num.trees}{number of trees}
\item{num.trees}{(Deprecated) Number of trees to convert (Default: \code{RF$num.trees}).}
\item{add_layer}{(Default: \code{FALSE}) Whether to \code{\link[=addLayer]{addLayer()}} in the same loop.}
\item{num.threads}{(Default: 1) Number of threads to spawn for parallelization.}
}
\value{
a list with trees. Each row of the list elements corresponds to a node of the respective tree and the columns correspond to:
A list of tree data frames of length \code{RF$num.trees}.
Each row of the tree data frames corresponds to a node of the respective tree and the columns correspond to:
\itemize{
\item nodeID: ID of the respective node (important for left and right daughters in the next columns)
\item leftdaughter: ID of the left daughter of this node
\item rightdaughter: ID of the right daughter of this node
\item splitvariable: ID of the split variable
\item splitpoint: splitpoint of the split variable (for categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right)
\item status: "0" for terminal and "1" for non-terminal
\item \code{nodeID}: ID of the respective node (important for left and right daughters in the next columns)
\item \code{leftdaughter}: ID of the left daughter of this node
\item \code{rightdaughter}: ID of the right daughter of this node
\item \code{splitvariable}: ID of the split variable
\item \code{splitpoint}: Split point of the split variable.
For categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right.
\item \code{status}: \code{0} for terminal (\code{splitpoint} is \code{NA}) and \code{1} for non-terminal.
\item \code{layer}: If \code{add_layer} is \code{TRUE}, see \code{\link[=addLayer]{addLayer()}}
}
}
\description{
......
......@@ -4,7 +4,27 @@
\alias{getsingletree}
\title{getsingletree}
\usage{
getsingletree(RF, k = 1)
getsingletree(RF, k = 1, add_layer = FALSE)
}
\arguments{
\item{RF}{A \code{\link[ranger:ranger]{ranger::ranger}} object.}
\item{k}{Tree index to convert.}
\item{add_layer}{}
}
\value{
A tree data frame for the \code{k}th tree in \code{RF}.
Each row of the tree data frames corresponds to a node of the respective tree and the columns correspond to:
\itemize{
\item \code{nodeID}: ID of the respective node (important for left and right daughters in the next columns)
\item \code{leftdaughter}: ID of the left daughter of this node
\item \code{rightdaughter}: ID of the right daughter of this node
\item \code{splitvariable}: ID of the split variable
\item \code{splitpoint}: Split point of the split variable.
For categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right.
\item \code{status}: \code{0} for terminal (\code{splitpoint} is \code{NA}) and \code{1} for non-terminal.
}
}
\description{
This is an internal function
......
......@@ -36,8 +36,6 @@ forest.new <- reduce.surrogates(forest = res$forest, s = 10)
# execute SMD on tree with reduced number of surrogates
res.new <- var.select.smd(
x = data.frame(),
create.forest = FALSE,
forest = forest.new,
num.threads = 1
)
......@@ -45,8 +43,6 @@ res.new$var
#' # investigate variable relations
rel <- var.relations(
x = data.frame(),
create.forest = FALSE,
forest = forest.new,
variables = c("X1", "X7"),
candidates = res$forest[["allvariables"]][1:100],
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment