Skip to content
Snippets Groups Projects
Unverified Commit 4529d2c1 authored by Gärber, Florian's avatar Gärber, Florian Committed by GitHub
Browse files

chore: Take non-breaking changes from v0.4.x branch (#11)

parents c012fab8 91473f65
No related branches found
No related tags found
No related merge requests found
Type: Package
Package: RFSurrogates
Title: Surrogate Minimal Depth Variable Importance
Version: 0.3.3.9000
Version: 0.3.3.9005
Authors@R: c(
person("Stephan", "Seifert", , "stephan.seifert@uni-hamburg.de", role = c("aut", "cre"),
comment = c(ORCID = "0000-0003-2567-5728")),
......@@ -36,4 +36,5 @@ LinkingTo:
Config/testthat/edition: 3
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.2.3
......@@ -4,6 +4,20 @@
* `create.forest` now defaults to `is.null(forest)`, so it will automatically be `TRUE` if no forest is provided, and `FALSE` otherwise.
* `x` is no longer required if `create.forest` is `FALSE`.
* (Internal) Inverted some nested guard clauses for readability.
* `addLayer()`: Refactor for-loop to lapply.
* Add `num.threads` param to enable parallelization using `parallel::mclapply()`. It defaults to 1 for backward compatability.
* `getTreeranger()`: Refactor `lapply()` to `parallel::mclapply()`.
* Add `num.threads` param (passed to `mc.cores` in `parallel::mclapply()`). It defaults to 1 for backward compatability.
* Add `add_layer` param to include the effect of `addLayer` within the same loop. Defaults to `FALSE` for backward compatability.
* (Internal) `getsingletree()`: Add `add_layer` param to enable adding layers within the same loop.
* `addSurrogates()`:
* Clarified default value for `num.threads` to be `parallel::detectCores()` by adding it as a default to the parameter
* Added assertion that `RF` is a `ranger` object.
* Added assertion that `RF$num.trees` and `length(trees)` are equal. This is not considered a breaking change since these values should always be equal when the function is used correctly.
* Added S3 classes to the `trees` list objects.
* `getTreeranger()` now returns a `RangerTrees` list.
* `addLayer()` and `getTreeranger(add_layer = TRUE)` add the `LayerTrees` class to the list (indicating presence of the `layer` list item). It now requires that its `trees` param inherits `RangerTrees`.
* `addSurrogates()` now adds the `SurrogateTrees` class. It now requires that its `trees` param inherits `RangerTrees`.
# RFSurrogates 0.3.3
......
#' Add layer information to a forest that was created by getTreeranger
#'
#' This functions adds the layer information to each node in a list with trees that was obtained by getTreeranger.
#' You should use [`getTreeranger()`] with `add_layer = TRUE` instead.
#'
#' @param trees The output of [`getTreeranger()`].
#' @param num.threads (Default: 1) Number of threads to spawn for parallelization.
#'
#' @returns A list of tree data frames of length `RF$num.trees`.
#' Each row of the tree data frames corresponds to a node of the respective tree and the columns correspond to:
#' * `nodeID`: ID of the respective node (important for left and right daughters in the next columns)
#' * `leftdaughter`: ID of the left daughter of this node
#' * `rightdaughter`: ID of the right daughter of this node
#' * `splitvariable`: ID of the split variable
#' * `splitpoint`: Split point of the split variable.
#' For categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right.
#' * `status`: `0` for terminal (`splitpoint` is `NA`) and `1` for non-terminal.
#' * `layer`: Tree layer depth information, starting at 0 (root node) and incremented for each layer.
#'
#' @param trees list of trees created by getTreeranger
#' @return a list with trees. Each row of the list elements corresponds to a node of the respective tree and the columns correspond to:
#' \itemize{
#' \item nodeID: ID of the respective node (important for left and right daughters in the next columns)
#' \item leftdaughter: ID of the left daughter of this node
#' \item rightdaughter: ID of the right daughter of this node
#' \item splitvariable: ID of the split variable
#' \item splitpoint: splitpoint of the split variable
#' \item status: "0" for terminal and "1" for non-terminal
#' \item layer: layer information (0 means root node, 1 means 1 layer below root, etc)
#' }
#' @export
addLayer <- function(trees) {
# This function adds the respective layer to the different nodes in a tree. The tree has to be prepared by getTree function
tree.layer <- list()
num.trees <- length(trees)
for (i in 1:num.trees) {
tree <- trees[[i]]
layer <- rep(NA, nrow(tree))
layer[1] <- 0
t <- 1
while (anyNA(layer)) {
r <- unlist(tree[which(layer == (t - 1)), 2:3])
layer[r] <- t
t <- t + 1
}
tree <- cbind(tree, layer)
tree <- tree[order(as.numeric(tree[, "layer"])), ]
tree.layer[[i]] <- tree
addLayer <- function(trees, num.threads = 1) {
if (!inherits(trees, "RangerTrees")) {
stop("`trees` must be a `getTreeranger` `RangerTrees` object.")
}
layer.trees <- parallel::mclapply(trees, add_layer_to_tree, mc.cores = num.threads)
class(layer.trees) <- c(class(trees), "LayerTrees")
return(layer.trees)
}
#' Internal function
#'
#' This function adds the respective layer to the different nodes in a tree.
#' The tree has to be prepared by getTree function.
#'
#' @param tree A tree data frame from [getTreeranger()].
#'
#' @returns A tree data frame with `layer` added.
#'
#' @seealso [addLayer()]
#'
#' @keywords internal
add_layer_to_tree <- function(tree) {
layer <- rep(NA, nrow(tree))
layer[1] <- 0
t <- 1
while (anyNA(layer)) {
r <- unlist(tree[which(layer == (t - 1)), 2:3])
layer[r] <- t
t <- t + 1
}
return(tree.layer)
tree <- cbind(tree, layer)
tree <- tree[order(as.numeric(tree[, "layer"])), ]
return(tree)
}
#' Add surrogate information that was created by getTreeranger
#' Add surrogate information to a tree list.
#'
#' This function adds surrogate variables and adjusted agreement values to a forest that was created by getTreeranger.
#' This function adds surrogate variables and adjusted agreement values to a forest that was created by [getTreeranger].
#'
#' @param RF random forest object created by ranger (with keep.inbag=TRUE).
#' @param trees list of trees created by getTreeranger.
#' @param s Predefined number of surrogate splits (it may happen that the actual number of surrogate splits differes in individual nodes). Default is 1 \% of no. of variables.
#' @param RF A [ranger::ranger] object which was created with `keep.inbag = TRUE`.
#' @param trees List of trees created by [getTreeranger].
#' @param s Predefined number of surrogate splits (it may happen that the actual number of surrogate splits differs in individual nodes).
#' @param Xdata data without the dependent variable.
#' @param num.threads number of threads used for parallel execution. Default is number of CPUs available.
#' @return a list with trees containing of lists of nodes with the elements:
#' \itemize{
#' \item nodeID: ID of the respective node (important for left and right daughters in the next columns)
#' \item leftdaughter: ID of the left daughter of this node
#' \item rightdaughter: ID of the right daughter of this node
#' \item splitvariable: ID of the split variable
#' \item splitpoint: splitpoint of the split variable
#' \item status: "0" for terminal and "1" for non-terminal
#' \item layer: layer information (0 means root node, 1 means 1 layer below root, etc)
#' \item surrogate_i: numbered surrogate variables (number depending on s)
#' \item adj_i: adjusted agreement of variable i
#' }
#' @param num.threads (Default: [parallel::detectCores()]) Number of threads to spawn for parallelization.
#'
#' @returns A list of trees.
#' A list of trees containing of lists of nodes with the elements:
#' * `nodeID`: ID of the respective node (important for left and right daughters in the next columns)
#' * `leftdaughter`: ID of the left daughter of this node
#' * `rightdaughter`: ID of the right daughter of this node
#' * `splitvariable`: ID of the split variable
#' * `splitpoint`: splitpoint of the split variable
#' * `status`: `0` for terminal and `1` for non-terminal
#' * `layer`: layer information (`0` means root node, `1` means 1 layer below root, etc)
#' * `surrogate_i`: numbered surrogate variables (number depending on s)
#' * `adj_i`: adjusted agreement of variable i
#'
#' @export
addSurrogates <- function(RF, trees, s, Xdata, num.threads) {
num.trees <- length(trees)
ncat <- sapply(sapply(Xdata, levels), length) # determine number of categories (o for continuous variables)
names(ncat) <- colnames(Xdata)
addSurrogates <- function(RF, trees, s, Xdata, num.threads = parallel::detectCores()) {
if (!inherits(RF, "ranger")) {
stop("`RF` must be a ranger object.")
}
if (is.null(num.threads)) {
num.threads <- parallel::detectCores()
if (!inherits(trees, "RangerTrees")) {
stop("`trees` must be a `getTreeranger` `RangerTrees` object.")
}
num.trees <- RF$num.trees
if (num.trees != length(trees)) {
stop("Number of trees in ranger model `RF` does not match number of extracted trees in `trees`.")
}
ncat <- sapply(sapply(Xdata, levels), length) # determine number of categories (o for continuous variables)
names(ncat) <- colnames(Xdata)
if (any(ncat) > 0) {
Xdata[, which(ncat > 0)] <- sapply(Xdata[, which(ncat > 0)], unclass)
}
......@@ -48,6 +58,9 @@ addSurrogates <- function(RF, trees, s, Xdata, num.threads) {
ncat = ncat
)
)
class(trees.surr) <- c(class(trees), "SurrogateTrees")
return(trees.surr)
}
......
#' Get a list of structured trees for ranger
#' Get a list of structured trees from a ranger object.
#'
#' This functions creates a list of trees for ranger objects similar as getTree function does for random Forest objects.
#'
#' @param RF random forest object created by ranger (with keep.inbag=TRUE)
#' @param num.trees number of trees
#' @return a list with trees. Each row of the list elements corresponds to a node of the respective tree and the columns correspond to:
#' \itemize{
#' \item nodeID: ID of the respective node (important for left and right daughters in the next columns)
#' \item leftdaughter: ID of the left daughter of this node
#' \item rightdaughter: ID of the right daughter of this node
#' \item splitvariable: ID of the split variable
#' \item splitpoint: splitpoint of the split variable (for categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right)
#' \item status: "0" for terminal and "1" for non-terminal
#' }
#' @param RF A [`ranger::ranger`] object which was created with `keep.inbag = TRUE`.
#' @param num.trees (Deprecated) Number of trees to convert (Default: `RF$num.trees`).
#' @param add_layer (Default: `FALSE`) Whether to [addLayer()] in the same loop.
#' @param num.threads (Default: 1) Number of threads to spawn for parallelization.
#'
#' @returns A list of tree data frames of length `RF$num.trees`.
#' Each row of the tree data frames corresponds to a node of the respective tree and the columns correspond to:
#' * `nodeID`: ID of the respective node (important for left and right daughters in the next columns)
#' * `leftdaughter`: ID of the left daughter of this node
#' * `rightdaughter`: ID of the right daughter of this node
#' * `splitvariable`: ID of the split variable
#' * `splitpoint`: Split point of the split variable.
#' For categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right.
#' * `status`: `0` for terminal (`splitpoint` is `NA`) and `1` for non-terminal.
#' * `layer`: If `add_layer` is `TRUE`, see [addLayer()]
#'
#' @export
getTreeranger <- function(RF, num.trees) {
trees <- lapply(1:num.trees, getsingletree, RF = RF)
getTreeranger <- function(RF, num.trees = RF$num.trees, add_layer = FALSE, num.threads = 1) {
trees <- parallel::mclapply(1:num.trees, getsingletree,
mc.cores = num.threads,
RF = RF,
add_layer = add_layer
)
class(trees) <- "RangerTrees"
if (add_layer) {
class(trees) <- c(class(trees), "LayerTrees")
}
return(trees)
}
......@@ -24,26 +38,41 @@ getTreeranger <- function(RF, num.trees) {
#'
#' This is an internal function
#'
#' @param RF A [`ranger::ranger`] object.
#' @param k Tree index to convert.
#' @param add_layer
#'
#' @returns A tree data frame for the `k`th tree in `RF`.
#' Each row of the tree data frames corresponds to a node of the respective tree and the columns correspond to:
#' * `nodeID`: ID of the respective node (important for left and right daughters in the next columns)
#' * `leftdaughter`: ID of the left daughter of this node
#' * `rightdaughter`: ID of the right daughter of this node
#' * `splitvariable`: ID of the split variable
#' * `splitpoint`: Split point of the split variable.
#' For categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right.
#' * `status`: `0` for terminal (`splitpoint` is `NA`) and `1` for non-terminal.
#'
#' @keywords internal
getsingletree <- function(RF, k = 1) {
getsingletree <- function(RF, k = 1, add_layer = FALSE) {
# here we use the treeInfo function of the ranger package to create extract the trees, in an earlier version this was done with a self implemented function
tree.ranger <- ranger::treeInfo(RF, tree = k)
ktree <- data.frame(
as.numeric(tree.ranger$nodeID + 1),
as.numeric(tree.ranger$leftChild + 1),
as.numeric(tree.ranger$rightChild + 1),
as.numeric(tree.ranger$splitvarID + 1),
tree.ranger$splitval,
tree.ranger$terminal
nodeID = as.numeric(tree.ranger$nodeID + 1),
leftdaughter = as.numeric(tree.ranger$leftChild + 1),
rightdaughter = as.numeric(tree.ranger$rightChild + 1),
splitvariable = as.numeric(tree.ranger$splitvarID + 1),
splitpoint = tree.ranger$splitval,
status = as.numeric(!tree.ranger$terminal)
)
if (is.factor(ktree[, 5])) {
ktree[, 5] <- as.character(levels(ktree[, 5]))[ktree[, 5]]
if (is.factor(ktree[, "splitpoint"])) {
ktree[, "splitpoint"] <- as.character(levels(ktree[, "splitpoint"]))[ktree[, "splitpoint"]]
}
ktree[, 6] <- as.numeric(ktree[, 6] == FALSE)
for (i in 2:4) {
ktree[, i][is.na(ktree[, i])] <- 0
ktree[, 2:4][is.na(ktree[, 2:4])] <- 0
if (add_layer) {
ktree <- add_layer_to_tree(ktree)
}
colnames(ktree) <- c("nodeID", "leftdaughter", "rightdaughter", "splitvariable", "splitpoint", "status")
return(ktree)
}
......@@ -4,23 +4,28 @@
\alias{addLayer}
\title{Add layer information to a forest that was created by getTreeranger}
\usage{
addLayer(trees)
addLayer(trees, num.threads = 1)
}
\arguments{
\item{trees}{list of trees created by getTreeranger}
\item{trees}{The output of \code{\link[=getTreeranger]{getTreeranger()}}.}
\item{num.threads}{(Default: 1) Number of threads to spawn for parallelization.}
}
\value{
a list with trees. Each row of the list elements corresponds to a node of the respective tree and the columns correspond to:
A list of tree data frames of length \code{RF$num.trees}.
Each row of the tree data frames corresponds to a node of the respective tree and the columns correspond to:
\itemize{
\item nodeID: ID of the respective node (important for left and right daughters in the next columns)
\item leftdaughter: ID of the left daughter of this node
\item rightdaughter: ID of the right daughter of this node
\item splitvariable: ID of the split variable
\item splitpoint: splitpoint of the split variable
\item status: "0" for terminal and "1" for non-terminal
\item layer: layer information (0 means root node, 1 means 1 layer below root, etc)
\item \code{nodeID}: ID of the respective node (important for left and right daughters in the next columns)
\item \code{leftdaughter}: ID of the left daughter of this node
\item \code{rightdaughter}: ID of the right daughter of this node
\item \code{splitvariable}: ID of the split variable
\item \code{splitpoint}: Split point of the split variable.
For categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right.
\item \code{status}: \code{0} for terminal (\code{splitpoint} is \code{NA}) and \code{1} for non-terminal.
\item \code{layer}: Tree layer depth information, starting at 0 (root node) and incremented for each layer.
}
}
\description{
This functions adds the layer information to each node in a list with trees that was obtained by getTreeranger.
You should use \code{\link[=getTreeranger]{getTreeranger()}} with \code{add_layer = TRUE} instead.
}
......@@ -2,35 +2,36 @@
% Please edit documentation in R/addSurrogates.R
\name{addSurrogates}
\alias{addSurrogates}
\title{Add surrogate information that was created by getTreeranger}
\title{Add surrogate information to a tree list.}
\usage{
addSurrogates(RF, trees, s, Xdata, num.threads)
addSurrogates(RF, trees, s, Xdata, num.threads = parallel::detectCores())
}
\arguments{
\item{RF}{random forest object created by ranger (with keep.inbag=TRUE).}
\item{RF}{A \link[ranger:ranger]{ranger::ranger} object which was created with \code{keep.inbag = TRUE}.}
\item{trees}{list of trees created by getTreeranger.}
\item{trees}{List of trees created by \link{getTreeranger}.}
\item{s}{Predefined number of surrogate splits (it may happen that the actual number of surrogate splits differes in individual nodes). Default is 1 \% of no. of variables.}
\item{s}{Predefined number of surrogate splits (it may happen that the actual number of surrogate splits differs in individual nodes).}
\item{Xdata}{data without the dependent variable.}
\item{num.threads}{number of threads used for parallel execution. Default is number of CPUs available.}
\item{num.threads}{(Default: \code{\link[parallel:detectCores]{parallel::detectCores()}}) Number of threads to spawn for parallelization.}
}
\value{
a list with trees containing of lists of nodes with the elements:
A list of trees.
A list of trees containing of lists of nodes with the elements:
\itemize{
\item nodeID: ID of the respective node (important for left and right daughters in the next columns)
\item leftdaughter: ID of the left daughter of this node
\item rightdaughter: ID of the right daughter of this node
\item splitvariable: ID of the split variable
\item splitpoint: splitpoint of the split variable
\item status: "0" for terminal and "1" for non-terminal
\item layer: layer information (0 means root node, 1 means 1 layer below root, etc)
\item surrogate_i: numbered surrogate variables (number depending on s)
\item adj_i: adjusted agreement of variable i
\item \code{nodeID}: ID of the respective node (important for left and right daughters in the next columns)
\item \code{leftdaughter}: ID of the left daughter of this node
\item \code{rightdaughter}: ID of the right daughter of this node
\item \code{splitvariable}: ID of the split variable
\item \code{splitpoint}: splitpoint of the split variable
\item \code{status}: \code{0} for terminal and \code{1} for non-terminal
\item \code{layer}: layer information (\code{0} means root node, \code{1} means 1 layer below root, etc)
\item \code{surrogate_i}: numbered surrogate variables (number depending on s)
\item \code{adj_i}: adjusted agreement of variable i
}
}
\description{
This function adds surrogate variables and adjusted agreement values to a forest that was created by getTreeranger.
This function adds surrogate variables and adjusted agreement values to a forest that was created by \link{getTreeranger}.
}
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/addLayer.R
\name{add_layer_to_tree}
\alias{add_layer_to_tree}
\title{Internal function}
\usage{
add_layer_to_tree(tree)
}
\arguments{
\item{tree}{A tree data frame from \code{\link[=getTreeranger]{getTreeranger()}}.}
}
\value{
A tree data frame with \code{layer} added.
}
\description{
This function adds the respective layer to the different nodes in a tree.
The tree has to be prepared by getTree function.
}
\seealso{
\code{\link[=addLayer]{addLayer()}}
}
\keyword{internal}
......@@ -2,24 +2,31 @@
% Please edit documentation in R/getTreeranger.R
\name{getTreeranger}
\alias{getTreeranger}
\title{Get a list of structured trees for ranger}
\title{Get a list of structured trees from a ranger object.}
\usage{
getTreeranger(RF, num.trees)
getTreeranger(RF, num.trees = RF$num.trees, add_layer = FALSE, num.threads = 1)
}
\arguments{
\item{RF}{random forest object created by ranger (with keep.inbag=TRUE)}
\item{RF}{A \code{\link[ranger:ranger]{ranger::ranger}} object which was created with \code{keep.inbag = TRUE}.}
\item{num.trees}{number of trees}
\item{num.trees}{(Deprecated) Number of trees to convert (Default: \code{RF$num.trees}).}
\item{add_layer}{(Default: \code{FALSE}) Whether to \code{\link[=addLayer]{addLayer()}} in the same loop.}
\item{num.threads}{(Default: 1) Number of threads to spawn for parallelization.}
}
\value{
a list with trees. Each row of the list elements corresponds to a node of the respective tree and the columns correspond to:
A list of tree data frames of length \code{RF$num.trees}.
Each row of the tree data frames corresponds to a node of the respective tree and the columns correspond to:
\itemize{
\item nodeID: ID of the respective node (important for left and right daughters in the next columns)
\item leftdaughter: ID of the left daughter of this node
\item rightdaughter: ID of the right daughter of this node
\item splitvariable: ID of the split variable
\item splitpoint: splitpoint of the split variable (for categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right)
\item status: "0" for terminal and "1" for non-terminal
\item \code{nodeID}: ID of the respective node (important for left and right daughters in the next columns)
\item \code{leftdaughter}: ID of the left daughter of this node
\item \code{rightdaughter}: ID of the right daughter of this node
\item \code{splitvariable}: ID of the split variable
\item \code{splitpoint}: Split point of the split variable.
For categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right.
\item \code{status}: \code{0} for terminal (\code{splitpoint} is \code{NA}) and \code{1} for non-terminal.
\item \code{layer}: If \code{add_layer} is \code{TRUE}, see \code{\link[=addLayer]{addLayer()}}
}
}
\description{
......
......@@ -4,7 +4,27 @@
\alias{getsingletree}
\title{getsingletree}
\usage{
getsingletree(RF, k = 1)
getsingletree(RF, k = 1, add_layer = FALSE)
}
\arguments{
\item{RF}{A \code{\link[ranger:ranger]{ranger::ranger}} object.}
\item{k}{Tree index to convert.}
\item{add_layer}{}
}
\value{
A tree data frame for the \code{k}th tree in \code{RF}.
Each row of the tree data frames corresponds to a node of the respective tree and the columns correspond to:
\itemize{
\item \code{nodeID}: ID of the respective node (important for left and right daughters in the next columns)
\item \code{leftdaughter}: ID of the left daughter of this node
\item \code{rightdaughter}: ID of the right daughter of this node
\item \code{splitvariable}: ID of the split variable
\item \code{splitpoint}: Split point of the split variable.
For categorical variables this is a comma separated lists of values, representing the factor levels (in the original order) going to the right.
\item \code{status}: \code{0} for terminal (\code{splitpoint} is \code{NA}) and \code{1} for non-terminal.
}
}
\description{
This is an internal function
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment