In this R package, several functions are provided for applying approaches based on random forest. Minimal depth (MD), Surrogate minimal depth (SMD) and mutual impurity reduction (MIR), which is a corrected approach of SMD, can be applied to assess the importance of variables and to select important variables. In addition, the parameters mean adjusted agreement and mutual forest impact (MFI), a corrected approach of the previous, can be applied to investigate variable relations based on surrogate variables.
# SurrogateMinimalDepth
In this R package functions are provided to select important variables with surrogate minimal depth (SMD) and minimal depth (MD) and to investigate variable relations with the mean adjusted agreement of surrogate variables.
Please cite the following manuscripts if you use the package:
SMD: S. Seifert, S. Gundlach, S. Szymczak, Surrogate minimal depth as an importance measure for variables in random forests, Bioinformatics 2019, 35, 3663-3671.
Please cite the following manuscript if you use the package:
Stephan Seifert, Sven Gundlach and Silke Szymczak (2018): Surrogate minimal depth as an importance measure for variables in random forests. In revision at Bioinformatics.
@@ -18,7 +17,7 @@ The package contains an example data set which consists of a single replicate of
# Usage
First the package and the example data are loaded:
```
library(RFSurrogates
library(SurrogateMinimalDepth)
data("SMD_example_data")
dim(SMD_example_data)
[1] 100 201
...
...
@@ -37,7 +36,7 @@ The data set has 100 observations in the rows and the columns contain the contin
## Minimal depth
First, we perform variable selection based on minimal depth using 1000 trees in the random forest. To make the analysis reproducible we set the seed first.
First, we perform variable selecion based on minimal depth using 1000 trees in the random forest. To make the analysis reproducible we set the seed first.
```
set.seed(42)
res.md = var.select.md(x = SMD_example_data[,2:ncol(SMD_example_data)], y = SMD_example_data[,1], ntree=1000)
...
...
@@ -54,13 +53,12 @@ head(md)
X1 X2 X3 X4 X5 X6
9.823 7.848 6.164 6.662 6.442 6.390
res.md$info$threshold
res$info$threshold
[1] 9.23097
```
We can see that variables X2, …, X6 have MD values smaller than the threshold in contrast to X1.
## Surrogate Minimal depth (SMD)
## Surrogate Minimal depth
Now we would like to analyze the example data with surrogate minimal depth which works similarly. However, we need to specify an additional parameter s, i.e. the number of surrogate variables that should be considered. In this analysis we use s = 10. Based on our simulation studies we recommend to set this parameter to approximately 1% of the predictor variables in larger datasets.
Variable selection with var.select.smd is conducted:
...
...
@@ -90,10 +88,7 @@ res.smd$info$threshold
We can see that variables X1, …, X6 have SMD values smaller than the threshold.
## Variable relations based on the mean adjusted agreement of surrogate variables
## Variable relations (based on the mean adjusted agreement of surrogate variables)
Now we want to investigate the relations of variables. We would like to identify which of the first 100 predictor variables are related to X1 and X7. We simulated 10 correlated predictor variables for each of these two basic variables.
One possibility to investigate variable relations is to use the results from var.select.smd. Hence, first SMD is conducted like in the previous section:
All of the variables that are correlated to X1 are correctly identified as related to X1 and all of the variables that are correlated to X7 are correcly identified as related to X7.
## Variable relations based on mutual forest impact (MFI)
## Mutual impurity reduction (MIR)
Now we would like to analyze the example data with MIR which determines the variable importance by the actual impurity reduction combined with the relations determined by MFI. Different to MD and SMD, this approach calculates p-values for the selection of important variables. For this, the null distribution is either obtained by negative importance scores, which is called the Janitza approach or by permutation. Since this example dataset is comparatively small, we use the permutation approach (see the second publication for more details about this parameter and MIR in generall)
The selected variables are stored in res.mir$var. Here, the relevant variables cp1_1 to cp1_10, cp2_1, cp2_3, cp2_4, cp2_6, cp2_7, cp2_10, cp3_1, cp3_4, cp3_5, as well as the non-relevant variables cgn_72 and cgn_81 are selected.
The MIR values and p-values can be extracted as follows:
We can see that variables X1, …, X6 have a p-value of 0 and are selected.
Since this approach is based on the actual impurity reduction combined with the relations determined by MFI, both of these can also be extracted from the results: