Skip to content
Snippets Groups Projects
Unverified Commit 9896f131 authored by Gärber, Florian's avatar Gärber, Florian Committed by GitHub
Browse files

chore: Update documentation (#8)

This PR includes several minor changes to configuration:
- `.Rbuildignore` now ignores two additional files previously being
noted by R CMD check
- `_pkgdown.yml` now includes basic custom sections for the reference
section
- `_pkgdown.yml` and `DESCRIPTION` now contain the GitHub Pages URL

Further, this PR updates README.md code examples to be up to date with
the latest released version. Selected variables changed slightly within
the older sections, but did not affect the overall conclusions.

The updated README also highlights an issue with how some functions are
called (such as requiring an empty `data.frame` even when no forest is
being created), which I intend to address in the future.
For now, the updated README will at least provide guidance as to how the
function can be called successfully.
parents 9ea0fad3 04d087eb
No related branches found
No related tags found
No related merge requests found
......@@ -4,3 +4,5 @@
^_pkgdown\.yml$
^docs$
^pkgdown$
^TROUBLESHOOTING\.md$
^\.github$
......@@ -18,6 +18,8 @@ Description: This package provides functions to obtain surrogate splits
corresponding adjusted agreement values are used for surrogate minimal
depth variable importance and to investigate variable relations.
License: MIT + file LICENSE
URL: https://agseifert.github.io/RFSurrogates/,
https://github.com/AGSeifert/RFSurrogates
Imports:
linkcomm,
parallel,
......
......@@ -4,189 +4,217 @@ In this R package, several functions are provided for applying approaches based
Please cite the following manuscripts if you use the package:
[1] S. Seifert, S. Gundlach, S. Szymczak, Surrogate minimal depth as an importance measure for variables in random forests, Bioinformatics 2019, 35, 3663-3671.
[1] S. Seifert, S. Gundlach, S. Szymczak, Surrogate minimal depth as an importance measure for variables in random forests, Bioinformatics 2019, 35, 3663-3671. [[doi:10.1093/bioinformatics/btz149](https://doi.org/10.1093/bioinformatics/btz149)]
[2] publication about MFI/MIR under preparation
[2] publication about MFI/MIR under preparation [[arXiv Preprint](https://doi.org/10.48550/ARXIV.2304.02490)]
# Install
```
library(devtools)
install_github("AGSeifert/RFSurrogates")
# Installation
```r
devtools::install_github("AGSeifert/RFSurrogates")
```
# Example data
The package contains an example data set which consists of a single replicate of the simulation study 1 in publication [1]. Please refer to the paper and the documentation of the SMD_example_data for further details on the simulation scenario. The R script for the simulation is published here: doi.org/10.25592/uhhfdm.12620.
# Example Data [![DOI](https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.12620.svg)](https://doi.org/10.25592/uhhfdm.12620)
The package contains an example data set which consists of a single replicate of the simulation study 1 in publication [1]. Please refer to the paper and the documentation of the SMD_example_data for further details on the simulation scenario. The R script for the simulation is published [here](https://doi.org/10.25592/uhhfdm.12620).
# Usage
First the package and the example data are loaded:
```
library(RFSurrogates
```r
library(RFSurrogates)
data("SMD_example_data")
dim(SMD_example_data)
[1] 100 201
# [1] 100 201
head(SMD_example_data[, 1:5])
y X1 X2 X3 X4
1 1.8222421 -0.02768266 -1.1019154 2.2659401 0.008021516
2 -1.0401813 0.73258486 -0.4107975 0.7587792 -0.718752746
3 2.7139607 -0.05399936 1.1851261 0.9743160 -2.563176970
4 -0.7081372 -0.84838121 -0.8975802 0.5247899 1.180683275
5 -1.0264429 -0.42219003 0.5439467 -0.1626504 0.682333020
6 3.1871209 0.91722598 0.1974106 0.9571554 0.351634641
# y X1 X2 X3 X4
# 1 1.8222421 -0.02768266 -1.1019154 2.2659401 0.008021516
# 2 -1.0401813 0.73258486 -0.4107975 0.7587792 -0.718752746
# 3 2.7139607 -0.05399936 1.1851261 0.9743160 -2.563176970
# 4 -0.7081372 -0.84838121 -0.8975802 0.5247899 1.180683275
# 5 -1.0264429 -0.42219003 0.5439467 -0.1626504 0.682333020
# 6 3.1871209 0.91722598 0.1974106 0.9571554 0.351634641
```
The data set has 100 observations in the rows and the columns contain the continuous outcome variable y and 200 continuous predictor variables in the columns.
## Minimal depth
## Minimal Depth (MD)
First, we perform variable selection based on minimal depth using 1000 trees in the random forest. To make the analysis reproducible we set the seed first.
```
```r
set.seed(42)
res.md = var.select.md(x = SMD_example_data[,2:ncol(SMD_example_data)], y = SMD_example_data[,1], ntree=1000)
res.md <- var.select.md(
x = SMD_example_data[, -1], y = SMD_example_data[, 1],
num.trees = 1000)
res.md$var
[1] "X2" "X3" "X4" "X5" "X6" "cp1_8" "cp2_6" "cp2_7" "cp3_4" "cp3_6" "cp8_10" "cgn_68" "cgn_72" "cgn_81"
# [1] "X2" "X3" "X4" "X5" "X6" "cp1_8"
# [7] "cp2_1" "cp2_6" "cp2_7" "cp3_4" "cp3_6" "cgn_20"
# [13] "cgn_68" "cgn_72" "cgn_81"
```
The selected variables are stored in res.md$var. In this analysis the relevant basic variables X2 to X6, as well as the relevant variables cp2_6, cp2_7, cp3_4, and cp3_6, and the non-relevant variables cp8_10, cgn_68, cgn_72, and cgn_81 are selected.
The selected variables are stored in `res.md$var`. In this analysis the relevant basic variables `"X2"` to `"X6"`, as well as the relevant variables `"cp1_8"`, `"cp2_1"`, `"cp2_6"`, `"cp2_7"`, `"cp3_4"`, and `"cp3_6"`, and the non-relevant variables `"cgn_20"`, `"cgn_68"`, `"cgn_72"`, and `"cgn_81"` are selected.
The MD values for each predictor variable and the threshold to select variables can be extracted as follows:
```
md = res.md$info$depth
```r
md <- res.md$info$depth
head(md)
X1 X2 X3 X4 X5 X6
9.823 7.848 6.164 6.662 6.442 6.390
# X1 X2 X3 X4 X5 X6
# 9.743 7.907 6.155 6.615 6.456 6.428
res.md$info$threshold
[1] 9.23097
# [1] 9.230328
```
We can see that variables X2, …, X6 have MD values smaller than the threshold in contrast to X1.
## Surrogate Minimal depth (SMD)
We can see that variables `"X2"`, …, `"X6"` have MD values smaller than the threshold in contrast to `"X1"`.
## Surrogate Minimal Depth (SMD)
Now we would like to analyze the example data with surrogate minimal depth which works similarly. However, we need to specify an additional parameter s, i.e. the number of surrogate variables that should be considered. In this analysis we use s = 10. Based on our simulation studies we recommend to set this parameter to approximately 1% of the predictor variables in larger datasets.
Now we would like to analyze the example data with surrogate minimal depth which works similarly. However, we need to specify an additional parameter `s`, the number of surrogate variables that should be considered. In this analysis we use `s = 10`. Based on our simulation studies we recommend to set this parameter to approximately 1% of the predictor variables in larger datasets.
Variable selection with var.select.smd is conducted:
```
```r
set.seed(42)
res.smd = var.select.smd(x = SMD_example_data[,2:ncol(SMD_example_data)], y = SMD_example_data[,1], s = 10, ntree = 1000)
res.smd <- var.select.smd(
x = SMD_example_data[, -1], y = SMD_example_data[, 1],
s = 10, num.trees = 1000)
res.smd$var
[1] "X1" "X2" "X3" "X4" "X5" "X6" "cp1_1" "cp1_2" "cp1_3" "cp1_4" "cp1_5" "cp1_6" "cp1_7" "cp1_8" "cp1_9"
[16] "cp1_10" "cp2_4" "cp2_6"
# [1] "X1" "X2" "X3" "X4" "X5" "X6"
# [7] "X7" "X8" "cp1_1" "cp1_2" "cp1_3" "cp1_4"
# [13] "cp1_5" "cp1_6" "cp1_7" "cp1_8" "cp1_9" "cp1_10"
# [19] "cp2_1" "cp2_3" "cp2_4" "cp2_6" "cp2_7" "cp2_9"
# [25] "cp2_10" "cp3_4"
```
The selected variables are stored in res.smd$var. In this analysis the relevant basic variables X1 to X6, as well as the relevant variables cp1_1 to cp1_10, cp2_4, and cp2_6 are selected. Compared to MD more of the relevant variables and none of the non-relevant variables are selected.
The selected variables are stored in `res.smd$var`. In this analysis the relevant basic variables `"X1"` to `"X6"`, as well as the relevant variables `"cp1_1"` to `"cp1_10"`, `"cp2_1"`, `"cp2_3"`, `"cp2_4"`, `"cp2_6"` through `"cp2_10"`, and `"cp3_4"` are selected. Compared to MD more of the relevant variables and none of the non-relevant variables are selected.
The SMD values for each predictor variable and the threshold to select variables can be extracted as follows:
```
smd = res.smd$info$depth
head(smd)
X1 X2 X3 X4 X5 X6
2.344 2.287 2.095 2.576 2.509 2.276
# X1 X2 X3 X4 X5 X6
# 2.112 2.085 1.806 2.208 2.148 2.014
res.smd$info$threshold
[1] 2.690082
# [1] 2.671644
```
We can see that variables X1, …, X6 have SMD values smaller than the threshold.
We can see that variables `"X1"`, …, `"X6"` have SMD values smaller than the threshold.
## Variable relations based on the mean adjusted agreement of surrogate variables
Now we want to investigate the relations of variables. We would like to identify which of the first 100 predictor variables are related to X1 and X7. We simulated 10 correlated predictor variables for each of these two basic variables.
One possibility to investigate variable relations is to use the results from var.select.smd. Hence, first SMD is conducted like in the previous section:
Now we want to investigate the relations of variables. We would like to identify which of the first 100 predictor variables are related to `"X1"` and `"X7"`. We simulated 10 correlated predictor variables for each of these two basic variables.
One possibility to investigate variable relations is to use the results from `var.select.smd()`. Hence, first SMD is conducted like in the [previous section](#surrogate-minimal-depth-smd).
```
res.smd = var.select.smd(x = SMD_example_data[,2:ncol(SMD_example_data)], y = SMD_example_data[,1], s = 10, ntree = 1000)
```
Subsequently, variable relations are analyzed with var.relations. The parameter t can be adapted to either focus on strongly related variables only (high numbers) or to include also moderately related variables (low numbers):
Subsequently, variable relations are analyzed with `var.relations()`. The parameter `t` can be adapted to either focus on strongly related variables only (high numbers) or to include also moderately related variables (low numbers):
```
candidates = colnames(SMD_example_data )[2:101]
rel = var.relations(forest = res.smd$forest, variables = c("X1","X7"), candidates = candidates, t = 5)
rel$var
$X1
[1] "cp1_1" "cp1_2" "cp1_3" "cp1_4" "cp1_5" "cp1_6" "cp1_7" "cp1_8" "cp1_9" "cp1_10"
```r
candidates <- colnames(SMD_example_data)[2:101]
rel <- var.relations(
x = data.frame(), create.forest = FALSE,
forest = res.smd$forest,
variables = c("X1", "X7"), candidates = candidates,
t = 5)
$X7
[1] "cp7_1" "cp7_2" "cp7_3" "cp7_4" "cp7_5" "cp7_6" "cp7_7" "cp7_8" "cp7_9" "cp7_10"
rel$var
# $X1
# [1] "cp1_1" "cp1_2" "cp1_3" "cp1_4" "cp1_5" "cp1_6"
# [7] "cp1_7" "cp1_8" "cp1_9" "cp1_10"
#
# $X7
# [1] "cp7_1" "cp7_2" "cp7_3" "cp7_4" "cp7_5" "cp7_6"
# [7] "cp7_7" "cp7_8" "cp7_9" "cp7_10"
```
All of the variables that are correlated to X1 are correctly identified as related to X1 and all of the variables that are correlated to X7 are correcly identified as related to X7.
All of the variables that are correlated to `"X1"` are correctly identified as related to `"X1"` and all of the variables that are correlated to `"X7"` are correctly identified as related to `"X7"`.
## Variable relations based on mutual forest impact (MFI)
## Variable relations based on Mutual Forest Impact (MFI)
MFI is a corrected relation parameter calculated by the mean adjusted agreement of the variables and permuted versions of them. Related variables are selected by p-values obtained from a null distribution either determined by negative relation scores (based on the Janitza approach) or by permuted relations.
We use the default parameters for the selection here, which is a p-values threshold of 0.01 and the Janitza approach.
```
```r
set.seed(42)
rel.mfi = var.relations.mfi(x = x, y = y, s = 10, ntree = 1000, variables = c("X1","X7"), candidates = colnames(x)[1:100], p.t = 0.01, method = "janitza" )
rel.mfi$var.rel
$X1
[1] "cp1_1" "cp1_2" "cp1_3" "cp1_4" "cp1_5" "cp1_6" "cp1_7" "cp1_8" "cp1_9" "cp1_10"
rel.mfi <- var.relations.mfi(
x = SMD_example_data[, -1], y = SMD_example_data[, 1],
s = 10, num.trees = 1000, variables = c("X1","X7"),
candidates = colnames(SMD_example_data)[2:101],
p.t = 0.01, method = "janitza", num.threads = 1)
$X7
[1] "cp7_1" "cp7_2" "cp7_3" "cp7_4" "cp7_5" "cp7_6" "cp7_7" "cp7_8" "cp7_9" "cp7_10"
rel.mfi$var.rel
# $X1
# [1] "cp1_1" "cp1_2" "cp1_3" "cp1_4" "cp1_5" "cp1_6"
# [7] "cp1_7" "cp1_8" "cp1_9" "cp1_10"
#
# $X7
# [1] "cp7_1" "cp7_2" "cp7_3" "cp7_4" "cp7_5" "cp7_6"
# [7] "cp7_7" "cp7_8" "cp7_9" "cp7_10"
```
Also by MFI, all of the variables that are correlated to X1 are correctly identified as related to X1 and all of the variables that are correlated to X7 are correcly identified as related to X7.
Also the matrix of determined relation (surr.res), permuted relations (surr.perm) and determined p-values (p.rel) can be extracted as followes:
Also by MFI, all of the variables that are correlated to `"X1"` are correctly identified as related to `"X1"` and all of the variables that are correlated to `"X7"` are correctly identified as related to `"X7"`.
Also the matrix of determined relation (`surr.res`), permuted relations (`surr.perm`) and determined p-values (`p.rel`) can be extracted as follows:
```
MFI = rel.mfi$surr.res
surr.perm = rel.mfi$surr.perm
p.rel = rel.mfi$p.rel
```r
MFI <- rel.mfi$surr.res
surr.perm <- rel.mfi$surr.perm
p.rel <- rel.mfi$p.rel
```
## Mutual impurity reduction (MIR)
## Mutual Impurity Reduction (MIR)
Now we would like to analyze the example data with MIR, which determines the variable importance by the actual impurity reduction combined with the relations determined by MFI. Different to MD and SMD, this approach calculates p-values for the selection of important variables. For this, the null distribution is obtained in a similar way as for MFI, either by negative importance scores called the Janitza approach or by permutation. Since this example dataset is comparatively small, we use the permutation approach. As a threshold for selection a value of 0.01 is applied (p.t.sel = 0.01).
Now we would like to analyze the example data with MIR, which determines the variable importance by the actual impurity reduction combined with the relations determined by MFI. Different to MD and SMD, this approach calculates p-values for the selection of important variables. For this, the null distribution is obtained in a similar way as for MFI, either by negative importance scores called the Janitza approach or by permutation. Since this example data set is comparatively small, we use the permutation approach. As a threshold for selection a value of 0.01 is applied (`p.t.sel = 0.01`).
```
```r
set.seed(42)
res.mir = var.select.mir(x = SMD_example_data[,2:ncol(SMD_example_data)], y = SMD_example_data[,1], s = 10, ntree = 1000, method.sel = "permutation", p.t.sel = 0.01)
res.mir <- var.select.mir(
x = SMD_example_data[, -1], y = SMD_example_data[, 1],
s = 10, num.trees = 1000, method.sel = "permutation",
p.t.sel = 0.01, num.threads = 1)
res.mir$var
[1] "X1" "X2" "X3" "X4" "X5" "X6" "cp1_1" "cp1_2" "cp1_3" "cp1_4" "cp1_5" "cp1_6"
[13] "cp1_7" "cp1_8" "cp1_9" "cp1_10" "cp2_1" "cp2_3" "cp2_4" "cp2_6" "cp2_7" "cp2_10" "cp3_1" "cp3_4"
[25] "cp3_5" "cgn_72" "cgn_81"
# [1] "X1" "X2" "X3" "X4" "X5" "X6"
# [7] "cp1_1" "cp1_2" "cp1_3" "cp1_4" "cp1_5" "cp1_6"
# [13] "cp1_7" "cp1_8" "cp1_9" "cp1_10" "cp2_1" "cp2_3"
# [19] "cp2_4" "cp2_6" "cp2_7" "cp2_10" "cp3_1" "cp3_4"
# [25] "cp3_5" "cgn_72" "cgn_81"
```
The selected variables are stored in res.mir$var. Here, the relevant variables cp1_1 to cp1_10, cp2_1, cp2_3, cp2_4, cp2_6, cp2_7, cp2_10, cp3_1, cp3_4, cp3_5, as well as the non-relevant variables cgn_72 and cgn_81 are selected.
The selected variables are stored in `res.mir$var`. Here, the relevant variables `"cp1_1"` to `"cp1_10"`, `"cp2_1"`, `"cp2_3"`, `"cp2_4"`, `"cp2_6"`, `"cp2_7"`, `"cp2_10"`, `"cp3_1"`, `"cp3_4"`, `"cp3_5"`, as well as the non-relevant variables `"cgn_72"` and `"cgn_81"` are selected.
The MIR values and p-values can be extracted as follows:
```
mir = res.mir$info$MIR
mir <- res.mir$info$MIR
head(mir)
X1 X2 X3 X4 X5 X6
10.68243 15.95674 27.09036 20.50233 23.16293 21.15731
# X1 X2 X3 X4 X5 X6
# 10.68243 15.95674 27.09036 20.50233 23.16293 21.15731
pvalues <- res.mir$info$pvalue
pvalues = res.mir$info$pvalue
head(pvalues)
X1 X2 X3 X4 X5 X6
0 0 0 0 0 0
# X1 X2 X3 X4 X5 X6
# 0 0 0 0 0 0
```
We can see that variables X1, …, X6 have a p-value of 0 and are selected.
Since this approach is based on the actual impurity reduction combined with the relations determined by MFI, both of these can also be extracted from the results:
We can see that variables `"X1"`, …, `"X6"` have a p-value of 0 and are selected.
Since this approach is based on the actual impurity reduction combined with the relations determined by MFI, both of these can also be extracted from the results:
```
air = res.mir$info$AIR
air <- res.mir$info$AIR
head(air)
X1 X2 X3 X4 X5 X6
1.072849 13.133904 26.444900 19.155187 22.718355 20.782305
# X1 X2 X3 X4 X5 X6
# 1.072849 13.133904 26.444900 19.155187 22.718355 20.782305
res.mfi = res.mir$info$relations
res.mfi <- res.mir$info$relations
```
res.mfi contains the results of var.relations.mfi conducted in MIR.
`res.mfi` contains the results of `var.relations.mfi()` conducted in MIR.
url: ~
url: https://agseifert.github.io/RFSurrogates
template:
bootstrap: 5
reference:
- title: Variable Selection
contents:
- starts_with("var.select")
- title: Variable Relations
contents:
- starts_with("var.relations")
- title: Data
contents:
- has_keyword("datasets")
- title: Additional functions
contents:
- -has_keyword("datasets")
- -starts_with("var.")
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment