Title: | Spectrally Deconfounded Models |
---|---|
Description: | Screen for and analyze non-linear sparse direct effects in the presence of unobserved confounding using the spectral deconfounding techniques (Ćevid, Bühlmann, and Meinshausen (2020)<jmlr.org/papers/v21/19-545.html>, Guo, Ćevid, and Bühlmann (2022) <doi:10.1214/21-AOS2152>). These methods have been shown to be a good estimate for the true direct effect if we observe many covariates, e.g., high-dimensional settings, and we have fairly dense confounding. Even if the assumptions are violated, it seems like there is not much to lose, and the deconfounded models will, in general, estimate a function closer to the true one than classical least squares optimization. 'SDModels' provides functions SDAM() for Spectrally Deconfounded Additive Models (Scheidegger, Guo, and Bühlmann (2025) <doi:10.1145/3711116>) and SDForest() for Spectrally Deconfounded Random Forests (Ulmer, Scheidegger, and Bühlmann (2025) <doi:10.48550/arXiv.2502.03969>). |
Authors: | Markus Ulmer [aut, cre, cph] |
Maintainer: | Markus Ulmer <[email protected]> |
License: | GPL-3 |
Version: | 1.0.6 |
Built: | 2025-03-11 09:31:04 UTC |
Source: | https://github.com/markusul/sdmodels |
Returns a copy of the forest object. Might be useful if you want to keep the original forest in comparison to the pruned forest.
## S3 method for class 'SDForest' copy(object, ...)
## S3 method for class 'SDForest' copy(object, ...)
object |
an SDForest object |
... |
Further arguments passed to or from other methods. |
A copy of the SDForest object
Markus Ulmer
set.seed(1) X <- matrix(rnorm(10 * 20), nrow = 10) Y <- rnorm(10) fit <- SDForest(x = X, y = Y, nTree = 2, cp = 0.5) fit2 <- copy(fit)
set.seed(1) X <- matrix(rnorm(10 * 20), nrow = 10) Y <- rnorm(10) fit <- SDForest(x = X, y = Y, nTree = 2, cp = 0.5) fit2 <- copy(fit)
Returns a copy of the tree object. Might be useful if you want to keep the original tree in comparison to the pruned tree.
## S3 method for class 'SDTree' copy(object, ...)
## S3 method for class 'SDTree' copy(object, ...)
object |
an SDTree object |
... |
Further arguments passed to or from other methods. |
A copy of the SDTree object
Markus Ulmer
set.seed(1) X <- matrix(rnorm(10 * 20), nrow = 10) Y <- rnorm(10) fit <- SDTree(x = X, y = Y, cp = 0.5) fit2 <- copy(fit)
set.seed(1) X <- matrix(rnorm(10 * 20), nrow = 10) Y <- rnorm(10) fit <- SDTree(x = X, y = Y, cp = 0.5) fit2 <- copy(fit)
Estimates the optimal complexity parameter for the SDTree using cross-validation. The transformations are estimated for each training set and validation set separately to ensure independence of the validation set.
cvSDTree( formula = NULL, data = NULL, x = NULL, y = NULL, max_leaves = NULL, cp = 0, min_sample = 5, mtry = NULL, fast = TRUE, Q_type = "trim", trim_quantile = 0.5, q_hat = 0, Qf = NULL, A = NULL, gamma = 0.5, gpu = FALSE, mem_size = 1e+07, max_candidates = 100, nfolds = 3, cp_seq = NULL, mc.cores = 1, Q_scale = TRUE )
cvSDTree( formula = NULL, data = NULL, x = NULL, y = NULL, max_leaves = NULL, cp = 0, min_sample = 5, mtry = NULL, fast = TRUE, Q_type = "trim", trim_quantile = 0.5, q_hat = 0, Qf = NULL, A = NULL, gamma = 0.5, gpu = FALSE, mem_size = 1e+07, max_candidates = 100, nfolds = 3, cp_seq = NULL, mc.cores = 1, Q_scale = TRUE )
formula |
Object of class |
data |
Training data of class |
x |
Predictor data, alternative to |
y |
Response vector, alternative to |
max_leaves |
Maximum number of leaves for the grown tree. |
cp |
Complexity parameter, minimum loss decrease to split a node.
A split is only performed if the loss decrease is larger than |
min_sample |
Minimum number of observations per leaf.
A split is only performed if both resulting leaves have at least
|
mtry |
Number of randomly selected covariates to consider for a split,
if |
fast |
If |
Q_type |
Type of deconfounding, one of 'trim', 'pca', 'no_deconfounding'.
'trim' corresponds to the Trim transform (Ćevid et al. 2020)
as implemented in the Doubly debiased lasso (Guo et al. 2022),
'pca' to the PCA transformation(Paul et al. 2008).
See |
trim_quantile |
Quantile for Trim transform,
only needed for trim and DDL_trim, see |
q_hat |
Assumed confounding dimension, only needed for pca,
see |
Qf |
Spectral transformation, if |
A |
Numerical Anchor of class |
gamma |
Strength of distributional robustness, |
gpu |
If |
mem_size |
Amount of split candidates that can be evaluated at once. This is a trade-off between memory and speed can be decreased if either the memory is not sufficient or the gpu is to small. |
max_candidates |
Maximum number of split points that are proposed at each node for each covariate. |
nfolds |
Number of folds for cross-validation. It is recommended to not use more than 5 folds if the number of covariates is larger than the number of observations. In this case the spectral transformation could differ to much if the validation data is substantially smaller than the training data. |
cp_seq |
Sequence of complexity parameters cp to compare using cross-validation,
if |
mc.cores |
Number of cores to use for parallel computation. |
Q_scale |
Should data be scaled to estimate the spectral transformation?
Default is |
A list containing
cp_min |
The optimal complexity parameter. |
cp_table |
A table containing the complexity parameter, the mean and the standard deviation of the loss on the validation sets for the complexity parameters. If multiple complexity parameters result in the same loss, only the one with the largest complexity parameter is shown. |
Markus Ulmer
Guo Z, Ćevid D, Bühlmann P (2022).
“Doubly debiased lasso: High-dimensional inference under hidden confounding.”
The Annals of Statistics, 50(3).
ISSN 0090-5364, doi:10.1214/21-AOS2152.
Paul D, Bair E, Hastie T, Tibshirani R (2008).
““Preconditioning” for feature selection and regression in high-dimensional problems.”
The Annals of Statistics, 36(4).
ISSN 0090-5364, doi:10.1214/009053607000000578.
Ćevid D, Bühlmann P, Meinshausen N (2020).
“Spectral Deconfounding via Perturbed Sparse Linear Models.”
J. Mach. Learn. Res., 21(1).
ISSN 1532-4435, http://jmlr.org/papers/v21/19-545.html.
SDTree
prune.SDTree
regPath.SDTree
set.seed(1) n <- 50 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n, 0, 5) cp <- cvSDTree(x = X, y = y, Q_type = 'no_deconfounding') cp
set.seed(1) n <- 50 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n, 0, 5) cp <- cvSDTree(x = X, y = y, Q_type = 'no_deconfounding') cp
Function of x on a fourier basis with a subset of covariates having a causal effect on Y using the parameters beta. The function is given by:
f_four(x, beta, js)
f_four(x, beta, js)
x |
a vector of covariates |
beta |
the parameter vector for the function f(X) |
js |
the indices of the causal covariates in X |
the value of the function f(x)
Markus Ulmer
set.seed(42) # simulation of confounded data sim_data <- simulate_data_nonlinear(q = 2, p = 150, n = 100, m = 2) X <- sim_data$X j <- sim_data$j[1] apply(X, 1, function(x) f_four(x, sim_data$beta, j))
set.seed(42) # simulation of confounded data sim_data <- simulate_data_nonlinear(q = 2, p = 150, n = 100, m = 2) X <- sim_data$X j <- sim_data$j[1] apply(X, 1, function(x) f_four(x, sim_data$beta, j))
Converts the trees in an SDForest object from
class list
to class Node
(Glur 2023).
## S3 method for class 'SDForest' fromList(object, ...)
## S3 method for class 'SDForest' fromList(object, ...)
object |
an SDForest object with the trees in list format |
... |
Further arguments passed to or from other methods. |
an SDForest object with the trees in Node format
Markus Ulmer
Glur C (2023). “data.tree: General Purpose Hierarchical Data Structure.” https://CRAN.R-project.org/package=data.tree.
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5, nTree = 2) fromList(toList(model))
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5, nTree = 2) fromList(toList(model))
Converts the tree in an SDTree object from
class list
to class Node
(Glur 2023).
## S3 method for class 'SDTree' fromList(object, ...)
## S3 method for class 'SDTree' fromList(object, ...)
object |
an SDTree object with the tree in list format |
... |
Further arguments passed to or from other methods. |
an SDTree object with the tree in Node format
Markus Ulmer
Glur C (2023). “data.tree: General Purpose Hierarchical Data Structure.” https://CRAN.R-project.org/package=data.tree.
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) fromList(toList(model))
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) fromList(toList(model))
This function extracts the sequence of complexity parameters of an SDForest that result in changes of the SDForest if pruned. Only cp values that differ in the first three digits after the decimal point are returned.
## S3 method for class 'SDForest' get_cp_seq(object, ...)
## S3 method for class 'SDForest' get_cp_seq(object, ...)
object |
an SDForest object |
... |
Further arguments passed to or from other methods. |
A sequence of complexity parameters
Markus Ulmer
regPath
stabilitySelection
get_cp_seq.SDTree
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', cp = 0, nTree = 2) get_cp_seq(model)
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', cp = 0, nTree = 2) get_cp_seq(model)
This function extracts the sequence of complexity parameters of an SDTree that result in changes of the tree structure if pruned. Only cp values that differ in the first three digits after the decimal point are returned.
## S3 method for class 'SDTree' get_cp_seq(object, ...)
## S3 method for class 'SDTree' get_cp_seq(object, ...)
object |
an SDTree object |
... |
Further arguments passed to or from other methods. |
A sequence of complexity parameters
Markus Ulmer
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0) get_cp_seq(model)
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0) get_cp_seq(model)
Estimates the spectral transformation Q for spectral deconfounding by shrinking the leading singular values of the covariates.
get_Q(X, type, trim_quantile = 0.5, q_hat = 0, gpu = FALSE, scaling = TRUE)
get_Q(X, type, trim_quantile = 0.5, q_hat = 0, gpu = FALSE, scaling = TRUE)
X |
Numerical covariates of class |
type |
Type of deconfounding, one of 'trim', 'pca', 'no_deconfounding'. 'trim' corresponds to the Trim transform (Ćevid et al. 2020) as implemented in the Doubly debiased lasso (Guo et al. 2022), 'pca' to the PCA transformation(Paul et al. 2008) and 'no_deconfounding' to the Identity. |
trim_quantile |
Quantile for Trim transform, only needed for trim. |
q_hat |
Assumed confounding dimension, only needed for pca. |
gpu |
If |
scaling |
Whether X should be scaled before calculating the spectral transformation. |
Q of class matrix
, the spectral transformation matrix.
Markus Ulmer
Guo Z, Ćevid D, Bühlmann P (2022).
“Doubly debiased lasso: High-dimensional inference under hidden confounding.”
The Annals of Statistics, 50(3).
ISSN 0090-5364, doi:10.1214/21-AOS2152.
Paul D, Bair E, Hastie T, Tibshirani R (2008).
““Preconditioning” for feature selection and regression in high-dimensional problems.”
The Annals of Statistics, 36(4).
ISSN 0090-5364, doi:10.1214/009053607000000578.
Ćevid D, Bühlmann P, Meinshausen N (2020).
“Spectral Deconfounding via Perturbed Sparse Linear Models.”
J. Mach. Learn. Res., 21(1).
ISSN 1532-4435, http://jmlr.org/papers/v21/19-545.html.
set.seed(1) X <- matrix(rnorm(50 * 20), nrow = 50) Q_trim <- get_Q(X, 'trim') Q_pca <- get_Q(X, 'pca', q_hat = 5) Q_plain <- get_Q(X, 'no_deconfounding')
set.seed(1) X <- matrix(rnorm(50 * 20), nrow = 50) Q_trim <- get_Q(X, 'trim') Q_pca <- get_Q(X, 'pca', q_hat = 5) Q_plain <- get_Q(X, 'no_deconfounding')
Estimates the anchor transformation for the Anchor-Objective.
The anchor transformation is ,
where
. For
this is just the identity.
For
this corresponds to residuals after orthogonal projecting onto A.
For large
this is close to the orthogonal projection onto A, scaled by
.
The estimator
corresponds to the Anchor-Regression Estimator
(Rothenhäusler et al. 2021), (Bühlmann 2020).
get_W(A, gamma, intercept = FALSE, gpu = FALSE)
get_W(A, gamma, intercept = FALSE, gpu = FALSE)
A |
Numerical Anchor of class |
gamma |
Strength of distributional robustness, |
intercept |
Logical, whether to include an intercept in the anchor. |
gpu |
If |
W of class matrix
, the anchor transformation matrix.
Markus Ulmer
Bühlmann P (2020).
“Invariance, Causality and Robustness.”
Statistical Science, 35(3).
ISSN 0883-4237, doi:10.1214/19-STS721.
Rothenhäusler D, Meinshausen N, Bühlmann P, Peters J (2021).
“Anchor Regression: Heterogeneous Data Meet Causality.”
Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(2), 215–246.
ISSN 1369-7412, doi:10.1111/rssb.12398.
set.seed(1) n <- 50 X <- matrix(rnorm(n * 1), nrow = n) Y <- 3 * X + rnorm(n) W <- get_W(X, gamma = 0) resid <- W %*% Y
set.seed(1) n <- 50 X <- matrix(rnorm(n * 1), nrow = n) Y <- 3 * X + rnorm(n) W <- get_W(X, gamma = 0) resid <- W %*% Y
This function merges two forests. The trees are combined and the variable importance is calculated as a weighted average of the two forests. If the forests are trained on the same data, the predictions and oob_predictions are combined as well.
mergeForest(fit1, fit2)
mergeForest(fit1, fit2)
fit1 |
first |
fit2 |
second |
merged SDForest
object
set.seed(1)
n <- 50
X <- matrix(rnorm(n * 5), nrow = n)
y <- sign(X[, 1]) * 3 + rnorm(n)
fit1 <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', nTree = 5, cp = 0.5)
fit2 <- SDForest(x = X, y = y, nTree = 5, cp = 0.5)
mergeForest(fit1, fit2)
Markus Ulmer
This function calculates the partial dependence of a model on a single variable. For that predictions are made for all observations in the dataset while varying the value of the variable of interest. The overall partial effect is the average of all predictions. (Friedman 2001)
partDependence(object, j, X = NULL, subSample = NULL, mc.cores = 1)
partDependence(object, j, X = NULL, subSample = NULL, mc.cores = 1)
object |
A model object that has a predict method that takes newdata as argument and returns predictions. |
j |
The variable for which the partial dependence should be calculated. Either the column index of the variable in the dataset or the name of the variable. |
X |
The dataset on which the partial dependence should be calculated. Should contain the same variables as the dataset used to train the model. If NULL, tries to extract the dataset from the model object. |
subSample |
Number of samples to draw from the original data for the empirical partial dependence. If NULL, all the observations are used. |
mc.cores |
Number of cores to use for parallel computation. Parallel computing is only supported for unix. |
An object of class partDependence
containing
preds_mean |
The average prediction for each value of the variable of interest. |
x_seq |
The sequence of values for the variable of interest. |
preds |
The predictions for each value of the variable of interest for each observation. |
j |
The name of the variable of interest. |
xj |
The values of the variable of interest in the dataset. |
Markus Ulmer
Friedman JH (2001). “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics, 29(5), 1189–1232. ISSN 00905364, http://www.jstor.org/stable/2699986.
set.seed(1) x <- rnorm(100) y <- sign(x) * 3 + rnorm(100) model <- SDTree(x = x, y = y, Q_type = 'no_deconfounding') pd <- partDependence(model, 1, X = x, subSample = 10) plot(pd)
set.seed(1) x <- rnorm(100) y <- sign(x) * 3 + rnorm(100) model <- SDTree(x = x, y = y, Q_type = 'no_deconfounding') pd <- partDependence(model, 1, X = x, subSample = 10) plot(pd)
This function plots the partial dependence of a model on a single variable.
## S3 method for class 'partDependence' plot(x, n_examples = 19, ...)
## S3 method for class 'partDependence' plot(x, n_examples = 19, ...)
x |
An object of class |
n_examples |
Number of examples to plot in addition to the average prediction. |
... |
Further arguments passed to or from other methods. |
A ggplot object.
Markus Ulmer
partDependence
set.seed(1)
x <- rnorm(10)
y <- sign(x) * 3 + rnorm(10)
model <- SDTree(x = x, y = y, Q_type = 'no_deconfounding', cp = 0.5)
pd <- partDependence(model, 1, X = x)
plot(pd)
This function visualizes the variable importance of an SDTree or SDForest for different complexity parameters. Both the regularization path and the stability selection path can be visualized.
## S3 method for class 'paths' plot(x, plotly = FALSE, selection = NULL, sqrt_scale = FALSE, ...)
## S3 method for class 'paths' plot(x, plotly = FALSE, selection = NULL, sqrt_scale = FALSE, ...)
x |
A |
plotly |
If TRUE the plot is returned interactive using plotly. Might be slow for large data. |
selection |
A vector of indices of the covariates to be plotted. Can be used to plot only a subset of the covariates in case of many covariates. |
sqrt_scale |
If TRUE the y-axis is on a square root scale. |
... |
Further arguments passed to or from other methods. |
A ggplot
object with the variable importance for different regularization.
If the path
object includes a cp_min value, a black dashed line is
added to indicate the out-of-bag optimal variable selection.
Markus Ulmer
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + sign(X[, 2]) + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) paths <- regPath(model) plot(paths) plot(paths, plotly = TRUE)
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + sign(X[, 2]) + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) paths <- regPath(model) plot(paths) plot(paths, plotly = TRUE)
Plot the SDTree.
## S3 method for class 'SDTree' plot(x, ...)
## S3 method for class 'SDTree' plot(x, ...)
x |
Fitted object of class |
... |
Further arguments for DiagrammeR::render_graph() |
graph object from DiagrammeR::render_graph()
Markus Ulmer
SDTree
set.seed(1)
n <- 10
X <- matrix(rnorm(n * 5), nrow = n)
y <- sign(X[, 1]) * 3 + rnorm(n)
model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5)
plot(model)
This function visualizes the out-of-bag performance of an SDForest for different complexity parameters. Can be used to choose the optimal complexity parameter.
plotOOB(object, sqrt_scale = FALSE)
plotOOB(object, sqrt_scale = FALSE)
object |
A paths object with loss_path |
sqrt_scale |
If TRUE the x-axis is on a square root scale. |
A ggplot object
Markus Ulmer
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + sign(X[, 2]) + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) paths <- regPath(model) plotOOB(paths)
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + sign(X[, 2]) + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) paths <- regPath(model) plotOOB(paths)
Predicts the contribution of an individual component j using a fitted SDAM.
predict_individual_fj(object, j, newdata = NULL)
predict_individual_fj(object, j, newdata = NULL)
object |
Fitted object of class |
j |
Which component to evaluate. |
newdata |
New test data of class |
A vector of predictions for fj evaluated at Xjnew.
Cyrill Scheidegger
set.seed(1) X <- matrix(rnorm(10 * 5), ncol = 5) Y <- sin(X[, 1]) - X[, 2] + rnorm(10) model <- SDAM(x = X, y = Y, Q_type = "trim", trim_quantile = 0.5, nfold = 2, n_K = 1) predict_individual_fj(model, j = 1)
set.seed(1) X <- matrix(rnorm(10 * 5), ncol = 5) Y <- sin(X[, 1]) - X[, 2] + rnorm(10) model <- SDAM(x = X, y = Y, Q_type = "trim", trim_quantile = 0.5, nfold = 2, n_K = 1) predict_individual_fj(model, j = 1)
Predicts the response for new data using a fitted SDAM.
## S3 method for class 'SDAM' predict(object, newdata, ...)
## S3 method for class 'SDAM' predict(object, newdata, ...)
object |
Fitted object of class |
newdata |
New test data of class |
... |
Further arguments passed to or from other methods. |
A vector of predictions for the new data.
Cyrill Scheidegger
set.seed(1) X <- matrix(rnorm(10 * 5), ncol = 5) Y <- sin(X[, 1]) - X[, 2] + rnorm(10) model <- SDAM(x = X, y = Y, Q_type = "trim", trim_quantile = 0.5, nfold = 2, n_K = 1) predict(model, newdata = data.frame(X))
set.seed(1) X <- matrix(rnorm(10 * 5), ncol = 5) Y <- sin(X[, 1]) - X[, 2] + rnorm(10) model <- SDAM(x = X, y = Y, Q_type = "trim", trim_quantile = 0.5, nfold = 2, n_K = 1) predict(model, newdata = data.frame(X))
Predicts the response for new data using a fitted SDForest.
## S3 method for class 'SDForest' predict(object, newdata, ...)
## S3 method for class 'SDForest' predict(object, newdata, ...)
object |
Fitted object of class |
newdata |
New test data of class |
... |
Further arguments passed to or from other methods. |
A vector of predictions for the new data.
Markus Ulmer
set.seed(1) n <- 50 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', nTree = 5, cp = 0.5) predict(model, newdata = data.frame(X))
set.seed(1) n <- 50 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', nTree = 5, cp = 0.5) predict(model, newdata = data.frame(X))
Predicts the response for new data using a fitted SDTree.
## S3 method for class 'SDTree' predict(object, newdata, ...)
## S3 method for class 'SDTree' predict(object, newdata, ...)
object |
Fitted object of class |
newdata |
New test data of class |
... |
Further arguments passed to or from other methods. |
A vector of predictions for the new data.
Markus Ulmer
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) predict(model, newdata = data.frame(X))
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) predict(model, newdata = data.frame(X))
Predicts the response for the training data using only the trees in the SDForest that were not trained on the observation.
predictOOB(object, X = NULL)
predictOOB(object, X = NULL)
object |
Fitted object of class |
X |
Covariates of the training data.
If |
A vector of out-of-bag predictions for the training data. #' set.seed(1) n <- 50 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', nTree = 5, cp = 0.5) predictOOB(model)
Markus Ulmer
SDForest
prune.SDForest
plotOOB
Print contents of the partDependence.
## S3 method for class 'partDependence' print(x, ...)
## S3 method for class 'partDependence' print(x, ...)
x |
Fitted object of class |
... |
Further arguments passed to or from other methods. |
No return value, called for side effects
Markus Ulmer
partDependence
, plot.partDependence
set.seed(1) x <- rnorm(10) y <- sign(x) * 3 + rnorm(10) model <- SDTree(x = x, y = y, Q_type = 'no_deconfounding', cp = 0.5) pd <- partDependence(model, 1, X = x) print(pd)
set.seed(1) x <- rnorm(10) y <- sign(x) * 3 + rnorm(10) model <- SDTree(x = x, y = y, Q_type = 'no_deconfounding', cp = 0.5) pd <- partDependence(model, 1, X = x) print(pd)
Print number of covariates and number of active covariates for SDAM object.
## S3 method for class 'SDAM' print(x, ...)
## S3 method for class 'SDAM' print(x, ...)
x |
Fitted object of class |
... |
Further arguments passed to or from other methods. |
No return value, called for side effects
Cyrill Scheidegger
set.seed(1) X <- matrix(rnorm(10 * 5), ncol = 5) Y <- sin(X[, 1]) - X[, 2] + rnorm(10) model <- SDAM(x = X, y = Y, Q_type = "trim", trim_quantile = 0.5, nfold = 2, n_K = 1) print(model)
set.seed(1) X <- matrix(rnorm(10 * 5), ncol = 5) Y <- sin(X[, 1]) - X[, 2] + rnorm(10) model <- SDAM(x = X, y = Y, Q_type = "trim", trim_quantile = 0.5, nfold = 2, n_K = 1) print(model)
Print contents of the SDForest.
## S3 method for class 'SDForest' print(x, ...)
## S3 method for class 'SDForest' print(x, ...)
x |
Fitted object of class |
... |
Further arguments passed to or from other methods. |
No return value, called for side effects
Markus Ulmer
set.seed(1) n <- 50 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', nTree = 5, cp = 0.5) print(model)
set.seed(1) n <- 50 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', nTree = 5, cp = 0.5) print(model)
Print contents of the SDTree.
## S3 method for class 'SDTree' print(x, ...)
## S3 method for class 'SDTree' print(x, ...)
x |
Fitted object of class |
... |
Further arguments passed to or from other methods. |
No return value, called for side effects
Markus Ulmer
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) print(model)
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) print(model)
Prunes all trees in the forest and re-calculates the out-of-bag predictions and performance measures. The training data is needed to calculate the out-of-bag statistics. Note that the forest is pruned in place. If you intend to keep the original forest, make a copy of it before pruning.
## S3 method for class 'SDForest' prune(object, cp, X = NULL, Y = NULL, Q = NULL, pred = TRUE, ...)
## S3 method for class 'SDForest' prune(object, cp, X = NULL, Y = NULL, Q = NULL, pred = TRUE, ...)
object |
an SDForest object |
cp |
Complexity parameter, the higher the value the more nodes are pruned. |
X |
The training data, if NULL the data from the forest object is used. |
Y |
The training response variable, if NULL the data from the forest object is used. |
Q |
The transformation function, if NULL the data from the forest object is used. |
pred |
If TRUE the predictions are calculated, if FALSE only the out-of-bag statistics are calculated. This can set to FALSE to save computation time if only the out-of-bag statistics are needed. |
... |
Further arguments passed to or from other methods. |
A pruned SDForest object
Markus Ulmer
set.seed(1) X <- matrix(rnorm(10 * 20), nrow = 10) Y <- rnorm(10) fit <- SDForest(x = X, y = Y, nTree = 2) pruned_fit <- prune(copy(fit), 0.2)
set.seed(1) X <- matrix(rnorm(10 * 20), nrow = 10) Y <- rnorm(10) fit <- SDForest(x = X, y = Y, nTree = 2) pruned_fit <- prune(copy(fit), 0.2)
Removes all nodes that did not improve the loss by more than cp times the initial loss. Either by themselves or by one of their successors. Note that the tree is pruned in place. If you intend to keep the original tree, make a copy of it before pruning.
## S3 method for class 'SDTree' prune(object, cp, ...)
## S3 method for class 'SDTree' prune(object, cp, ...)
object |
an SDTree object |
cp |
Complexity parameter, the higher the value the more nodes are pruned. |
... |
Further arguments passed to or from other methods. |
A pruned SDTree object
Markus Ulmer
set.seed(1) X <- matrix(rnorm(10 * 20), nrow = 10) Y <- rnorm(10) tree <- SDTree(x = X, y = Y) pruned_tree <- prune(copy(tree), 0.2) tree pruned_tree
set.seed(1) X <- matrix(rnorm(10 * 20), nrow = 10) Y <- rnorm(10) tree <- SDTree(x = X, y = Y) pruned_tree <- prune(copy(tree), 0.2) tree pruned_tree
This function calculates the variable importance of an SDForest and the out-of-bag performance for different complexity parameters.
## S3 method for class 'SDForest' regPath(object, cp_seq = NULL, X = NULL, Y = NULL, Q = NULL, copy = TRUE, ...)
## S3 method for class 'SDForest' regPath(object, cp_seq = NULL, X = NULL, Y = NULL, Q = NULL, copy = TRUE, ...)
object |
an SDForest object |
cp_seq |
A sequence of complexity parameters. If NULL, the sequence is calculated automatically using only relevant values. |
X |
The training data, if NULL the data from the forest object is used. |
Y |
The training response variable, if NULL the data from the forest object is used. |
Q |
The transformation matrix, if NULL the data from the forest object is used. |
copy |
Whether the tree should be copied for the regularization path. If FALSE, the pruning is done in place and will change the SDForest. This might be reasonable, if the SDForest is to large to copy. |
... |
Further arguments passed to or from other methods. |
An object of class paths
containing
cp |
The sequence of complexity parameters. |
varImp_path |
A |
loss_path |
A |
cp_min |
The complexity parameter with the lowest out-of-bag performance. |
type |
Path type |
Markus Ulmer
plot.paths
plotOOB
regPath.SDTree
prune
get_cp_seq
SDForest
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + sign(X[, 2]) + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) paths <- regPath(model) plotOOB(paths) plot(paths) plot(paths, plotly = TRUE)
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + sign(X[, 2]) + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) paths <- regPath(model) plotOOB(paths) plot(paths) plot(paths, plotly = TRUE)
This function calculates the variable importance of an SDTree for different complexity parameters.
## S3 method for class 'SDTree' regPath(object, cp_seq = NULL, ...)
## S3 method for class 'SDTree' regPath(object, cp_seq = NULL, ...)
object |
an SDTree object |
cp_seq |
A sequence of complexity parameters. If NULL, the sequence is calculated automatically using only relevant values. |
... |
Further arguments passed to or from other methods. |
An object of class paths
containing
cp |
The sequence of complexity parameters. |
varImp_path |
A |
type |
Path type |
Markus Ulmer
plot.paths
prune
get_cp_seq
SDTree
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + sign(X[, 2]) + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) paths <- regPath(model) plot(paths) plot(paths, plotly = TRUE)
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + sign(X[, 2]) + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) paths <- regPath(model) plot(paths) plot(paths, plotly = TRUE)
Estimate high-dimensional additive models using spectral deconfounding (Scheidegger et al. 2025). The covariates are expanded into B-spline basis functions. A spectral transformation is used to remove bias arising from hidden confounding and a group lasso objective is minimized to enforce component-wise sparsity. Optimal number of basis functions per component and sparsity penalty are chosen by cross validation.
SDAM( formula = NULL, data = NULL, x = NULL, y = NULL, Q_type = "trim", trim_quantile = 0.5, q_hat = 0, nfolds = 5, cv_method = "1se", n_K = 4, n_lambda1 = 10, n_lambda2 = 20, Q_scale = TRUE, ind_lin = NULL, mc.cores = 1, verbose = TRUE )
SDAM( formula = NULL, data = NULL, x = NULL, y = NULL, Q_type = "trim", trim_quantile = 0.5, q_hat = 0, nfolds = 5, cv_method = "1se", n_K = 4, n_lambda1 = 10, n_lambda2 = 20, Q_scale = TRUE, ind_lin = NULL, mc.cores = 1, verbose = TRUE )
formula |
Object of class |
data |
Training data of class |
x |
Matrix of covariates, alternative to |
y |
Vector of responses, alternative to |
Q_type |
Type of deconfounding, one of 'trim', 'pca', 'no_deconfounding'.
'trim' corresponds to the Trim transform (Ćevid et al. 2020)
as implemented in the Doubly debiased lasso (Guo et al. 2022),
'pca' to the PCA transformation(Paul et al. 2008).
See |
trim_quantile |
Quantile for Trim transform,
only needed for trim, see |
q_hat |
Assumed confounding dimension, only needed for pca,
see |
nfolds |
The number of folds for cross-validation. Default is 5. |
cv_method |
The method for selecting the regularization parameter during cross-validation. One of "min" (minimum cv-loss) and "1se" (one-standard-error rule) Default is "1se". |
n_K |
The number of candidate values for the number of basis functions for B-splines. Default is 4. |
n_lambda1 |
The number of candidate values for the regularization parameter in the initial cross-validation step. Default is 10. |
n_lambda2 |
The number of candidate values for the regularization parameter in the second stage of cross-validation (once the optimal number of basis function K is decided, a second stage of cross-validation for the regularization parameter is performed on a finer grid). Default is 20. |
Q_scale |
Should data be scaled to estimate the spectral transformation?
Default is |
ind_lin |
A vector of indices specifying which covariates to model linearly (i.e. not expanded into basis function). Default is 'NULL'. |
mc.cores |
Number of cores to use for parallel processing, if |
verbose |
If |
An object of class 'SDAM' containing the following elements:
X |
The original design matrix. |
p |
The number of covariates in 'X'. |
intercept |
The intercept term of the fitted model. |
K |
A vector of the number of basis functions for each covariate, where 1 corresponds to a linear term. The entries of the vector will mostly by the same, but some entries might be lower if the corresponding component of X contains only few unique values. |
breaks |
A list of breakpoints used for the B-splines. Used to reconstruct the B-spline basis functions. |
coefs |
A list of coefficients for the B-spline basis functions for each component. |
active |
A vector of active covariates that contribute to the model. |
Cyrill Scheidegger
Guo Z, Ćevid D, Bühlmann P (2022).
“Doubly debiased lasso: High-dimensional inference under hidden confounding.”
The Annals of Statistics, 50(3).
ISSN 0090-5364, doi:10.1214/21-AOS2152.
Paul D, Bair E, Hastie T, Tibshirani R (2008).
““Preconditioning” for feature selection and regression in high-dimensional problems.”
The Annals of Statistics, 36(4).
ISSN 0090-5364, doi:10.1214/009053607000000578.
Scheidegger C, Guo Z, Bühlmann P (2025).
“Spectral Deconfounding for High-Dimensional Sparse Additive Models.”
ACM / IMS J. Data Sci..
doi:10.1145/3711116.
Ćevid D, Bühlmann P, Meinshausen N (2020).
“Spectral Deconfounding via Perturbed Sparse Linear Models.”
J. Mach. Learn. Res., 21(1).
ISSN 1532-4435, http://jmlr.org/papers/v21/19-545.html.
get_Q
, predict.SDAM
, varImp.SDAM
,
predict_individual_fj
, partDependence
set.seed(1) X <- matrix(rnorm(10 * 5), ncol = 5) Y <- sin(X[, 1]) - X[, 2] + rnorm(10) model <- SDAM(x = X, y = Y, Q_type = "trim", trim_quantile = 0.5, nfold = 2, n_K = 1) library(HDclassif) data(wine) names(wine) <- c("class", "alcohol", "malicAcid", "ash", "alcalinityAsh", "magnesium", "totPhenols", "flavanoids", "nonFlavPhenols", "proanthocyanins", "colIntens", "hue", "OD", "proline") wine <- log(wine) # estimate model # do not use class in the model and restrict proline to be linear model <- SDAM(alcohol ~ -class + ., wine, ind_lin = "proline", nfold = 3) # extract variable importance varImp(model) # most important variable mostImp <- names(which.max(varImp(model))) mostImp # predict for individual Xj predJ <- predict_individual_fj(object = model, j = mostImp) plot(wine[, mostImp], predJ, xlab = paste0("log ", mostImp), ylab = "log alcohol") # partial dependece plot(partDependence(model, mostImp)) # predict predict(model, newdata = wine[42, ]) ## alternative function call mod_none <- SDAM(x = as.matrix(wine[1:10, -c(1, 2)]), y = wine$alcohol[1:10], Q_type = "no_deconfounding", nfolds = 2, n_K = 4, n_lambda1 = 4, n_lambda2 = 8)
set.seed(1) X <- matrix(rnorm(10 * 5), ncol = 5) Y <- sin(X[, 1]) - X[, 2] + rnorm(10) model <- SDAM(x = X, y = Y, Q_type = "trim", trim_quantile = 0.5, nfold = 2, n_K = 1) library(HDclassif) data(wine) names(wine) <- c("class", "alcohol", "malicAcid", "ash", "alcalinityAsh", "magnesium", "totPhenols", "flavanoids", "nonFlavPhenols", "proanthocyanins", "colIntens", "hue", "OD", "proline") wine <- log(wine) # estimate model # do not use class in the model and restrict proline to be linear model <- SDAM(alcohol ~ -class + ., wine, ind_lin = "proline", nfold = 3) # extract variable importance varImp(model) # most important variable mostImp <- names(which.max(varImp(model))) mostImp # predict for individual Xj predJ <- predict_individual_fj(object = model, j = mostImp) plot(wine[, mostImp], predJ, xlab = paste0("log ", mostImp), ylab = "log alcohol") # partial dependece plot(partDependence(model, mostImp)) # predict predict(model, newdata = wine[42, ]) ## alternative function call mod_none <- SDAM(x = as.matrix(wine[1:10, -c(1, 2)]), y = wine$alcohol[1:10], Q_type = "no_deconfounding", nfolds = 2, n_K = 4, n_lambda1 = 4, n_lambda2 = 8)
Estimate regression Random Forest using spectral deconfounding.
The spectrally deconfounded Random Forest (SDForest) combines SDTrees in the same way,
as in the original Random Forest (Breiman 2001).
The idea is to combine multiple regression trees into an ensemble in order to
decrease variance and get a smooth function. Ensembles work best if the different
models are independent of each other. To decorrelate the regression trees as much
as possible from each other, we have two mechanisms. The first one is bagging
(Breiman 1996), where we train each regression
tree on an independent bootstrap sample of the observations, e.g., we draw a
random sample of size with replacement from the observations.
The second mechanic to decrease the correlation is that only a random subset
of the covariates is available for each split. Before each split,
we sample
from all the covariates and choose the one
that reduces the loss the most only from those.
SDForest( formula = NULL, data = NULL, x = NULL, y = NULL, nTree = 100, cp = 0, min_sample = 5, mtry = NULL, mc.cores = 1, Q_type = "trim", trim_quantile = 0.5, q_hat = 0, Qf = NULL, A = NULL, gamma = 7, max_size = NULL, gpu = FALSE, return_data = TRUE, mem_size = 1e+07, leave_out_ind = NULL, envs = NULL, nTree_leave_out = NULL, nTree_env = NULL, max_candidates = 100, Q_scale = TRUE, verbose = TRUE )
SDForest( formula = NULL, data = NULL, x = NULL, y = NULL, nTree = 100, cp = 0, min_sample = 5, mtry = NULL, mc.cores = 1, Q_type = "trim", trim_quantile = 0.5, q_hat = 0, Qf = NULL, A = NULL, gamma = 7, max_size = NULL, gpu = FALSE, return_data = TRUE, mem_size = 1e+07, leave_out_ind = NULL, envs = NULL, nTree_leave_out = NULL, nTree_env = NULL, max_candidates = 100, Q_scale = TRUE, verbose = TRUE )
formula |
Object of class |
data |
Training data of class |
x |
Matrix of covariates, alternative to |
y |
Vector of responses, alternative to |
nTree |
Number of trees to grow. |
cp |
Complexity parameter, minimum loss decrease to split a node.
A split is only performed if the loss decrease is larger than |
min_sample |
Minimum number of observations per leaf.
A split is only performed if both resulting leaves have at least
|
mtry |
Number of randomly selected covariates to consider for a split,
if |
mc.cores |
Number of cores to use for parallel processing,
if |
Q_type |
Type of deconfounding, one of 'trim', 'pca', 'no_deconfounding'.
'trim' corresponds to the Trim transform (Ćevid et al. 2020)
as implemented in the Doubly debiased lasso (Guo et al. 2022),
'pca' to the PCA transformation(Paul et al. 2008).
See |
trim_quantile |
Quantile for Trim transform,
only needed for trim, see |
q_hat |
Assumed confounding dimension, only needed for pca,
see |
Qf |
Spectral transformation, if |
A |
Numerical Anchor of class |
gamma |
Strength of distributional robustness, |
max_size |
Maximum number of observations used for a bootstrap sample.
If |
gpu |
If |
return_data |
If |
mem_size |
Amount of split candidates that can be evaluated at once. This is a trade-off between memory and speed can be decreased if either the memory is not sufficient or the gpu is to small. |
leave_out_ind |
Indices of observations that should not be used for training. |
envs |
Vector of environments of class |
nTree_leave_out |
Number of trees that should be estimated while leaving one of the environments out. Results in number of environments times number of trees. |
nTree_env |
Number of trees that should be estimated for each environment. Results in number of environments times number of trees. |
max_candidates |
Maximum number of split points that are proposed at each node for each covariate. |
Q_scale |
Should data be scaled to estimate the spectral transformation?
Default is |
verbose |
If |
Object of class SDForest
containing:
predictions |
Vector of predictions for each observation. |
forest |
List of SDTree objects. |
var_names |
Names of the covariates. |
oob_loss |
Out-of-bag loss. MSE |
oob_SDloss |
Out-of-bag loss using the spectral transformation. |
var_importance |
Variable importance. The variable importance is calculated as the sum of the decrease in the loss function resulting from all splits that use a covariate for each tree. The mean of the variable importance of all trees results in the variable importance for the forest. |
oob_ind |
List of indices of trees that did not contain the observation in the training set. |
oob_predictions |
Out-of-bag predictions. |
If return_data
is TRUE
the following are also returned:
X |
Matrix of covariates. |
Y |
Vector of responses. |
Q |
Spectral transformation. |
If envs
is provided the following are also returned:
envs |
Vector of environments. |
nTree_env |
Number of trees for each environment. |
ooEnv_ind |
List of indices of trees that did not contain the observation or the same environment in the training set for each observation. |
ooEnv_loss |
Out-of-bag loss using only trees that did not contain the observation or the same environment. |
ooEnv_SDloss |
Out-of-bag loss using the spectral transformation and only trees that did not contain the observation or the same environment. |
ooEnv_predictions |
Out-of-bag predictions using only trees that did not contain the observation or the same environment. |
nTree_leave_out |
If environments are left out, the environment for each tree, that was left out. |
nTree_env |
If environments are provided, the environment each tree is trained with. |
Markus Ulmer
Breiman L (1996).
“Bagging predictors.”
Machine Learning, 24(2), 123–140.
ISSN 0885-6125, doi:10.1007/BF00058655.
Breiman L (2001).
“Random Forests.”
Machine Learning, 45(1), 5–32.
ISSN 08856125, doi:10.1023/A:1010933404324.
Guo Z, Ćevid D, Bühlmann P (2022).
“Doubly debiased lasso: High-dimensional inference under hidden confounding.”
The Annals of Statistics, 50(3).
ISSN 0090-5364, doi:10.1214/21-AOS2152.
Paul D, Bair E, Hastie T, Tibshirani R (2008).
““Preconditioning” for feature selection and regression in high-dimensional problems.”
The Annals of Statistics, 36(4).
ISSN 0090-5364, doi:10.1214/009053607000000578.
Ćevid D, Bühlmann P, Meinshausen N (2020).
“Spectral Deconfounding via Perturbed Sparse Linear Models.”
J. Mach. Learn. Res., 21(1).
ISSN 1532-4435, http://jmlr.org/papers/v21/19-545.html.
get_Q
, get_W
, SDTree
,
simulate_data_nonlinear
, regPath
,
stabilitySelection
, prune
, partDependence
set.seed(1) n <- 50 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', nTree = 5, cp = 0.5) predict(model, newdata = data.frame(X)) set.seed(42) # simulation of confounded data sim_data <- simulate_data_nonlinear(q = 2, p = 150, n = 100, m = 2) X <- sim_data$X Y <- sim_data$Y train_data <- data.frame(X, Y) # causal parents of y sim_data$j # comparison to classical random forest fit_ranger <- ranger::ranger(Y ~ ., train_data, importance = 'impurity') fit <- SDForest(x = X, y = Y, nTree = 10, Q_type = 'pca', q_hat = 2) fit <- SDForest(Y ~ ., nTree = 10, train_data) fit # comparison of variable importance imp_ranger <- fit_ranger$variable.importance imp_sdf <- fit$var_importance imp_col <- rep('black', length(imp_ranger)) imp_col[sim_data$j] <- 'red' plot(imp_ranger, imp_sdf, col = imp_col, pch = 20, xlab = 'ranger', ylab = 'SDForest', main = 'Variable Importance') # check regularization path of variable importance path <- regPath(fit) # out of bag error for different regularization plotOOB(path) plot(path) # detection of causal parent using stability selection stablePath <- stabilitySelection(fit) plot(stablePath) # pruning of forest according to optimal out-of-bag performance fit <- prune(fit, cp = path$cp_min) # partial functional dependence of y on the most important covariate most_imp <- which.max(fit$var_importance) dep <- partDependence(fit, most_imp) plot(dep, n_examples = 100)
set.seed(1) n <- 50 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', nTree = 5, cp = 0.5) predict(model, newdata = data.frame(X)) set.seed(42) # simulation of confounded data sim_data <- simulate_data_nonlinear(q = 2, p = 150, n = 100, m = 2) X <- sim_data$X Y <- sim_data$Y train_data <- data.frame(X, Y) # causal parents of y sim_data$j # comparison to classical random forest fit_ranger <- ranger::ranger(Y ~ ., train_data, importance = 'impurity') fit <- SDForest(x = X, y = Y, nTree = 10, Q_type = 'pca', q_hat = 2) fit <- SDForest(Y ~ ., nTree = 10, train_data) fit # comparison of variable importance imp_ranger <- fit_ranger$variable.importance imp_sdf <- fit$var_importance imp_col <- rep('black', length(imp_ranger)) imp_col[sim_data$j] <- 'red' plot(imp_ranger, imp_sdf, col = imp_col, pch = 20, xlab = 'ranger', ylab = 'SDForest', main = 'Variable Importance') # check regularization path of variable importance path <- regPath(fit) # out of bag error for different regularization plotOOB(path) plot(path) # detection of causal parent using stability selection stablePath <- stabilitySelection(fit) plot(stablePath) # pruning of forest according to optimal out-of-bag performance fit <- prune(fit, cp = path$cp_min) # partial functional dependence of y on the most important covariate most_imp <- which.max(fit$var_importance) dep <- partDependence(fit, most_imp) plot(dep, n_examples = 100)
Estimates a regression tree using spectral deconfounding.
A regression tree is part of the function class of step functions
, where (
) with
are regions dividing the space of
into
rectangular parts. Each region has response level
.
For the training data, we can write the step function as
where
is an indicator matrix encoding
to which region an observation belongs and
is a vector
containing the levels corresponding to the different regions. This function then minimizes
We find by using the tree structure and repeated splitting of the leaves,
similar to the original cart algorithm (Breiman et al. 2017).
Since comparing all possibilities for
is impossible, we let a tree grow greedily.
Given the current tree, we iterate over all leaves and all possible splits.
We choose the one that reduces the spectral loss the most and estimate after each split
all the leave estimates
which is just a linear regression problem. This is repeated until the loss decreases
less than a minimum loss decrease after a split.
The minimum loss decrease equals a cost-complexity parameter
times
the initial loss when only an overall mean is estimated.
The cost-complexity parameter
controls the complexity of a regression tree
and acts as a regularization parameter.
SDTree( formula = NULL, data = NULL, x = NULL, y = NULL, max_leaves = NULL, cp = 0.01, min_sample = 5, mtry = NULL, fast = TRUE, Q_type = "trim", trim_quantile = 0.5, q_hat = 0, Qf = NULL, A = NULL, gamma = 0.5, gpu = FALSE, mem_size = 1e+07, max_candidates = 100, Q_scale = TRUE )
SDTree( formula = NULL, data = NULL, x = NULL, y = NULL, max_leaves = NULL, cp = 0.01, min_sample = 5, mtry = NULL, fast = TRUE, Q_type = "trim", trim_quantile = 0.5, q_hat = 0, Qf = NULL, A = NULL, gamma = 0.5, gpu = FALSE, mem_size = 1e+07, max_candidates = 100, Q_scale = TRUE )
formula |
Object of class |
data |
Training data of class |
x |
Matrix of covariates, alternative to |
y |
Vector of responses, alternative to |
max_leaves |
Maximum number of leaves for the grown tree. |
cp |
Complexity parameter, minimum loss decrease to split a node.
A split is only performed if the loss decrease is larger than |
min_sample |
Minimum number of observations per leaf.
A split is only performed if both resulting leaves have at least
|
mtry |
Number of randomly selected covariates to consider for a split,
if |
fast |
If |
Q_type |
Type of deconfounding, one of 'trim', 'pca', 'no_deconfounding'.
'trim' corresponds to the Trim transform (Ćevid et al. 2020)
as implemented in the Doubly debiased lasso (Guo et al. 2022),
'pca' to the PCA transformation(Paul et al. 2008).
See |
trim_quantile |
Quantile for Trim transform,
only needed for trim, see |
q_hat |
Assumed confounding dimension, only needed for pca,
see |
Qf |
Spectral transformation, if |
A |
Numerical Anchor of class |
gamma |
Strength of distributional robustness, |
gpu |
If |
mem_size |
Amount of split candidates that can be evaluated at once. This is a trade-off between memory and speed can be decreased if either the memory is not sufficient or the gpu is to small. |
max_candidates |
Maximum number of split points that are proposed at each node for each covariate. |
Q_scale |
Should data be scaled to estimate the spectral transformation?
Default is |
Object of class SDTree
containing
predictions |
Predictions for the training set. |
tree |
The estimated tree of class |
var_names |
Names of the covariates in the training data. |
var_importance |
Variable importance of the covariates. The variable importance is calculated as the sum of the decrease in the loss function resulting from all splits that use this covariate. |
Markus Ulmer
Breiman L, Friedman JH, Olshen RA, Stone CJ (2017).
Classification And Regression Trees.
Routledge.
ISBN 9781315139470, doi:10.1201/9781315139470.
Glur C (2023).
“data.tree: General Purpose Hierarchical Data Structure.”
https://CRAN.R-project.org/package=data.tree.
Guo Z, Ćevid D, Bühlmann P (2022).
“Doubly debiased lasso: High-dimensional inference under hidden confounding.”
The Annals of Statistics, 50(3).
ISSN 0090-5364, doi:10.1214/21-AOS2152.
Paul D, Bair E, Hastie T, Tibshirani R (2008).
““Preconditioning” for feature selection and regression in high-dimensional problems.”
The Annals of Statistics, 36(4).
ISSN 0090-5364, doi:10.1214/009053607000000578.
Ćevid D, Bühlmann P, Meinshausen N (2020).
“Spectral Deconfounding via Perturbed Sparse Linear Models.”
J. Mach. Learn. Res., 21(1).
ISSN 1532-4435, http://jmlr.org/papers/v21/19-545.html.
simulate_data_nonlinear
, regPath.SDTree
,
prune.SDTree
, partDependence
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDTree(x = X, y = y, cp = 0.5) set.seed(42) # simulation of confounded data sim_data <- simulate_data_step(q = 2, p = 15, n = 100, m = 2) X <- sim_data$X Y <- sim_data$Y train_data <- data.frame(X, Y) # causal parents of y sim_data$j tree_plain_cv <- cvSDTree(Y ~ ., train_data, Q_type = "no_deconfounding") tree_plain <- SDTree(Y ~ ., train_data, Q_type = "no_deconfounding", cp = 0) tree_causal_cv <- cvSDTree(Y ~ ., train_data) tree_causal <- SDTree(y = Y, x = X, cp = 0) # check regularization path of variable importance path <- regPath(tree_causal) plot(path) tree_plain <- prune(tree_plain, cp = tree_plain_cv$cp_min) tree_causal <- prune(tree_causal, cp = tree_causal_cv$cp_min) plot(tree_causal) plot(tree_plain)
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDTree(x = X, y = y, cp = 0.5) set.seed(42) # simulation of confounded data sim_data <- simulate_data_step(q = 2, p = 15, n = 100, m = 2) X <- sim_data$X Y <- sim_data$Y train_data <- data.frame(X, Y) # causal parents of y sim_data$j tree_plain_cv <- cvSDTree(Y ~ ., train_data, Q_type = "no_deconfounding") tree_plain <- SDTree(Y ~ ., train_data, Q_type = "no_deconfounding", cp = 0) tree_causal_cv <- cvSDTree(Y ~ ., train_data) tree_causal <- SDTree(y = Y, x = X, cp = 0) # check regularization path of variable importance path <- regPath(tree_causal) plot(path) tree_plain <- prune(tree_plain, cp = tree_plain_cv$cp_min) tree_causal <- prune(tree_causal, cp = tree_causal_cv$cp_min) plot(tree_causal) plot(tree_plain)
Simulation of data from a confounded non-linear model. The data generating process is given by:
where is a random function on the fourier basis
with a subset of size m covariates
having a causal effect on
.
,
are random error terms and
is a matrix of random confounding covariates.
and
are random coefficient vectors.
For the simulation, all the above parameters are drawn from a standard normal distribution, except for
which is drawn from a normal distribution with standard deviation 0.1.
The parameters
are drawn from a uniform distribution between -1 and 1.
simulate_data_nonlinear(q, p, n, m, K = 2, eff = NULL, fixEff = FALSE)
simulate_data_nonlinear(q, p, n, m, K = 2, eff = NULL, fixEff = FALSE)
q |
number of confounding covariates in H |
p |
number of covariates in X |
n |
number of observations |
m |
number of covariates with a causal effect on Y |
K |
number of fourier basis functions K |
eff |
the number of affected covariates in X by the confounding, if NULL all covariates are affected |
fixEff |
if eff is smaller than p: If fixEff = TRUE, the causal parents are always affected by confounding if fixEff = FALSE, affected covariates are chosen completely at random. |
a list containing the simulated data:
X |
a matrix of covariates |
Y |
a vector of responses |
f_X |
a vector of the true function f(X) |
j |
the indices of the causal covariates in X |
beta |
the parameter vector for the function f(X), see |
H |
the matrix of confounding covariates |
Markus Ulmer
set.seed(42) # simulation of confounded data sim_data <- simulate_data_nonlinear(q = 2, p = 150, n = 100, m = 2) X <- sim_data$X Y <- sim_data$Y
set.seed(42) # simulation of confounded data sim_data <- simulate_data_nonlinear(q = 2, p = 150, n = 100, m = 2) X <- sim_data$X Y <- sim_data$Y
Simulation of data from a confounded non-linear model. Where the non-linear function is a random regression tree. The data generating process is given by:
where is a random regression tree with
random splits of the data.
Resulting in a random step-function with
levels, i.e. leaf-levels.
,
are random error terms and
is a matrix of random confounding covariates.
and
are random coefficient vectors.
For the simulation, all the above parameters are drawn from a standard normal distribution, except for
which is drawn from a normal distribution with standard deviation 10.
The leaf levels
are drawn from a uniform distribution between -50 and 50.
simulate_data_step(q, p, n, m, make_tree = FALSE)
simulate_data_step(q, p, n, m, make_tree = FALSE)
q |
number of confounding covariates in H |
p |
number of covariates in X |
n |
number of observations |
m |
number of covariates with a causal effect on Y |
make_tree |
Whether the random regression tree should be returned. |
a list containing the simulated data:
X |
a |
Y |
a |
f_X |
a |
j |
the indices of the causal covariates in X |
tree |
If |
Markus Ulmer
Glur C (2023). “data.tree: General Purpose Hierarchical Data Structure.” https://CRAN.R-project.org/package=data.tree.
set.seed(42) # simulation of confounded data sim_data <- simulate_data_step(q = 2, p = 15, n = 100, m = 2) X <- sim_data$X Y <- sim_data$Y
set.seed(42) # simulation of confounded data sim_data <- simulate_data_step(q = 2, p = 15, n = 100, m = 2) X <- sim_data$X Y <- sim_data$Y
This function calculates the stability selection of an SDForest (Meinshausen and Bühlmann 2010). Stability selection is calculated as the fraction of trees in the forest that select a variable for a split at each complexity parameter.
## S3 method for class 'SDForest' stabilitySelection(object, cp_seq = NULL, ...)
## S3 method for class 'SDForest' stabilitySelection(object, cp_seq = NULL, ...)
object |
an SDForest object |
cp_seq |
A sequence of complexity parameters. If NULL, the sequence is calculated automatically using only relevant values. |
... |
Further arguments passed to or from other methods. |
An object of class paths
containing
cp |
The sequence of complexity parameters. |
varImp_path |
A |
type |
Path type |
Markus Ulmer
Meinshausen N, Bühlmann P (2010). “Stability Selection.” Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(4), 417–473. ISSN 1369-7412, doi:10.1111/j.1467-9868.2010.00740.x.
plot.paths
regPath
prune
get_cp_seq
SDForest
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + sign(X[, 2]) + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', nTree = 2, cp = 0.5) paths <- stabilitySelection(model) plot(paths) plot(paths, plotly = TRUE)
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + sign(X[, 2]) + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', nTree = 2, cp = 0.5) paths <- stabilitySelection(model) plot(paths) plot(paths, plotly = TRUE)
Converts the trees in an SDForest object from
class Node
(Glur 2023) to class list
.
This makes it substantially easier to save the forest to disk.
## S3 method for class 'SDForest' toList(object, ...)
## S3 method for class 'SDForest' toList(object, ...)
object |
an SDForest object with the trees in Node format |
... |
Further arguments passed to or from other methods. |
an SDForest object with the trees in list format
Markus Ulmer
Glur C (2023). “data.tree: General Purpose Hierarchical Data Structure.” https://CRAN.R-project.org/package=data.tree.
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5, nTree = 2) toList(model)
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDForest(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5, nTree = 2) toList(model)
Converts the tree in an SDTree object from
class Node
(Glur 2023) to class list
.
This makes it substantially easier to save the tree to disk.
## S3 method for class 'SDTree' toList(object, ...)
## S3 method for class 'SDTree' toList(object, ...)
object |
an SDTree object with the tree in Node format |
... |
Further arguments passed to or from other methods. |
an SDTree object with the tree in list format
Markus Ulmer
Glur C (2023). “data.tree: General Purpose Hierarchical Data Structure.” https://CRAN.R-project.org/package=data.tree.
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) toList(model)
set.seed(1) n <- 10 X <- matrix(rnorm(n * 5), nrow = n) y <- sign(X[, 1]) * 3 + rnorm(n) model <- SDTree(x = X, y = y, Q_type = 'no_deconfounding', cp = 0.5) toList(model)
This function extracts the variable importance of an SDAM. The variable importance is calculated as the empirical squared L2 norm of fj. The measure is not standardized.
## S3 method for class 'SDAM' varImp(object)
## S3 method for class 'SDAM' varImp(object)
object |
an |
A vector of variable importance
Cyrill Scheidegger
set.seed(1) X <- matrix(rnorm(10 * 5), ncol = 5) Y <- sin(X[, 1]) - X[, 2] + rnorm(10) model <- SDAM(x = X, y = Y, Q_type = "trim", trim_quantile = 0.5, nfold = 2) varImp(model)
set.seed(1) X <- matrix(rnorm(10 * 5), ncol = 5) Y <- sin(X[, 1]) - X[, 2] + rnorm(10) model <- SDAM(x = X, y = Y, Q_type = "trim", trim_quantile = 0.5, nfold = 2) varImp(model)
This function extracts the variable importance of an SDForest. The variable importance is calculated as the sum of the decrease in the loss function resulting from all splits that use a covariate for each tree. The mean of the variable importance of all trees results in the variable importance for the forest.
## S3 method for class 'SDForest' varImp(object)
## S3 method for class 'SDForest' varImp(object)
object |
an SDForest object |
A named vector of variable importance
Markus Ulmer
data(iris) fit <- SDForest(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, iris, nTree = 10, cp = 0.5) varImp(fit)
data(iris) fit <- SDForest(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, iris, nTree = 10, cp = 0.5) varImp(fit)
This function extracts the variable importance of an SDTree. The variable importance is calculated as the sum of the decrease in the loss function resulting from all splits that use this covariate.
## S3 method for class 'SDTree' varImp(object)
## S3 method for class 'SDTree' varImp(object)
object |
an SDTree object |
A named vector of variable importance
Markus Ulmer
data(iris) tree <- SDTree(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, iris, cp = 0.5) varImp(tree)
data(iris) tree <- SDTree(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, iris, cp = 0.5) varImp(tree)