| Title: | Directed Dependence Coefficient |
|---|---|
| Description: | Directed Dependence Coefficient (didec) is a measure of functional dependence. Multivariate Feature Ordering by Conditional Independence (MFOCI) is a variable selection algorithm based on didec. Hierarchical Variable Clustering (VarClustPartition) is a variable clustering method based on didec. For more information, see the paper by Ansari and Fuchs (2025, <doi:10.48550/arXiv.2212.01621>), and the paper by Fuchs and Wang (2024, <doi:10.1016/j.ijar.2024.109185>). |
| Authors: | Yuping Wang [aut, cre], Sebastian Fuchs [aut], Jonathan Ansari [aut] |
| Maintainer: | Yuping Wang <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.0 |
| Built: | 2026-06-02 09:29:26 UTC |
| Source: | https://github.com/cran/didec |
A data set of bioclimatic variables for locations homogeneously distributed over the global landmass from CHELSA ("Climatologies at high resolution for the earth’s land surface areas").
bioclimaticbioclimatic
An object of class data.frame with 1862 rows and 19 columns.
D.N. Karger, O. Conrad, J. Böhner, T. Kawohl, H. Kreft, R.W. Soria-Auza, N.E. Zimmermann, H.P. Linder, M. Kessler, Climatologies at high resolution for the Earth's land surface areas, Sci. Data 4(1), 2017.
data(bioclimatic) head(bioclimatic)data(bioclimatic) head(bioclimatic)
The directed dependence coefficient (didec) estimates the degree of functional dependence of a random vector Y on a random vector X, based on an i.i.d. sample of (X,Y).
didec( X, Y, trans = FALSE, trans.method = c("standardization"), estim.method = c("copula"), perm = FALSE, perm.method = c("decreasing") )didec( X, Y, trans = FALSE, trans.method = c("standardization"), estim.method = c("copula"), perm = FALSE, perm.method = c("decreasing") )
X |
A numeric matrix or data.frame/data.table. Contains the predictor vector X. |
Y |
A numeric matrix or data.frame/data.table. Contains the response vector Y. |
trans |
A logical. If |
trans.method |
An optional character string specifying the data standardization method. This must be one of the strings |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient. This must be one of the strings |
perm |
A logical. If |
perm.method |
An optional character string specifying a method for permuting the response variables. This must be one of the strings |
The directed dependence coefficient (didec) is an extension of Azadkia & Chatterjee's measure of functional dependence (Azadkia & Chatterjee, 2021) to a vector of response variables introduced in (Ansari & Fuchs, 2025).
estim.method specifies two methods for estimating the directed dependence coefficient. "codec" uses the function codec which estimates Azadkia & Chatterjee’s measure of functional dependence and is provided in the R package FOCI. "copula" estimates the directed dependence coefficient based on a dimension reduction principle; see (Fuchs 2024). The value returned by didec may be positive or negative. In the asymptotic limit, however, it is guaranteed to lie between and .
By definition, didec is invariant under permutations of the variables within the predictor vector X. Invariance under permutations within the -dimensional response vector Y is achieved by computing the arithmetic mean over all possible permutations.
In addition to the option "full" of running all permutations of , less computationally intensive options are also available: a random selection of permutations "sample", cyclic permutations such as , either "increasing" or "decreasing".
Note that when the number of variables is large, choosing "full" may result in long computation times.
The degree of functional dependence of the random vector Y on the random vector X.
Yuping Wang, Sebastian Fuchs, Jonathan Ansari
J. Ansari, S. Fuchs, A direct extension of Azadkia & Chatterjee's rank correlation to multi-response vectors, Available at https://arxiv.org/abs/2212.01621, 2025.
M. Azadkia, S. Chatterjee, A simple measure of conditional dependence, Ann. Stat. 49 (6), 2021.
S. Fuchs, Quantifying directed dependence via dimension reduction, J. Multivariate Anal. 201, Article ID 105266, 2024.
A variable selection algorithm based on the directed dependence coefficient (didec).
mfoci( X, Y, trans = FALSE, trans.method = c("standardization"), estim.method = c("copula"), perm = FALSE, perm.method = c("decreasing"), pre.selected = NULL, select.method = c("forward"), autostop = TRUE, max.num = NULL )mfoci( X, Y, trans = FALSE, trans.method = c("standardization"), estim.method = c("copula"), perm = FALSE, perm.method = c("decreasing"), pre.selected = NULL, select.method = c("forward"), autostop = TRUE, max.num = NULL )
X |
A numeric matrix or data.frame/data.table. Contains the predictor vector X. |
Y |
A numeric matrix or data.frame/data.table. Contains the response vector Y. |
trans |
A logical. If |
trans.method |
An optional character string specifying a method for data standardization. This must be one of the strings |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient |
perm |
A logical. If |
perm.method |
An optional character string specifying a method for permuting the response variables. This must be one of the strings |
pre.selected |
An integer vector for indexing pre-selected components from predictor X. |
select.method |
An optional character string specifying a feature selection method. This must be one of the strings |
autostop |
A logical. If |
max.num |
An integer for limiting the maximal number of selected variables if |
mfoci involves a forward feature selection algorithm for multiple-outcome data that employs the directed dependence coefficient (didec) at each step.
If autostop == TRUE the algorithm stops at the first non-increasing value of didec, thereby selecting a subset of variables.
Otherwise, all predictor variables are ranked according to their predictive strength measured by didec.
In addition to the forward feature selection algorithm, this function also provides a best subset selection, which can be accomplished by select.method == "subset".
This method selects features by calculating the directed dependence coefficient of all possible feature combinations.
Note that the features selected by this method are not ordered.
A list containing:
A vector listing all features in X;
A vector listing the pre.selected features in X if pre.selected != NULL;
A data.frame listing the selected and ranked variables and the corresponding values of the directed dependence coefficient if select.method == "forward"; A vector listing the selected features if select.method == "subset";
The values of the directed dependence coefficient if select.method == "subset".
Sebastian Fuchs, Jonathan Ansari, Yuping Wang
J. Ansari, S. Fuchs, A direct extension of Azadkia & Chatterjee's rank correlation to multi-response vectors, Available at https://arxiv.org/abs/2212.01621, 2025.
library(didec) df <- as.data.frame(bioclimatic) X <- df[, c(9:12)] Y <- df[, c(1,8)] mfoci(X, Y, pre.selected = c(1, 3))library(didec) df <- as.data.frame(bioclimatic) X <- df[, c(9:12)] Y <- df[, c(1,8)] mfoci(X, Y, pre.selected = c(1, 3))
VarClustPartition is a hierarchical variable clustering algorithm based on the directed dependence coefficient (didec) or a concordance measure (Kendall tau or Spearman's footrule) according to a pre-selected number of clusters or an optimality criterion (Adiam&Msplit or Silhouette coefficient).
VarClustPartition( X, trans = FALSE, trans.method = c("standardization"), dist.method = c("PD"), estim.method = c("copula"), linkage = FALSE, link.method = c("complete"), part.method = c("optimal"), part.criterion = c("Adiam&Msplit"), num.cluster = NULL, plot = FALSE )VarClustPartition( X, trans = FALSE, trans.method = c("standardization"), dist.method = c("PD"), estim.method = c("copula"), linkage = FALSE, link.method = c("complete"), part.method = c("optimal"), part.criterion = c("Adiam&Msplit"), num.cluster = NULL, plot = FALSE )
X |
A numeric matrix or data.frame/data.table. Contains the variables to be clustered. |
trans |
A logical. If |
trans.method |
An optional character string specifying a method for data standardization. This must be one of the strings |
dist.method |
An optional character string computing a distance function for clustering. This must be one of the strings |
estim.method |
An optional character string specifying a method for estimating the directed dependence coefficient if |
linkage |
A logical. If |
link.method |
An optional character string selecting a linkage method. This must be one of the strings |
part.method |
An optional character string selecting a partitioning method. This must be one of the strings |
part.criterion |
An optional character string selecting a criterion for the optimal partition if |
num.cluster |
An integer value for the pre-selected number of clusters if |
plot |
A logical. If |
VarClustPartition performs a hierarchical variable clustering based on the directed dependence coefficient (didec) and provides a partition of the set of variables.
If dist.method =="PD" (perfect dependence) or dist.method =="MPD" (mutual perfect dependence) the clustering is performed using didec either as a directed ("PD") or as a symmetric ("MPD") dependence coefficient.
If dist.method =="kendall" or dist.method =="footrule", clustering is performed using either multivariate Kendall's tau ("kendall") or multivariate Spearman's footrule ("footrule"). "kendall" uses the function cor.fk which is provided in the R package pcaPP to calculate bivariate Kendall's tau.
Instead of using one of the above-mentioned four multivariate measures for the clustering, the option linkage == TRUE enables the use of bivariate linkage methods,
including complete linkage (link.method == "complete"), average linkage (link.method == "average") and single linkage (link.method == "single").
Note that the multivariate distance methods are computationally demanding because higher-dimensional dependencies are included in the calculation, in contrast to linkage methods which only incorporate pairwise dependencies.
A pre-selected number of clusters num.cluster can be realized with the option part.method == "selected".
Otherwise (part.method == "optimal"), the number of clusters is determined by maximizing the intra-cluster similarity (similarity within the same cluster) and minimizing the inter-cluster similarity (similarity among the clusters). Two optimality criteria (Fuchs & Wang 2024) are available:
"Adiam&Msplit": Adiam measures the intra-cluster similarity and Msplit measures the inter-cluster similarity.
"Silhouette": A mixed coefficient incorporating the intra-cluster similarity and the inter-cluster similarity. The optimal number of clusters corresponds to the maximum Silhouette coefficient.
A list containing:
A dendrogram without colored branches;
An integer value determining the number of clusters after partitioning;
A list containing the clusters after partitioning.
Yuping Wang, Sebastian Fuchs
S. Fuchs, Y. Wang, Hierarchical variable clustering based on the predictive strength between random vectors, Int. J. Approx. Reason. 170, Article ID 109185, 2024.
P. Hansen, B. Jaumard, Cluster analysis and mathematical programming, Math. Program. 79 (1) 191–215, 1997.
L. Kaufman, Finding Groups in Data, John Wiley & Sons, 1990.
library(didec) n <- 50 X1 <- rnorm(n,0,1) X2 <- X1 X3 <- rnorm(n,0,1) X4 <- X3 + X2 X <- data.frame(X1=X1,X2=X2,X3=X3,X4=X4) vcp <- VarClustPartition(X, dist.method = c("PD"), part.method = c("optimal"), part.criterion = c("Silhouette"), plot = TRUE) vcp$clusters data("bioclimatic") X <- bioclimatic[c(2:4,9)] vcp1 <- VarClustPartition(X, linkage = TRUE, link.method = c("complete"), dist.method = "PD", part.method = "optimal", part.criterion = "Silhouette", plot = TRUE) vcp1$clusters vcp2 <- VarClustPartition(X, linkage = TRUE, link.method = c("complete"), dist.method = "footrule", part.method = "optimal", part.criterion = "Adiam&Msplit", plot = TRUE) vcp2$clusterslibrary(didec) n <- 50 X1 <- rnorm(n,0,1) X2 <- X1 X3 <- rnorm(n,0,1) X4 <- X3 + X2 X <- data.frame(X1=X1,X2=X2,X3=X3,X4=X4) vcp <- VarClustPartition(X, dist.method = c("PD"), part.method = c("optimal"), part.criterion = c("Silhouette"), plot = TRUE) vcp$clusters data("bioclimatic") X <- bioclimatic[c(2:4,9)] vcp1 <- VarClustPartition(X, linkage = TRUE, link.method = c("complete"), dist.method = "PD", part.method = "optimal", part.criterion = "Silhouette", plot = TRUE) vcp1$clusters vcp2 <- VarClustPartition(X, linkage = TRUE, link.method = c("complete"), dist.method = "footrule", part.method = "optimal", part.criterion = "Adiam&Msplit", plot = TRUE) vcp2$clusters