MetaDICT is a computational method designed for the integration of microbiome datasets, effectively addressing batch effects while preserving biological variation across heterogeneous datasets.
The method operates in two stages:
Initial Batch Effect Estimation – Utilizes covariate balancing to estimate batch effects, which are defined as heterogeneous sequencing efficiency across datasets.
Refinement via Shared Dictionary Learning – Further refines batch effect estimation by leveraging shared structures across datasets.
Compared with regression-based approaches, MetaDICT minimizes overcorrection when unobserved confounding variables are present, ensuring a more reliable integration of datasets. The resulting integrated data can be applied to downstream analyses, including Principal Coordinates Analysis (PCoA), taxa/sample community detection, and differential abundance test. For more details, please refer to the MetaDICT paper (Yuan and Wang 2024).
The package can be downloaded from github:
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("BoYuan07/MetaDICT", build_vignettes = TRUE)Load the package:
Example dataset contains two simulated datasets, using gut microbiome data collected by He et al. (He et al. 2018). These two datasets share the same set of taxa, and each dataset contains 200 samples. Load the example sample:
# load data
data("exampleData")
O = exampleData$O
meta = exampleData$meta
dist_mat = exampleData$dist_mat
taxonomy = exampleData$taxonomy
tree = exampleData$treeThe object contains the following components:
Integrated Count Matrix (O)
A merged abundance data matrix (taxa-by-sample) combining both
datasets.
Meta Table (meta)
The sample metadata, which includes details for both datasets:
batch – Indicates the dataset source
(Dataset1 or Dataset2).Y – A covariate associated with
microbial compositions.Y2 – An uninformative covariate with
no biological relevance.Sequence Distance Matrix
(dist_mat)
A matrix quantifying the relationships between sequences.
Phylogenetic Tree (tree)
Taxonomy Table (taxonomy)
We will use this example data to illustrate how to use MetaDICT.
Significant batch effects are observed between these two datasets on PCoA plots.
The PCoA plots of target variable Y:
A crucial assumption in MetaDICT is that the microbial load matrix can be approximated by a product of two matrices, one of which is shared dictionary. A simple diagnostic tool for such an assumption is to evaluate the singular values of the sequencing count matrix in each study and see how fast the singular values decay:
We apply MetaDICT to remove batch effects and integrate these two datasets. MetaDICT requires three inputs: integrated count table, integrated meta table, and taxa dissimilarity matrix used for measurement efficiency (batch effect) estimation. A taxa dissimilarity matrix can directly be used in MetaDICT:
# main function of MetaDICT
metadict_res = MetaDICT(O, meta, distance_matrix = dist_mat)
X = metadict_res$count
D = metadict_res$D
R_list = metadict_res$R
w = metadict_res$w
meta_output = metadict_res$metaThe results from MetaDICT include:
X) – The
adjusted abundance matrix after integration.D) – The
learned dictionary representing shared features across datasets.R) –
The latent factor representation of samples.w) –
The scaling factors capturing dataset-specific measurement
variations.Batch effect is significantly reduced using MetaDICT:
MetaDICT can also accept a phylogenetic tree as input and uses phylogenetic information to estimate taxa similarity:
If phylogenetic tree is not available, MetaDICT can use taxonomic information to estimate taxa similarity. In this case, the taxonomy level of the count table must be specified.
metadict_res2 = MetaDICT(O, meta, taxonomy = taxonomy, tax_level = "order")
X2 = metadict_res2$countUsers can specify the covariates that should be included in MetaDICT
using covariates. In this example, we only use
Y2 in the covariate balancing step. MetaDICT is able to
preserve the biological variation of Y even when
Y is not observed.
metadict_res3 = MetaDICT(O,meta,covariates = c("Y2"),
distance_matrix = dist_mat)
X3 = metadict_res3$countMetaDICT includes two predefined parameters \(\alpha\) and \(\beta\). The parameter \(\alpha\) enforces the low-rank structure of
shared dictionary. The parameter \(\beta\) enforces the smoothness of
estimated measurement efficiency. If
customize_parameter = FALSE, MetaDICT automatically selects
parameters based on the provided inputs. When
customize_parameter = TRUE, users can manually set
parameter values to customize the analysis.
In certain situations, new studies may become available after multiple datasets have already been integrated and a machine learning model has been trained on the combined data. To incorporate these new studies into the existing integrated dataset, use the following function:
# load the data
data("exampleData_transfer")
new_data = exampleData_transfer$new_data
new_meta = exampleData_transfer$new_meta
# add new dataset to previous result
new_data_res = metadict_add_new_data(new_data, new_meta, metadict_res)
# corrected count
new_count = new_data_res$count
# integrate count tables
all_count_raw = cbind(X,new_data)
all_count_corrected = cbind(X,new_count)
covariates <- intersect(colnames(meta), colnames(new_meta))
all_meta = rbind(meta[,covariates, drop = FALSE],new_meta[,covariates, drop = FALSE])Before batch correction:
# PCoA plot of sample covariate
pcoa_plot_discrete(all_count_raw,all_meta$Y,"Sample",colorset = "Set2")After batch correction:
# PCoA plot of sample covariate
pcoa_plot_discrete(all_count_corrected,all_meta$Y,"Sample",colorset = "Set2")The corrected data new_count can be directly applied to
pre-trained machine learning model.
We can detect taxa communities and sample subpopulation using the output of MetaDICT.
Load ggraph for visualization:
Shared dictionary D can be used in taxa community
detection. The number of columns used in this process is determined
using the elbow of column-wise squared norms.
The elbow value is around 20. We apply community detection method:
D_filter = D[,1:20]
taxa_c = community_detection(D_filter, max_k = 5)
taxa_cluster = taxa_c$cluster
taxa_graph = taxa_c$graph
ggraph(taxa_graph, layout = "stress") +
geom_node_point(aes(color = as.factor(taxa_cluster)), size = 2) +
scale_color_brewer(palette = "Set1", name = "Taxa Cluster") +
theme_bw() +
xlab("") +
ylab("") +
theme(
legend.position = "right",
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 10)
)R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggraph_2.2.2 MetaDICT_1.1.0 DT_0.34.0 lubridate_1.9.4
[5] forcats_1.0.1 stringr_1.5.2 dplyr_1.1.4 purrr_1.1.0
[9] readr_2.1.5 tidyr_1.3.1 tibble_3.3.0 ggplot2_4.0.0
[13] tidyverse_2.0.0 BiocStyle_2.37.1
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 viridisLite_0.4.2 farver_2.1.2
[4] viridis_0.6.5 S7_0.2.0 fastmap_1.2.0
[7] tweenr_2.0.3 RANN_2.6.2 digest_0.6.37
[10] timechange_0.3.0 lifecycle_1.0.4 cluster_2.1.8.1
[13] statmod_1.5.1 magrittr_2.0.4 compiler_4.5.1
[16] rlang_1.1.6 sass_0.4.10 tools_4.5.1
[19] igraph_2.2.1 yaml_2.3.10 knitr_1.50
[22] ggsignif_0.6.4 graphlayouts_1.2.2 labeling_0.4.3
[25] htmlwidgets_1.6.4 RColorBrewer_1.1-3 abind_1.4-8
[28] withr_3.0.2 sys_3.4.3 polyclip_1.10-7
[31] grid_4.5.1 ggpubr_0.6.2 edgeR_4.9.0
[34] scales_1.4.0 MASS_7.3-65 cli_3.6.5
[37] rmarkdown_2.30 vegan_2.7-2 generics_0.1.4
[40] tzdb_0.5.0 ggforce_0.5.0 ape_5.8-1
[43] cachem_1.1.0 splines_4.5.1 parallel_4.5.1
[46] BiocManager_1.30.26 matrixStats_1.5.0 vctrs_0.6.5
[49] Matrix_1.7-4 jsonlite_2.0.0 carData_3.0-5
[52] car_3.1-3 hms_1.1.4 ggrepel_0.9.6
[55] rstatix_0.7.3 Formula_1.2-5 maketools_1.3.2
[58] locfit_1.5-9.12 limma_3.65.7 jquerylib_0.1.4
[61] glue_1.8.0 cowplot_1.2.0 ecodist_2.1.3
[64] stringi_1.8.7 gtable_0.3.6 pillar_1.11.1
[67] htmltools_0.5.8.1 R6_2.6.1 tidygraph_1.3.1
[70] evaluate_1.0.5 lattice_0.22-7 backports_1.5.0
[73] memoise_2.0.1 broom_1.0.10 bslib_0.9.0
[76] Rcpp_1.1.0 gridExtra_2.3 nlme_3.1-168
[79] permute_0.9-8 mgcv_1.9-3 xfun_0.53
[82] buildtools_1.0.0 pkgconfig_2.0.3