Contents

1 Introduction

In DNA methylation data, genes are often represented by variable number of correlated probes, and a single probe can map to multiple genes. This complex data structure poses significant challenges for gene set enrichment analysis (GSEA), and can lead to biased enrichment results. The dmGsea package offers several functions with novel methods specifically designed to perform efficient gene set enrichment analysis while addressing probe dependency and probe number bias. Compared to alternative packages for DNA methylation data, these methods effectively utilize probe dependency information, provide higher statistical power, can well control type I errer rates, and are computationally more efficient. The package fully supports enrichment analysis for Illumina DNA methylation array data, and is easily extendable to other types of omics data when provided with appropriate probe annotation information.

2 List of functions

3 Example Analysis

The following examples are brief demonstrations on how to perform gene set enrichment analysis using dmGsea functions.

3.1 Example 1: Differentially methylated probes from EWAS

require(dmGsea)
#generating example data
annopkg <- "IlluminaHumanMethylation450kanno.ilmn12.hg19"
anno <- minfi::getAnnotation(eval(annopkg))
#Use a subset of the data in the example to speed up execution
anno <- anno[1:10000,]
probe.p <- data.frame(Name=rownames(anno),p=runif(nrow(anno)))
probe.p$p[1:500] <- probe.p$p[1:500]/100000
Data4cor <- matrix(runif(nrow(probe.p)*100),ncol=100)
rownames(Data4cor) <- rownames(anno)

#geneset enrichment analysis with threshold based method
#for top ranked 1000 genes
gsGene(probe.p <- probe.p,Data4Cor=Data4cor,arrayType="450K",nTopGene=1000,
    outGenep=TRUE, method="Threshold",gSetName="KEGG",species="Human",
    outfile="gs1",ncore=1)
file.remove("gs1_KEGG_KEGG.csv")
  • To perform GSEA using significant genes, set FDRthre = 0.05 to apply an FDR threshold of 0.05.

  • To perform GSEA using a ranking-based method, specify method = "Ranking".

  • If the argument Data4Cor is not provided, GSEA will be performed without using the methylation data matrix to account for between-CpG correlation.

  • Use gsPG() to perform enrichment analysis based on probe group-level p-values.

  • To perform GSEA directly with probe-level p-values without combining them into gene-level statistics, use the gsProbe() function. It applies a noncentral hypergeometric test to adjust for bias introduced by variable numbers of probes per gene.

3.2 Example 2: Enrichment analysis for arrays other than 450K and EPIC

#generate example dataset
kegg <- getKEGG(species="Human")
gene1 <- unique(as.vector(unlist(kegg[1:5])))
gene2 <- unique(as.vector(unlist(kegg[6:length(kegg)])))
gene1 <- rep(gene1,sample(1:10,length(gene1),replace=TRUE))
gene2 <- rep(gene2,sample(1:10,length(gene2),replace=TRUE))
p11 <- runif(length(gene1))*(1e-3)
p2 <- runif(length(gene2))
geneid <- c(gene1,gene2)
p <- c(p11,p2)
Name <- paste0("cg",1:length(p))
probe.p <- data.frame(Name=Name,p=p)
GeneProbeTable <- data.frame(Name=Name,entrezid=geneid)
dat <- matrix(runif(length(p)*100),ncol=100)
rownames(dat) <- Name

#enrichment analysis
gsGene(probe.p=probe.p,Data4Cor=dat,GeneProbeTable=GeneProbeTable,
        method="Threshold",gSetName="KEGG",species="Human",outfile="gs5",
        ncore=1)
file.remove("gs5_KEGG_KEGG.csv")
  • To perform GSEA using a ranking-based method, specify method = "Ranking".

3.3 Example 3: Enrichment analysis with user provided geneset

#generatin example dataset
userGeneset <- getKEGG(species="Human")

#enrichment analysis
gsGene(probe.p=probe.p,Data4Cor=dat,GeneProbeTable=GeneProbeTable,
        method="Threshold",geneSet=userGeneset,species="Human",outfile="gs7",
        ncore=1)
file.remove("gs7_userSet_userSet.csv")
  • To perform GSEA using a ranking-based method, specify method = "Ranking".

3.4 Example 4: Enrichment analysis for gene expression type of data that do not

3.5 need to combine test statistics

#generatin example dataset
kegg <- getKEGG(species="Human")
gene <- unique(as.vector(unlist(kegg)))
p <- runif(length(gene))
names(p) <- gene
stats <- -log(p)*sample(c(1,-1),length(p),replace=TRUE)

#traditional GSEA analysis, enrichment toward higher or lower end of statstics
stats <- sort(stats,decr=TRUE)
gsRank(stats=stats,gSetName="KEGG",scoreType="std",outfile="gs9",nperm=1e4,
    ncore=1)
file.remove("gs9_KEGG_KEGG.csv")
file.remove("gsGene_genep.csv")

#enrichment of genes with higher statistics
stats <- sort(abs(stats),decr=TRUE)

4 Gene set and pathway databases

The package includes built-in support for KEGG, GO, MSigDB, and Reactome gene sets for both human and mouse pathways. All functions also offer options to incorporate custom, user-provided gene sets.

4.1 Kyoto Encyclopedia of Genes and Genomes (KEGG)

The KEGG pathway database is a widely used resource that provides a comprehensive collection of manually curated biological pathways. These pathways cover various biological processes, including metabolism, cellular processes, genetic information processing, and human diseases. KEGG pathways integrate information about molecular interactions, reactions, and relationships between genes,proteins, and other molecules, helping researchers understand complex biological functions at a systems level.

4.2 Gene Ontology (GO)

GO is a widely used framework for describing the roles of genes and their products (proteins, RNAs) in biological systems. Unlike pathway databases that focus on specific molecular interactions, GO provides a standardized vocabulary for annotating gene functions across species in three main categories:

  • Biological Process (BP): Describes the biological goals a gene or protein contributes to, such as cell division or metabolic processes.
  • Molecular Function (MF): Refers to the specific biochemical activities of a gene product, like binding or catalysis.
  • Cellular Component (CC): Indicates where in the cell a gene product is active, such as the nucleus, membrane,or cytoplasm.

4.3 The Molecular Signatures Database (MSigDB)

MSigDB is a comprehensive collection of gene sets used for interpreting high-throughput gene expression data in biological research. It is a key resource for gene set enrichment analysis (GSEA), helping researchers identify biological pathways, processes, and mechanisms that are overrepresented in a given dataset.

The Molecular Signatures Database (MSigDB) is divided into several major collections, each of which contains different sub-categories. Here’s a list of all the main categories and their sub-categories:

  • H: Hallmark Gene Sets. These gene sets represent fundamental biological processes, combining several similar gene sets into cohesive biological themes.
  • C1: Positional Gene Sets. Gene sets based on the chromosomal location of genes.
  • C2: Curated Gene Sets. C2.CP: Canonical Pathways: Gene sets derived from well-known pathway databases, including KEGG, Reactome, BioCarta, and others. C2.CGP: Chemical and Genetic Perturbations: Gene sets derived from published studies of chemical/genetic perturbations, often based on experimental data.
  • C3: Regulatory Target Gene Sets. C3.TFT: Transcription Factor Targets: Gene sets defined by transcription factor binding motifs. C3.MIR: microRNA Targets: Gene sets representing genes targeted by specific microRNAs.
  • C4: Computational Gene Sets. C4.CGN: Cancer Gene Neighborhoods: Gene sets computationally derived based on the relationships between genes in cancer studies. C4.CM: Cancer Modules: Gene sets based on modules of genes that co-vary across different cancers.
  • C5: GO Gene Sets. C5.BP: Biological Process: Gene sets from the biological process branch of Gene Ontology (GO). C5.CC: Cellular Component: Gene sets from the cellular component branch of GO. C5.MF: Molecular Function: Gene sets from the molecular function branch of GO.
  • C6: Oncogenic Signatures. Gene sets representing signatures of oncogenic pathway activation, often based on experimental perturbation of cancer-related genes.
  • C7: Immunologic Signatures. Gene sets derived from immunological studies, such as immune cell expression profiles, cytokine treatments, and immune responses.
  • C8: Cell Type Signatures. Gene sets representing the expression profiles of different cell types, including cell states, tissue types, and developmental stages.

4.4 Reactome

Reactome is a freely accessible, curated database of biological pathways that provides detailed insights into molecular processes across a wide range of organisms. It covers various cellular and biochemical processes, such as signal transduction, metabolism, gene expression, and immune responses. Each pathway in Reactome is represented as a network of molecular interactions, where entities like proteins, small molecules, and complexes participate in reactions, including binding, transport, and modifications. These reactions are organized hierarchically, from individual molecular events to larger biological processes.Reactome pathways are extensively curated by domain experts and are used in functional analysis of high-throughput omics data (e.g., gene expression, proteomics). It integrates experimental data, enabling researchers to explore the molecular mechanisms behind diseases, drug responses, and other biological phenomena.

5 Types of gene set enrichment analyses

Threshold-Based GSEA or Over-Representation Analysis. It requires a threshold to define “significant” genes (e.g., p-value, fold change) and tests whether the overlap between a predefined list of differentially expressed genes (DEGs) and a gene set is statistically significant.

Ranking-Based or functional class scoring (FCS) GSEA, ranks all genes in a dataset based on a continuous metric (e.g., P-value, fold change, t-statistic) and assesses whether the genes in a predefined gene set cluster at the top or bottom of this ranked list.

The choice between threshold-based and ranking-based Gene Set Enrichment Analysis (GSEA) depends on the nature of the hypothesis being tested, the type of data, and the goals of the analysis. Threshold-Based GSEA: Use for well-defined, significant subsets of genes/features when you have a justifiable cutoff and are interested in strong, specific signals. Ranking-Based GSEA: Use when you want to incorporate the full dataset, avoid arbitrary cutoffs, or have a hypothesis that requires considering a continuous spectrum of feature significance.

6 Method options to combine p-value:

sessionInfo() 
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
##  [2] minfi_1.55.0                                      
##  [3] bumphunter_1.51.0                                 
##  [4] locfit_1.5-9.12                                   
##  [5] iterators_1.0.14                                  
##  [6] foreach_1.5.2                                     
##  [7] Biostrings_2.77.1                                 
##  [8] XVector_0.49.0                                    
##  [9] dmGsea_0.99.3                                     
## [10] SummarizedExperiment_1.39.0                       
## [11] Biobase_2.69.0                                    
## [12] GenomicRanges_1.61.0                              
## [13] GenomeInfoDb_1.45.4                               
## [14] IRanges_2.43.0                                    
## [15] S4Vectors_0.47.0                                  
## [16] BiocGenerics_0.55.0                               
## [17] generics_0.1.4                                    
## [18] MatrixGenerics_1.21.0                             
## [19] matrixStats_1.5.0                                 
## [20] Matrix_1.7-3                                      
## [21] BiocStyle_2.37.0                                  
## 
## loaded via a namespace (and not attached):
##   [1] bitops_1.0-9              beanplot_1.3.1           
##   [3] DBI_1.2.3                 poolr_1.2-0              
##   [5] BiasedUrn_2.0.12          magrittr_2.0.3           
##   [7] rlang_1.1.6               scrime_1.3.5             
##   [9] compiler_4.5.0            RSQLite_2.4.1            
##  [11] GenomicFeatures_1.61.3    DelayedMatrixStats_1.31.0
##  [13] png_0.1-8                 vctrs_0.6.5              
##  [15] quadprog_1.5-8            pkgconfig_2.0.3          
##  [17] crayon_1.5.3              fastmap_1.2.0            
##  [19] Rsamtools_2.25.0          rmarkdown_2.29           
##  [21] tzdb_0.5.0                UCSC.utils_1.5.0         
##  [23] preprocessCore_1.71.0     purrr_1.0.4              
##  [25] bit_4.6.0                 xfun_0.52                
##  [27] cachem_1.1.0              jsonlite_2.0.0           
##  [29] blob_1.2.4                rhdf5filters_1.21.0      
##  [31] DelayedArray_0.35.1       reshape_0.8.9            
##  [33] Rhdf5lib_1.31.0           BiocParallel_1.43.3      
##  [35] R6_2.6.1                  bslib_0.9.0              
##  [37] RColorBrewer_1.1-3        limma_3.65.1             
##  [39] rtracklayer_1.69.0        genefilter_1.91.0        
##  [41] jquerylib_0.1.4           Rcpp_1.0.14              
##  [43] bookdown_0.43             knitr_1.50               
##  [45] readr_2.1.5               tidyselect_1.2.1         
##  [47] rentrez_1.2.4             illuminaio_0.51.0        
##  [49] splines_4.5.0             abind_1.4-8              
##  [51] yaml_2.3.10               siggenes_1.83.0          
##  [53] codetools_0.2-20          curl_6.3.0               
##  [55] doRNG_1.8.6.2             tibble_3.3.0             
##  [57] lattice_0.22-7            plyr_1.8.9               
##  [59] KEGGREST_1.49.0           askpass_1.2.1            
##  [61] evaluate_1.0.3            survival_3.8-3           
##  [63] xml2_1.3.8                mclust_6.1.1             
##  [65] pillar_1.10.2             BiocManager_1.30.26      
##  [67] rngtools_1.5.2            mathjaxr_1.8-0           
##  [69] RCurl_1.98-1.17           hms_1.1.3                
##  [71] sparseMatrixStats_1.21.0  xtable_1.8-4             
##  [73] glue_1.8.0                tools_4.5.0              
##  [75] BiocIO_1.19.0             data.table_1.17.4        
##  [77] GenomicAlignments_1.45.0  annotate_1.87.0          
##  [79] GEOquery_2.77.0           XML_3.99-0.18            
##  [81] rhdf5_2.53.1              grid_4.5.0               
##  [83] tidyr_1.3.1               AnnotationDbi_1.71.0     
##  [85] base64_2.0.2              nlme_3.1-168             
##  [87] nor1mix_1.3-3             HDF5Array_1.37.0         
##  [89] restfulr_0.0.15           cli_3.6.5                
##  [91] S4Arrays_1.9.1            dplyr_1.1.4              
##  [93] sass_0.4.10               digest_0.6.37            
##  [95] SparseArray_1.9.0         dqrng_0.4.1              
##  [97] org.Hs.eg.db_3.21.0       rjson_0.2.23             
##  [99] memoise_2.0.1             htmltools_0.5.8.1        
## [101] multtest_2.65.0           lifecycle_1.0.4          
## [103] h5mread_1.1.1             httr_1.4.7               
## [105] statmod_1.5.0             openssl_2.3.3            
## [107] bit64_4.6.0-1             MASS_7.3-65