---
title: "Functional Analysis of DNAm Sequencing Data"
shorttitle: "KYCG-SEQ"
package: knowYourCG
output: rmarkdown::html_vignette
fig_width: 6
fig_height: 5
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{"1. Sequencing Data Analysis"}
%\VignetteEncoding{UTF-8}
---
KnowYourCG (KYCG) is a supervised learning framework designed for the
functional analysis of DNA methylation data. Unlike existing tools that
focus on genes or genomic intervals, KnowYourCG directly targets CpG
dinucleotides, featuring automated supervised screenings of diverse
biological and technical influences, including sequence motifs,
transcription factor binding, histone modifications, replication timing,
cell-type-specific methylation, and trait associations. KnowYourCG
addresses the challenges of data sparsity in various methylation
datasets, including low-pass Nanopore sequencing, single-cell DNA
methylomes, 5-hydroxymethylation profiles, spatial DNA methylation maps,
and array-based datasets for epigenome-wide association studies and
epigenetic clocks.
The input to KYCG is a CpG set (query). The CpG sets can represent differential
methylation, results from an epigenome-wide association study, or any sets
that may be derived from analysis. If analyzing **sequencing data**, the
preferred format is a YAME-compressed binary vector of 0 and 1 to indicate
whether the CpG is in the set. This format assumes a specific order of CpGs
following the genomic coordinates. Since it's a coordinate-free approach, the
reference coordinate is critical. Please refer to the YAME documentation for
details. https://zhou-lab.github.io/YAME/.
# QUICK START
```{r seq1, fig.width=5, fig.height=5, message=FALSE, eval=.Platform$OS.type!="windows"}
# Code that should run only on non-Windows systems
library(knowYourCG)
# Download query and knowledgebase datasets:
temp_dir <- tempdir()
knowledgebase <- file.path(temp_dir, "ChromHMM.20220414.cm")
query <- file.path(temp_dir, "mm10_f3_10cells.cg")
knowledgebase_url <- "https://zenodo.org/records/18175656/files/ChromHMM.20220414.cm"
query_url <- "https://zenodo.org/records/18176004/files/mm10_f3_10cells.cg"
download.file(knowledgebase_url, destfile = knowledgebase)
download.file(query_url, destfile = query)
# test enrichment (require YAME installed in shell)
res = testEnrichment(query, knowledgebase)
KYCG_plotDot(res, short_label=TRUE)
```
# KNOWLEDGEBASES
The curated target features are called the knowledgebase sets. We have curated
a variety of knowledgebases that represent different categorical and continuous
methylation features such as CpGs associated with chromatin states, technical
artifacts, gene association and gene expression correlation, transcription
factor binding sites, tissue specific methylation, CpG density, etc.
Whole-genome knowledgebases are available as listed in the following tables.
| Assembly | Link |
|--------------|-----------------------------------------|
| human (hg38) | |
| mouse (mm10) | |
# INPUT FORMAT
For non-array data, CpG sets (query, knowledgebase, and universe) should be
formatted using YAME. YAME supports binary representation of a set (format "b")
and optionally with a universe (format "d"). Format "d" can be created from
format "b" using the `yame mask` function.
If you have a BED-formatted data, you can pack it to YAME-"b" format
using the following pipeline. This pipeline requires
[BEDTools](https://bedtools.readthedocs.io/en/latest/content/installation.html)
and a reference coordinate file (YAME-"r") available below:
| Assembly | Link |
|--------------|---------------------------------------------|
| human (hg38) | |
| mouse (mm10) | |
```bash
yame unpack cpg_nocontig.cr |
bedtools intersect -a - -b [your_input.bed] -c -sorted |
cut -f4 | yame pack -fb - > [your_input.cg]
```
The above assumes your input is already sorted. Check out the [bedtools
intersect](https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html)
if you encounter any problems at this step.
As a convenience, you can do the same conversion with the `bedToCg()` wrapper,
which validates inputs and checks for required tools before running the
pipeline:
```R
bed_file <- "input_cpgs.bed"
ref_cr <- "cpg_nocontig.cr"
out_file <- "input_cpgs.cg"
bedToCg(bed_file, ref_cr, out_file)
```
# ENRICHMENT TESTING
Then we simply run
[`yame summary`](https://zhou-lab.github.io/YAME/docs/summarize.html)
with `-m` feature file for enrichment testing. We have provided
comprehensive enrichment feature files, and you can download them from
Zenodo [mm10](https://zenodo.org/records/18175656)/[hg38](https://zenodo.org/records/18175838).
You can also create your own feature file with
[`yame pack`](https://zhou-lab.github.io/YAME/docs/pack_unpack.html).
```bash
yame summary -m feature.cm yourfile.cg > yourfile.txt
```
Detailed information of the output columns can be found on the
[`yame summary`](https://zhou-lab.github.io/YAME/docs/summarize.html)
page. Basically, a higher log2oddsratio indicates a stronger association
between the feature being tested and the query set. Generally, a large
log2 odds ratio is typically considered to be around 2 or greater, with
values between 1 and 2 often being viewed as potentially important and
worthy of further investigation, while values around 0.5 might be
considered a small effect size. For significance testing,
[sesame](https://www.bioconductor.org/packages/release/bioc/html/sesame.html)
R package provides the testEnrichmentFisherN function, which is also
provided in the yame github R page. The four input parameters correspond
to the four columns from yame summary output.
```
ND = N_mask
NQ = N_query
NDQ = N_overlap
NU = N_universe
```
We can create a coarse differential methylation dataset the following way
```bash
yame pairwise -H 1 -c 10 sample1.cg sample2.cg -o output.cg
```
-H controls directionality and -c controls minimum coverage.
The output is a query CG set with proper universe background. Selecting the
appropriate background for enrichment testing is crucial because it can
significantly impact the interpretation of the results. Usually, we use the
background set that is measured in the experiment under different conditions.
```bash
yame mask -c query.cg universe.cg | yame summary -m feature.cm - > yourfile.txt
```
The following is an analysis example in R that explicitly calls `yame` in R.
```R
if (nzchar(Sys.which("yame"))) {
df <- read.table(
text = system("yame summary -m feature.cm yourfile.cg", intern = TRUE),
header = TRUE
)
} else {
message("`yame` not found on PATH; install it to run this example.")
}
```
# SESSION INFO
```{r session-info}
sessionInfo()
```