---
title: "Cell type deconvolutiond of (cf)MeDIP-seq data with decemedip"
author:
- name: Ning Shen
  affiliation: 
  - Department of Statistics, University of British Columbia; 
  - Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute
  email: ning.shen.wk@gmail.com
output: 
  BiocStyle::html_document:
    titlecaps: false
    toc_float: true
    fig_width: 8
    fig_height: 6
    number_sections: true
bibliography: ../inst/references.bib
vignette: >
  %\VignetteIndexEntry{Cell type deconvolutiond of (cf)MeDIP-seq data with decemedip}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##"
)
```


# Citation

**If you use decemedip in published research, please cite @shen2025decemedip, or: **

Shen N, Zhang Z, Baca S, Korthauer K. decemedip: hierarchical Bayesian modeling for cell type deconvolution of immunoprecipitation-based DNA methylomes. bioRxiv. 2025:2025-05.


# Installation

This package will be submitted to Bioconductor. For now, you can install the development version from GitHub:

```{r installation, eval = FALSE}
# Install stable version from Bioconductor (once available)
BiocManager::install("decemedip")
```

After installation, load the decemedip package:

```{r setup, warning=F, message=F}
library(SummarizedExperiment)
library(dplyr)
library(ggplot2)
library(decemedip)
options(digits = 2)
```

# Background

Cell-free and bulk DNA methylation data obtained through MeDIP-seq reflect a mixture of methylation signals across multiple cell types. Decomposing these signals to infer cell type composition can provide valuable insights for cancer diagnosis, immune response monitoring, and other biomedical applications. However, challenges like enrichment-induced biases and sparse reference data make this task complex. The `decemedip` package addresses these challenges through a hierarchical Bayesian framework that estimates cell type proportions and models the relationship between MeDIP-seq counts and reference methylation data.

decemedip couples a logit-normal model with a generalize additive model (GAM) framework. For each site $i \in \{1, ..., N\}$ in the reference panel, the input to the model is the fractional methylation levels $x_{ik}$ for each cell type $k \in \{1, ..., K\}$, the CpG density level $z_i$ and the MeDIP-seq read count $y_i$, where $K$ is the total number of cell types. A unit simplex variable that follows a logit-normal prior, $\boldsymbol{\pi} = (\pi_1, ..., \pi_K)$ where $\pi_k > 0, \sum_{k=1}^K \pi_k = 1$, is included to describe the proportions of cell types in the reference panel while taking into account the correlations between these cell types.

```{r, fig.retina = NULL, fig.align='center', fig.wide = TRUE, echo=FALSE}
# knitr::include_graphics(system.file("figures/method_main_figure.png", package = "decemedip"), dpi = 300)
```

# Input Data

`decemedip` requires three primary inputs: 

1. **Reference methylation matrix**: A matrix of methylation levels for multiple cell types at selected CpG sites.
2. **CpG density**: Information about CpG site density in the genome.
3. **MeDIP-seq read counts**: Coverage values from the sample of interest.

## Reference panel

Our package provides default reference matrices for hg19 and hg38 along with the corresponding CpG density information, as objects in class `SummarizedExperiment`. The objects can be accessed by calling `data(hg19.ref.cts.se)` and `data(hg19.ref.anc.se)`, or `data(hg38.ref.cts.se)` and `data(hg38.ref.anc.se)`. By default, the main function `decemedip()` applies `hg19.ref.cts.se` and `hg19.ref.anc.se` as the reference panels. Please refer to the manuscript for details of how the default reference panels were constructed.

On a side note, we provide the function `makeReferencePanel()` to allow user to build their own reference panels, which only requires input of reference CpGs and corresponded fractional methylation level matrix. The function computes CpG density on its own. Note that the cell type-specific sites and anchor sites need to be included in two `SummarizedExperiment` objects to be inputs to the main function `decemedip()`. See `?makeReferencePanel` for more information.

We show how the reference panel look like in the following chunks:

```{r}
data(hg19.ref.cts.se)
print(hg19.ref.cts.se)
head(granges(hg19.ref.cts.se))
```

```{r}
data(hg19.ref.anc.se)
print(hg19.ref.anc.se)
head(granges(hg19.ref.anc.se))
```


# Fit the Bayesian Model

The main function `decemedip()` fits the decemedip model. It allows two types of input:

1. A BAM file of the sample MeDIP-seq data, or
1. read counts of the reference CpG sites of the sample.

We provide instructions for both input types as follows.

## Input with a BAM file

```{r eval=FALSE}
sample_bam_file <- "path/to/bam/files"
paired <- TRUE # whether the sequencing is paired-end
```

```{r eval=FALSE}
output <- decemedip(
  sample_bam_file = sample_bam_file,
  paired = paired,
  cores = 4
)
```

PS: By default, the `decemedip()` function uses a hg19 reference panel. But users may add the arguments `ref_cts = hg38.ref.cts.se, ref_anc = hg38.ref.anc.se` to apply read counts extraction on hg38 data.

## Input with read counts

We use built-in objects of the package that contains read counts of the prostate tumor patient-derived xenograft (PDX) samples from the @berchuck2022detecting study to demonstrate the output and diagnostics in this vignette. 

Note that in this example, we use only a subset of the cell types from the default reference panel to reduce the build time of the vignette. Only blood cells and prostate are included.

```{r}
data(example.hg19.ref.cts.se)
data(example.hg19.ref.anc.se)

data(example.pdx.counts.cts.se)
data(example.pdx.counts.anc.se)
```

We extract the sample `LuCaP_147CR` from this example dataset for follwing illustration:
```{r}
# read counts of cell type-specific CpGs of the sample 'LuCaP_147CR'
counts_cts <- assay(example.pdx.counts.cts.se)[, "LuCaP_147CR"]
# read counts of anchor CpGs of the sample 'LuCaP_147CR'
counts_anc <- assay(example.pdx.counts.anc.se)[, "LuCaP_147CR"]
```

Due to the vignette running time limit by Bioconductor, we only run 500 iterations (`iter = 500`) for the purpose of demonstration, which causes the warning of effective Samples Size (ESS) being too low. In regular cases, we recommend to run 2000 iterations (the default) for a stable posterior inference. 

```{r eval=TRUE}
output <- decemedip(
  counts_cts = counts_cts,
  counts_anc = counts_anc,
  ref_cts = example.hg19.ref.cts.se,
  ref_anc = example.hg19.ref.anc.se,
  diagnostics = TRUE,
  cores = 4,
  iter = 500
)
```

The `output` is a list containing two elements:

+ `data_list`: An organized list of variables used as input to the Stan posterior sampling function.
+ `posterior`: An `stanfit` object produced by Stan representing the fitted posteriors. 


# Checking model outputs

## Cell type proportions

After running the model, you may extract and save the summary of fitted posteriors using the `monitor()` and `extract()` functions provided by the `RStan` package. See documentation of `RStan` for details of these functions.

```{r message=FALSE}
library(rstan)
```

Extract the fitted posterior of cell type proportions ($\boldsymbol\pi$): 

```{r eval=TRUE}
smr_pi.df <- getSummaryOnPi(output$posterior, cell_type_names = colnames(example.hg19.ref.cts.se))
print(smr_pi.df)
```

+ Columns in the `smr_pi.df`:
  + `cell_type`: The name of the parameter or variable being analyzed.
  + `mean`: The posterior mean, representing the point estimate of the parameter.
  + `se_mean`: The standard error of the mean, calculated as sd / sqrt(n_eff), indicating precision of the mean estimate.
  + `sd`: The posterior standard deviation, representing the spread or uncertainty of the parameter estimate.
  + `2.5%`, `25%`, `50%` (median), `75%`, `97.5%`: Percentiles of the posterior distribution, providing a summary of parameter uncertainty. These define the 95% credible interval (2.5% to 97.5%).

+ Diagnostics Columns:
  + `n_eff`: The effective sample size, indicating how many independent samples the chain produced after accounting for autocorrelation.
  + `Rhat`: The potential scale reduction factor, measuring chain convergence. Values close to 1.00 suggest good convergence.
  + `valid`: A flag indicating whether diagnostic checks (e.g., Rhat and n_eff) passed for this parameter (1 = passed, 0 = potential issues).
  
Plotting out the fitted cell type proportions with credible intervals:

```{r, fig.fullwidth=TRUE, warning=FALSE, eval=TRUE}
labels <- gsub("_", " ", smr_pi.df$cell_type)
labels <- gsub("(.*) EPIC", "\\1", labels)

smr_pi.df |>
  mutate(cell_type = factor(cell_type, labels = labels)) |>
  ggplot(aes(cell_type, mean)) +
  geom_linerange(aes(ymin = `2.5%`, ymax = `97.5%`),
    position = position_dodge2(width = 0.035),
    linewidth = 7, alpha = 0.3
  ) +
  geom_linerange(aes(ymin = `25%`, ymax = `75%`),
    position = position_dodge2(width = 0.035),
    linewidth = 7, alpha = 1
  ) +
  geom_point(
    position = position_dodge2(width = 0.035),
    fill = "white", shape = 21, size = 8
  ) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```


## Fitted relationship between fractional methylation and MeDIP-seq counts

Note that this plot is only accessible when `diagnostics` is set to`TRUE` in the `decemedip()` function. 
The actual \gls*{MeDIP-seq} read counts (orange) and the fitted counts predicted by the GAM component (black) are shown in the figure across varying levels of CpG density. Grey area represents the 95\% credible intervals of the predicted counts. `CpG density: x' means that there are x CpGs in the 100-bp window surrounding the CpG.

```{r, fig.fullwidth=TRUE, warning=FALSE, eval=TRUE}
plotDiagnostics(output, plot_type = "y_fit") +
  ylim(0, 300)
```


# Conclusion

The `decemedip` package provides a robust framework for cell type deconvolution from MeDIP-seq data. By following this vignette, users can apply the method to their own datasets, extract key model outputs, and generate diagnostic plots for analysis.


# Session Info

```{r}
sessionInfo()
```

# References