---
title: "OVESEG User Manual"
author: "Lulu Chen"
date: "`r Sys.Date()`"
bibliography: oveseg.bib
output: 
  BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteIndexEntry{OVESEG User Manual}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"   
)
```
# Introduction

OVESEG-test (One Versus Everyone Subtype Exclusively-expressed Genes test) is a statistically-principled multiple-group comparison method that can detect tissue/cell-specific marker genes (MGs) among many subtypes (e.g. tissue/cell types) [@oveseg]. To assess the statistical significance of MGs, OVESEG-test uses a specifically designed test statistics that mathematically matches the definition of MGs, and employs a novel permutation scheme to estimate the corresponding distribution under null hypothesis where the expression patterns of non-MGs can be highly complex.

`OVESEG` package provides functions to compute OVESEG-test statistics, derive component weights in the mixture null distribution model and estimate p-values from weightedly aggregated permutations. While OVESEG-test p-values can be used to detect MGs, component weights of hypotheses also help portrait all kinds of upregulation patterns among tissue/cell types.

# Quick Start

The function `OVESEGtest()` includes all necessary steps to obtain OVESEG-test p-values. You need to specify the expression matrix, tissue/cell type labels, number of permutations, and parallel option. For microarray data, the expressions need to be log2-transformed prior to the test. For RNAseq count data, the counts need to be transformed to logCPM (log2-counts per million) prior to the test and use `limma::voom()` to incorporate the mean-variance relationship. Parallel option is set according to `r Biocpkg("BiocParallel")` package.

```{r, eval=FALSE}
#Microarray
rtest <- OVESEGtest(y, group, NumPerm=999, 
                    BPPARAM=BiocParallel::SnowParam())

#RNAseq
yvoom <- limma::voom(count, model.matrix(~0+factor(group)))
rtest <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999, 
                    BPPARAM=BiocParallel::SnowParam())
```

# OVESEG-test

## Datatypes and Input Format

Theoretically, OVESEG-test can be applied to any molecular expression data types as long as they can be modeled by normal distribution after a transformation, such as log2-transformation for microarray and proteomics data, logCPM for RNAseq counts, logit2-transformation for DNA methylation beta values. If mean-variance relationship exists after transformation, `limma::vooma()/voom()` need to be performed firstly and the resulted weight matrix is used as the input of `OVESEG`.

Requirements for the input expression data:

- be transformed (log2/logCPM/logit2).
- be already processed by normalization and batch effect removal.
- no missing values; the molecules containing missing values should be removed prior to OVESEG-test.
- no low-expressed molecules; otherwise, the results will be affected.


The input expression data should be stored in a matrix. Data frame, or SummarizedExperiment object is also accepted and will be internally coerced into a matrix format before analysis. Rows correspond to probes and columns to samples. Tissue/cell labels must match each column in expression matrix. Weight matrix, if provided, must match each spot in expression matrix.

## Microarray data

We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from [GSE28490](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE28490)) as an example to show OVESEG-test on microarray data.

```{r, echo=TRUE}
library(OVESEG)
#Import data
data(RocheBT)
y <- RocheBT$y #5000*15 matrix
group <- RocheBT$group #15 labels

#filter low-expressed probes
groupMean <- sapply(levels(group), function(x) rowMeans(y[,group == x]))
groupMeanMax <- apply(groupMean, 1, max)
keep <- (groupMeanMax > log2(64))
y <- y[keep,]

#OVESEG-test
rtest1 <- OVESEGtest(y, group, NumPerm = 999, 
                     BPPARAM=BiocParallel::SnowParam())
```

Note there are many low-expressed probe filtering methods. Here we filtered the probes whose mean expression is less than a threshold even in the highest expressed group. If mean-variance relationship is obvious, we can consider to apply `limma::vooma()` firstly.

```{r, echo=TRUE}
#OVESEG-test
yvooma <- limma::vooma(y, model.matrix(~0+factor(group)))
rtest2 <- OVESEGtest(yvooma$E, group, weights=yvooma$weights, NumPerm = 999,
                     BPPARAM=BiocParallel::SnowParam())
```

## RNAseq count data

We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from [GSE60424](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60424)) as an example to show OVESEG-test on RNAseq count data.

```{r, echo=TRUE}
library(OVESEG)
#Import data
data(countBT)
count <- countBT$count #10000*60 matrix
group <- countBT$group #60 labels

#filter low-expressed probes
groupMean <- sapply(levels(group), function(x) rowMeans(count[,group == x]))
groupMeanMax <- apply(groupMean, 1, max)
keep <- (groupMeanMax > 2)
count <- count[keep,]

#OVESEG-test
lib.size <- max(colSums(count))
yvoom <- limma::voom(count, model.matrix(~0+factor(group)), 
                     lib.size = lib.size)
rtest3 <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999,
                     BPPARAM=BiocParallel::SnowParam())
```
Note there are many low-expressed probe filtering methods. Here we filtered the probes whose mean count is less than a threshold even in the highest expressed group. `lib.size` and other parameters in `limma::voom()` can be set manually according to `r Biocpkg("limma")` package.

## p-values

`OVESEGtest` returns three p-values: `pv.overall`, `pv.oneside` and `pv.multiside`. `pv.overall` is calculated by all permutations regardless of upregulated subtypes. `pv.oneside` is subtype-specific p-values calculated specifically for the upregulated subtype of each probe. `pv.multiside` is `pv.oneside` multiplied by K (K comparison correction) and truncated at 1. More details can be found in the paper [@oveseg].

# Useful intermediate results

## Test statistics

OVESEG-test statistics are defined as $$\mathbf{\max_{k=1,...,K}\left\{min_{l \neq k}\left\{ \frac{\mu_k(j)-\mu_l(j)}{\sigma(j)\sqrt{\frac{1}{N_k}+\frac{1}{N_l}}} \right\}\right\}}$$ where $\mathbf{\mu_k(j)}$ is the mean of logarithmic expressions of gene j in subtype k. While it takes a long time to execute permutations for p-value estimation, `OVESEGtstat()` is useful if users only need test statistics for ranking genes: 

```{r, echo=TRUE}
#OVESEG-test statistics
rtstat1 <- OVESEGtstat(y, RocheBT$group)
rtstat2 <- OVESEGtstat(yvooma$E, RocheBT$group, weights=yvooma$weights)
rtstat3 <- OVESEGtstat(yvoom$E, countBT$group, weights=yvoom$weights)
```

## Posterior probabilities of null hypothesis components

Null hypothesis of OVESEG-test is modeled as a mixture of multiple components, where the weights of components are estimated from posterior probabilities over all probes. Those posterior probabilities are also returned by `OVESEGtest()`. If users only want to observe probewise posterior probabilities, `postProbNull()` is helpful.

```{r, echo=TRUE}
ppnull1 <- postProbNull(y, RocheBT$group)
ppnull2 <- postProbNull(yvooma$E, RocheBT$group, weights=yvooma$weights)
ppnull3 <- postProbNull(yvoom$E, countBT$group, weights=yvoom$weights)
```

By accumulating and normalizing probewise posterior probability with the function `nullDistri()`, we can also obtain the probability of one subtype being upregulated conditioned on null hypotheses. The subtype with higher probability tends to get more False Positive MGs.

```{r, echo=TRUE}
nullDistri(ppnull1)
nullDistri(ppnull2)
nullDistri(ppnull3)
```

There are totally $\mathbf{2^K-1}$ types of expression patterns in a real dataset: probes exclusively expressed in 1,2,…, or K of subtypes. Probe unexpressed in any of subtypes should have been filtered during pre-processing. Accumulating and normalizing probewise posterior probability of null hypotheses and of alternative hypotheses using the function `patternDistri()` can present us the ratios of all possible upregulation patterns among subtypes:

```{r, echo=TRUE}
patterns <- patternDistri(ppnull1)
#or
patterns <- patternDistri(rtest1)
```

The ratios of patterns can be illustrated as following.

```{r, fig.height=6, fig.width=6, fig.show='hold'}
library(gridExtra)
library(grid)
library(reshape2)
library(ggplot2)

gridpatterns <- function (ppnull) {
    K <- length(ppnull$label)
    cellcomp <- patternDistri(ppnull)
    cellcomp <- cellcomp[order(cellcomp[,K+1], decreasing = TRUE),]
    
    #Bar Chart to show pattern percentage
    DF1 <- data.frame(Rank = seq_len(2^K - 1), cellcomp)
    p1 <- ggplot(DF1, aes(x = Rank, y = Wpattern)) +
        geom_bar(stat = "identity") +
        scale_y_continuous(labels=scales::percent) +
        theme_bw(base_size = 16) +
        theme(axis.text.x = element_blank(),
              axis.title.x = element_blank(),
              plot.margin=unit(c(1,1,-0.2,1), "cm"),
              panel.grid.minor = element_line(size = 1),
              panel.grid.major = element_line(size = 1),
              panel.border = element_blank(),
              axis.ticks.x = element_blank()) +
        ylab("Percentage of up in certain subtype(s)")
    
    #patterns as X-axis
    DF2 <- data.frame(Rank = seq_len(2^K-1), cellcomp[,-(K+1)])
    DF2 <- melt(DF2, id.var="Rank")
    p2 <- ggplot(DF2, aes(x = Rank, y = value, fill = variable)) +
        geom_bar(stat = "identity") +
        scale_fill_brewer(palette="Set2") +
        theme(line = element_blank(),
              axis.text.x = element_blank(),
              axis.text.y = element_blank(),
              title = element_blank(),
              panel.background = element_rect(fill = "white"),
              legend.position="bottom",
              legend.key.size = unit(1.2,"line"),
              plot.margin=unit(c(-0.2,1,1,1), "cm")) +
        scale_y_reverse() +
        guides(fill = guide_legend(nrow = 1, byrow = TRUE))
    
    g1 <- ggplotGrob(p1)
    g2 <- ggplotGrob(p2)
    colnames(g1) <- paste0(seq_len(ncol(g1)))
    colnames(g2) <- paste0(seq_len(ncol(g2)))
    g <- gtable_combine(g1, g2, along=2)
    panels <- g$layout$t[grepl("panel", g$layout$name)]
    n1 <- length(g$heights[panels[1]])
    n2 <- length(g$heights[panels[2]])
    # assign new (relative) heights to the panels
    # notice the *4 here to get different heights
    g$heights[panels[1]] <- unit(n1*4,"null")
    g$heights[panels[2]] <- unit(n2,"null")
    return(g)
} 

grid.newpage()
grid.draw(gridpatterns(ppnull1))
#or grid.draw(gridpatterns(rtest1))
```


# Notes

Default variance estimator in `OVESEG` is empirical Bayes moderated variance estimator used in `r Biocpkg("limma")` (argument `alpha="moderated"`). Another option is adding a constant $\alpha$ to pooled variance estimator (argument `alpha=`$\alpha$). Setting argument `alpha=NULL` will treat all variances as the same constant.

Before MG detection, OVESEG-test p-values still need multiple comparison correction or false discovery rate control.

```{r, echo=TRUE}
##multiple comparison correction 
pv.overall.adj <- stats::p.adjust(rtest1$pv.overall, method="bonferroni")
pv.multiside.adj <- stats::p.adjust(rtest1$pv.multiside, method="bonferroni")

##fdr control
qv.overall <- fdrtool::fdrtool(rtest1$pv.overall, statistic="pvalue",
                               plot=FALSE, verbose=FALSE)$qval
qv.multiside <- fdrtool::fdrtool(rtest1$pv.multiside, statistic="pvalue",
                                 plot=FALSE, verbose=FALSE)$qval
```

# sessionInfo

```{r, echo=TRUE}
sessionInfo()
```

# References