---
title: "Confident fold change"
author: "Paul Harrison"
date: "`r Sys.Date()`"
output: 
    BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{Confident fold change}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
params:
    cache: !r FALSE
---

This document shows typical Topconfects usage with limma, edgeR, or DESeq2.

The first step is to load a dataset. Here, we're looking at RNA-seq data that
investigates the response of *Arabodopsis thaliana* to a bacterial pathogen.
Besides the experimental and control conditions, there is also a batch effect.
This dataset is also examined in section 4.2 of the `edgeR` user manual, and
I've followed the initial filtering steps in the `edgeR` manual.

```{r setup, include=FALSE}
knitr::opts_chunk$set(fig.width=7, fig.height=6)
```


```{r load, message=FALSE, warning=FALSE}
library(topconfects)

library(NBPSeq)
library(edgeR)
library(limma)

library(dplyr)
library(ggplot2)

data(arab)

# Retrieve symbol for each gene
info <- 
    AnnotationDbi::select(
        org.At.tair.db::org.At.tair.db, 
        rownames(arab), "SYMBOL") %>%
    group_by(TAIR) %>% 
    summarize(
        SYMBOL=paste(unique(na.omit(SYMBOL)),collapse="/"))
arab_info <- 
    info[match(rownames(arab),info$TAIR),] %>% 
    select(-TAIR) %>%
    as.data.frame
rownames(arab_info) <- rownames(arab)

# Extract experimental design from sample names
Treat <- factor(substring(colnames(arab),1,4)) %>% relevel(ref="mock")
Time <- factor(substring(colnames(arab),5,5))

y <- DGEList(arab, genes=as.data.frame(arab_info))

# Keep genes with at least 3 samples having an RPM of more than 2
keep <- rowSums(cpm(y)>2) >= 3
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
```


# limma analysis

## Standard limma analysis steps

```{r limma}
design <- model.matrix(~Time+Treat)
design[,]

fit <-
    voom(y, design) %>%
    lmFit(design)
```

## Apply topconfects

Find largest fold changes that we can be confident of at FDR 0.05.

```{r limma_confects}
confects <- limma_confects(fit, coef="Treathrcc", fdr=0.05)

confects
```


## Looking at the result

Here the usual `logFC` values estimated by `limma` are shown as dots, with lines
to the `confect` value.

```{r fig.height=7}
confects_plot(confects)
```

`confects_plot_me` overlays the confects (red/blue) on a Mean-Difference Plot
(grey) (as might be produced by `limma::plotMD`). As we should expect, the very
noisy differences with low mean expression are removed if we look at the
confects.

```{r}
confects_plot_me(confects)
```

Let's compare this to the ranking we obtain from `topTable`.

```{r, fig.height=7}
fit_eb <- eBayes(fit)
top <- topTable(fit_eb, coef="Treathrcc", n=Inf)

rank_rank_plot(confects$table$name, rownames(top), "limma_confects", "topTable")
```

You can see that the top 19 genes from topTable are all within the top 40 for
topconfects ranking, but topconfects has also highly ranked some other genes.
These have a large effect size, and sufficient if not overwhelming evidence of
this.

An MD-plot highlighting the positions of the top 40 genes in both rankings also
illustrates the differences between these two ways of ranking genes.

```{r}
plotMD(fit, legend="bottomleft", status=paste0(
    ifelse(rownames(fit) %in% rownames(top)[1:40], "topTable ",""),
    ifelse(rownames(fit) %in% confects$table$name[1:40], "confects ","")))
```


# edgeR analysis

An analysis in edgeR produces similar results. Note that only quasi-likelihood
testing from edgeR is supported.

## Standard edgeR analysis

```{r edger}
y <- estimateDisp(y, design, robust=TRUE)
efit <- glmQLFit(y, design, robust=TRUE)
```

## Apply topconfects

A step of 0.05 is used here merely so that the vignette will build quickly. 
`edger_confects` calls `edgeR::glmTreat` repeatedly, which is necessarily slow.
In practice a smaller value such as 0.01 should be used.

```{r edger_confects}
econfects <- edger_confects(efit, coef="Treathrcc", fdr=0.05, step=0.05)

econfects
```


## Looking at the result

```{r fig.height=7}
confects_plot(econfects)
confects_plot_me(econfects)
```

```{r}
etop <-
    glmQLFTest(efit, coef="Treathrcc") %>%
    topTags(n=Inf)

plotMD(efit, legend="bottomleft", status=paste0(
    ifelse(rownames(efit) %in% econfects$table$name[1:40], "confects ", ""),
    ifelse(rownames(efit) %in% rownames(etop)[1:40], "topTags ","")))
```


# DESeq2 analysis

DESeq2 does its own filtering of lowly expressed genes, so we start from the
original count matrix. The initial steps are as for a normal DESeq2 analysis.

```{r message=F, warning=F}
library(DESeq2)

dds <- DESeqDataSetFromMatrix(
    countData = arab,
    colData = data.frame(Time, Treat),
    rowData = arab_info,
    design = ~Time+Treat)

dds <- DESeq(dds)
```

## Apply topconfects

The contrast or coefficient to test is specified as in the `DESeq2::results`
function. The step of 0.05 is merely so that this vignette will build quickly, 
in practice a smaller value such as 0.01 should be used. `deseq2_confects` 
calls `results` repeatedly, and in fairness `results` 
has not been optimized for this.

```{r}
dconfects <- deseq2_confects(dds, name="Treat_hrcc_vs_mock", step=0.05)
```

DESeq2 offers shrunken estimates of LFC. This is another sensible way of ranking
genes. Let's compare them to the confect values.

```{r}
shrunk <- lfcShrink(dds, coef="Treat_hrcc_vs_mock", type="ashr")
dconfects$table$shrunk <- shrunk$log2FoldChange[dconfects$table$index]

dconfects
```

DESeq2 filters some genes, these are placed last in the table. If your intention
is to obtain a ranking of all genes, you should disable this with
`deseq2_confects(..., cooksCutoff=Inf, independentFiltering=FALSE)`.

```{r}
table(dconfects$table$filtered)
tail(dconfects$table)
```

## Looking at the result

Shrunk LFC estimates are shown in red.

```{r fig.height=7}
confects_plot(dconfects) + 
    geom_point(aes(x=shrunk, size=baseMean, color="lfcShrink"), alpha=0.75)
```

`lfcShrink` aims for a best estimate of the LFC, whereas confect is a
conservative estimate. `lfcShrink` can produce non-zero values for genes which
can't be said to significantly differ from zero -- it doesn't do double duty as
an indication of significance -- whereas the confect value will be `NA` in this
case. The plot below compares these two quantities. Only un-filtered genes are
shown (see above).

```{r}
filter(dconfects$table, !filtered) %>%
ggplot(aes(
        x=ifelse(is.na(confect),0,confect), y=shrunk, color=!is.na(confect))) +
    geom_point() + geom_abline() + coord_fixed() + theme_bw() +
    labs(color="Significantly\nnon-zero at\nFDR 0.05", 
        x="confect", y="lfcShrink using ashr")
```

# Comparing results

```{r fig.height=7}
rank_rank_plot(confects$table$name, econfects$table$name, 
    "limma confects", "edgeR confects")
rank_rank_plot(confects$table$name, dconfects$table$name, 
    "limma confects", "DESeq2 confects")
rank_rank_plot(econfects$table$name, dconfects$table$name, 
    "edgeR confects", "DESeq2 confects")
```

---

```{r}
sessionInfo()
```