---
title: "ScreenR Example Analysis"
output:
    BiocStyle::html_document:
author:
- name: "Emanuel Michele Soda"
  affiliation: Istituto Europeo Oncologia (IEO),
               Milano, Italy
  email: emanuelmichele@ieo.it
- name: "Elena Ceccacci"
  affiliation: Istituto Europeo Oncologia (IEO),
               Milano, Italy
  email: elena.ceccacci@ieo.it
package: ScreenR  
vignette: >
  %\VignetteIndexEntry{ScreenR Example Analysis}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::knitr}
  
date: "`r Sys.Date()`"
editor_options: 
  chunk_output_type: console
  markdown: 
    wrap: 80
---

# Importing Package

```{r packages, message=TRUE, warning=TRUE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(ggplot2)
library(dplyr)
library(tidyr)
```

# Introduction

The R package **ScreenR** has been developed to perform analysis of
High-throughput RNA interference screening using pooled shRNA libraries and next
generation sequencing.. Nowadays several short hairpin RNA (shRNA) libraries are
commercial available, and in the last years the interest in this type of
analysis, often called barcode screening, has greatly increased for their
benefits both from a time-consuming point of view and for the possibility of
carrying out screening on a large number of genes simultaneously. However, the
bioinformatic analysis of this type of screening still lacks a golden standard.
Here, ScreenR allows the user to carry out a preliminary quality check of the
experiment, visually inspect the data and finally identify the most significant
hits of the experiment through a series of plots and cross-statistical analyses.

# Installation {#installation}

## Bioconductor

**ScreenR** requires several CRAN and Bioconductor R packages to be installed.
Dependencies are usually handled automatically, when installing the package
using the following commands:

```{r ScreenR install Bioc, eval=FALSE, message=TRUE, warning=TRUE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ScreenR")
```

## Github

The newest version can directly be installed from GitHub using the CRAN package
devtools:

```{r github install, eval=FALSE, message=TRUE, warning=TRUE}
if (!require("devtools", quietly = TRUE))
   install.packages("devtools")

devtools::install_github("EmanuelSoda/ScreenR")
```

# Analysis

Here is reported the ScreenR pipeline
<img src="/Users/ieo5571/Documents/IEO/Stage/ScreenR/man/figures/Pipeline.png" align="top"/>

## Loading the package

After installation, loading the package is simple as :

```{r, message=TRUE, warning=TRUE}
library(ScreenR)
```

## Read Data

The input of ScreenR is a count table obtained from barcode alignment to the
reference genome/library. A count table is usually the starting point of an
RNA-seq deferentially expressed genes analysis and consists of a matrix
containing reads count organized with:

-   Genes on the rows
-   Samples on the columns

For this vignette we will use as an example a Loss of Function Chemical
lethality Genetic Screening performed using shRNA libraries where each gene is
represented by ten slightly different shRNA each labeled with a unique barcode
coming from an unpublished dataset generated using the
[Cellecta](https://cellecta.com/) protocol.

First of all the data has to be read. Then another very important step is to set
up the column names in the following way:

-   The first part of the string is the timepoint
-   The second is the type of sample (example: treated or control or the name of
    the treatment)
-   Third slot is a way to make each replicate unique (for example a letter or a
    number)

In the end we have

**`Time point_Type of sample_replicate`**

and for example we could have **h24_control_1** which mean the first replicate
of the control sample at 24 hour.

Since this dataset comes from a Chemical Synthetic Lethality experiments the
samples treated with the drug combined with the shRNAs knockdown should present
a decreased number of reads compared to the controls.

```{r read_data, message=TRUE, warning=TRUE}
data(count_table)
data(annotation_table)

data <- count_table
colnames(data) <- c(
    "Barcode", "Time1", "Time2", "Time3_TRT_A", "Time3_TRT_B", "Time3_TRT_C",
    "Time3_CTRL_A", "Time3_CTRL_B", "Time3_CTRL_C", 
    "Time4_TRT_A", "Time4_TRT_B", "Time4_TRT_C", 
    "Time4_CTRL_A", "Time4_CTRL_B", "Time4_CTRL_C"
)
data <- data %>%
    dplyr::mutate(Barcode = as.factor(Barcode)) %>%
    dplyr::filter(Barcode != "*") %>%  
  tibble()


total_Annotation <- annotation_table
```

## Object Creation

The second needed step is to create a **ScreenR object** from the count table.
The ScreenR object is created using the function **create_screenr_object()** and
will be used to store the most important information to perform the analysis.
Most of the ScreenR function takes as main input the ScreenR object to perform
the needed operation and return a result.

```{r Create_Object, message=TRUE, warning=TRUE}
groups <- factor(c(
    "T1/T2", "T1/T2",
    "Time3_TRT", "Time3_TRT", "Time3_TRT",
    "Time3_CTRL", "Time3_CTRL", "Time3_CTRL",
    "Time4_TRT", "Time4_TRT", "Time4_TRT",
    "Time4_CTRL", "Time4_CTRL", "Time4_CTRL"
))


palette <- c("#66c2a5", "#fc8d62", rep("#8da0cb", 3), rep("#e78ac3", 3),
    rep("#a6d854", 3), rep("#ffd92f", 3))


object <- create_screenr_object(
    table = data, annotation = total_Annotation, groups = groups,
    replicates = ""
)
```

## Removing all zero rows

```{r Removing all zero rows, message=TRUE, warning=TRUE}
object <- remove_all_zero_row(object)
```

## Computing the needed tables

Once the object is created, the data must be normalized to start the analysis.
ScreenR uses *Counts Per Million* (**CPM**) normalization which has the
following mathematical expression:

$$CPM = \frac{Number \; of \; mapped \; reads \; to \; a \; barcode} 
             { \sum_{sample}{Number\; of \;mapped \; reads}} *10^{6}$$

The number of reads mapped for each barcode in a sample are normalized by the
number of reads in that sample and multiplied by one million.

This information is store in a *data table* which is a tidy version of the
original *count table* and will be used throughout the analysis.

```{r normalization, message=TRUE, warning=TRUE}
object <- normalize_data(object)
object <- compute_data_table(object)
```

## Quality Check

The first step to perform when dealing with sequencing data is to check the
quality of the samples. In ScreenR this can be done using several methods.

### Mapped Reads

The total number of mapped reads can be displayed with a barplot with the
formula.

```{r plot_mapped_reads, message=TRUE, warning=TRUE}
plot <- plot_mapped_reads(object, palette) +
    ggplot2::coord_flip() +
    ggplot2::scale_y_continuous(labels = scales::comma) +
    ggplot2::theme(legend.position = "none") +
    ggplot2::ggtitle("Number of Mapped Reads in each sample")

plot
```

For example the distribution can be seen using both boxplots or density plots.

\#### Boxplot Mapped Reads

```{r  plot_mapped_reads_distribution_boxplot, message=TRUE, warning=TRUE}
plot <- plot_mapped_reads_distribution(
    object, palette,
    alpha = 0.8,
    type = "boxplot"
) +
    coord_flip() +
    theme(legend.position = "none") 

plot
```

#### Density plot

```{r  plot_mapped_reads_distribution_density, message=TRUE, warning=TRUE}
plot <- plot_mapped_reads_distribution(
    object, palette,
    alpha = 0.5,
    type = "density"
) +
    ggplot2::theme(legend.position = "none") 

plot
```

### Barcode Lost

Another very important quality check when a Genetic Screening is performed is to
check the barcode lost during the experiment, meaning the barcode that after
different time points or treatments results in reads count equal to zero.
ScreenR implements a function able to compute and plot the number of barcodes
lost for each samples.

```{r  plot_barcode_lost, message=TRUE, warning=TRUE}
plot <- plot_barcode_lost(screenR_Object = object, palette = palette) +
    ggplot2::coord_flip()
plot
```

Moreover it is important to check if the lost barcodes in a sample all belong to
the same gene, in order to verify that an adequate number of barcodes per gene
are still present. This can be done by visualizing the number of barcode lost in
a sample by gene.

```{r  plot_barcode_lost_for_gene, message=TRUE, warning=TRUE}
plot <- plot_barcode_lost_for_gene(object,
    samples = c("Time4_TRT_C", "Time4_CTRL_C")
)
plot

```

### Plot MDS {.tabset}

In order to see the samples clusterization an initial MDS analysis can be
conducted. In ScreenR this can be done using the *plot_mds* function and the
user can decide the color code of the graph in order to highlight the trend of
the samples based on replicates, treatment or timepoints simply by modifying the
field levels in the plot_mds function.

#### For Sample

```{r  Plot_MDS_Sample, message=TRUE, warning=TRUE}
plot_mds(screenR_Object = object)
```

#### For Treatment

```{r  Plot_MDS_Treatment, message=TRUE, warning=TRUE}
group_table <- get_data_table(object)   %>%
    select(Sample, Day, Treatment) %>%
    distinct()

group_treatment <- group_table$Treatment

plot_mds(
    screenR_Object = object,
    groups = factor(x = group_treatment, levels = unique(group_treatment))
)
```

#### For Day

```{r  Plot_MDS_Day, message=TRUE, warning=TRUE}
group_day <- group_table$Day

plot_mds(
    screenR_Object = object,
    groups = factor(x = group_day, levels = unique(group_day))
)
```

## Statistical Analysis

Once the various steps of the quality check have been passed, the actual
statistical analysis can begin. The statistical Analysis of ScreenR is based on
three different methods:

-   [Z-score filtering](https://pubmed.ncbi.nlm.nih.gov/21515799/)
-   [CAMERA filtering](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3458527/)
-   [ROAST filtering](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2922896/)

### Z-score hit

In order to compute the Z-score, first a list of metrics has to be computed. In
particular a *Log2FC* is computed for the treated vs control samples in the
different conditions. This can be done using the function *compute_metrics()*.
Here is reported an example of treated vs control in different day.

Then the different distribution of the Z-score can be plotted using the
*plot_zscore_distribution* function.

```{r  compute_metrics, message=TRUE, warning=TRUE}
# 2DG
data_with_measure_TRT <- list(
    Time3 = compute_metrics(
        object,
        control = "CTRL", treatment = "TRT",
        day = "Time3"
    ),
    Time4 = compute_metrics(
        object,
        control = "CTRL", treatment = "TRT",
        day = "Time4"
    )
)

plot_zscore_distribution(data_with_measure_TRT, alpha = 0.8) 
```

### Z-score hit

Based on these metrics the Z-score hit identification can be computed using the
*find_zscore_hit* function.

```{r  Z_score_hit, message=TRUE, warning=TRUE}
zscore_hit_TRT <- list(
    Time3 = find_zscore_hit(
        table_treate_vs_control = data_with_measure_TRT$Time3,
        number_barcode = 6, metric = "median"
    ),
    Time4 = find_zscore_hit(
        table_treate_vs_control = data_with_measure_TRT$Time4,
        number_barcode = 6, metric = "median"
    )
)
zscore_hit_TRT
```

### CAMERA

The same can be done with the CAMERA hit using the function *find_camera_hit*.

```{r  CAMERA, message=TRUE, warning=TRUE}
matrix_model <- model.matrix(~ groups)
colnames(matrix_model) <- unique(groups)


camera_hit_TRT <- list(
    Time3 = find_camera_hit(
        screenR_Object = object, matrix_model = matrix_model,
        contrast = "Time3_TRT",
    ),
    Time4 = find_camera_hit(
        screenR_Object = object, matrix_model = matrix_model,
        contrast = "Time4_TRT"
    )
)

camera_hit_TRT
```

### ROAST

Last but not least this is done also for the ROAST hit using the function\
*find_roast_hit*.

```{r ROAST, message=TRUE, warning=TRUE}
roast_hit_TRT <- list(
    Time3 = find_roast_hit(
        screenR_Object = object, matrix_model = matrix_model,
        contrast = "Time3_TRT", 
    ),
    Time4 = find_roast_hit(
        screenR_Object = object, matrix_model = matrix_model,
        contrast = "Time4_TRT"
    )
)

roast_hit_TRT
```

### Find Common Hit

ScreenR considers as final hit only the one result as candidate hit in all three
statistical methods. However this is a particularly stringent method and in some
cases leads to a small number of results. For this reason the user can also
decide to opt for a less stringent method that considers only the hits present
in at least two of the statistical methods. The two different strategies can be
computed with the function::

-   **common_hit_TRT_at_least_2**: considering candidate Hits the one present in
    at least two of the three methods (less stringent)

-   **common_hit_TRT_at_least_3**: considering candidate Hits the one present in
    all of the three methods

```{r Common_Hit,  message=TRUE, warning=TRUE}
common_hit_TRT_at_least_2 <- list(
    Time3 = find_common_hit(
        zscore_hit_TRT$Time3, camera_hit_TRT$Time3, roast_hit_TRT$Day3,
        common_in = 2
    ),
    Time4 = find_common_hit(
        zscore_hit_TRT$Time4, camera_hit_TRT$Time4, roast_hit_TRT$Day6,
        common_in = 2
    )
)

common_hit_TRT_at_least_3 <- list(
    Time3 = find_common_hit(
        zscore_hit_TRT$Time3, camera_hit_TRT$Time3, roast_hit_TRT$Time3,
        common_in = 3
    ),
    Time4 = find_common_hit(
        zscore_hit_TRT$Time4, camera_hit_TRT$Time4, roast_hit_TRT$Time4,
        common_in = 3
    )
)
```

### Plot common hit

The intersection of the hits found by the three statistical methods can be
easily visualized using the *plot_common_hit* function.

```{r Venn_diagram_in_at_least_2,  message=TRUE, warning=TRUE}
plot_common_hit(
    hit_zscore = zscore_hit_TRT$Time4, hit_camera = camera_hit_TRT$Time4,
    roast_hit_TRT$Time4, show_elements = FALSE, show_percentage = TRUE
)
```

As we all know, when we deal with statistical methods the is the possibility of
*type I* error also known as "false positive". For this reason is important to
visualize the results obtained. This can be done by visualizing the trend of the
candidate hits obtained using the function **plot_trend**.

```{r plot_trend,  message=TRUE, warning=TRUE}
candidate_hits <- common_hit_TRT_at_least_2$Time3

plot_trend(screenR_Object = object, 
           genes = candidate_hits[1], 
           nrow = 2, ncol = 2, 
           group_var = c("Time1", "Time2", "TRT"))

plot_trend(screenR_Object = object, 
           genes = candidate_hits[2], 
           nrow = 2, ncol = 2, 
           group_var = c("Time1", "Time2", "TRT"))
```

```{r,  message=TRUE, warning=TRUE}
sessionInfo()
```