---
author:
- 'Aleksandra P<span>e</span>kowska, Simon Anders'
title: 'Basics of ChIP-seq data analysis'
...

# Introduction

This vignette describes steps of a basic analysis of ChIP-seq data. To
exemplify this tutorial, we use ChIP-seq data for the lysine 27
acetylation of the histone H3 (i.e H3K27ac).

## Objectives of this tutorial

After completing this vignette, you will be able to:\
1. Read a ChIP-seq experiment into *R*\
2. Extend the reads and bin the data (details and relevance discussed
later)\
3. Generate .bedGraph files\
4. Visualize ChIP-seq data with *R*\
5. Perform basic analysis of ChIP-seq peaks\
6. Generate average profiles and heatmaps of ChIP-seq enrichment around
a set of annotated genomic loci\
In the appendix part, we show how to download, preprocess and asses the
quality of .fastq files.

# Data

H3K27ac is a histone modification associated with active promoters and
enhancers. We downloaded data corresponding to a ChIP-seq experiment
with two biological replicates of mouse Embryonic Stem cells (mESC)
along with the input control sample *Histone H3K27ac separates active
from poised enhancers and predicts developmental state* by Creyghton *et
al*., PNAS 2010.

## Preprocessing of data

The first part of ChIP-seq analysis workflow consists in read
preprocessing. We will not focus here on these first steps, we outline
them and provide the code in the *Appendix* part of the vignette. The
three major steps in the preprocessing are briefly outlined below.

### Initial quality assessment

Sequenced reads are saved in .fastq files. The very first step in the
analyses of sequencing results consists in quality assessment. The *R*
package *ShortRead* provides a *qa* to perform this analysis. The reader
can find the code necessary to generate a *HTML* read quality control
report in the *Appendix* part of the vignette.

### External to *R* data opperations

Initial parts of the analysis of sequenced reads include: alignment,
filtering and peak finding. They can be performed using tools such as
*Bowtie2*, *samtools* or *MACS*. We provide all the necessary code in
the *Appendix* part of the vignette.

### Additional considerations

Two steps from this vignette (visualization and read distribution
analysis) require *biomart* database querying via the internet. In order
to avoid many downloads, we provide the necessary objects in the data
package *EpigeneticsCSAMA*. The code for the generation of these objects
can be found in the *Appendix* of the vignette.

Additionally, in order to reduce memory requirements, we restrict our
analysis to the filtered reads mapping to chromosome 6.

## Data package

The data files produced by the steps above are placed in the R objects
of a data package called *EpigeneticsCSAMA*, which we load now. (Note
that such a data package is used for convenience in this course, but
typically, you would not package up interemediate data in this way.)

```{r Dir, echo=TRUE, eval=TRUE}
library(EpigeneticsCSAMA)
dataDirectory =  system.file("bedfiles", package="EpigeneticsCSAMA")
```

The variable *dataDirectory* shows the directory containing the data
objects necessary for this vignette.

```{r DirShow}
dataDirectory
```

In order to explore this files, have a look at them with a text editor
or via a terminal emulator.

# Reading the filtered ChIP-seq reads

We need to load the *GenomicRanges*, *rtracklayer* and *IRanges*
packages. To read the .bam file to *R*, we use the *import.bed* function
from the *rtracklayer* package. The result is a *GRanges* object. This
is an extremely useful and powerful class of objects which the readers
are already familiar with. Each filtered read is represented here as a
genomic interval.

```{r RepresentReadsAsGRanges,eval=TRUE, results='hide'}

library(GenomicRanges)
library(rtracklayer)
library(IRanges)

input = import.bed(file.path(dataDirectory, 'ES_input_filtered_ucsc_chr6.bed'))
rep1 = import.bed(file.path(dataDirectory, 'H3K27ac_rep1_filtered_ucsc_chr6.bed'))
rep2 = import.bed(file.path(dataDirectory, 'H3K27ac_rep2_filtered_ucsc_chr6.bed'))

```

The objects *input*, *rep1* and *rep2* hold the genomic annotation of
the filtered reads for the input sample and ChIP-seq replicate 1 and
replicate 2, respectively. We display the *rep1* object. We see that the
strand information, read name along with alignment score are included as
information for each read.

```{r dataStr}
rep1
```

We see that we have roughly the same number of reads in the input and
IP-ed experiments.

```{r ReadNumber}
length(input)
length(rep1)
length(rep2)
```

# Preparation of the ChIP-seq and control samples: read extension

The reads correspond to sequences at the end of each IP-ed fragment
(single-end sequencing data). We need to extend these reads in order to
represent each IP-ed DNA fragment.

We estimate the mean read length using the *estimate.mean.fraglen*
function from *chipseq* packege. Next, we extend the reads to the
inferred read length using the *resize* function. We remove any reads
for which the coordinates, after the extension, exceed chromosome
length. These three analysis steps are wrapped in a single function
*prepareChIPseq* function which we define below.

```{r ReadExtension_Definition, results='hide'}
library(chipseq)

prepareChIPseq = function(reads){
    frag.len = median( estimate.mean.fraglen(reads) )
    cat( paste0( 'Median fragment size for this library is ', round(frag.len)))
    reads.extended = resize(reads, width = frag.len)
    return( trim(reads.extended) )
}
```

We next apply it to the input and ChIP-seq samples.

```{r ReadExtension,eval=TRUE}
input = prepareChIPseq( input )
rep1 = prepareChIPseq( rep1 )
rep2 = prepareChIPseq( rep2 )
```

Compare with above to see how *rep1* has changed.

```{r Rep1Inspect}
rep1
```

# Binning the ChIP-seq and control

The next step in the analysis is to count how many reads map to each of
the pre-established genomic intervals (bins).

## Generation of bins

We will tile the genome into non-overlapping bins of size 200 bp. To
this end we need the information about chromosome sizes in the mouse
genome (assembly *mm9*). In the data package, we provide the object *si*
(strand information), which holds these data. The reader can find the
code necessary to create the *si* object in the *Obtaining *si* object
for *mm9** of the *Appendix*.

```{r GetBins_preps}
data(si)
si
```

Next, we use the *tileGenome* function from the *GenomicRanges* package
to generate a *GRanges* object with intervals covering the genome in
tiles (bins) of size of 200 bp.

```{r GetBins,eval=TRUE}
binsize = 200
bins = tileGenome(si['chr6'], tilewidth=binsize,
                  cut.last.tile.in.chrom=TRUE)
bins
```

## Binning

We now count how many reads fall into each bin. For this purpose, we
define the function *BinChIPseq*. It takes two arguments, *reads* and
*bins* which are *GRanges* objects.

```{r Binning_function,eval=TRUE}
BinChIPseq = function( reads, bins ){

       mcols(bins)$score = countOverlaps( bins, reads ) 
       return( bins ) 
}
```

Now we apply it to the objects *input*, *rep1* and *rep2*. We obtain
*input.200bins*, *rep1.200bins* and *rep2.200bins*, which are *GRanges*
objects that contain the binned read coverage of the input and ChIP-seq
experiments.

```{r Binning, eval=TRUE}
input.200bins = BinChIPseq( input, bins )
rep1.200bins = BinChIPseq( rep1, bins )
rep2.200bins = BinChIPseq( rep2, bins )

rep1.200bins
```

We can plot coverage for 1000 bins, starting from bin 200,000.

```{r simplePlot,fig.width=7, fig.height=7, out.width='.65\\linewidth', fig.align='center'}
plot( 200000:201000, rep1.200bins$score[200000:201000], 
   xlab="chr6", ylab="counts per bin" )
```

Below, we will see more sophisticated ways of plotting coverage.

## Exporting binned data

At this step of the analysis, the data is ready to be visualized and
shared. One of the most common means of sharing ChIP-seq data is to
generate .wig, .binWig or .bedGraph files. They are memory and
size-efficient files holding the information about the signal along the
genome. Moreover, these files can be visualized in genome browsers such
as IGV and IGB. We show how to export the binned data as a .bedGraph
file.

```{r ExportbedGraphFiles}
export(input.200bins, 
       con='input_chr6.bedGraph',
       format = "bedGraph")
export(rep1.200bins, 
       con='H3K27ac_rep1_chr6.bedGraph',
       format = "bedGraph")
export(rep2.200bins, 
       con='H3K27ac_rep2_chr6.bedGraph',
       format = "bedGraph")
```

In the next section, we see how to visualize bedGraph files using *R*.

# Visualisation of ChIP-seq data wit *Gviz*

Now, we have data which we would like to display along the genome. *R*
offers a flexible infrastructure for visualisation of many types of
genomics data. Here, we use the *Gviz* package for these purposes.

```{r Visualisation_Prep_libs, results='hide'}
library(Gviz)
```

The principle of working with *Gviz* relies on the generation of tracks
which can be, for example ChIP-seq signal along the genome, ChIP-seq
peaks, gene models or any kind of other data such as annotation of CpG
islands in the genome. We start with loading the gene models for
chromosome 6 starting at position 122,530,000 and ending at position
122,900,000. We focus on this region as it harbors the *Nanog* gene,
which is stongly expressed in ES cells.

We obtain the annotation using *biomaRt* package. Work with *biomaRt*
package relies on querying the *biomart* database. In the *Appendix*, we
show how to obtain gene models for protein coding genes for the archive
mouse genome assembly (mm9) and how to generate the *bm* object holding
the annotation of all the RefSeq genes.

```{r BM}
data(bm)
bm
```

We include the *GenomeAxisTrack* object which is a coordinate axis
showing the genomic span of the analyzed region.

```{r AT}
AT = GenomeAxisTrack( )
```

We plot the result using the *plotTracks* function. We choose the region
to zoom into with the *from* and *to* arguments. The
*transcriptAnnotation* argument allows to put the gene symbols in the
plot.

```{r Visualisation_Gviz, fig.width=5, fig.height=3, out.width='.95\\linewidth', fig.align='center'}
plotTracks(c( bm, AT),
           from=122530000, to=122900000,
           transcriptAnnotation="symbol", window="auto", 
           cex.title=1, fontsize=10 )
```

We next add our two data tracks to the plot. We first generate
*DataTrack* objects with *DataTrack* function. We include the
information about how the track is be labaled and colored. We obtain
*input.track*, *rep1.track* and *rep2.track* objects.

```{r dataTrackGet}
input.track = DataTrack(input.200bins, 
                        strand="*", genome="mm9", col.histogram='gray',
                        fill.histogram='black', name="Input", col.axis="black",
                        cex.axis=0.4, ylim=c(0,150))

rep1.track = DataTrack(rep1.200bins, 
                        strand="*", genome="mm9", col.histogram='steelblue',
                        fill.histogram='black', name="Rep. 1", col.axis='steelblue',
                        cex.axis=0.4, ylim=c(0,150))

rep2.track = DataTrack(rep2.200bins, 
                        strand="*", genome="mm9", col.histogram='steelblue',
                        fill.histogram='black', name="Rep. 2", col.axis='steelblue',
                        cex.axis=0.4, ylim=c(0,150))
```

Finally, we plot these tracks along with the genomic features. We
observe a uniform coverage in the case of the input track and pronounced
peaks of enrichment H3K27ac in promoter and intergenic regions.
Importantly, H3K27ac enriched regions are easily identified.

```{r dataTrackPlot, fig.width=5, fig.height=4, out.width='.95\\linewidth', fig.align='center'}
plotTracks(c(input.track, rep1.track, rep2.track, bm, AT),
           from=122530000, to=122900000,
           transcriptAnnotation="symbol", window="auto", 
           type="histogram", cex.title=0.7, fontsize=10 )
```

# ChIP-seq peaks

ChIP-seq experiments are designed to isolate regions enriched in a
factor of interest. The identification of enriched regions, often
refered to as peak finding, is an area of research by itself. There are
many algorithms and tools used for peak finding. The choice of a method
is strongly motivated by the kind of factor analyzed. For instance,
transcription factor ChIP-seq yield well defined narrow peaks whereas
histone modifications ChIP-seq experiments such as H3K36me3 yield
extended regions of high coverage. Finally, ChIP-seq with antobodies
recognizing polymerase II result in narrow peaks combined with extended
regions of enrichment.

## Identification of peaks

As we saw in the previous section of the tutorial, H3K27ac mark shows
well defined peaks. In such a case, *MACS* is one of the most commonly
used software for peak finding. ChIP-seq peak calling can also be done
in *R* with the *BayesPeak* package. However, we stick here to the most
common approach and use *MACS*. We ran *MACS* for you and provide the
result in the data package. You can find the code necessary to obtain
the peaks in the *Appendix* of the vignette.

## Peaks – basic analysis in *R*

We next import the .bed files of the isolated peaks from the data
package.

```{r MACSreadingtoR}
peaks.rep1 = import.bed(file.path(dataDirectory,'Rep1_peaks_ucsc_chr6.bed'))
peaks.rep2 = import.bed(file.path(dataDirectory,'Rep2_peaks_ucsc_chr6.bed'))
```

First step in the analysis of the identified peaks is to simply display
them in the browser, along with the ChIP-seq and input tracks. To this
end, we use *AnnotationTrack* function. We display peaks as boxes
colored in blue.

```{r PeaksInBrowser_preps}
peaks1.track = AnnotationTrack(peaks.rep1, 
                               genome="mm9", name='Peaks Rep. 1',
                               chromosome='chr6',
                               shape='box',fill='blue3',size=2)
peaks2.track = AnnotationTrack(peaks.rep2, 
                               genome="mm9", name='Peaks Rep. 2',
                               chromosome='chr6',
                               shape='box',fill='blue3',size=2)
```

We visualise the *Nanog* locus.

```{r PeaksInBrowserPlot_nanog, fig.width=5, fig.height=4, out.width='.95\\linewidth', fig.align='center'}
plotTracks(c(input.track, rep1.track, peaks1.track,
             rep2.track, peaks2.track, bm, AT),
           from=122630000, to=122700000,
           transcriptAnnotation="symbol", window="auto", 
           type="histogram", cex.title=0.7, fontsize=10 )
```

We can see that *MACS* has succesfully identified regions showing high
H3K27ac signal. We see that both biological replicates agree well,
however, in some cases peaks are called only in one sample. In the next
section, we will analyse how often do we see the overlap between peaks
and isolate reproducible peaks.

## Venn diagrams

We first find the overlap between the peak sets of the two replicates.

```{r findOverlap}
ovlp = findOverlaps( peaks.rep1, peaks.rep2 )
ovlp
```

If a peak in one replicate overlaps with mutiple peaks in the other
replicate, it will appear multiple times in *ovlp*. To see, how many
peaks overlap with something in the other replicate, we count the number
of unique peaks in each of the two columns of *ovlp* and take the
smaller of these two counts to as the number of common peaks.

```{r nbrCommonPeaks}
ov = min( length(unique( queryHits(ovlp) )), length(unique( subjectHits(ovlp) ) ) )
```

We draw this as a Venn diagram, using the *draw.pairwise.venn* function
from the *VennDiagram* package.

```{r VennDiagram1, fig.width=4, fig.height=4, out.width='.75\\linewidth', fig.align='center'}
library(VennDiagram)

draw.pairwise.venn( 
   area1=length(peaks.rep1),
   area2=length(peaks.rep2), 
   cross.area=ov, 
   category=c("rep1", "rep2"), 
   fill=c("steelblue", "blue3"), 
   cat.cex=0.7)
```

We will focus only on peaks identified in both replicates (hereafter
refered to as enriched areas). The enriched areas are colored in green.

```{r EnrichedRegionsIsolation, fig.width=5, fig.height=5, out.width='.95\\linewidth', fig.align='center'}
enriched.regions = Reduce(subsetByOverlaps, list(peaks.rep1, peaks.rep2))

enr.reg.track = AnnotationTrack(enriched.regions,
                                genome="mm9", name='Enriched regions',
                                chromosome='chr6',
                                shape='box',fill='green3',size=2)

plotTracks(c(input.track, rep1.track, peaks1.track,
             rep2.track, peaks2.track, enr.reg.track, 
             bm, AT),
           from=122630000, to=122700000,
           transcriptAnnotation="symbol", window="auto", 
           type="histogram", cex.title=0.5, fontsize=10 )

```

## Isolation of promoters overlapping H3K27ac peaks

One of the questions of a ChIP seq analyses is to which extend
ChIP-enriched regions overlap a chosen type of features, such as
promoters or regions enriched with other modifications. To this end, the
overlap between peaks of ChIP-seq signal and the features of interest is
analysed.

We exemplify such an analysis by testing how many of the H3K27ac
enriched regions overlap promoter regions.

### Identification of promoters

As shown in the Appendix, we have used *biomaRt* to get coordinates for
start and end of all mouse genes. (These are the coordinates of the
outermost UTR boundaries.) We load the results of the *biomaRt* query
from the data package. It is given in the object *egs*, a *data.frame*
containing *ensembl* ID along with gene symbols, genomic coordinates and
orientation of of mouse genes.

```{r TSS}
data(egs)
head(egs)
```

We next identify the transcription start site (TSS), taking into account
gene orientation.

```{r TSSfinding}
egs$TSS = ifelse( egs$strand == "1", egs$start_position, egs$end_position )
head(egs)
```

We consider regions of $\pm 200$ bp around the TSS as promoters.

```{r Promoter}
promoter_regions = 
  GRanges(seqnames = Rle( paste0('chr', egs$chromosome_name) ),
          ranges = IRanges( start = egs$TSS - 200,
                            end = egs$TSS + 200 ),
          strand = Rle( rep("*", nrow(egs)) ),
          gene = egs$external_gene_id)
promoter_regions
```

### Overlapping promoters with H3K27ac enriched regions

Now we would like to know how many of out the promoters overlap with a
H3K27ac enriched regions.

```{r}
ovlp2 = findOverlaps( enriched.regions, promoter_regions )

cat(sprintf( "%d of %d promoters are overlapped by an enriched region.",
   length( unique(subjectHits(ovlp2)) ), length( promoter_regions ) ) )
```

We can also turn the question around:

```{r}
ovlp2b = findOverlaps( promoter_regions, enriched.regions )

cat(sprintf( "%d of %d enriched regions overlap a promoter.",
   length( unique( subjectHits(ovlp2b) ) ), length( enriched.regions ) ) )
```

Is this a significant enrichment? To see, we first calculate how much
chromosome 6 is part of a promoter region. The following command reduces
the promoter list to non-overlapping intervals and sums up their widths

```{r}
promotor_total_length = sum(width(reduce(promoter_regions)))
promotor_total_length
```

Which fraction of the chromsome is this?

```{r}
promotor_fraction_of_chromosome_6 = promotor_total_length / seqlengths(si)["chr6"]
```

Nearly a quarter of promoters are overlapped by H3K27ac-enriched regions
even though they make up only half a percent of the chromosome. Clearly,
this is a strong enrichment. A binomial test confirms this:

```{r}
binom.test( length( unique( subjectHits( ovlp2b ) ) ), length( enriched.regions ), promotor_fraction_of_chromosome_6 )
```

Which promotors are overlapped with an H3K27ac peak? Let’s see some
examples:

```{r promoterRegionTiling,eval=TRUE}
pos.TSS = egs[ unique( queryHits( findOverlaps( promoter_regions, enriched.regions ) ) ),]
pos.TSS[1:3,]
```

The first three promoters identified as overlapping a H3K27ac peak
include: *Brpf1*, *Ogg1* and *Camk1 loci*.

## Analysis of the distribution of H3K27ac around a subset of gene promoters

In this part of the analysis, we show how to generate plots displaying
the distribution of ChIP-seq signal around certain genomic positions,
here a set of promoter regions. These include a heatmap representation
and an average profile for H3K27ac signal at promoters overlapping a
peak of H3K27ac identified by *MACS*. These are one of the most
frequently performed analysis steps in ChIP-seq experiments.

In the previous section, we have identified promoters overlaping a
H3K27ac peak (the *pos.TSS* object). In order to get a comprehensive
view of the distribution of H3K27ac around the corresponding TSS, we
extend the analysed region to $\pm 1000$ bp around the TSS. We divide
each of these 2000 bp regions into 20 bins of 100 bp length each and
order the bins with increasing position for genes on the ’+’ strand and
decreasing for genes on the ’-’ strand.

Next, we tile the promoter regions with consecutive 100bp tiles. For
each region, we order the tiles according to the gene orientation. We
create 20 tiles per promoter region.

```{r Tiles}
tiles = sapply( 1:nrow(pos.TSS), function(i)
   if( pos.TSS$strand[i] == "1" )
      pos.TSS$TSS[i] + seq( -1000, 900, length.out=20 )
   else
      pos.TSS$TSS[i] + seq( 900, -1000, length.out=20 ) )

tiles = GRanges(tilename = paste( rep( pos.TSS$ensembl_gene_id, each=20), 1:20, sep="_" ),
                seqnames = Rle( rep(paste0('chr', pos.TSS$chromosome_name), each=20) ), 
                ranges = IRanges(start = as.vector(tiles),
                                 width = 100),
                strand = Rle(rep("*", length(as.vector(tiles)))),
                seqinfo=si)

tiles                
```

Next, we count how many reads are mapping to each tile. The resulting
vector *H3K27ac.p* is next used to create a matrix (*H3K27ac.p.matrix*),
where each row is a H3K27ac-enriched promoter. Each column coresponds to
a consecutive 100bp tile of 2000 bp region around the TSS overlapping a
H3K27ac peak. Since we have divided each promoter region in 21 tiles, we
obtain a matrix with 21 columns and 634 rows (the number of promoters
overlapping H3K27ac peak).

```{r AverProf_I,eval=TRUE}
H3K27ac.p = countOverlaps( tiles, rep1) +
  countOverlaps( tiles, rep2 )

H3K27ac.p.matrix = matrix( H3K27ac.p, nrow=nrow(pos.TSS), 
                           ncol=20, byrow=TRUE )
```

Finally, we plot the result as a heatmap and as a plot of average values
per each tile for all the included promoters.

```{r Aver_plot, fig.width=8, fig.height=10, out.width='.70\\linewidth', fig.align='center', dev.args = list(pointsize=11)}
colors = colorRampPalette(c('white','red','gray','black'))(100) 

layout(mat=matrix(c(1,2,0,3), 2, 2), 
       widths=c(2,2,2), 
       heights=c(0.5,5,0.5,5), TRUE)


par(mar=c(4,4,1.5,1))
image(seq(0, max(H3K27ac.p.matrix), length.out=100), 1,
      matrix(seq(0, max(H3K27ac.p.matrix), length.out=100),100,1),
      col = colors,
      xlab='Distance from TSS', ylab='',
      main='Number of reads', yaxt='n',
      lwd=3, axes=TRUE)
box(col='black', lwd=2)
image(x=seq(-1000, 1000, length.out=20),
      y=1:nrow(H3K27ac.p.matrix),
      z=t(H3K27ac.p.matrix[order(rowSums(H3K27ac.p.matrix)),]), 
      col=colors,
      xlab='Distance from TSS (bp)',
      ylab='Promoters', lwd=2)
box(col='black', lwd=2)
abline(v=0, lwd=1, col='gray')
plot(x=seq(-1000, 1000, length.out=20),
     y=colMeans(H3K27ac.p.matrix),
     ty='b', pch=19,
     col='red4',lwd=2,
     ylab='Mean tag count',
     xlab='Distance from TSS (bp)')
abline(h=seq(1,100,by=5),
       v=seq(-1000, 1000, length.out=20),
       lwd=0.25, col='gray')
box(col='black', lwd=2)

```

We observe a strong enrichment of H3K27ac modification right after the
TSS and a weaker peak of H3K27ac at the region immediately upstream of
the TSS.

# Session info

```{r}
sessionInfo()
```

# Appendix

## Obtaining data from European Nucleotide Archive

The European Nucleotide Archive (http://www.ebi.ac.uk/ena) provides many
types of raw sequencing data, sequence assembly information and
functional annotation. We download the data corresponding to ChIP-seq
experiment mapping the H3K27ac histone modification in mouse Embryonic
Stem cells (mES cells) along with the input control sample from the
study *Histone H3K27ac separates active from poised enhancers and
predicts developmental state* by Creyghton *et al*.

```{r DataDownload, echo=TRUE,eval=FALSE}
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR066/SRR066787/SRR066787.fastq.gz .
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR066/SRR066766/SRR066766.fastq.gz .
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR066/SRR066767/SRR066767.fastq.gz .
```

## Read quality

Read quality is the first step in all the analyses of sequenced reads.
The package *ShortRead* provides a function taking as input the .fastq
files downloaded from the ENA database. We first generate a vector with
fastq file names.

```{r ReadQuality_preps, echo=TRUE,eval=FALSE}
fls = list.files(dataDirectory, ".fastq$", full=TRUE)
names(fls) = sub(".fastq", "", basename(fls))
```

We read each of these files and apply the *qas* function assessing the
quality of the reads in each file. Finally, we generate a *HTML* quality
report.

```{r QA, echo=TRUE,eval=FALSE}
library(ShortRead)
qas = lapply(seq_along(fls),
              function(i, fls) qa(readFastq(fls[i]), names(fls)[i]),
              fls)
qa = do.call(rbind, qas)
rpt = report(qa,dest = 'QA_report.html')
```

## External file preparations

The next step is to align the reads to mm9 mouse genome assembly. This
is done using *Bowtie2* tool. The resulting .sam files are next
transformed to .bam files and filtered for best aligned reads using
*samtools*. PCR duplicates are removed. BAM files are next transfomed to
bed files. For the sake of consistency with other tools, in the final
step of data preprocessing we add a ’chr’ prefix to the chromosome names
using *awk*.

```{r ReadProcessing, echo=TRUE, eval=FALSE}
gunzip SRR066787.fastq.gz
gunzip SRR066766.fastq.gz 
gunzip SRR066767.fastq.gz 
```

## Alignment

```{r Alignment, echo=TRUE, eval=FALSE}
bowtie2 -p 8 -q NCBIM37.67 SRR066787.fastq -S ES_input.sam
bowtie2 -p 8 -q NCBIM37.67 SRR066766.fastq -S H3K27ac_rep1.sam
bowtie2 -p 8 -q NCBIM37.67 SRR066767.fastq -S H3K27ac_rep2.sam
```

## Retaining only best alignments

```{r BestQualityRead, echo=TRUE, eval=FALSE}
samtools view -bS -q 40 ES_input.sam > ES_input_bestAlignment.bam
samtools view -bS -q 40 H3K27ac_rep1.sam > H3K27ac_rep1_bestAlignment.bam
samtools view -bS -q 40 H3K27ac_rep2.sam > H3K27ac_rep2_bestAlignment.bam
```

## PCR duplicate removal

```{r PCRDuplRemoval, echo=TRUE, eval=FALSE}
samtools rmdup -s  ES_input_bestAlignment.bam ES_input_filtered.bam
samtools rmdup -s  H3K27ac_rep1_bestAlignment.bam H3K27ac_rep1_filtered.bam
samtools rmdup -s  H3K27ac_rep2_bestAlignment.bam H3K27ac_rep2_filtered.bam
```

## Transforming reads to .bed format

```{r BAMTOBED, echo=TRUE, eval=FALSE}
bedtools bamtobed -i ES_input_filtered.bam > ES_input_filtered.bed
bedtools bamtobed -i H3K27ac_rep1_filtered.bam > H3K27ac_rep1_filtered.bed
bedtools bamtobed -i H3K27ac_rep2_filtered.bam > H3K27ac_rep2_filtered.bed
```

## Additional preparations

```{r Prefixes, echo=TRUE, eval=FALSE}
awk '$0="chr"$0' ES_input_filtered.bed > ES_input_filtered_ucsc.bed
awk '$0="chr"$0' H3K27ac_rep1_filtered.bed > H3K27ac_rep1_filtered_ucsc.bed
awk '$0="chr"$0' H3K27ac_rep2_filtered.bed > H3K27ac_rep2_filtered_ucsc.bed
```

Finally, for the purpose of this lab, we isolate data for only one
chromosome (chr6).

```{r bedSubsetting, echo=TRUE, eval=FALSE}
awk '{if($1=="chr6") print $0}' ES_input_filtered_ucsc.bed 
> ES_input_filtered_ucsc_chr6.bed
awk '{if($1=="chr6") print $0}' H3K27ac_rep1_filtered_ucsc.bed 
> H3K27ac_rep1_filtered_ucsc_chr6.bed
awk '{if($1=="chr6") print $0}' H3K27ac_rep2_filtered_ucsc.bed  
> H3K27ac_rep2_filtered_ucsc_chr6.bed
```

### Obtaining object *si* for *mm9*

We obtain chromosome lengths from the *BSgenome.Mmusculus.UCSC.mm9*
package. The chromosome names in the *si* file are in the *ensembl*
format, we add a prefix ’chr’ to chromosome names.

```{r Getmm9SequenceInfo, echo=TRUE,eval=FALSE}
library(BSgenome.Mmusculus.UCSC.mm9)
genome = BSgenome.Mmusculus.UCSC.mm9
si = seqinfo(genome)
si = si[ paste0('chr', c(1:19, 'X', 'Y'))]
```

### Obtaining object *bm* for *mm9*

```{r Visualisation_Prep_mart, eval=FALSE}
library(biomaRt)
mart = useMart(biomart = "ENSEMBL_MART_ENSEMBL", 
               dataset = "mmusculus_gene_ensembl", 
               host="may2012.archive.ensembl.org")
fm = Gviz:::.getBMFeatureMap()
fm["symbol"] = "external_gene_id"
```

Next, we get a snapshot of the results for chromosome 6 starting at
position 122530000 and ending at position 122900000. This region amongst
others encodes a highly ES cell specific *Nanog* gene. We first isolate
gene models for this interval. The result *bm* is saved in the data
directory.

```{r Visualisation_Prep_region,eval=FALSE}
bm = BiomartGeneRegionTrack(chromosome='chr6', genome="mm9", 
                             start=122530000, end = 122900000, 
                             biomart=mart,filter=list("with_ox_refseq_mrna"=TRUE), 
                             size=4, name="RefSeq", utr5="red3", utr3="red3", 
                             protein_coding="black", col.line=NULL, cex=7,
                             collapseTranscripts="longest",
                             featureMap=fm)
```

### Peak finding with *MACS*

```{r macs,eval=FALSE}
macs14 -t H3K27ac_rep1_filtered.bed -c ES_input_filtered_ucsc.bed -f BED -g mm --nomodel -n Rep1
macs14 -t H3K27ac_rep2_filtered.bed -c ES_input_filtered_ucsc.bed -f BED -g mm --nomodel -n Rep2
awk '$0="chr"$0' Rep1_peaks.bed > Rep1_peaks_ucsc.bed
awk '$0="chr"$0' Rep2_peaks.bed > Rep2_peaks_ucsc.bed
awk '{if($1=="chr6") print $0}' Rep1_peaks_ucsc.bed > Rep1_peaks_ucsc_chr6.bed
awk '{if($1=="chr6") print $0}' Rep2_peaks_ucsc.bed > Rep2_peaks_ucsc_chr6.bed
```

### Promoter isolation

Here we provide the code necessary to isolate gene models from the
*biomart* data base. The object *egs* contains the annotation of the
most external 5 and 3 prime UTRs for each gene model.

```{r usingMartToFindFeaturesOfInterest,eval=FALSE}
listAttributes(mart)[1:3,]
ds = useDataset('mmusculus_gene_ensembl', mart=mart)
chroms = 6

egs = getBM(attributes = c('ensembl_gene_id','external_gene_id',
                           'chromosome_name','start_position',
                           'end_position','strand'), 
            filters='chromosome_name',
            values=chroms,
            mart=ds)
```