This package and the underlying code are distributed under the Artistic license
2.0. You are free to use and redistribute this software.
 Rationale
“The ENCODE (Encyclopedia of DNA Elements) Consortium is an international
collaboration of research groups funded by the National Human Genome Research
Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of
functional elements in the human genome, including elements that act at the
protein and RNA levels, and regulatory elements that control cells and
circumstances in which a gene is active” source:
ENCODE Projet Portal.
However, retrieving and downloading data can be time consuming using the
current web portal, especially when multiple files from different experiments
are involved.
This package has been designed to facilitate access to ENCODE data by compiling the
metadata associated with files, experiments, datasets, biosamples, and treatments.
We implemented time-saving features to select ENCODE files by querying their
metadata, downloading them and validating that the file was correctly
downloaded.
This vignette will introduce the main features of the ENCODExplorer package.
 
 Main functions
 Query
The queryEncode function allows the user to find the subset of files corresponding to
a precise query defined according to the following criteria :
By default, the query function uses exact string matching to perform the
selection of the relevant entries. This behavior can be changed by modifying the
fixed or fuzzy parameters. Setting fixed to FALSE will perform
case-insensitive regular expression matching. Setting fuzzy to TRUE will
retrieve search results where the query string is a partial match.
The result set is a subset of the encode_df_lite table.
For example, to select all fastq files originating from assays on the MCF-7
(human breast cancer) cell line:
query_results <- queryEncode(organism = "Homo sapiens", 
                      biosample_name = "MCF-7", file_format = "fastq",
                      fixed = TRUE)
## Results : 811 files, 233 datasets
The same request with approximate spelling of the biosample name will return
no results:
query_results <- queryEncode(organism = "Homo sapiens", biosample_name = "mcf7",
                        file_format = "fastq", fixed = TRUE,
                        fuzzy = FALSE)
## No result found in encode_df. You can try the <searchEncode> function or set the fuzzy option to TRUE.
However, if you follow the warning guidance and set the fuzzy parameter to
TRUE:
query_results <- queryEncode(organism = "Homo sapiens",
                    biosample_name = "mcf7", file_format = "fastq",
                    fixed = TRUE, fuzzy = TRUE)
## Results : 811 files, 233 datasets
You can also perform matching through regular expressions by setting fixed to
FALSE.
query_results <- queryEncode(assay = ".*RNA-seq",
                    biosample_name = "HeLa-S3", fixed = FALSE)
## Results : 318 files, 11 datasets
table(query_results$assay)
## 
## polyA minus RNA-seq  polyA plus RNA-seq       small RNA-seq 
##                  90                 150                  78
Finally, the queryEncodeGeneric function can be used to perform searches on
columns which are not part of the queryEncode interface but are present within
the encode_df_lite data.table:
query_results <- queryEncodeGeneric(biosample_name="HeLa-S3",
                    assay="RNA-seq", submitted_by="Diane Trout",
                    fuzzy=TRUE)
## Results : 54 files, 2 datasets
table(query_results$submitted_by)
## 
## Diane Trout 
##          54
These criteria correspond to the filters that you can find on ENCODE portal:
 
 fuzzySearch
This function is a more user-friendly version of queryEncode that also
perform searches on the encode_df_lite object. The character vector or the
list of characters specified by the user will be searched for in every column of
the database. The user can also constrain the query by selecting the specific
columns in which to search for the query term by using the filterVector
parameter.
The following request will produce a data.table with every files containing
the term brca.
fuzzy_results <- fuzzySearch(searchTerm = c("brca"))
## Results: 236 files, 7 datasets
Multiple terms can be searched simultaneously. This example extracts all
files containing brca or ZNF24 within the target column.
fuzzy_results <- fuzzySearch(searchTerm = c("brca", "ZNF24"),
                             filterVector = c("target"),
                             multipleTerm = TRUE)
## Results: 710 files, 17 datasets
When searching for multiple terms, three type of input can be passed to the
searchTerm parameter :
- A single character where the various terms are separated by commas
- A character vector
- A list of characters
 
 Search
This function simulates a keyword search performed through the
ENCODE web portal.
The searchEncode function returns a data frame corresponding to the result
page provided by the ENCODE portal. If a specific file or dataset isn’t
available with fuzzySearch or queryEncode (i.e. within get_encode_df()),
the user can access the latest data from the ENCODE database through the
searchEncode function.
The searchToquery function convert the result of a search to a data.table
with the same design as get_encode_df(). This format contains more metadata
and allow the user to extract all files within the dataset. This format also
allows the user to create a design using the createDesign function.
Here is the example of the following search : “a549 chip-seq homo sapiens”.
On ENCODE portal :
With our function :
  search_results <- searchEncode(searchTerm = "a549 chip-seq homo sapiens",
                                 limit = "all")
## results : 783
 
 createDesign
This function organizes the data.table created by fuzzySearch, queryEncode
or searchToquery. It extracts the replicate and control files within a dataset.
It creates a data.table with the file accessions, the dataset accessions and
numeric values associated with the nature of the file (1:replicate / 2:control)
when the format parameter is set to long.
By setting the format parameter to wide, each dataset will have its own column
as illustrated below.
 
 downloadEncode
downloadEncode allows a user to download a file or an entire dataset. Downloading
files can be done by providing a vector of file accessions or dataset accessions
(represented by the accession column in get_encode_df()) to the file_acc parameter.
This parameter can also be the data.table created by queryEncode, fuzzySearch,
searchToquery or createDesign.
If the accession doesn’t exist within the passed-in get_encode_df() database,
downloadEncode will search for the accession directly within the ENCODE database.
The path to the download directory can be specified (default: /tmp).
To ensure the integrity of each file, the md5 sum of each downloaded file
is compared to the reported md5 sum in ENCODE.
Moreover, if the accession is a dataset accession, the function will download each
file in this dataset. The format option, which is set by default to all, enables the
downloading of a specific format.
Here is a small example query:
query_results <- queryEncode(assay = "switchgear", target ="elavl1", fixed = FALSE)
## Results : 2 files, 1 datasets
And its equivalent search:
search_results <- searchEncode(searchTerm = "switchgear elavl1", limit = "all")
## results : 1
To select a particular file format you can:
- add filters to your query and then run the downloadEncodefunction.
query_results <- queryEncode(assay = "switchgear", target ="elavl1",
                             file_format = "bed" , fixed = FALSE)
downloadEncode(query_results)
- specify the format to the downloadEncodefunction.
downloadEncode(search_results, format = "bed")
 
 Conversion
The function searchToquery enables the conversion of the results of
searchEncode to a queryEncode output based on the accession numbers.
The user can then benefit from all the collected metadata and the createDesign
function.
The structure of the result set is similar to the get_encode_df() structure.
Let’s try it with the previous example :
- search
search_results <- searchEncode(searchTerm = "switchgear elavl1", limit = "all")
## results : 1
- convert
convert_results <- searchToquery(searchResults = search_results)
 
 shinyEncode
This function launches the shinyApp of ENCODExplorer that implements the
fuzzySearch and queryEncode search functions. It also allows the creation
of a design to organize and download specific files with the downloadEncode
function. The Search tab of shinyEncode uses the fuzzySearch function for a
low specificity request while the Advanced Search tab uses the queryEncode
function.
 
 
 Summarizing ENCODE data
While queryEncode, searchEncode and downloadEncode gives the user access to
ENCODE’s raw files, ENCODExplorer also provides helper functions which load and
summarize ENCODE data for common biological questions.
 Obtaining consensus peaks from ChIP-Seq
The most common question in a ChIP-Seq assay is: “Where does the protein of
interest bind the genome?” To answer this question, ENCODExplorer provides the
queryConsensusPeaks method. queryConsensusPeaks finds all ChIP-seq peak
files matching the given criteria, split them by treatment group, and builds a
set of “consensus peaks”. The “consensus peaks” identified by ENCODExplorer are
those that appear in all replicates of a given group. Two peaks are considered
to belong to the same binding event if at least one of their nucleotides
overlap.
# Obtain a summary of all peaks for CTCF ChIP-Seq assays in the 22Rv1
# (human prostate carcinoma) cell line.
res = queryConsensusPeaks("22Rv1", "GRCh38", "CTCF")
## Results : 40 files, 2 datasets
## Found the following output_type: peaks and background as input for IDR, optimal IDR thresholded peaks, conservative IDR thresholded peaks
## Selecting optimal idr thresholded peaks. To choose another output_type, specify it in the 'output_type', argument or set use_interactive to TRUE.
## [1] "Success downloading file : ./ENCFF147YCW.bed.gz"
## [1] "Success downloading file : ./ENCFF730MQM.bed.gz"
## [1] "Files can be found at /tmp/RtmpCa8una/Rbuild46967413fc0a/ENCODExplorer/vignettes"
The list of downloaded files is available through the files() method.
files(res)
##            ENCFF147YCW            ENCFF730MQM 
## "./ENCFF147YCW.bed.gz" "./ENCFF730MQM.bed.gz"
The metadata for those file is available through the file_metadata() method.
The file metadata are split according to treatment group:
f_meta = file_metadata(res)
names(f_meta)
## [1] "17β-hydroxy-5α-androstan-3-one;10;nM;4;hour"
## [2] "NA;NA;NA;NA;NA"
f_meta[[1]][,1:5]
##      accession file_accession      file_type file_format file_size
## 1: ENCSR847XGE    ENCFF147YCW bed narrowPeak         bed    1.5 Mb
A data-frame explaining how each treatment group was split is available
through the metadata() method:
metadata(res)
##                        treatment treatment_amount treatment_amount_unit
## 1 17β-hydroxy-5α-androstan-3-one               10                    nM
## 2                           <NA>               NA                  <NA>
##   treatment_duration treatment_duration_unit
## 1                  4                    hour
## 2                 NA                    <NA>
##                                   split_group
## 1 17β-hydroxy-5α-androstan-3-one;10;nM;4;hour
## 2                              NA;NA;NA;NA;NA
The list of all peaks identified in individual files are accessed through
the peaks() method.
names(peaks(res))
## [1] "17β-hydroxy-5α-androstan-3-one;10;nM;4;hour"
## [2] "NA;NA;NA;NA;NA"
names(peaks(res)[[1]])
## [1] "ENCFF147YCW"
peaks(res)[[1]]
## GRangesList object of length 1:
## $ENCFF147YCW
## GRanges object with 90007 ranges and 6 metadata columns:
##                         seqnames              ranges strand |        name
##                            <Rle>           <IRanges>  <Rle> | <character>
##       [1]                   chr8 118811953-118812652      * |        <NA>
##       [2]                  chr11   77411794-77412493      * |        <NA>
##       [3]                   chr3 156174297-156174996      * |        <NA>
##       [4]                   chr5 155844245-155844944      * |        <NA>
##       [5]                   chrX   78101567-78102266      * |        <NA>
##       ...                    ...                 ...    ... .         ...
##   [90003]                   chr7   43838516-43839567      * |        <NA>
##   [90004]                   chr1 225474579-225475646      * |        <NA>
##   [90005] chr17_GL000205v2_ran..         55936-57017      * |        <NA>
##   [90006]                   chr7 151172179-151173297      * |        <NA>
##   [90007]       chrUn_GL000219v1       124914-126043      * |        <NA>
##               score signalValue    pValue    qValue      peak
##           <numeric>   <numeric> <numeric> <numeric> <integer>
##       [1]       857     17.0273        -1   0.10922       350
##       [2]       788     17.0314        -1   0.10943       350
##       [3]       758     17.0363        -1   0.10953       350
##       [4]       617     17.0412        -1   0.10965       350
##       [5]       592     17.0430        -1   0.10968       350
##       ...       ...         ...       ...       ...       ...
##   [90003]      1000     1955.77        -1    5.1613       513
##   [90004]      1000     1962.42        -1    5.1613       548
##   [90005]      1000     2486.40        -1    5.1613       520
##   [90006]      1000     2517.74        -1    5.1613       528
##   [90007]      1000     3011.03        -1    5.1613       528
##   -------
##   seqinfo: 61 sequences from an unspecified genome; no seqlengths
Finally, the consensus peaks (those who are present in all individual
replicates) are accessed through the consensus() method:
names(consensus(res))
## [1] "17β-hydroxy-5α-androstan-3-one;10;nM;4;hour"
## [2] "NA;NA;NA;NA;NA"
consensus(res)
## GRangesList object of length 2:
## $`17β-hydroxy-5α-androstan-3-one;10;nM;4;hour`
## GRanges object with 52988 ranges and 0 metadata columns:
##                         seqnames        ranges strand
##                            <Rle>     <IRanges>  <Rle>
##       [1]                   chr8   71076-71250      *
##       [2]                   chr8 206401-207402      *
##       [3]                   chr8 240248-241146      *
##       [4]                   chr8 265480-266224      *
##       [5]                   chr8 299020-299928      *
##       ...                    ...           ...    ...
##   [52984] chr14_KI270724v1_ran..   31580-32279      *
##   [52985] chr14_GL000009v2_ran.. 157035-157734      *
##   [52986] chr9_KI270719v1_random 147925-148624      *
##   [52987] chr14_KI270726v1_ran..   41635-42264      *
##   [52988] chr1_KI270711v1_random   22862-23263      *
##   -------
##   seqinfo: 61 sequences from an unspecified genome; no seqlengths
## 
## $`NA;NA;NA;NA;NA`
## GRanges object with 55189 ranges and 0 metadata columns:
##                         seqnames        ranges strand
##                            <Rle>     <IRanges>  <Rle>
##       [1]                   chr8   71046-71288      *
##       [2]                   chr8 206321-207382      *
##       [3]                   chr8 229367-230042      *
##       [4]                   chr8 240212-240887      *
##       [5]                   chr8 265493-266220      *
##       ...                    ...           ...    ...
##   [55185] chr9_KI270719v1_random 147868-148543      *
##   [55186] chr9_KI270719v1_random 164211-164886      *
##   [55187] chr14_KI270726v1_ran..   41660-42293      *
##   [55188] chr1_KI270711v1_random   22877-23392      *
##   [55189]       chrUn_KI270743v1   28835-29510      *
##   -------
##   seqinfo: 61 sequences from an unspecified genome; no seqlengths
 
 Fine-tuning a consensus peaks query
Certain versions of the ENCODE pipeline provide multiple calling algorithms.
Also, sometimes multiple labs have performed ChIP-seq experiments on the same
tissue and protein, and these results might not be directly comparable.
ENCODExplorer uses heuristics to try and determine which set of files will
provide the most informative results, but the results of these heuristics might
prove unsatisfactory.
In such cases, a user can provide his own set of ENCODE metadata and his own
choice of splitting columns using the buildQueryConsensus function. The user
can also specify which proportion of individual replicates a peak must appear in
to be included in the consensus peaks through the consensus_threshold parameter:
query_results = queryEncodeGeneric(biosample_name="A549", assembly="GRCh38",
                                   file_format="^bed$", output_type="^peaks$", 
                                   treatment_duration_unit="minute",
                                   treatment_duration="(^5$|^10$)", 
                                   target="NR3C1", fixed=FALSE)
## Results : 6 files, 2 datasets
# Obtain a summary of all peaks for NR3C1 ChIP-Seq assays in the A549
# cell line.
res = buildConsensusPeaks(query_results, split_by=c("treatment_duration"), 
                          consensus_threshold=0.5)
## [1] "Success downloading file : ./ENCFF944WPL.bed.gz"
## [1] "Success downloading file : ./ENCFF349VZU.bed.gz"
## [1] "Success downloading file : ./ENCFF424TOY.bed.gz"
## [1] "Success downloading file : ./ENCFF259MMW.bed.gz"
## [1] "Success downloading file : ./ENCFF494KBA.bed.gz"
## [1] "Success downloading file : ./ENCFF404GGW.bed.gz"
## [1] "Files can be found at /tmp/RtmpCa8una/Rbuild46967413fc0a/ENCODExplorer/vignettes"
res
## An object of class ENCODEBindingConsensus.
## Summarizing 6 ENCODE files into 2 categories.
## 
## Metadata:
##   treatment_duration split_group
## 1                 10          10
## 2                  5           5
## 
## Consensus regions:
## GRangesList object of length 2:
## $`10`
## GRanges object with 31431 ranges and 0 metadata columns:
##                         seqnames            ranges strand
##                            <Rle>         <IRanges>  <Rle>
##       [1]                  chr14 20501403-20501711      *
##       [2]                  chr14 20628847-20629251      *
##       [3]                  chr14 20686268-20686794      *
##       [4]                  chr14 20848492-20849570      *
##       [5]                  chr14 20986843-20987131      *
##       ...                    ...               ...    ...
##   [31427]       chrUn_KI270333v1          999-1819      *
##   [31428]       chrUn_KI270333v1         1887-2683      *
##   [31429] chr9_KI270720v1_random         2403-2997      *
##   [31430]       chrUn_KI270336v1          609-1026      *
##   [31431]       chrUn_KI270442v1     391000-391452      *
##   -------
##   seqinfo: 38 sequences from an unspecified genome; no seqlengths
## 
## $`5`
## GRanges object with 21114 ranges and 0 metadata columns:
##                         seqnames            ranges strand
##                            <Rle>         <IRanges>  <Rle>
##       [1]                  chr14 20628896-20629164      *
##       [2]                  chr14 20686395-20686803      *
##       [3]                  chr14 20848565-20848981      *
##       [4]                  chr14 20849138-20849593      *
##       [5]                  chr14 20986851-20987151      *
##       ...                    ...               ...    ...
##   [21110]       chrUn_KI270333v1         2056-2668      *
##   [21111] chr9_KI270720v1_random         2414-2833      *
##   [21112]       chrUn_KI270336v1             1-320      *
##   [21113]       chrUn_KI270336v1          646-1026      *
##   [21114]       chrUn_KI270442v1     391043-391393      *
##   -------
##   seqinfo: 38 sequences from an unspecified genome; no seqlengths
 
 Obtaining average gene expression
For RNA-Seq experiment, the most straightforward type of results is the
expression level of all genes or transcripts. ENCODExplorer provides the
queryGeneExpression and queryTranscriptExpression methods to summarize
these results. ENCODExplorer finds all gene or transcript expression levels
for a given biosample and calculates per-condition mean values.
Most biosamples in the ENCODE Project have RNA-seq experiments targeting
different cell fractions, such as whole cells, cytoplasmic fractions, and
nuclear fractions. Since it makes no biological sense to aggregate such results,
ENCODExplorer automatically splits them by the dataset_description column,
which details the cell fraction as well as other methodological or biological
parameters which make samples unfit for aggregation.
# Obtain a summary of all peaks for NR3C1 ChIP-Seq assays in the A549
# cell line.
res = queryGeneExpression("bone marrow")
## Results : 2 files, 1 datasets
## Only mm10 was found. Selecting it.
## Only polyA plus RNA-seq was found. Selecting it.
## [1] "Success downloading file : ./ENCFF128MGD.tsv"
## [1] "Success downloading file : ./ENCFF339PKR.tsv"
## [1] "Files can be found at /tmp/RtmpCa8una/Rbuild46967413fc0a/ENCODExplorer/vignettes"
The files(), file_metadata() and metadata() methods behave the same way as
they do for queryConsensusPeaks:
metadata(res)
##     treatment treatment_amount treatment_amount_unit treatment_duration
## All      <NA>               NA                  <NA>                 NA
##     treatment_duration_unit
## All                    <NA>
You can see which expression metric ENCODExplorer extracted using the
metric() method.
metric(res)
## [1] "^TPM$"
Per gene/transcript values for all metrics are available through the
metric_data() method:
head(metric_data(res))
##      id   All
## 1 10000 0.000
## 2 10001 0.000
## 3 10002 0.000
## 4 10003 1.985
## 5 10004 0.000
## 6 10005 0.000
You can also get a list of the raw ENCODE files by calling the raw_data()
method.
head(raw_data(res)[[1]][[1]])
##   gene_id transcript_id.s. length effective_length expected_count  TPM FPKM
## 1   10000            10000     72               43              0 0.00 0.00
## 2   10001            10001     73               44              0 0.00 0.00
## 3   10002            10002     73               44              0 0.00 0.00
## 4   10003            10003     75               46              1 3.97 4.24
## 5   10004            10004     78               49              0 0.00 0.00
## 6   10005            10005     73               44              0 0.00 0.00
##   posterior_mean_count posterior_standard_deviation_of_count pme_TPM pme_FPKM
## 1                 0.00                                  0.00    3.69     4.42
## 2                 0.00                                  0.00    3.60     4.32
## 3                 0.00                                  0.00    3.60     4.32
## 4                 0.96                                  0.18    6.78     8.12
## 5                 0.00                                  0.00    3.24     3.88
## 6                 0.00                                  0.00    3.60     4.32
##   TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_bound FPKM_ci_upper_bound
## 1        4.03657e-05           10.98300         4.83120e-05             13.1626
## 2        2.14146e-04           10.75490         2.56696e-04             12.8875
## 3        7.29571e-05           10.82350         3.14634e-04             12.9686
## 4        6.52931e-03           16.16710         3.66534e-02             19.4081
## 5        8.06415e-05            9.65181         9.66626e-05             11.5613
## 6        3.69893e-05           10.86470         4.43908e-05             13.0253
 
 Fine tuning expression summaries
Just as it is the case for ChIP-seq assays, it can sometimes be easier for the
user to perform filtering of the ENCODE results manually. For thse cases,
ENCODExplorer provides the buildExpressionSummary method.
query_results = queryEncodeGeneric(biosample_name="neural tube", 
                                   output_type="gene quantifications",
                                   file_type="tsv",
                                   assay="polyA plus RNA-seq",
                                   assembly="^mm10$",
                                   dataset_biosample_summary="(15.5|13.5)",
                                   fixed=FALSE)
## Results : 4 files, 2 datasets
                                   
buildExpressionSummary(query_results, split_by="dataset_biosample_summary")                                   
## [1] "Success downloading file : ./ENCFF037GWJ.tsv"
## [1] "Success downloading file : ./ENCFF365DLM.tsv"
## [1] "Success downloading file : ./ENCFF049EIV.tsv"
## [1] "Success downloading file : ./ENCFF502BTV.tsv"
## [1] "Files can be found at /tmp/RtmpCa8una/Rbuild46967413fc0a/ENCODExplorer/vignettes"
## An object of class ENCODEExpressionSummary.
## Summarizing 4 ENCODE files into 2 categories.
## 
## Metadata:
##                dataset_biosample_summary                            split_group
## 1 C57BL/6 neural tube embryo (15.5 days) C57BL/6 neural tube embryo (15.5 days)
## 2 C57BL/6 neural tube embryo (13.5 days) C57BL/6 neural tube embryo (13.5 days)
## 
## Sumarizing 69691 gene expression levels.
 
 Interactive mode
queryConsensusPeaks, queryGeneExpression and queryTranscriptExpression all
make educated guesses about the assembly, assay and sample types to be used for
generating the summaries. However, by setting the use_interactive argument to
TRUE, a user can take direct control of some these choices.
queryGeneExpression("neural tube", use_interactive=TRUE)
 
 
 Updating the ENCODE file database
By default, ENCODExplorer retrieves the ENCODE metadata from its sister
package, ENCODExplorerData. The version of the metadata provided
by default will be updated with each Bioconductor release in the
ENCODExplorer package. However, since all of ENCODExplorer’s function take an
explicit df parameter, it is possible to use the AnnotationHub
package to download a more recent version:
require(AnnotationHub)
ah = AnnotationHub()
query(ah, "ENCODExplorerData")
## AnnotationHub with 4 records
## # snapshotDate(): 2020-10-26
## # $dataprovider: ENCODE Project
## # $species: NA
## # $rdataclass: data.table
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH69290"]]' 
## 
##             title                                         
##   AH69290 | ENCODE File Metadata (Light, 2019-04-12 build)
##   AH69291 | ENCODE File Metadata (Full, 2019-04-12 build) 
##   AH75131 | ENCODE File Metadata (Light, 2019-10-13 build)
##   AH75132 | ENCODE File Metadata (Full, 2019-10-13 build)
Finally, it is also possible to use ENCODExplorerData
functionalities to generate an up-to-date data.table, and pass it to
ENCODExplorer’s functions.
We refer the user to the ENCODExplorerData vignettes for details
on how to generate an up-to-date data.table.