--- title: "Introduction to combinatorial demultiplexing" bibliography: ../inst/REFERENCES.bib author: - name: Jakob Peder Pettersen affiliation: - UiT, the Arctic University of Norway, Department of Computer Science email: jakobpeder.pettersen@gmail.com output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r doc_date()`" package: "`r pkg_ver('posDemux')`" vignette: > %\VignetteIndexEntry{Introduction to combinatorial demultiplexing} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL # Related to # https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ``` ```{r vignetteSetup, echo=FALSE, message=FALSE, warning = FALSE} ## Bib setup library("RefManageR") ## Write bibliography information bib_packages <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[1], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1] ) bib <- ReadBib("../inst/REFERENCES.bib") # for (pkg in names(bib_packages)) { # bib[[pkg]] <- bib_packages[[pkg]] # } BibOptions( check.entries = FALSE, style = "markdown", cite.style = "numeric", bib.style = "numeric" ) ``` # Basics ## When and why to use this package This package is aimed at single-cell RNA-seq approaches such as SPLiT-seq `r Citep(bib[c("Rosenberg2018","Kuijpers2024")])`, BacDrop `r Citep(bib[c("Ma2023")])`, and PETRI-seq `r Citep(bib[c("Blattman2020","Blattman2024")])`. Here, each cell to be sequenced is defined by its own unique combination of multiple DNA barcodes. We do assume that the individual barcodes come from a predefined whitelist, but that the combination of barcodes is random and that every possible combination of barcodes is valid. In order to identify from which cell each read is coming from, we must extract the barcodes, compare them with the reference barcodes, and find the best match. ## Required knowledge `r Biocpkg("posDemux")` uses the `r Biocpkg("Biostrings")` package for handling sequencing data and hence a basic understanding of `Biostrings` is necessary to use this package. Also important to note is the fact that while `posDemux` does provide utilities for demultiplexing scRNA-seq data, it is designed to be part of a greater workflow. In particular, the package does not include essential components of a scRNA-seq workflow like quality control, alignment, feature counting and UMI deduplication, neither does it provide any utilities for tertiary analysis once the gene count matrix is constructed. Hence, the user should be familiar with the other components of such workflows `r Citep(bib["Kuijpers2024"])` and know how the functionality of this package plays a part therein. If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in [this blog post](https://lcolladotor.github.io/2014/10/16/where-do-i-start-using-bioconductor/). ## Limitations Since this is a positional demultiplexer, the position of all segments and barcodes must be known in advance and be the same for all reads. Hence, the package is not suited in cases where segments occur at variable positions among the reads. Furthermore, in the cases where the segments are at fixed positions among the reads, but the segment lengths are unknown, it may require some trial and error to obtain the correct segmentation. The demultiplexer is written with Illumina sequencing in mind. It assumes that substitution sequencing errors may occur, but has no safeguard against indel sequencing errors. Hence, indel errors will result in erroneous segmentation and failure to assign the correct barcode. For sequencing platforms where indel errors are common, such as Nanopore sequencing, this package may therefore not be suitable. This package does not by itself support combining barcodes on both reverse and forward reads, but this limitation can be circumvented by either artificially merging reads or manipulating the result structures from the demultiplexer. ## Ways to interact with this package This vignette describes the basic usage of the package where the FASTQ files are read into `R` before the demultiplexing begins. For large datasets or memory-limited systems, reading all sequences into memory at once is too taxing. For this reason, `posDemux` has a streaming API where only parts of the input file is read at a time. For brevity, this functionality is explained in its own vignette (`vignette("streaming")`). We believe that most end-users won't interact with this package directly, but rather in the form of a custom-made bioinformatic pipeline. Also, we think that most developers of bioinformatics pipelines would prefer the streaming API. Still, we think those developing with this package should have an understanding of its low-level API as to understand what it does under the hood. ## Install `posDemux` This package is hosted on [Bioconductor](http://bioconductor.org) repository for `R` packages. In order to install it from a fresh `R` install, run: ```{r "install", eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("posDemux") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` If you want to obtain the development version of the package, it is available on [GitHub](https://github.com/yaccos/posDemux): ```{r "github_install", eval = FALSE} if (!requireNamespace("devtools", quietly = TRUE)) { install.packages("devtools") } devtools::install("yaccos/posDemux") ``` If you compile the package from source, we do recommend you use the compiler flag `-ftree-vectorize` as this provides considerable speedup to the demultiplexer. ## Asking for help As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But `R` and `Bioconductor` have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the [Bioconductor support site](https://support.bioconductor.org/) as the main resource for getting help: Remember to use the `posDemux` tag and check [the older posts](https://support.bioconductor.org/tag/posDemux/). Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the [posting guidelines](http://www.bioconductor.org/help/support/posting-guide/). It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error. ## Citing `posDemux` We hope that `r Biocpkg("posDemux")` will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you! ```{r "citation"} ## Citation info citation("posDemux") ``` # An example with PETRI-seq ## Quick overview of the method The PETRI-seq method is a method for single-cell RNA-sequencing of bacteria `r Citep(bib[c("Blattman2020","Blattman2024")])`. It uses paired-end Illumina sequencing where the forward read contains the cell barcodes and the transcript UMI, whereas the reverse read contain the cDNA to be aligned to the genome. Hence, `posDemux` will only be applied to the forward reads. We start by loading the `posDemux` itself and some helper packages: ```{r "start", message=FALSE} library(posDemux) library(Biostrings) library(purrr) library(magrittr) ``` ## Sequence annotation For PETRI-seq, the forward read consists of the following segments in the following order (from 5' to 3'): - UMI: 7 nucletides - Barcode 3(`bc3`) : 7 nucleotides - Linker: 15 nucletides - Barcode 2(`bc2`): 7 nucleotides - Linker: 14 nucletides - Barcode 1(`bc1`): 7 nucleotides - The rest of the read is ignored Hence, we consider the first segment, the UMI, as a payload (denoted as `P` in the sequence annotation) to be kept. The three barcodes (denoted as `B`) are used for demultiplexing. Finally, the linkers and the read past Barcode 1 (denoted as `A`) are ignored. We specify it with as: ```{r segments} sequence_annotation <- c(UMI = "P", "B", "A", "B", "A", "B", "A") segment_lengths <- c(7L, 7L, 15L, 7L, 14L, 7L, NA_integer_) ``` ## Data loading For the demultiplexing, we need the listing of the barcodes, and of course the reads to demultiplex. For this package, the PETRI-seq barcodes and synthetically generated reads are provided. We load the barcodes: ```{r barcodes} barcode_files <- system.file("extdata/PETRI-seq_barcodes", c( bc1 = "bc1.fa", bc2 = "bc2.fa", bc3 = "bc3.fa" ), package = "posDemux" ) names(barcode_files) <- paste0("bc", 1L:3L) barcode_index <- map(barcode_files, readDNAStringSet) ``` ... and the FASTQ file containing the forward reads: ```{r read_fastq} input_fastq <- system.file("extdata", "PETRI-seq_forward_reads.fq.gz", package = "posDemux" ) reads <- readDNAStringSet(input_fastq, format = "fastq") ``` `posDemux` accepts both `DNAStringSet` and `QualityScaledDNAStringSet` as input where the latter contain the quality scores of the reads. However, the quality score is ignored by the demultiplexer and the only reason to pass a `QualityScaledDNAStringSet` is when it is desirable to retain the quality scores in the payload. ## Running the demultiplexer Before we run the demultiplexer, we must ensure that the barcodes index is arranged in the correct order: ```{r barcode_reorder} barcodes <- barcode_index[c("bc3", "bc2", "bc1")] ``` We are now ready to run the demultiplexer: ```{r run_demultiplex} demultiplex_res <- combinatorial_demultiplex( reads, barcodes = barcodes, segments = sequence_annotation, segment_lengths = segment_lengths ) ``` The main results from this demultiplexer are the table of assigned barcodes: ```{r assigned_barcodes} head(demultiplex_res$assigned_barcodes) ``` the table of mismatches to the barcode: ```{r mismatches} head(demultiplex_res$mismatches) ``` as well as the extracted UMI sequences: ```{r umis} demultiplex_res$payload$UMI ``` ## Error correction `posDemux` handles two types reads filtering: - Retaining only the reads being matched with a barcode combination after error correction. - Removing barcode combinations which have few reads as these are usually artifacts and are not assumed to cover the transcripts of an entire cell. This subsection consider the first type of filtering. During demultiplexing,the barcode with the smallest Hamming distance to the read is chosen and the number of mismatches is recorded. If there are multiple barcodes being equal in distance to the query, one of these barcodes is chosen. When there are no mismatches for any of the barcode sets, we obviously want to keep this read. When one or more barcode has mismatches, `posDemux` allows for error correction. In that case, we will keep the read if the number of mismatches to each of its barcodes is below a certain threshold. In our example, the smallest Hamming distance between two different barcodes is $3$ for all barcode sets. Given $n$, the number of barcode sequencing errors which can be reliably corrected in a barcode set, the minimum distance required between two distinct barcodes needs to be at least $2n+1$. This means that in our case, we can correct for exactly one sequencing error: ```{r filter_res} filtered_res <- filter_demultiplex_res( demultiplex_res, allowed_mismatches = 1L ) ``` We will now inspect the summary of the demultiplexing and filtering: ```{r summary_res} filtered_res$summary_res ``` Now, the results in `filtered_res$demultiplex_res` only shows the reads either without mismatches or where error correction could be applied: ```{r filtered_tables} head(filtered_res$demultiplex_res$assigned_barcodes) head(filtered_res$demultiplex_res$mismatches) ``` We can also see which reads are retained: ```{r retained} head(filtered_res$retained) ``` ## Filtering by barcode frequency Now as we have conducted the filtering based on barcode mismatches, we proceed with the second kind of filtering. It utilizes the Knee method `r Citep(bib[c("Macosko2015","Blattman2020")])` where the barcode combinations are arranged in descending order on frequency and the most abundant ones are selected. The cutoff is set such that we achieve a saturation where most reads are covered, yet the remaining barcode combinations not covered all have a small number of reads. We first create a frequency table of the barcodes is question: ```{r freq_table} freq_table <- create_freq_table( filtered_res$demultiplex_res$assigned_barcodes ) head(freq_table) ``` This package contains an interaction Shiny application for selecting the number of barcodes. It can be run by: ```{r shiny_app, eval=FALSE} interactive_bc_cutoff(freq_table) ``` If you are working outside RStudio, you may have to run ```{r shiny_app_cmd, eval=FALSE} app <- interactive_bc_cutoff(freq_table) shiny::runApp(app, launch.browser = FALSE) ``` and opening the resulting link inside the browser. If you are working on a headless system, you may consider: - Using RStudio Server - Opening a reverse ssh tunnel for the Shiny server - Copy the frequency table onto your local machine and run the application there If you run the application, you will see that keeping approximately 500 barcodes is the optimal choice^[In case you wonder why the best choice is keeping half the barcodes: Remember that the data is synthetic, so making half of the barcode combinations artifacts with low numbers of reads, was a design decision.]. We can also calculate the frequency of the least abundant barcode combination being included. ```{r cutoff_convert} bc_cutoff <- 500L freq_cutoff <- bc_to_freq_cutoff(freq_table, bc_cutoff) freq_cutoff ``` We can convert the cutoff the other way as well, but it is in general not an exact inverse as multiple barcodes can have the same frequency: ```{r cutoff_reconstruct} reconstrued_bc_cutoff <- freq_to_bc_cutoff( freq_table, freq_cutoff ) reconstrued_bc_cutoff ``` The cutoff can be illustrated by the Knee plot as in the interactive application: ```{r knee_plot} knee_plot(freq_table = freq_table, cutoff = bc_cutoff) ``` We can also illustrate this filtering by looking at the distribution of barcode frequencies: ```{r freq_plot} # Since the cutoff lines of the plot are provided by the literal x-coordinate, # we must use the frequency cutoff freq_plot(freq_table, cutoff = freq_cutoff, type = "density", log_scale_x = TRUE ) ``` Often the frequency plot gets easier to interpret when we scale the y-axis by the number of reads, hence creating a mass plot of where the reads are distributed based on their respective barcodes: ```{r mass_plot} # Since the cutoff lines of the plot are provided by the literal x-coordinate, # we must use the frequency cutoff freq_plot(freq_table, cutoff = freq_cutoff, type = "density", log_scale_x = TRUE, scale_by_reads = TRUE ) ``` ## Exporting results Now as we know how many barcodes to keep, we can extract the desired results and write it to a table. We start with the frequency table. Since it is already sorted in descending order of frequency, we simply take the top entries in the table: ```{r select_freq_table} selected_freq_table <- freq_table[seq_len(bc_cutoff), ] ``` For finding which reads correspond to these barcodes, we do the following: ```{r extract_tables} assigned_barcodes <- filtered_res$demultiplex_res$assigned_barcodes read_in_selection <- row_match(assigned_barcodes, selected_freq_table) selected_assigned_barcodes <- assigned_barcodes[read_in_selection, ] ``` We continue with the UMIs which we convert to a character vector (remember that the payload and assigned barcodes are aligned in order): ```{r extract_umis} assigned_UMI <- filtered_res$demultiplex_res$payload$UMI %>% as.character() selected_assigned_UMI <- assigned_UMI[read_in_selection] ``` With all of this done, we can make a data frame containing the UMI, read identifier, and barcode assignments: ```{r combined_frame} res_table <- as.data.frame(selected_assigned_barcodes) %>% dplyr::mutate( read = rownames(selected_assigned_barcodes), UMI = selected_assigned_UMI ) %>% # Ensures the columns appears in the desired order dplyr::select(read, UMI, bc3, bc2, bc1) head(res_table) ``` Finally, we can write the table to file. The following is the suggested way for achieving the desired formatting: ```{r write_to_file} file <- tempfile(pattern = "barcode_table", fileext = ".txt") write.table(res_table, file, row.names = FALSE, col.names = TRUE, sep = "\t", eol = "\n", quote = FALSE ) ``` # Reproducibility The `r Biocpkg("posDemux")` package `r Citep(bib[["posDemux"]])` was made possible thanks to: - R `r Citep(bib[["R"]])` - `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` - `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` - `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])` - `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` - `r CRANpkg("sessioninfo")` `r Citep(bib[["sessioninfo"]])` - `r CRANpkg("testthat")` `r Citep(bib[["testthat"]])` This package was developed using `r BiocStyle::Biocpkg("biocthis")`. `R` session information: ```{r reproduce3, echo=FALSE} ## Session info library("sessioninfo") options(width = 120) session_info() ``` # Bibliography This vignette was generated using `r Biocpkg("BiocStyle")` `r Citep(bib[["BiocStyle"]])` with `r CRANpkg("knitr")` `r Citep(bib[["knitr"]])` and `r CRANpkg("rmarkdown")` `r Citep(bib[["rmarkdown"]])` running behind the scenes. Citations made with `r CRANpkg("RefManageR")` `r Citep(bib[["RefManageR"]])`. ```{r biblio, results = "asis", echo = FALSE, warning = FALSE, message = FALSE} ## Print bibliography PrintBibliography(bib, .opts = list(hyperlink = "to.doc", style = "html")) ```