BatchQC 2.7.1
Sequencing and microarray samples are often collected or processed in multiple batches or at different times. This often produces technical biases that can lead to incorrect results in the downstream analysis. BatchQC is a software tool that streamlines batch preprocessing and evaluation by providing shiny app interactive diagnostics, visualizations, and statistical analyses to explore the extent to which batch variation impacts the data. This is contained in a full R package which allows reproducibility of all shiny evaluations and analyses. BatchQC diagnostics help determine whether batch adjustment needs to be done, and how correction should be applied before proceeding with a downstream analysis. Moreover, BatchQC interactively applies multiple common batch effect approaches to the data, and the user can quickly see the benefits of each method. BatchQC is developed as a Shiny App, but is reproducible at the command line through the functions contained in the R package. The output of the shiny app is organized into multiple tabs, and each tab features an important part of the batch effect analysis and visualization of the data. The BatchQC interface has the following features:
BatchQC() is the function that launches the Shiny App in interactive mode.
To begin, install Bioconductor and then install BatchQC:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BatchQC")
To install the most up-to-date version of BatchQC, please install directly from GitHub. You will need the devtools package. You can install both of these with the following commands:
if (!require("devtools", quietly = TRUE)) {
install.packages("devtools")
}
library(devtools)
install_github("wejlab/BatchQC")
You should now be able to load BatchQC and launch the shiny app.
library(BatchQC)
BatchQC()
Below is an example of how you may use the shiny app to analyze your data:
Upon launching the shiny app, you will be on the “Upload Data” screen. From here, you may upload the following data:
A preview of the selected data will show up under the “Input Summary” and “Full Metadata” tab. Input summary shows the counts file/assay on the Input summary tab and the metadata on the “Full Metadata” tab. If a preview does not load as expected, please ensure your dataset is structured properly and a valid file type.
You MUST hit “Upload” for your data to be available to interact with the shiny app.
All shiny app functions can also be run from the command line; all command line execution require that your data is in a Summarized Experiment object. You can create a summarized experiment object from a features and metadata matrix. This would be done as follows:
# Load package data examples
data(signature_data)
data(batch_indicator)
# Create SE Object
features <- signature_data
metadata <- batch_indicator
se_object <- BatchQC::summarized_experiment(features, metadata)
# Name your assay
SummarizedExperiment::assayNames(se_object) <- "log_intensity"
# Ensure all relevat metadata varaibles are factors
se_object$batch <- as.factor(se_object$batch)
se_object$condition <- as.factor(se_object$condition)
Likewise, you can upload your own data into a summarized experiment object for use with command line functions by providing a data matrix and sample info matrix to the summarized_experiment function.
The remainder of this documentation will utilize the TB example data set included in the package. This can be loaded from the command line as a summarized experiment object with the following:
se_object <- tb_data_upload()
## Loading required namespace: curatedTBData
## Downloading: GSE152218
##
|
| | 0%
|
|============== | 20%
|
|============================ | 40%
|
|========================================== | 60%
|
|======================================================== | 80%
|
|======================================================================| 100%
## Downloading: GSE101705
##
|
| | 0%
|
|========== | 14%
|
|==================== | 29%
|
|============================== | 43%
|
|======================================== | 57%
|
|================================================== | 71%
|
|============================================================ | 86%
|
|======================================================================| 100%
## Finished!
For ease of reference, we will also create variables to store the assay we are interested in, the batch variable, and the covariable we are interested in.
assay_of_interest <- "features"
batch <- "Experiment"
covar <- "TBStatus"
You may choose to use these tools in a different order, although we recommend checking the distribution of your data before deciding the best batch correction method. you may also choose to explore your raw data with the various tools within BatchQC to confirm the presence of a batch effect prior to applying a batch correction method. In this example, we have applied a batch correction method based on the results of the distribution check, but you may choose to do further analysis before selecting (a) (or even comparing) batch correction method(s).
compute_aic' can provide a negative binomial distribution value only for positive discrete data. Positive non-discrete data will result incompute_aic`
reporting only the voom and log-normal distribution analysis.
The negative binomial goodness-of-fit will provide an estimate for discrete or non-discrete data.
These tools are found on the Batch Correction/Normalization tab as we recommend utilizing these tools to inform your choice of batch correction method.
TB_AIC <- compute_aic(se = se_object,
assay_of_interest = assay_of_interest,
batchind = batch,
groupind = covar,
maxit = 1000,
zero_filt_percent = 15
)
## Warning in sqrt(1/i): NaNs produced
## Warning in sqrt(1/i): NaNs produced
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: algorithm did not converge
TB_AIC["AIC_table"]
## $AIC_table
## nb_AIC lognormal_AIC voom_AIC
## Total AIC 18608300.213 4683915.6234 3174420.1261
## Frequency of Minimum AIC 1.000 14109.0000 3247.0000
## AIC Score 18608300.213 331.9807 977.6471
## Median AIC 1099.963 128.2294 177.1217
TB_AIC["AIC_boxplot"]
## $AIC_boxplot
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
The lowest AIC represented by “AIC Score” indicates the model that is likely the best fit. In this dataset, the lowest “AIC Score” is lognormal. Therefore, we should consider using ComBat as our correction method.
We can further confirm that a negative binomial distribution is not a good fit by running the DESeq2 negative binomial check.
# Set a seed if you would like reproducible results
set.seed(1)
TB_fit <- goodness_of_fit_nb(se = se_object,
count_matrix = assay_of_interest,
condition = covar,
other_variables = batch,
num_genes = 3000,
method = "edgeR")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the BatchQC package.
## Please report the issue at <https://github.com/wejlab/BatchQC/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
TB_fit$res_histogram
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
TB_fit$recommendation
## [1] "These methods have not been tested on data sets with between 21-149 samples. Thus, caution should be had when interpretting the results. With an FDR adjusted p-value cut off of 0.05, 1072 of your condition variable features are below the cutoff. If edgeR's assumptions are met, we would expect a uniform distribution of significant features. Since 36% of features have a significant adjusted pvalue (<0.05), you should be cautious about using edgeR for your analysis. You have significant features, and thus you are at risk of receiving false results."
The recommendation states that the edgeR negative binomial distribution is likely being violated. Therefore we probably should not consider down stream methods that assume a negative binomial distribution.
After you have successfully uploaded your data set of interest, you can apply one of the following Normalization methods if appropriate:
To use one of these two methods, select it from the normalization method drop down, select an assay to perform the normalization on from the drop down, and name the new assay (the prepopulated name can be changed to your preference). You may also choose to log-transform the results of either of these normalization methods or your raw data by selecting the “Log transform” box. Hit “Normalize” to complete the analysis. Your new assay will now be available for further analysis.
Alternatively, this can be called directly from the command line on an SE object and will return the same SE object with an additional assay with the name that you provide as the output_assay_name.
Our TB data set is raw count data and to utilize ComBat we will need normalized data. We will utilize log_CPM normalized data for ComBat. However, examples are provided below on how you would complete many different normalization methods if needed via the command line:
# CPM Normalization
se_object <- BatchQC::normalize_SE(se = se_object, method = "CPM",
log_bool = FALSE, assay_to_normalize = assay_of_interest,
output_assay_name = "CPM_normalized_counts")
# CPM Normalization w/log
se_object <- BatchQC::normalize_SE(se = se_object, method = "CPM",
log_bool = TRUE, assay_to_normalize = assay_of_interest,
output_assay_name = "CPM_log_normalized_counts")
# DESeq Normalization
se_object <- BatchQC::normalize_SE(se = se_object, method = "DESeq",
log_bool = FALSE, assay_to_normalize = assay_of_interest,
output_assay_name = "DESeq_normalized_counts")
# DESeq Normalization w/log
se_object <- BatchQC::normalize_SE(se = se_object, method = "DESeq",
log_bool = TRUE, assay_to_normalize = assay_of_interest,
output_assay_name = "DESeq_log_normalized_counts")
# log adjust
se_object <- BatchQC::normalize_SE(se = se_object, method = "none",
log_bool = TRUE, assay_to_normalize = assay_of_interest,
output_assay_name = "log_normalized_counts")
Select the appropriate Batch Effect correction model to create a batch-corrected assay for your data. You may select one of the following:
To apply your desired batch effect correction model, select the method from the drop down menu, then select the appropriate assay in the second drop down menu, select your batch variable (typically labeled batch in built-in examples), and the covariates that you would like to preserve (not including the batch variable). If desired, revise the name for the corrected assay and then hit “Correct.”
A progress bar will pop up in the bottom right hand corner and a message will state “Correction Complete” in the bottom right hand corner when complete.
You are now ready to analyze your raw and batch effect corrected data sets!
The distribution analysis indicated that our TB data follows a log normal distribution, so ComBat is likely an appropriate batch correction tool. We will use the log-CPM assay since ComBat requires normalized data. Therefore, to run from the command line use the following:
# ComBat correction
se_object <- BatchQC::batch_correct(se = se_object, method = "ComBat",
assay_to_normalize = "CPM_log_normalized_counts", batch = batch,
covar = c(covar), output_assay_name = "ComBat_corrected")
## Found 4182 genes with uniform expression within a single batch (all zeros); these will not be adjusted for batch.
## Found2batches
## Adjusting for1covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
ComBat-Seq (or limma or sva or svaseq) can be applied to a data set by providing “ComBat-Seq” (or “limma” or “sva” or “svaseq”) to the method parameter in the function call.
This tab allows you to view the batch status of each condition. Select the variable that represents your batch variable and then select the variable that represents the condition. A table will then appear listing each condition in the row and how many samples in each condition are in each batch (displayed in the columns).
To recreate this table, run the following code, which will produce a tibble as output:
batch_design_tibble <- BatchQC::batch_design(se = se_object, batch = batch,
covariate = covar)
batch_design_tibble
## # A tibble: 2 × 3
## # Groups: TBStatus [2]
## TBStatus GSE101705 GSE152218
## <fct> <int> <int>
## 1 LTBI 15 32
## 2 PTB 28 16
This tab displays the Pearson correlation coefficient and Cramer’s V estimation based on the selected batch and condition.
This table can be recreated with the following code:
confound_table <- BatchQC::confound_metrics(se = se_object, batch = batch)
confound_table
## Pearson Correlation Coefficient Cramer's V
## TBStatus 0.4279870 0.3175219
## HIVStatus NaN NaN
## BMI 1.0000000 1.0000000
## BMIcat 0.5898655 0.4589234
Pearson correlation coefficient indicates the strength of the linear relationship between your batch and condition variables. It can range from -1 to 1. The closer the value is to 0, the less likely there is a batch effect. The closer the value is to -1 or 1, the more likely there is a batch effect.
To calculate only the pearson correlation coefficient, run the following:
pearson_cor_result <- BatchQC::std_pearson_corr_coef(batch_design_tibble)
pearson_cor_result
## [1] 0.427987
This is an additional metric for batch effect and will be between 0 and 1. The closer the value is to 0, the lower the batch effect. The closer the value is to 1, the greater the batch effect
To calculate only Cramer’s V, run the following:
cramers_v_result <- BatchQC::cramers_v(batch_design_tibble)
cramers_v_result
## [1] 0.3175219
If both the Pearson Correlation Coefficient and the Cramer’s V test indicate low batch effect, you likely do not need to adjust your results for batch. However, if one or both metrics indicate some or high batch effect, you should use additional analysis tools (listed below) to explore the need for a batch correction for your results.
First, select your unadjusted assay, your batch variable and any covariates of interest. Click “Here we go!” to see the variation analysis results. The “Individual Variable” tab shows the individual, or raw variation, explained by each variable. The “Residual” tab displays the type 2, or residual, variation for each variable. A box plot is displayed along with a table with the variation specific to each gene listed. This table is searchable and can be displayed in ascending or descending order. Notice the amount of variation attributed to batch. In our signature data set, most of the variation is attributed to batch rather than the condition. This indicates a need for a batch correction.
The “Individual Variation Variable/Batch Ratio” tab displays the individual, or raw variation, divided by the batch. A ratio greater that 1 indicates that batch has a stronger affect than the variable of interest. And the “Residual Variation Variable/Batch Ratio” displays the residual, or type 2 variation, divided by the batch. A ratio greater that 1 indicates that batch has a stronger affect than the variable of interest.
Next, select the batch corrected assay, with the same batch and condition variable and observe the changes in variation. When batch effect is being properly corrected, you should notice that the variation explained by the batch variable has decreased in the batch corrected assay, indicating that the variation in the data is now more likely to be associated with your condition of interest rather than the batch. The ratio statistics should also be less than 1. This is visible in our example data set and confirms the need for batch adjustment.
To recreate the variation analysis plot for the TB counts assay, use the following code for individual variation:
ex_variation <- batchqc_explained_variation(se = se_object,
batch = batch, condition = covar, assay_name = assay_of_interest)
EV_boxplot <- BatchQC::EV_plotter(batchqc_ev = ex_variation[[1]])
EV_boxplot
BatchQC::summary_stats_EV_table(ex_variation[[1]])
## Explained Experiment TBStatus Unexplained
## Min. 0.000 0.00000 0.000000 6.370
## 1st Qu. 1.730 1.00000 0.600000 65.010
## Median 8.580 3.10000 1.190000 91.420
## Mean 19.871 14.29845 4.043791 80.129
## 3rd Qu. 34.990 22.19000 4.440000 98.270
## Max. 93.630 93.58000 53.000000 100.000
To recreate the residual variation analysis table for the TB counts assay, use the following code:
EV_boxplot_type_2 <- BatchQC::EV_plotter(batchqc_ev = ex_variation[[2]])
EV_boxplot_type_2
BatchQC::summary_stats_EV_table(ex_variation[[2]])
## Explained Experiment TBStatus Unexplained
## Min. 0.000 0.00000 0.000000 6.370
## 1st Qu. 1.730 0.54000 0.530000 65.010
## Median 8.580 3.73500 1.620000 91.420
## Mean 19.871 15.82695 5.572554 80.129
## 3rd Qu. 34.990 25.96000 7.130000 98.270
## Max. 93.630 84.65000 54.450000 100.000
We can then compare our batch corrected assay, using the following code:
ex_variation <- batchqc_explained_variation(se = se_object,
batch = batch, condition = covar, assay_name = "ComBat_corrected")
EV_boxplot <- BatchQC::EV_plotter(batchqc_ev = ex_variation[[1]])
EV_boxplot
BatchQC::summary_stats_EV_table(ex_variation[[1]])
## Explained Experiment TBStatus Unexplained
## Min. 0.000000 0.000000 0.000000 1.62000
## 1st Qu. 0.730000 0.190000 0.500000 87.15000
## Median 3.040000 0.570000 2.270000 96.96000
## Mean 8.379356 1.071386 8.004516 91.62064
## 3rd Qu. 12.850000 1.340000 12.520000 99.27000
## Max. 98.380000 98.350000 60.260000 100.00000
EV_boxplot_type_2 <- BatchQC::EV_plotter(batchqc_ev = ex_variation[[2]])
EV_boxplot_type_2
BatchQC::summary_stats_EV_table(ex_variation[[2]])
## Explained Experiment TBStatus Unexplained
## Min. 0.000000 0.0000000 0.000000 1.62000
## 1st Qu. 0.730000 0.0200000 0.380000 87.15000
## Median 3.040000 0.0700000 2.100000 96.96000
## Mean 8.379356 0.3745516 7.308033 91.62064
## 3rd Qu. 12.850000 0.2600000 11.540000 99.27000
## Max. 98.380000 89.3800000 54.730000 100.00000
EV_table <- BatchQC::EV_table(batchqc_ev = ex_variation[[1]])
EV_table
## Gene Name Explained Experiment TBStatus Unexplained
## <char> <num> <num> <num> <num>
## 1: A1BG 10.74 1.14 10.74 89.26
## 2: A1BG-AS1 11.89 1.47 11.87 88.11
## 3: A1CF 0.86 0.37 0.73 99.14
## 4: A2M 1.13 0.13 1.13 98.87
## 5: A2M-AS1 15.47 0.95 15.39 84.53
## ---
## 33582: ZYG11A 1.88 0.62 1.73 98.12
## 33583: ZYG11AP1 3.14 2.11 2.02 96.86
## 33584: ZYG11B 9.90 0.03 9.21 90.10
## 33585: ZYX 2.36 0.07 2.31 97.64
## 33586: ZZEF1 27.60 2.27 27.57 72.40
EV_table_type_2 <- BatchQC::EV_table(batchqc_ev = ex_variation[[2]])
EV_table_type_2
## Gene Name Explained Experiment TBStatus Unexplained
## <char> <num> <num> <num> <num>
## 1: A1BG 10.74 0.00 9.60 89.26
## 2: A1BG-AS1 11.89 0.02 10.42 88.11
## 3: A1CF 0.86 0.13 0.49 99.14
## 4: A2M 1.13 0.00 1.00 98.87
## 5: A2M-AS1 15.47 0.08 14.52 84.53
## ---
## 33582: ZYG11A 1.88 0.15 1.26 98.12
## 33583: ZYG11AP1 3.14 1.12 1.02 96.86
## 33584: ZYG11B 9.90 0.69 9.87 90.10
## 33585: ZYX 2.36 0.06 2.30 97.64
## 33586: ZZEF1 27.60 0.03 25.33 72.40
To confirm the presence of batch effects in the simulated dataset, we use kBET, a method that quantifies batch effects through a chi-square test. kBET works by comparing the batch label distribution in each sample’s local neighborhood to the global batch label distribution across the entire dataset. If these distributions differ significantly, it indicates the presence of batch effects.
Below is how we would run the kBET analysis on the TB data set.
TB_kBET <- BatchQC::run_kBET(se = se_object,
assay_to_normalize = assay_of_interest,
batch = batch)
BatchQC::plot_kBET(TB_kBET)
The boxplot shows that the observed rejection rate is significantly higher than the expected rejection rate, indicating a strong batch effect in the TB dataset.
BatchQC contains a number of visualization methods that may help you identify a batch effect (or to confirm that a batch effect has been removed).
The heatmap tab contains options for a sample correlation heatmap or for a more traditional heatmap comparing samples and genes. Select your assay of interest and the batch and condition(s) of interest. We recommend viewing no more than 500 elements at a time (or the max number in your data set).
This heat map shows how similar each sample is to each other sample in your data (samples are plotted on both the x and y axis, so are identical to their self). The samples are clustered together based on similarity. Ideally, you would expect your samples to cluster and be more similar to samples from the same condition rather than from the sample batch. If there is clear visual clustering of the batch in your raw data, you should apply a batch correction method. Your batch corrected assay should have stronger clustering by condition.
Using the TB dataset, we initially see clustering based on batch, but after apply ComBat, our adjusted data set clusters more based on condition.
Use the following code to recreate the sample correlation heatmap:
heatmaps <- BatchQC::heatmap_plotter(se = se_object, assay = assay_of_interest,
nfeature = 38, annotation_column = c(batch, covar),
log_option = "FALSE")
correlation_heatmap <- heatmaps$correlation_heatmap
correlation_heatmap
The heatmap shoes the sample clustering on the x axis and the y axis is each gene. The values on the heatmap represent gene expression. The dendrogram on the x axis is the same clustering arrangement as was seen in the sample correlations. This heatmap allows the viewer to see patterns of gene expression across the samples based on the selected variables. Remember that if batch is clustering together on your raw data, you should perform a batch effect correction to adjust your data.
Use the following code to recreate the heatmap:
heatmaps <- BatchQC::heatmap_plotter(se = se_object, assay = assay_of_interest,
nfeature = 38, annotation_column = c(batch, covar),
log_option = FALSE)
heatmap <- heatmaps$topn_heatmap
heatmap
Within the dendrogram tab you can view either a traditional dendrogram or a circular dendrogram.
This tab shows a more detailed dendrogram than seen in the heatmap. When samples cluster together by batch, you are likely to have a strong batch effect in your data and you should consider a batch correction method for your data. Samples are clustered based on similarity of gene expression and other metadata using an Euclidian distance metric. Labels will include the batch variable and one covariate of interest.
Use the following code to recreate the dendrogram for the TB data set:
dendrogram_plot <- BatchQC::dendrogram_plotter(se = se_object,
assay = assay_of_interest,
batch_var = batch,
category_var = covar)
## Coordinate system already present.
## ℹ Adding new coordinate system, which will replace the existing one.
dendrogram_plot$dendrogram
This circularizes the previous dendrogram to better show relatedness of all branches on the dendrogram without the appearance of great distance of the samples at the top and the bottom of the chart.
Use the following code to recreate the circular dendrogram:
circular_dendrogram_plot <- BatchQC::dendrogram_plotter(
se = se_object, assay = assay_of_interest, batch_var = batch,
category_var = covar)
## Coordinate system already present.
## ℹ Adding new coordinate system, which will replace the existing one.
circular_dendrogram_plot$circular_dendrogram
You may select multiple assays to plot PCA analysis side by side. Therefore, select your raw data assay and batch corrected assay, the batch variable as shape, condition as color, and the number of top variable features you are interested in. Review the clustering pattern. In data sets where a strong batch effect is seen, similar shapes (batch) will all cluster together. Whereas if a batch effect is not seen, shapes (batch) should be dispersed throughout and you should see clustering by condition (color).
Use the following code to recreate the PCA plot and associated table that lists the explained variation of each PC for our TB data set. Please note that count data should be log-normalized prior to PCA plotting, so we use the log_normalized assay:
pca_plot <- BatchQC::PCA_plotter(se = se_object, nfeature = 20, color = batch,
shape = covar, batch = batch,
assays = c("log_normalized_counts", "ComBat_corrected"),
xaxisPC = 1, yaxisPC = 2, log_option = FALSE)
pca_plot$plot
## Warning: The following aesthetics were dropped during statistical transformation: shape.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## The following aesthetics were dropped during statistical transformation: shape.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
pca_plot$var_explained
## log_normalized_counts ComBat_corrected
## PC1 0.66782 0.55537
## PC2 0.20989 0.17098
You may plot your data on a UMAP by selecting the assay, the batch variable, and selecting a value for nearest neighbors and minimum distance for the points to plot. The default for neighbors is 15 and minimum distance default is 0.1. We expect each batch to cluster together when there is a batch effect present. Otherwise, having dispersed points in each batch indicates there is not a strong batch effect. You can adjust the nearest neighbor parameter, a lower value will prioritize local structure and a higher value will represent bigger picture, but lose finer details. A higher minimum distance value will also put less emphasis on the global structure.
Use the following code to recreate the UMAP for the TB data set:
umap_plot <- BatchQC::umap(se = se_object,
assay_of_interest = assay_of_interest, batch = batch, covar = covar,
neighbors = 15, min_distance = 0.1, spread = 1, log_option = TRUE)
umap_plot
This code creates a UMAP for the batch corrected assay:
umap_plot_batch_corrected <- BatchQC::umap(se = se_object,
assay_of_interest = "ComBat_corrected", batch = batch, covar = covar,
neighbors = 15, min_distance = 0.1, spread = 1)
umap_plot_batch_corrected
To explore which features are differentially expressed between conditions, use the differential expression tab. DE_Seq2 should be used for count data (non-negative, integer values) with a negative binomial distribution, edgeR can be used for edgeR-normalized data, ANOVA can be used for normal-distributed data, Kruskal-Wallis test can be used as a nonparametric equivalent of one-way ANOVA, and limma can be used for all other data. edgeR implemented in BatchQC uses glm quasi-likelihood F-tests to test for differential expression.
Our TB data is count data with a log normal distribution, so we will use limma. We will select the assay we are interested in, select the batch variable and all covariates that we would like to include in the analysis, and select a method to adjust p-values. The “Results Table” return a table for each condition within each variable. The results table displays the log2 fold change, p-value and adjusted p-value for each feature. The “P-Value Analysis” tab displays a box plot of p-values for each analysis. In both the “Results Table” and “P-Value Analysis” tabs, the tables can be sorted by ascending or descending order of a column of your choice. They are also searchable with the “Search” bar (which can be helpful if you are interested in a specific feature or p-value threshold).
To run the limma differential expression analysis from the command line, run the following code (which can be modified to run DESeq2 when appropriate by changing the method argument to “DESeq2”):
differential_expression <- BatchQC::DE_analyze(se = se_object,
method = "limma", batch = batch, conditions = c(covar),
assay_to_analyze = assay_of_interest, padj_method = "BH")
After running the differential expression analysis, you can see all the completed analysis by running this code:
names(differential_expression)
## [1] "(Intercept)" "TBStatusPTB" "ExperimentGSE152218"
You can then view the log2 fold change, p-value and adjusted p-value table for a specific analysis of interest, by running the following:
head(differential_expression[["TBStatusPTB"]])
## log2FoldChange pvalue padj
## GGA2 -1924.29097 5.592932e-17 1.270415e-12
## HIP1R -1266.45296 7.565144e-17 1.270415e-12
## TTC9 -656.55235 3.717015e-16 4.161322e-12
## CARNS1 -278.81677 5.372133e-16 4.510712e-12
## LRIG1 -364.31449 1.518643e-15 9.872038e-12
## TRBJ1-4 -58.61267 1.773130e-15 9.872038e-12
After running a differential expression analysis, you may replicate the p-value box plot and table from the command line by running the following code:
pval_plotter(differential_expression)
## Warning: `position_dodge()` requires non-overlapping x intervals.
head(pval_summary(differential_expression))
## (Intercept) TBStatusPTB ExperimentGSE152218
## EIF1P7 1.820311e-54 5.592932e-17 2.022356e-52
## YIPF4 4.501172e-54 7.565144e-17 5.544681e-46
## RAB4B 1.071535e-52 3.717015e-16 2.052650e-41
## COX15 3.493950e-52 5.372133e-16 1.517725e-40
## LMO4 6.885670e-51 1.518643e-15 2.254907e-40
## LRRC57 1.306966e-50 1.773130e-15 1.006524e-39
The “Volcano Plot” is based on the limma or DESeq2 analysis results and must be run after completing the differential expression analysis. You should select the analysis result that you are interested in, specify the threshold for p-value cut off (the default is 0.05), and the magnitude of expression change (the shiny default is the average of the absolute value of the minimum and maximum value in the data set). After clinking “Run,” a volcano plot will appear. All values that are below the p-value threshold and exceed the set expression change threshold will appear in red, indicating that these may be potential features that differ between the conditions in that variable.
In a data set with a batch effect, you will likely see many significant values in both your p-value analysis and in the volcano plot. When you apply a batch correction analysis, you should see more significant values in your covariate conditions than in your batch conditions and fewer significant values that meet the volcano plot thresholds.
The volcano plot is interactive and you can select specific points to view the feature information, log2fold change, and -log10(p-value).
To recreate the volcano plot for the “TBStatusPTB” assay (which you may change by using any of the named results assays) differential expression result from our uncorrected TB data , run the following code:
value <- round((max(abs(
differential_expression[[length(differential_expression)]][, 1]))
+ min(
abs(differential_expression[[length(differential_expression)]][, 1])))
/ 2)
volcano_plot(DE_results = differential_expression[["TBStatusPTB"]],
pslider = 0.05,
fcslider = value)
You may download the data set that you have been working with int he shiny app, including the assays you added in the “Normalization” or “Batch Effect Correction” steps when you uploaded your data. A preview is provided that shows the Summarized Experiment (SE) object, including the dimensions, metadata table, assays, rownames and colNames.
Click the “Download” button and you can save this SE object to your computer for additional analysis outside of the BatchQC shiny app. The object is saved as a .RDS Summarized Experiment object in the folder of your choice. You should give it a descriptive name (default is “se.RDS”).
If you run code from the command line according to the above instructions, all assays will be saved to your original SE object and available for your use in other applications. You may save it as an SE object by running the following:
file_location <- "location/to/save/object"
saveRDS(object = se_object, file = file_location)
Many of these analyses can be run again on the batch corrected assay to see improvements in groupings based on your biological variable rather than the batch effect.
## R Under development (unstable) (2026-01-15 r89304)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] Biobase_2.71.0 S4Vectors_0.49.0 BiocGenerics_0.57.0
## [4] generics_0.1.4 curatedTBData_2.7.0 BatchQC_2.7.1
## [7] BiocStyle_2.39.0
##
## loaded via a namespace (and not attached):
## [1] shinythemes_1.2.0 splines_4.6.0
## [3] later_1.4.5 filelock_1.0.3
## [5] bitops_1.0-9 tibble_3.3.1
## [7] XML_3.99-0.20 httr2_1.2.2
## [9] Ckmeans.1d.dp_4.3.5 lifecycle_1.0.5
## [11] Rdpack_2.6.5 rstatix_0.7.3
## [13] edgeR_4.9.2 lattice_0.22-7
## [15] MASS_7.3-65 crosstalk_1.2.2
## [17] MultiAssayExperiment_1.37.2 backports_1.5.0
## [19] magrittr_2.0.4 plotly_4.12.0
## [21] limma_3.67.0 sass_0.4.10
## [23] rmarkdown_2.30 jquerylib_0.1.4
## [25] yaml_2.3.12 metapod_1.19.1
## [27] httpuv_1.6.16 otel_0.2.0
## [29] askpass_1.2.1 reticulate_1.44.1
## [31] reader_1.0.6 DBI_1.2.3
## [33] RColorBrewer_1.1-3 abind_1.4-8
## [35] GenomicRanges_1.63.1 purrr_1.2.1
## [37] rappdirs_0.3.4 sva_3.59.0
## [39] IRanges_2.45.0 irlba_2.3.7
## [41] genefilter_1.93.0 testthat_3.3.2
## [43] pheatmap_1.0.13 umap_0.2.10.0
## [45] RSpectra_0.16-2 annotate_1.89.0
## [47] dqrng_0.4.1 codetools_0.2-20
## [49] DelayedArray_0.37.0 scuttle_1.21.0
## [51] tidyselect_1.2.1 farver_2.1.2
## [53] ScaledMatrix_1.19.0 matrixStats_1.5.0
## [55] BiocFileCache_3.1.0 Seqinfo_1.1.0
## [57] jsonlite_2.0.0 BiocNeighbors_2.5.3
## [59] Formula_1.2-5 survival_3.8-6
## [61] tools_4.6.0 ggnewscale_0.5.2
## [63] Rcpp_1.1.1 glue_1.8.0
## [65] SparseArray_1.11.10 xfun_0.56
## [67] mgcv_1.9-4 DESeq2_1.51.6
## [69] MatrixGenerics_1.23.0 dplyr_1.2.0
## [71] withr_3.0.2 BiocManager_1.30.27
## [73] fastmap_1.2.0 bluster_1.21.0
## [75] shinyjs_2.1.1 openssl_2.3.4
## [77] caTools_1.18.3 digest_0.6.39
## [79] rsvd_1.0.5 R6_2.6.1
## [81] mime_0.13 gtools_3.9.5
## [83] dichromat_2.0-0.1 RSQLite_2.4.5
## [85] utf8_1.2.6 tidyr_1.3.2
## [87] data.table_1.18.2.1 FNN_1.1.4.1
## [89] htmlwidgets_1.6.4 httr_1.4.7
## [91] S4Arrays_1.11.1 pkgconfig_2.0.3
## [93] gtable_0.3.6 NCmisc_1.2.0
## [95] blob_1.3.0 S7_0.2.1
## [97] SingleCellExperiment_1.33.0 XVector_0.51.0
## [99] brio_1.1.5 htmltools_0.5.9
## [101] carData_3.0-6 bookdown_0.46
## [103] scales_1.4.0 tidyverse_2.0.0
## [105] blockmodeling_1.1.8 png_0.1-8
## [107] scran_1.39.0 EBSeq_2.9.0
## [109] ggdendro_0.2.0 knitr_1.51
## [111] reshape2_1.4.5 curl_7.0.0
## [113] nlme_3.1-168 cachem_1.1.0
## [115] stringr_1.6.0 BiocVersion_3.23.1
## [117] KernSmooth_2.23-26 parallel_4.6.0
## [119] Harman_1.39.0 AnnotationDbi_1.73.0
## [121] pillar_1.11.1 grid_4.6.0
## [123] vctrs_0.7.1 gplots_3.3.0
## [125] promises_1.5.0 ggpubr_0.6.2
## [127] car_3.1-5 BiocSingular_1.27.1
## [129] dbplyr_2.5.1 beachmat_2.27.2
## [131] xtable_1.8-4 cluster_2.1.8.1
## [133] evaluate_1.0.5 magick_2.9.0
## [135] tinytex_0.58 cli_3.6.5
## [137] locfit_1.5-9.12 compiler_4.6.0
## [139] rlang_1.1.7 crayon_1.5.3
## [141] ggsignif_0.6.4 labeling_0.4.3
## [143] plyr_1.8.9 stringi_1.8.7
## [145] viridisLite_0.4.2 BiocParallel_1.45.0
## [147] Biostrings_2.79.4 lazyeval_0.2.2
## [149] Matrix_1.7-4 ExperimentHub_3.1.0
## [151] RcppEigen_0.3.4.0.2 bit64_4.6.0-1
## [153] ggplot2_4.0.2 KEGGREST_1.51.1
## [155] statmod_1.5.1 shiny_1.12.1
## [157] SummarizedExperiment_1.41.0 AnnotationHub_4.1.0
## [159] rbibutils_2.4.1 igraph_2.2.1
## [161] broom_1.0.12 memoise_2.0.1
## [163] bslib_0.10.0 bit_4.6.0