--- title: "Topology-based Clustering in Seurat" author: - name: Mattia Chiesa affiliation: - Bioinformatics and Aritificial Intelligence facility, Centro Cardiologico Monzino, IRCCS, Milan, Italy - Department of Electronics, Information and Biomedical Engineering, Politecnico di Milano, Milan, Italy email: mattia.chiesa@cardiologicomonzino.it - name: Laura Ballarini affiliation: - Bioinformatics and Aritificial Intelligence facility, Centro Cardiologico Monzino, IRCCS, Milan, Italy - Dipartimento di Ingegneria Industrale e dell Informazione, Universita' degli studi di Pavia, Pavia, Italy - name: Alessia Gerbasi affiliation: Dipartimento di Ingegneria Industrale e dell Informazione, Universita' degli studi di Pavia, Pavia, Italy - name: Giuseppe Albi affiliation: Dipartimento di Ingegneria Industrale e dell Informazione, Universita' degli studi di Pavia, Pavia, Italy - name: Arianna Dagliati affiliation: Dipartimento di Ingegneria Industrale e dell Informazione, Universita' degli studi di Pavia, Pavia, Italy - name: Luca Piacentini affiliation: Bioinformatics and Aritificial Intelligence facility, Centro Cardiologico Monzino, IRCCS, Milan, Italy - name: Carlo Leonardi affiliation: Wellcome Sanger Institute, Cambridge, UK vignette_author: - name: Carlo Leonardi affiliation: Wellcome Sanger Institute, Cambridge, UK email: cl35@sanger.ac.uk package: PIUMA output: BiocStyle::html_document: toc_float: true bibliography: library.bib vignette: | %\VignetteIndexEntry{PIUMA package and Seurat} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: 80 --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Introduction This tutorial shows how to seamlessly integrate PIUMA into the Seurat workflow for scRNA-seq, using datasets from SeuratData, finding clusters cells with PIUMA and, then, performing downstream analyses again with Seurat. The update has been inspired by this book [@2023shapeofdata]. # Installation PIUMA can be installed by: ```{r install-package, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("PIUMA") ``` # Scope of this Vignette In this vignette, we demonstrate how to integrate PIUMA’s TDA–based geometry–informed clustering into a Seurat workflow for single‐cell RNA‐seq data. Due to compilation time constrains of this vignette, we did not execute the chunks of this vignette, thus the user will not see the generated plots. ## Seurat pbmc3k testing data As a practical demo of PIUMA-Seurat integration, we’ll use the pbmc3k dataset. The idea is simple: run your usual Seurat preprocessing, then hand the data to PIUMA for TDA and process again with Seurat. Below we fetch the data and perform the standard scRNA-seq steps (QC, normalization, feature selection, scaling, and dimensional reduction) before moving on to Mapper. ```{r,seurat-0, fig.width=10, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE} set.seed(42) library(Seurat) library(SingleCellExperiment) library(SummarizedExperiment) library(PIUMA) #devtools::install_github('satijalab/seurat-data') library(SeuratData) SeuratData::InstallData("pbmc3k") pbmc3k <- SeuratData::LoadData("pbmc3k") pbmc3k <- UpdateSeuratObject(pbmc3k) pbmc3k <- NormalizeData(pbmc3k) pbmc3k <- ScaleData(pbmc3k) pbmc3k <- FindVariableFeatures(pbmc3k) pbmc3k <- RunPCA(pbmc3k) pbmc3k <- RunUMAP(pbmc3k,dims = 1:20) DimPlot(pbmc3k,group.by = 'seurat_annotations') ``` The cells are colored according to cell types. Let's now convert the S4 Seurat file into a SummarizedExperiment format, prior to converting it to TDAObj. ```{r,seurat-1, fig.width=10, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE} set.seed(42) counts_mat <- pbmc3k@assays$RNA@layers$counts logcounts_mat <- pbmc3k@assays$RNA@layers$data colnames(counts_mat) <- rownames(pbmc3k@meta.data) colnames(logcounts_mat) <- rownames(pbmc3k@meta.data) rownames(counts_mat) <- rownames(pbmc3k@assays$RNA@features@.Data) rownames(logcounts_mat) <- rownames(pbmc3k@assays$RNA@features@.Data) counts_dense <- as.matrix(counts_mat) logcounts_dense <- as.matrix(logcounts_mat) cell_meta <- pbmc3k@meta.data gene_meta <- DataFrame( gene_id = rownames(counts_mat), row.names = rownames(counts_mat) ) # assemble the SummarizedExperiment se <- SummarizedExperiment( assays = list( counts = counts_dense, logcounts = logcounts_dense ), colData = cell_meta, rowData = gene_meta ) #refine Seurat annotation colData(se)$seurat_annotations <- factor( ifelse( is.na(colData(se)$seurat_annotations), "Unknown", as.character(colData(se)$seurat_annotations) ) ) # Create a PIUMA TDAobj from the SummarizedExperiment tda_obj <- makeTDAobjFromSE(se,'seurat_annotations') ``` Now we can proceed to run the PIUMA TDA pipeline, using preprocessed dimensional reductions of Seurat. ```{r,seurat-2, fig.width=10, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE} set.seed(42) library(umap) tda_obj <- dfToDistance(tda_obj, distMethod = "euclidean") umap <- pbmc3k@reductions$umap@cell.embeddings colnames(umap) <- c('comp1','comp2') tda_obj@comp <- as.data.frame(umap) ``` Perfect! Now we have everything set up for our mapperCore() TDA call. ## PIUMA TDA clustering With our hyperparameters (nBins = 33, overlap = 0.15), we can now proceed with the standard PIUMA workflow. ```{r,seurat-4, fig.width=10, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE} set.seed(42) tda_obj <- mapperCore(x = tda_obj, nBins = 33, overlap = 0.15) tda_obj <- jaccardMatrix(tda_obj) tda_obj <- setGraph(tda_obj) tda_obj <- predict_mapper_class(tda_obj) tda_obj <- autoClusterMapper(tda_obj,method = 'walktrap') ``` Now we can go back to the Seurat object: ```{r,seurat-5, fig.width=20, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE} set.seed(42) clusters_piuma <- tda_obj@clustering$obs_cluster all_cells <- rownames(pbmc3k@meta.data) piuma_clusters <- setNames( rep(NA_integer_, length(all_cells)), all_cells ) piuma_clusters[ clusters_piuma$obs ] <- clusters_piuma$cluster pbmc3k@meta.data$PIUMA_clusters <- as.factor(piuma_clusters) DimPlot(pbmc3k,group.by = 'PIUMA_clusters',label = TRUE,label.box = TRUE,repel = TRUE)+ DimPlot(pbmc3k,group.by = 'seurat_annotations',label = TRUE,label.box = TRUE,repel = TRUE) ``` Here, we are showing the identified clusters from TDA compared to the original annotations. Some clusters mostly overlap with Seurat clusters, such as the subdivision between Naive CD4 T cells and Memory CD4 T cells, while other clusters, such as cluster 11 in PIUMA, do not have a clear corresponding sub-cluster. ## Biological Validation: GZMK+ CD8+ T Subset Let's investigate what PIUMA has uncovered in this dataset: ```{r,seurat-6, fig.width=10, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE,fig.cap = "(1) Feature plot of GZMK in the original UMAP space (2) Violin plots of GZMK expression across Seurat‐annotated cell types in the pbmc3k dataset. Within each cell type, violins are split to compare cells belonging to PIUMA/TDA Cluster 12 versus all other cells. PIUMA Cluster 12 selectively enriches for a GZMK+ CD8+ T-cell subpopulation"} set.seed(42) library(dplyr) library(ggplot2) Idents(pbmc3k) <- 'PIUMA_clusters' markers_12 <- FindMarkers(pbmc3k, ident.1 = 12, group.by = "PIUMA_clusters", test.use = "wilcox", logfc.threshold = 0.25, only.pos=TRUE) markers_12_top <- markers_12 %>% filter(pct.1 - pct.2 > 0.7) FeaturePlot(pbmc3k,features=rownames(markers_12_top),alpha = 1)+ggplot2::ggtitle('Expression of Granzime K') pbmc3k$GZMK_pos <- FetchData(pbmc3k, vars = "GZMK")[,1] > 2 pbmc3k$GZMK_pos <- factor(pbmc3k$GZMK_pos, levels = c(FALSE, TRUE), labels = c("GZMK–", "GZMK+")) pbmc3k$Cluster_12 <- pbmc3k$PIUMA_clusters==12 pbmc3k_sub <- subset(pbmc3k, subset = !is.na(seurat_annotations)) VlnPlot(pbmc3k_sub, features = "GZMK", group.by = "seurat_annotations", split.by = "Cluster_12", pt.size = 0, assay = "RNA" ) + ggplot2::scale_fill_manual( name = "PIUMA Cluster 12", values = c("FALSE" = "grey80", "TRUE" = "steelblue"), labels = c("Other cells", "Cluster 12") ) + ggplot2::guides(fill = ggplot2::guide_legend(override.aes = list(alpha = 1))) + ggplot2::theme( axis.text.x = ggplot2::element_text(angle = 45, hjust = 1), legend.position = "right" ) + ggplot2::labs( title = "TDA Cluster 12 identifies a GZMK+ CD8+ T subset", x = "Seurat annotation", y = "GZMK expression" ) ``` PIUMA cluster 12 selectively isolates a GZMK+ T-cell sub-population. While standard annotation already captures GZMK+ CD4+ T cells, PIUMA uniquely uncovers a discrete GZMK+ CD8+ T-cell subset that the original clustering missed. ## Quantitative Comparison Having demonstrated PIUMA’s ability to pinpoint biologically meaningful cellular subtypes, we now turn to a quantitative comparison with the standard annotation: ```{r,seurat-7, fig.width=10, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE,fig.cap = "Quantitative comparison of PIUMA‐derived clusters against the original Seurat annotations, showing Adjusted Rand Index (ARI) = 0.571 and Normalized Mutual Information (NMI) = 0.68 (computed on cells with non-missing labels). These values indicate strong overall concordance: major cell‐type compartments are preserved, while leaving substantial “wiggle room” (<1.0) for PIUMA to unearth novel substructure (e.g. the GZMK+ CD8+ T-cell subset in cluster 8)."} set.seed(42) library(mclust) library(aricode) library(viridis) valid <- which(!is.na(pbmc3k$seurat_annotations) & !is.na(pbmc3k$PIUMA_clusters)) ari <- mclust::adjustedRandIndex(pbmc3k$seurat_annotations[valid], pbmc3k$PIUMA_clusters[valid]) nmi <- aricode::NMI(pbmc3k$seurat_annotations[valid], pbmc3k$PIUMA_clusters[valid]) df <- data.frame( Metric = c("ARI","NMI"), Value = c(ari, nmi) ) ggplot2::ggplot(df, ggplot2::aes(x = Metric, y = 1, fill = Value)) + ggplot2::geom_tile(color = "white", width = 0.9, height = 0.9) + ggplot2::geom_text(aes(label = sprintf("%.3f", Value)), color = "white", size = 6) + viridis::scale_fill_viridis(option = "magma", limits = c(0, 1)) + coord_fixed(ratio = 1) + # force squares ggplot2::theme_minimal() + ggplot2::theme( axis.title = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank(), plot.title = element_text(hjust = 0.5) ) + ggplot2::labs( fill = "Score", title = "Clustering Concordance (ARI & NMI)" ) ``` We observe an ARI of 0.571 and an NMI of 0.68 between the original Seurat labels and the PIUMA TDA clusters. If PIUMA was merely recapitulating the standard annotation, both metrics would approach 1.0. Their lower values here instead reflect PIUMA’s reorganization of cells, precisely the flexibility needed to reveal previously unrecognized cellular subpopulations. ## Conclusion This vignette demonstrates that PIUMA achieves strong concordance with Seurat’s standard annotations (ARI = 0.571, NMI = 0.68) while offering the flexibility to discover novel cell‐state subpopulations, such as a GZMK+ CD8+ T‐cell cluster unseen by conventional methods. # Session Info {.unnumbered} ```{r session_info} sessionInfo() ``` # References {.unnumbered}