---
title: "Topology-based Clustering in Seurat"
author:
  - name: Mattia Chiesa
    affiliation: 
    - Bioinformatics and Aritificial Intelligence facility, Centro Cardiologico Monzino, IRCCS, Milan, Italy
    - Department of Electronics, Information and Biomedical Engineering, Politecnico di Milano, Milan, Italy
    email: mattia.chiesa@cardiologicomonzino.it
  - name: Laura Ballarini
    affiliation:   
    - Bioinformatics and Aritificial Intelligence facility, Centro Cardiologico Monzino, IRCCS, Milan, Italy
    - Dipartimento di Ingegneria Industrale e dell Informazione, Universita' degli studi di Pavia, Pavia, Italy
  - name: Alessia Gerbasi
    affiliation:  Dipartimento di Ingegneria Industrale e dell Informazione, Universita' degli studi di Pavia, Pavia, Italy
  - name: Giuseppe Albi
    affiliation: Dipartimento di Ingegneria Industrale e dell Informazione, Universita' degli studi di Pavia, Pavia, Italy
  - name: Arianna Dagliati
    affiliation: Dipartimento di Ingegneria Industrale e dell Informazione, Universita' degli studi di Pavia, Pavia, Italy
  - name: Luca Piacentini
    affiliation: Bioinformatics and Aritificial Intelligence facility, Centro Cardiologico Monzino, IRCCS, Milan, Italy
  - name: Carlo Leonardi
    affiliation: Wellcome Sanger Institute, Cambridge, UK  

vignette_author:
  - name: Carlo Leonardi
    affiliation: Wellcome Sanger Institute, Cambridge, UK
    email: cl35@sanger.ac.uk
package: PIUMA
output:
  BiocStyle::html_document:
    toc_float: true
bibliography: library.bib
vignette: |
  %\VignetteIndexEntry{PIUMA package and Seurat}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 80
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Introduction

This tutorial shows how to seamlessly integrate PIUMA 
into the Seurat workflow for scRNA-seq, using datasets from SeuratData, finding 
clusters cells with PIUMA and, then, performing downstream analyses again with Seurat. 
The update has been inspired by this book [@2023shapeofdata].

# Installation

PIUMA can be installed by:

```{r install-package, eval=FALSE}
if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("PIUMA")
```

# Scope of this Vignette

In this vignette, we demonstrate how to integrate PIUMA’s TDA–based 
geometry–informed clustering into a Seurat workflow for single‐cell RNA‐seq 
data. Due to compilation time constrains of this vignette, we did not execute the chunks of this vignette, thus the user will not see the generated plots.

## Seurat pbmc3k testing data

As a practical demo of PIUMA-Seurat integration, we’ll use the pbmc3k dataset.
The idea is simple: run your usual Seurat preprocessing, then hand the data to 
PIUMA for TDA and process again with Seurat.
Below we fetch the data and perform the standard scRNA-seq steps 
(QC, normalization, feature selection, scaling, and dimensional reduction) 
before moving on to Mapper.

```{r,seurat-0, fig.width=10, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE}
set.seed(42)
library(Seurat)
library(SingleCellExperiment)
library(SummarizedExperiment)
library(PIUMA)
#devtools::install_github('satijalab/seurat-data')
library(SeuratData)

SeuratData::InstallData("pbmc3k")
pbmc3k <- SeuratData::LoadData("pbmc3k")

pbmc3k <- UpdateSeuratObject(pbmc3k)
pbmc3k <- NormalizeData(pbmc3k)
pbmc3k <- ScaleData(pbmc3k)
pbmc3k <- FindVariableFeatures(pbmc3k)
pbmc3k <- RunPCA(pbmc3k)
pbmc3k <- RunUMAP(pbmc3k,dims = 1:20)
DimPlot(pbmc3k,group.by = 'seurat_annotations')
```

The cells are colored according to cell types. Let's now convert the S4 Seurat
file into a SummarizedExperiment format, prior to converting it to TDAObj.

```{r,seurat-1, fig.width=10, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE}
set.seed(42)
counts_mat <- pbmc3k@assays$RNA@layers$counts
logcounts_mat <- pbmc3k@assays$RNA@layers$data

colnames(counts_mat) <- rownames(pbmc3k@meta.data)
colnames(logcounts_mat) <- rownames(pbmc3k@meta.data)

rownames(counts_mat) <- rownames(pbmc3k@assays$RNA@features@.Data)
rownames(logcounts_mat) <- rownames(pbmc3k@assays$RNA@features@.Data)

counts_dense    <- as.matrix(counts_mat)
logcounts_dense <- as.matrix(logcounts_mat)
cell_meta     <- pbmc3k@meta.data

gene_meta     <- DataFrame(
  gene_id = rownames(counts_mat),
  row.names = rownames(counts_mat)
)

# assemble the SummarizedExperiment
se <- SummarizedExperiment(
  assays   = list(
    counts    = counts_dense,
    logcounts = logcounts_dense
  ),
  colData = cell_meta,
  rowData = gene_meta
)

#refine Seurat annotation
colData(se)$seurat_annotations <- factor(
  ifelse(
    is.na(colData(se)$seurat_annotations),
    "Unknown",
    as.character(colData(se)$seurat_annotations)
  )
)

# Create a PIUMA TDAobj from the SummarizedExperiment
tda_obj <- makeTDAobjFromSE(se,'seurat_annotations')
```

Now we can proceed to run the PIUMA TDA pipeline, using preprocessed dimensional
reductions of Seurat.


```{r,seurat-2, fig.width=10, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE}
set.seed(42)
library(umap)
tda_obj <- dfToDistance(tda_obj, distMethod = "euclidean")
umap <- pbmc3k@reductions$umap@cell.embeddings
colnames(umap) <- c('comp1','comp2')

tda_obj@comp <- as.data.frame(umap)
```

Perfect! Now we have everything set up for our mapperCore() TDA call. 

## PIUMA TDA clustering 

With our hyperparameters (nBins = 33, overlap = 0.15), we can now 
proceed with the standard PIUMA workflow.

```{r,seurat-4, fig.width=10, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE}
set.seed(42)
tda_obj <- mapperCore(x = tda_obj,
                      nBins = 33,
                      overlap = 0.15)
tda_obj <- jaccardMatrix(tda_obj)
tda_obj <- setGraph(tda_obj)
tda_obj <- predict_mapper_class(tda_obj)
tda_obj <- autoClusterMapper(tda_obj,method = 'walktrap')
```

Now we can go back to the Seurat object:

```{r,seurat-5, fig.width=20, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE}
set.seed(42)
clusters_piuma <- tda_obj@clustering$obs_cluster
all_cells     <- rownames(pbmc3k@meta.data)
piuma_clusters <- setNames(
  rep(NA_integer_, length(all_cells)),
  all_cells
)
piuma_clusters[ clusters_piuma$obs ] <- clusters_piuma$cluster
pbmc3k@meta.data$PIUMA_clusters <- as.factor(piuma_clusters)

DimPlot(pbmc3k,group.by = 'PIUMA_clusters',label = TRUE,label.box = TRUE,repel = TRUE)+
  DimPlot(pbmc3k,group.by = 'seurat_annotations',label = TRUE,label.box = TRUE,repel = TRUE)

```

Here, we are showing the identified clusters from TDA compared to the original
annotations. Some clusters mostly overlap with Seurat clusters, such as the subdivision between
Naive CD4 T cells and Memory CD4 T cells, while other clusters, such as cluster 
11 in PIUMA, do not have a clear corresponding sub-cluster.

## Biological Validation: GZMK+ CD8+ T Subset

Let's investigate what PIUMA has uncovered in this dataset:

```{r,seurat-6, fig.width=10, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE,fig.cap = "(1) Feature plot of GZMK in the original UMAP space (2) Violin plots of GZMK expression across Seurat‐annotated cell types in the pbmc3k dataset. Within each cell type, violins are split to compare cells belonging to PIUMA/TDA Cluster 12 versus all other cells. PIUMA Cluster 12 selectively enriches for a GZMK+ CD8+ T-cell subpopulation"}
set.seed(42)
library(dplyr)
library(ggplot2)

Idents(pbmc3k) <- 'PIUMA_clusters'
markers_12 <- FindMarkers(pbmc3k,
                         ident.1    = 12,
                         group.by   = "PIUMA_clusters",
                         test.use   = "wilcox",
                         logfc.threshold = 0.25,
                         only.pos=TRUE)
markers_12_top <- markers_12 %>% filter(pct.1 - pct.2 > 0.7)                      

FeaturePlot(pbmc3k,features=rownames(markers_12_top),alpha = 1)+ggplot2::ggtitle('Expression of Granzime K')
pbmc3k$GZMK_pos <- FetchData(pbmc3k, vars = "GZMK")[,1] > 2
pbmc3k$GZMK_pos <- factor(pbmc3k$GZMK_pos,
                          levels = c(FALSE, TRUE),
                          labels = c("GZMK–", "GZMK+"))
pbmc3k$Cluster_12 <- pbmc3k$PIUMA_clusters==12
pbmc3k_sub <- subset(pbmc3k, subset = !is.na(seurat_annotations))

VlnPlot(pbmc3k_sub,
        features = "GZMK",
        group.by = "seurat_annotations",
        split.by = "Cluster_12",
        pt.size  = 0,
        assay    = "RNA"
) +
  ggplot2::scale_fill_manual(
    name   = "PIUMA Cluster 12",
    values = c("FALSE" = "grey80", "TRUE" = "steelblue"),
    labels = c("Other cells", "Cluster 12")
  ) +
  ggplot2::guides(fill = ggplot2::guide_legend(override.aes = list(alpha = 1))) +
  ggplot2::theme(
    axis.text.x     = ggplot2::element_text(angle = 45, hjust = 1),
    legend.position = "right"
  ) +
  ggplot2::labs(
    title = "TDA Cluster 12 identifies a GZMK+ CD8+ T subset",
    x     = "Seurat annotation",
    y     = "GZMK expression"
  )

```

PIUMA cluster 12 selectively isolates a 
GZMK+ T-cell sub-population. While standard annotation already captures 
GZMK+ CD4+ T cells, PIUMA uniquely uncovers a discrete GZMK+ CD8+ T-cell 
subset that the original clustering missed. 

## Quantitative Comparison

Having demonstrated PIUMA’s ability to pinpoint biologically meaningful cellular
subtypes, we now turn to a quantitative comparison with the standard annotation:

```{r,seurat-7, fig.width=10, fig.height=10, warning=FALSE, eval=FALSE,message=FALSE,fig.cap = "Quantitative comparison of PIUMA‐derived clusters against the original Seurat annotations, showing Adjusted Rand Index (ARI) = 0.571 and Normalized Mutual Information (NMI) = 0.68 (computed on cells with non-missing labels). These values indicate strong overall concordance: major cell‐type compartments are preserved, while leaving substantial “wiggle room” (<1.0) for PIUMA to unearth novel substructure (e.g. the GZMK+ CD8+ T-cell subset in cluster 8)."}
set.seed(42)
library(mclust)
library(aricode)
library(viridis)
valid   <- which(!is.na(pbmc3k$seurat_annotations) & !is.na(pbmc3k$PIUMA_clusters))
ari     <- mclust::adjustedRandIndex(pbmc3k$seurat_annotations[valid],
                             pbmc3k$PIUMA_clusters[valid])
nmi     <- aricode::NMI(pbmc3k$seurat_annotations[valid],
               pbmc3k$PIUMA_clusters[valid])
df <- data.frame(
  Metric = c("ARI","NMI"),
  Value  = c(ari, nmi)
)

ggplot2::ggplot(df, ggplot2::aes(x = Metric, y = 1, fill = Value)) +
  ggplot2::geom_tile(color = "white", width = 0.9, height = 0.9) +
  ggplot2::geom_text(aes(label = sprintf("%.3f", Value)),
            color = "white", size = 6) +
  viridis::scale_fill_viridis(option = "magma", limits = c(0, 1)) +
  coord_fixed(ratio = 1) +      # force squares
  ggplot2::theme_minimal() +
  ggplot2::theme(
    axis.title    = element_blank(),
    axis.text     = element_blank(),
    axis.ticks    = element_blank(),
    panel.grid    = element_blank(),
    plot.title    = element_text(hjust = 0.5)
  ) +
  ggplot2::labs(
    fill  = "Score",
    title = "Clustering Concordance (ARI & NMI)"
  )
```

We observe an ARI of 0.571 and an NMI of 0.68 
between the original Seurat labels and the PIUMA TDA clusters. If PIUMA was 
merely recapitulating the standard annotation, both metrics would approach 1.0. 
Their lower values here instead reflect PIUMA’s reorganization of cells, 
precisely the flexibility needed to reveal previously unrecognized cellular 
subpopulations.

## Conclusion

This vignette demonstrates that PIUMA achieves strong concordance with 
Seurat’s standard annotations (ARI = 0.571, NMI = 0.68) while offering the 
flexibility to discover novel cell‐state subpopulations, such as a GZMK+ CD8+ 
T‐cell cluster unseen by conventional methods.

# Session Info {.unnumbered}

```{r session_info}
sessionInfo()
```

# References {.unnumbered}