This vignette describes the steps to generate supercells for cytometry data using SuperCellCyto R package.
Briefly, supercells are “mini” clusters of cells that are similar in their marker expressions. The motivation behind supercells is that instead of analysing millions of individual cells, you can analyse thousands of supercells, making downstream analysis much faster while maintaining biological interpretability.
See other vignettes for how to:
You can install stable version of SuperCellCyto from Bioconductor using:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SuperCellCyto")For the latest development version, you can install it from GitHub
using pak:
The function which creates supercells is called
runSuperCellCyto, and it operates on a
data.table object, an enhanced version of R native
data.frame.
In addition to needing the data stored in a data.table
object it also requires:
runSuperCellCyto
does not perform any data transformation or scaling.If you are not sure how to import CSV or FCS files into
data.table object, and/or how to subsequently prepare the
object ready for SuperCellCyto, please consult this vignette. In that vignette, we also
provide an explanation behind why we need to have the cell ID and sample
column.
For this vignette, we will simulate some toy data using the
simCytoData function. Specifically, we will simulate 15
markers and 3 samples, with each sample containing 10,000 cells. Hence
in total, we will have a toy dataset containing 15 markers and 30,000
cells.
n_markers <- 15
n_samples <- 3
dat <- simCytoData(nmarkers = n_markers, ncells = rep(10000, n_samples))
head(dat)
#> Marker_1 Marker_2 Marker_3 Marker_4 Marker_5 Marker_6 Marker_7 Marker_8
#> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 14.78278 11.567102 18.69865 13.76314 9.245221 7.727297 7.884639 6.850134
#> 2: 14.02632 9.129753 18.43311 13.21190 5.309538 6.765459 7.385760 6.355473
#> 3: 12.98504 9.206499 19.37516 16.55864 5.578001 9.078976 8.422849 6.768931
#> 4: 12.18127 8.764114 17.72421 15.22696 6.139471 7.929010 5.560673 7.781486
#> 5: 14.06126 10.362235 19.23445 14.91916 6.993390 8.372682 7.499620 8.083946
#> 6: 12.61782 8.892380 17.59964 13.01365 6.826958 8.978095 7.700187 7.035582
#> Marker_9 Marker_10 Marker_11 Marker_12 Marker_13 Marker_14 Marker_15
#> <num> <num> <num> <num> <num> <num> <num>
#> 1: 8.923444 14.94594 11.54265 14.29901 10.620021 7.021821 8.755222
#> 2: 9.757331 12.68158 13.67492 14.22971 8.966575 5.445757 5.999378
#> 3: 11.558556 14.44387 14.31546 14.40503 10.315988 6.252150 7.984758
#> 4: 9.465287 14.94592 12.93958 14.05161 7.536462 6.085299 8.046376
#> 5: 8.554612 11.20589 12.14442 14.09274 9.391597 5.343297 6.336340
#> 6: 8.355508 14.62260 13.36571 13.61676 9.618760 5.159058 6.666807
#> Sample Cell_Id
#> <char> <char>
#> 1: Sample_1 Cell_1
#> 2: Sample_1 Cell_2
#> 3: Sample_1 Cell_3
#> 4: Sample_1 Cell_4
#> 5: Sample_1 Cell_5
#> 6: Sample_1 Cell_6For our toy dataset, we will transform our data using arcsinh
transformation. We will use the base R asinh function to do
this:
# Specify which columns are the markers to transform
marker_cols <- paste0("Marker_", seq_len(n_markers))
# The co-factor for arc-sinh
cofactor <- 5
# Do the transformation
dat_asinh <- asinh(dat[, marker_cols, with = FALSE] / cofactor)
# Rename the new columns
marker_cols_asinh <- paste0(marker_cols, "_asinh")
names(dat_asinh) <- marker_cols_asinh
# Add them our previously loaded data
dat <- cbind(dat, dat_asinh)
head(dat[, marker_cols_asinh, with = FALSE])
#> Marker_1_asinh Marker_2_asinh Marker_3_asinh Marker_4_asinh Marker_5_asinh
#> <num> <num> <num> <num> <num>
#> 1: 1.804618 1.575616 2.029575 1.737175 1.3740137
#> 2: 1.754998 1.362974 2.015764 1.698850 0.9244785
#> 3: 1.682669 1.370324 2.063949 1.912670 0.9608002
#> 4: 1.623292 1.327297 1.977967 1.832704 1.0337088
#> 5: 1.757342 1.475573 2.056893 1.813321 1.1372134
#> 6: 1.655947 1.339940 1.971181 1.684723 1.1177003
#> Marker_6_asinh Marker_7_asinh Marker_8_asinh Marker_9_asinh Marker_10_asinh
#> <num> <num> <num> <num> <num>
#> 1: 1.219718 1.2366913 1.120436 1.342981 1.815022
#> 2: 1.110411 1.1820223 1.060703 1.421717 1.660635
#> 3: 1.358086 1.2929705 1.110824 1.574938 1.782674
#> 4: 1.241434 0.9584849 1.225591 1.394760 1.815020
#> 5: 1.287838 1.1947211 1.257848 1.306345 1.546569
#> 6: 1.348311 1.2167688 1.142111 1.286075 1.794303
#> Marker_11_asinh Marker_12_asinh Marker_13_asinh Marker_14_asinh
#> <num> <num> <num> <num>
#> 1: 1.573675 1.773154 1.497755 1.1405160
#> 2: 1.731134 1.768569 1.347190 0.9430297
#> 3: 1.774239 1.780130 1.471546 1.0478616
#> 4: 1.679397 1.756695 1.198802 1.0268489
#> 5: 1.620491 1.759449 1.387855 0.9290995
#> 6: 1.709682 1.727132 1.409007 0.9036898
#> Marker_15_asinh
#> <num>
#> 1: 1.326416
#> 2: 1.015893
#> 3: 1.247367
#> 4: 1.253889
#> 5: 1.058335
#> 6: 1.098629We will also create a column Cell_id_dummy which uniquely
identify each cell. It will have values such as
Cell_1, Cell_2, all the way until Cell_x where
x is the number of cells in the dataset.
dat$Cell_id_dummy <- paste0("Cell_", seq_len(nrow(dat)))
head(dat$Cell_id_dummy, n = 10)
#> [1] "Cell_1" "Cell_2" "Cell_3" "Cell_4" "Cell_5" "Cell_6" "Cell_7"
#> [8] "Cell_8" "Cell_9" "Cell_10"By default, the simCytoData function will generate cells
for multiple samples, and that the resulting data.table
object will already have a column called Sample that denotes
the sample the cells come from.
Let’s take note of the sample and cell id column for later.
Now that we have our data, let’s create some supercells. To do this,
we will use runSuperCellCyto function and pass the markers,
sample and cell ID columns as parameters.
The reason why we need to specify the markers is because the function
will create supercells based on only the expression of those markers. We
highly recommend creating supercells using all markers in your data, let
that be cell type or cell state markers. However, if for any reason you
only want to only use a subset of the markers in your data, then make
sure you specify them in a vector that you later pass to
runSuperCellCyto function.
For this tutorial, we will use all the arcsinh transformed markers in the toy data.
supercells <- runSuperCellCyto(
dt = dat,
markers = marker_cols_asinh,
sample_colname = sample_col,
cell_id_colname = cell_id_col
)Let’s dig deeper into the object it created:
It is a list containing 3 elements:
names(supercells)
#> [1] "supercell_expression_matrix" "supercell_cell_map"
#> [3] "supercell_object"The supercell_object contains the metadata used to
create the supercells. It is a list, and each element contains the
metadata used to create the supercells for a sample. This will come in
handy if we need to either regenerate the supercells using different
gamma values (so we get more or less supercells) or do some debugging
later down the line. More on regenerating supercells on Controlling supercells
granularity section below.
The supercell_expression_matrix contains the marker
expression of each supercell. These are calculated by taking the average
of the marker expression of all the cells contained within a
supercell.
head(supercells$supercell_expression_matrix)
#> Marker_1_asinh Marker_2_asinh Marker_3_asinh Marker_4_asinh Marker_5_asinh
#> <num> <num> <num> <num> <num>
#> 1: 1.654041 1.458318 1.987401 1.779596 0.9881635
#> 2: 1.676502 1.435836 1.996955 1.791418 1.1193335
#> 3: 1.663106 1.417878 2.000335 1.803773 0.8263119
#> 4: 1.675163 1.403706 1.993347 1.773286 1.1474580
#> 5: 1.711270 1.340816 2.019482 1.820541 1.1179375
#> 6: 1.684951 1.279524 1.995874 1.796703 1.1879662
#> Marker_6_asinh Marker_7_asinh Marker_8_asinh Marker_9_asinh Marker_10_asinh
#> <num> <num> <num> <num> <num>
#> 1: 1.342093 1.1604070 1.195228 1.340560 1.737991
#> 2: 1.408708 0.9539824 1.232327 1.350786 1.763375
#> 3: 1.351761 1.1289082 1.016328 1.372125 1.759001
#> 4: 1.352578 0.9933325 1.179617 1.356472 1.749128
#> 5: 1.223357 1.0255913 1.193484 1.352087 1.773609
#> 6: 1.353466 1.0661695 1.275910 1.326281 1.762229
#> Marker_11_asinh Marker_12_asinh Marker_13_asinh Marker_14_asinh
#> <num> <num> <num> <num>
#> 1: 1.676863 1.704090 1.430442 0.8675931
#> 2: 1.717376 1.728814 1.415665 1.0746079
#> 3: 1.727630 1.753401 1.326540 0.8652571
#> 4: 1.695803 1.705386 1.391143 0.8304150
#> 5: 1.737982 1.753675 1.298775 0.9139457
#> 6: 1.718693 1.721333 1.413943 0.9450858
#> Marker_15_asinh Sample SuperCellId
#> <num> <char> <char>
#> 1: 1.221217 Sample_1 SuperCell_1_Sample_Sample_1
#> 2: 1.267439 Sample_1 SuperCell_2_Sample_Sample_1
#> 3: 1.187305 Sample_1 SuperCell_3_Sample_Sample_1
#> 4: 1.166070 Sample_1 SuperCell_4_Sample_Sample_1
#> 5: 1.197264 Sample_1 SuperCell_5_Sample_Sample_1
#> 6: 1.206566 Sample_1 SuperCell_6_Sample_Sample_1Therein, we will have the following columns:
markers_col variable. In this example, they are the arcsinh
transformed markers in our toy data.Sample in this case) denoting which sample a
supercell belongs to, (note the column name is the same as what is
stored in sample_col variable).SuperCellId column denoting the unique ID of the
supercell.Let’s have a look at SuperCellId:
head(unique(supercells$supercell_expression_matrix$SuperCellId))
#> [1] "SuperCell_1_Sample_Sample_1" "SuperCell_2_Sample_Sample_1"
#> [3] "SuperCell_3_Sample_Sample_1" "SuperCell_4_Sample_Sample_1"
#> [5] "SuperCell_5_Sample_Sample_1" "SuperCell_6_Sample_Sample_1"Let’s break down one of them,
SuperCell_1_Sample_Sample_1. SuperCell_1 is a
numbering (1 to however many supercells there are in a sample) used to
uniquely identify each supercell in a sample. Notably, you may encounter
this (SuperCell_1, SuperCell_2) being repeated
across different samples, e.g.,
supercell_ids <- unique(supercells$supercell_expression_matrix$SuperCellId)
supercell_ids[grep("SuperCell_1_", supercell_ids)]
#> [1] "SuperCell_1_Sample_Sample_1" "SuperCell_1_Sample_Sample_2"
#> [3] "SuperCell_1_Sample_Sample_3"While these 3 supercells’ id are pre-fixed with
SuperCell_1, it does not make them equal to one another!
SuperCell_1_Sample_Sample_1 will only contain cells from
Sample_1 while SuperCell_1_Sample_Sample_2
will only contain cells from Sample_2.
By now, you may have noticed that we appended the sample name into each supercell id. This aids in differentiating the supercells in different samples.
supercell_cell_map maps each cell in our dataset to the
supercell it belongs to.
head(supercells$supercell_cell_map)
#> SuperCellID CellId Sample
#> <char> <char> <char>
#> 1: SuperCell_297_Sample_Sample_1 Cell_1 Sample_1
#> 2: SuperCell_18_Sample_Sample_1 Cell_2 Sample_1
#> 3: SuperCell_32_Sample_Sample_1 Cell_3 Sample_1
#> 4: SuperCell_22_Sample_Sample_1 Cell_4 Sample_1
#> 5: SuperCell_401_Sample_Sample_1 Cell_5 Sample_1
#> 6: SuperCell_39_Sample_Sample_1 Cell_6 Sample_1This map is very useful if we later need to expand the supercells out. Additionally, this is also the reason why we need to have a column in the dataset which uniquely identify each cell.
runSuperCellCyto in parallelBy default, runSuperCellCyto will process each sample
one after the other. As each sample is processed independent of one
another, strictly speaking, we can process all of them in parallel.
To do this, we need to:
BiocParallelParam object from the BiocParallel
package. This object can either be of type MulticoreParamor
SnowParam. We highly recommend consulting their vignette
for more information.BiocParallelParam
object to the number of samples we have in the dataset.load_balancing parameter for
runSuperCellCyto function to TRUE. This is to ensure even
distribution of the supercell creation jobs. As each sample will be
processed by a parallel job, we don’t want a job that processs large
sample to also be assigned other smaller samples if possible. If you
want to know more how this feature works, please refer to our
manuscript.This is described in the runSuperCellCyto function’s
documentation, but let’s briefly go through it here.
The runSuperCellCyto function is equipped with various
parameters which can be customised to alter the composition of the
supercells. The one that is very likely to be used the most is the gamma
parameter, denoted as gam in the function. By default, the
value for gam is set to 20, which we found work well for
most cases.
The gamma parameter controls how many supercells to generate, and
indirectly, how many cells are captured within each supercell. This
parameter is resolved into the following formula
gamma=n_cells/n_supercells where n_cell
denotes the number of cells and n_supercells denotes the
number of supercells.
In general, the larger gamma parameter is set to, the less supercells we will get. Say for instance we have 10,000 cells. If gamma is set to 10, we will end up with about 1,000 supercells, whereas if gamma is set to 50, we will end up with about 200 supercells.
You may have noticed, after reading the sections above,
runSuperCellCyto is ran on each sample independent of each
other, and that we can only set 1 value as the gamma parameter. Indeed,
for now, the same gamma value will be used across all samples, and that
depending on how many cells we have in each sample, we will end up with
different number of supercells for each sample. For instance, say we
have 10,000 cells for sample 1, and 100,000 cells for sample 2. If gamma
is set to 10, for sample 1, we will get 1,000 supercells (10,000/10)
while for sample 2, we will get 10,000 supercells (100,000/10).
Do note: whatever gamma value you chose, you should not expect each supercell to contain exactly the same number of cells. This behaviour is intentional to ensure rare cell types are not intermixed with non-rare cell types in a supercell.
If you have run runSuperCellCyto once and have not
discarded the SuperCell object it generated (no serious, please don’t!),
you can use the object to quickly regenerate supercells
using different gamma values.
As an example, using the SuperCell object we have generated for our
toy dataset, we will regenerate the supercells using gamma of 10 and 50.
The function to do this is recomputeSupercells. We will
store the output in a list, one element per gamma value.
addt_gamma_vals <- c(10, 50)
supercells_addt_gamma <- lapply(addt_gamma_vals, function(gam) {
recomputeSupercells(
dt = dat,
sc_objects = supercells$supercell_object,
markers = marker_cols_asinh,
sample_colname = sample_col,
cell_id_colname = cell_id_col,
gam = gam
)
})We should end up with a list containing 2 elements. The 1st element contains supercells generated using gamma = 10, and the 2nd contains supercells generated using gamma = 50.
supercells_addt_gamma[[1]]
#> $supercell_expression_matrix
#> Marker_1_asinh Marker_2_asinh Marker_3_asinh Marker_4_asinh
#> <num> <num> <num> <num>
#> 1: 1.700258 1.2633787 2.004239 1.814396
#> 2: 1.694261 1.4472994 1.996105 1.806227
#> 3: 1.696676 1.3270727 2.012580 1.805320
#> 4: 1.692268 1.3625901 2.006320 1.809748
#> 5: 1.709087 1.4584984 2.015922 1.795555
#> ---
#> 2996: 2.072977 0.9898431 1.281365 2.043598
#> 2997: 2.044742 1.1495015 1.536270 1.993100
#> 2998: 2.112726 1.0005266 1.347721 2.064821
#> 2999: 2.056008 0.8638917 1.422907 2.082006
#> 3000: 2.087241 0.9845857 1.546533 2.062052
#> Marker_5_asinh Marker_6_asinh Marker_7_asinh Marker_8_asinh
#> <num> <num> <num> <num>
#> 1: 1.1001476 1.3672295 1.1583204 1.020670
#> 2: 1.1153047 1.4239319 0.9584862 1.218206
#> 3: 1.1182118 1.4236379 1.2334068 1.302024
#> 4: 1.1899240 1.3495341 1.0160619 1.244196
#> 5: 1.0969891 1.3967963 1.0543984 1.238779
#> ---
#> 2996: 0.9613936 0.9122869 1.8054184 1.689606
#> 2997: 1.0651929 1.1261519 1.7480124 1.618265
#> 2998: 0.7290134 1.0519855 1.6797040 1.607769
#> 2999: 0.7908642 1.0035246 1.7854573 1.579124
#> 3000: 1.0226738 1.1621878 1.6303588 1.735277
#> Marker_9_asinh Marker_10_asinh Marker_11_asinh Marker_12_asinh
#> <num> <num> <num> <num>
#> 1: 1.395034 1.760907 1.740636 1.7274701
#> 2: 1.338449 1.780174 1.728354 1.7565790
#> 3: 1.384181 1.795822 1.738325 1.7511742
#> 4: 1.455372 1.740196 1.720587 1.7199402
#> 5: 1.491709 1.767345 1.737076 1.7312752
#> ---
#> 2996: 1.382776 1.246211 1.893865 1.1308438
#> 2997: 1.134617 1.248637 1.938155 1.1778568
#> 2998: 1.167446 1.203092 1.919244 1.1718252
#> 2999: 1.269429 1.207888 1.939008 0.9957091
#> 3000: 1.061431 1.312269 1.944217 1.3513589
#> Marker_13_asinh Marker_14_asinh Marker_15_asinh Sample
#> <num> <num> <num> <char>
#> 1: 1.318602 1.1549314 1.106958 Sample_1
#> 2: 1.369665 1.0453555 1.278743 Sample_1
#> 3: 1.228401 0.7567715 1.136054 Sample_1
#> 4: 1.377570 1.1756663 1.117614 Sample_1
#> 5: 1.424382 0.9356724 1.117582 Sample_1
#> ---
#> 2996: 1.960290 1.9957841 1.536391 Sample_3
#> 2997: 1.911389 2.0079570 1.719991 Sample_3
#> 2998: 1.981379 2.0147756 1.757036 Sample_3
#> 2999: 1.925785 2.0025586 1.636496 Sample_3
#> 3000: 1.956039 2.0206586 1.612030 Sample_3
#> SuperCellId
#> <char>
#> 1: SuperCell_1_Sample_Sample_1
#> 2: SuperCell_2_Sample_Sample_1
#> 3: SuperCell_3_Sample_Sample_1
#> 4: SuperCell_4_Sample_Sample_1
#> 5: SuperCell_5_Sample_Sample_1
#> ---
#> 2996: SuperCell_996_Sample_Sample_3
#> 2997: SuperCell_997_Sample_Sample_3
#> 2998: SuperCell_998_Sample_Sample_3
#> 2999: SuperCell_999_Sample_Sample_3
#> 3000: SuperCell_1000_Sample_Sample_3
#>
#> $supercell_cell_map
#> SuperCellID CellId Sample
#> <char> <char> <char>
#> 1: SuperCell_32_Sample_Sample_1 Cell_1 Sample_1
#> 2: SuperCell_794_Sample_Sample_1 Cell_2 Sample_1
#> 3: SuperCell_85_Sample_Sample_1 Cell_3 Sample_1
#> 4: SuperCell_500_Sample_Sample_1 Cell_4 Sample_1
#> 5: SuperCell_533_Sample_Sample_1 Cell_5 Sample_1
#> ---
#> 29996: SuperCell_4_Sample_Sample_3 Cell_29996 Sample_3
#> 29997: SuperCell_113_Sample_Sample_3 Cell_29997 Sample_3
#> 29998: SuperCell_557_Sample_Sample_3 Cell_29998 Sample_3
#> 29999: SuperCell_785_Sample_Sample_3 Cell_29999 Sample_3
#> 30000: SuperCell_481_Sample_Sample_3 Cell_30000 Sample_3The output generated by recomputeSupercells is
essentially a list:
supercell_expression_matrix: A data.table object that
contains the marker expression for each supercell.supercell_cell_map: A data.table that maps each cell to
its corresponding supercell.As mentioned before, gamma dictates the granularity of supercells. Compared to the previous run where gamma was set to 20, we should get more supercells for gamma = 10, and less for gamma = 50. Let’s see if that’s the case.
In the future, we may add the ability to specify different
gam value for different samples. For now, if we want to do
this, we will need to break down our data into multiple
data.table objects, each containing data from 1 sample, and
run runSuperCellCyto function on each of them with
different gam parameter value. Something like the
following:
n_markers <- 10
dat <- simCytoData(nmarkers = n_markers)
markers_col <- paste0("Marker_", seq_len(n_markers))
sample_col <- "Sample"
cell_id_col <- "Cell_Id"
samples <- unique(dat[[sample_col]])
gam_values <- c(10, 20, 10)
supercells_diff_gam <- lapply(seq_len(length(samples)), function(i) {
sample <- samples[i]
gam <- gam_values[i]
dat_samp <- dat[dat$Sample == sample, ]
supercell_samp <- runSuperCellCyto(
dt = dat_samp,
markers = markers_col,
sample_colname = sample_col,
cell_id_colname = cell_id_col,
gam = gam
)
return(supercell_samp)
})Subsequently, to extract and combine the
supercell_expression_matrix and
supercell_cell_map, we will need to use
rbind:
supercell_expression_matrix <- do.call(
"rbind", lapply(
supercells_diff_gam, function(x) x[["supercell_expression_matrix"]]
)
)
supercell_cell_map <- do.call(
"rbind", lapply(
supercells_diff_gam, function(x) x[["supercell_cell_map"]]
)
)rbind(
head(supercell_expression_matrix, n = 3),
tail(supercell_expression_matrix, n = 3)
)
#> Marker_1 Marker_2 Marker_3 Marker_4 Marker_5 Marker_6 Marker_7 Marker_8
#> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 6.721237 17.42854 15.94193 15.220307 14.74566 7.195091 10.77248 17.67051
#> 2: 7.082797 17.00876 15.32411 16.987421 14.96861 5.635079 12.73142 16.51664
#> 3: 7.432550 14.92124 15.61198 16.379962 14.68095 7.152308 12.04727 18.32607
#> 4: 9.909016 13.21091 14.89304 5.849434 11.99746 8.015280 15.07472 16.50933
#> 5: 12.084492 11.93139 14.40607 4.153426 11.64125 9.895273 14.96549 14.19265
#> 6: 11.884558 13.66217 15.88546 5.085003 12.54244 9.646548 16.50927 15.14807
#> Marker_9 Marker_10 Sample SuperCellId
#> <num> <num> <char> <char>
#> 1: 8.262861 8.041802 Sample_1 SuperCell_1_Sample_Sample_1
#> 2: 6.512652 7.418535 Sample_1 SuperCell_2_Sample_Sample_1
#> 3: 7.506751 7.666498 Sample_1 SuperCell_3_Sample_Sample_1
#> 4: 20.268531 14.743797 Sample_2 SuperCell_498_Sample_Sample_2
#> 5: 20.004255 17.347272 Sample_2 SuperCell_499_Sample_Sample_2
#> 6: 18.887566 16.348424 Sample_2 SuperCell_500_Sample_Sample_2rbind(head(supercell_cell_map, n = 3), tail(supercell_cell_map, n = 3))
#> SuperCellID CellId Sample
#> <char> <char> <char>
#> 1: SuperCell_666_Sample_Sample_1 Cell_1 Sample_1
#> 2: SuperCell_447_Sample_Sample_1 Cell_2 Sample_1
#> 3: SuperCell_912_Sample_Sample_1 Cell_3 Sample_1
#> 4: SuperCell_277_Sample_Sample_2 Cell_19998 Sample_2
#> 5: SuperCell_428_Sample_Sample_2 Cell_19999 Sample_2
#> 6: SuperCell_18_Sample_Sample_2 Cell_20000 Sample_2If for whatever reason you don’t mind (or perhaps more to the point
want) each supercell to contain cells from different biological samples,
you still need to have the sample column in your
data.table. However, what you need to do is essentially set
the value in the column to exactly one unique
value. That way, SuperCellCyto will treat all cells as coming from one
sample.
Just note, the parallel processing feature in SuperCellCyto won’t work for this as you will essentially only have 1 sample and nothing for SuperCellCyto to parallelise.
Is your dataset so huge that you are constantly running out of RAM when generating supercells? This thing happens and we have a solution for it.
Since supercells are generated for each sample independent of others you can easily break up the process. For example:
supercell_expression_matrix and
supercell_cell_map, and export them out as a csv file using
data.table’s fwrite function.Once you have processed all the samples, you can then load all
supercell_expression_matrix and
supercell_cell_map csv files and analyse them.
If you want to regenerate the supercells using different gamma
values, load the relevant output saved using the qs package and the
relevant data (remember to note which output belongs to which sets of
samples!), and run recomputeSupercells function.
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] parallel stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] BiocParallel_1.44.0 SuperCellCyto_1.0.0 BiocStyle_2.38.0
#>
#> loaded via a namespace (and not attached):
#> [1] cli_3.6.5 knitr_1.51 rlang_1.1.7
#> [4] xfun_0.56 otel_0.2.0 data.table_1.18.2.1
#> [7] jsonlite_2.0.0 buildtools_1.0.0 plyr_1.8.9
#> [10] htmltools_0.5.9 maketools_1.3.2 sys_3.4.3
#> [13] sass_0.4.10 rmarkdown_2.30 grid_4.5.2
#> [16] evaluate_1.0.5 jquerylib_0.1.4 fastmap_1.2.0
#> [19] yaml_2.3.12 lifecycle_1.0.5 BiocManager_1.30.27
#> [22] compiler_4.5.2 igraph_2.2.1 codetools_0.2-20
#> [25] Rcpp_1.1.1 pkgconfig_2.0.3 lattice_0.22-7
#> [28] digest_0.6.39 SuperCell_1.1 R6_2.6.1
#> [31] RANN_2.6.2 magrittr_2.0.4 bslib_0.10.0
#> [34] Matrix_1.7-4 tools_4.5.2 cachem_1.1.0