Seurat BUG: Cell Order Affects ScaleData & PCA/UMAP Results
Hey guys,
I wanted to share something pretty important I stumbled upon while working with Seurat (v5). It turns out that just shuffling the column order of a sparse count matrix—without actually changing the data or metadata—can seriously mess up the results you get from ScaleData()
. This, in turn, throws off PCA, Harmony, UMAP, and even your clustering results. It's a bit of a head-scratcher, so let's dive into what's happening and how to reproduce it.
Issue Description
So, here's the deal: the order of cells in your sparse count matrix shouldn't be a big deal, right? You'd think that as long as the data itself remains untouched, the results should be consistent. But that's not what's happening in Seurat v5. When you change the column order, ScaleData()
spits out different results, and these differences cascade through your downstream analyses. This means your PCA plots, UMAP visualizations, and cluster assignments can all be affected. It’s like rearranging the deck chairs on the Titanic—except in this case, the chairs are your cells, and the Titanic is your entire analysis.
The implications here are pretty significant. If cell order is influencing your results, it introduces a source of variability that shouldn't be there. This could lead to misinterpretations of your data, especially if you're comparing different datasets or replicates where cell order might vary. Imagine spending hours, days, or even weeks on an analysis, only to find out that the foundations are shaky because of something as seemingly trivial as column order. That's why it's crucial to understand and address this issue.
To really drive this home, let's think about what ScaleData()
is supposed to do. It's designed to normalize gene expression across cells, removing unwanted technical variation and making it easier to identify biologically meaningful differences. This involves centering and scaling the data, typically using z-scores. The process should be consistent regardless of how the cells are arranged in the matrix. However, the bug suggests that there's some dependency on cell order that's creeping into the calculations. This could be related to how the sparse matrix operations are handled or some other internal mechanism within the function. Whatever the cause, it's clear that we need to get to the bottom of it to ensure the reliability of our single-cell analyses.
Reproducing Code Example
To show you exactly what I mean, I’ve put together a minimal reproducible example using a real single-cell RNA sequencing dataset. This dataset comes from a 10x Genomics experiment and includes 32,285 genes across 64,466 cells. It's a substantial dataset, which makes the impact of this bug even more concerning. Here’s the code you can use to see the issue for yourself:
# Create reference Seurat object
# Assuming 'counts' is your real dgCMatrix
seu1 <- CreateSeuratObject(counts)
# Normalize the data
seu1 <- NormalizeData(seu1)
# Find variable features
seu1 <- FindVariableFeatures(seu1)
# Scale the data
seu1 <- ScaleData(seu1)
# Randomize the column (cell) order
set.seed(456)
shuffled_cells <- sample(colnames(counts))
counts_shuffled <- counts[, shuffled_cells]
# Create a second Seurat object with shuffled data
seu2 <- CreateSeuratObject(counts_shuffled)
# Normalize the data again
seu2 <- NormalizeData(seu2)
# Find variable features again
seu2 <- FindVariableFeatures(seu2)
# Scale the data again
seu2 <- ScaleData(seu2)
# Reorder seu2 to match seu1 for comparison
# This ensures that the cells are in the same order for direct comparison
seu2 <- seu2[, colnames(seu1)]
# Get scaled data from both objects
scale1 <- GetAssayData(seu1, layer = "scale.data")
scale2 <- GetAssayData(seu2, layer = "scale.data")
# Compare the scaled data matrices
max_diff <- max(abs(scale1 - scale2))
num_genes_diff <- sum(apply(abs(scale1 - scale2), 1, max) > 1e-5)
# Print the results
cat("Max abs difference:", max_diff, "\n")
cat("Genes with >1e-5 difference:", num_genes_diff, "\n")
Let's break down what this code does: first, we create a reference Seurat object (seu1
) from the original count matrix. We then normalize, find variable features, and scale the data. Next, we shuffle the column order of the count matrix and create a second Seurat object (seu2
). We perform the same normalization, feature selection, and scaling steps on this object. To make a fair comparison, we reorder the columns of seu2
to match the original order in seu1
. Finally, we compare the scaled data matrices from both objects and print the maximum absolute difference and the number of genes with substantial differences.
The crucial part here is the comparison. If ScaleData()
were working correctly, the scaled data in seu1
and seu2
should be virtually identical, even though the cell order was shuffled in between. However, as you’ll see when you run the code, this isn't the case. There are significant differences between the scaled data matrices, indicating that cell order is indeed influencing the results. This highlights a potential flaw in the scaling process that needs to be addressed to ensure the reliability of Seurat analyses.
Error Message
When you run the code, you’ll likely see an output similar to this:
Output:
Max abs difference: 10.54427
Genes with >1e-5 difference: 2000
This output is pretty alarming. The Max abs difference
value tells us the largest absolute difference between any corresponding elements in the two scaled data matrices. A value of 10.54427 is huge—it means that for some gene-cell combinations, the scaled expression values differ by more than 10 standard deviations. This is a massive discrepancy that can have a major impact on downstream analyses.
The Genes with >1e-5 difference
value is equally concerning. It tells us how many genes have at least one cell where the scaled expression differs by more than 1e-5 (a very small threshold). In this case, we see that 2000 genes have significant differences in their scaled expression values between the two objects. Given that we started with 32,285 genes, this means that a substantial portion of the transcriptome is affected by the cell order issue.
These results underscore the severity of the problem. It's not just a minor numerical difference; it's a widespread effect that alters the scaled expression values for a large number of genes. This, in turn, can lead to different PCA results, UMAP embeddings, cluster assignments, and ultimately, different biological interpretations. Imagine if you were trying to identify differentially expressed genes between conditions, and the scaling process was introducing spurious differences simply because of cell order. You could end up chasing false positives and missing real biological signals. That's why it's so critical to understand and resolve this bug.
Additional Comments
No response
Session Info
Here’s the session info from my R environment, which might be helpful for anyone trying to reproduce this issue:
sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.6.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Matrix_1.6-5 harmony_1.2.0 Rcpp_1.0.13-1 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[7] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[13] tidyverse_2.0.0 SeuratWrappers_0.3.5 Seurat_5.1.0 Signac_1.14.0 SeuratObject_5.0.2 sp_2.1-4
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 rstudioapi_0.16.0 jsonlite_1.8.9 magrittr_2.0.3 spatstat.utils_3.1-1
[6] farver_2.1.2 zlibbioc_1.48.2 vctrs_0.6.5 ROCR_1.0-11 spatstat.explore_3.3-3
[11] Rsamtools_2.18.0 RCurl_1.98-1.16 RcppRoll_0.3.1 htmltools_0.5.8.1 sctransform_0.4.1
[16] parallelly_1.41.0 KernSmooth_2.23-24 htmlwidgets_1.6.4 ica_1.0-3 plyr_1.8.9
[21] plotly_4.10.4 zoo_1.8-12 igraph_2.1.2 mime_0.12 lifecycle_1.0.4
[26] pkgconfig_2.0.3 rsvd_1.0.5 R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.11
[31] fitdistrplus_1.2-1 future_1.34.0 shiny_1.10.0 digest_0.6.37 colorspace_2.1-1
[36] patchwork_1.3.0 S4Vectors_0.40.2 tensor_1.5 RSpectra_0.16-2 irlba_2.3.5.1
[41] pkgload_1.4.0 GenomicRanges_1.54.1 labeling_0.4.3 progressr_0.15.1 timechange_0.3.0
[46] spatstat.sparse_3.1-0 httr_1.4.7 polyclip_1.10-7 abind_1.4-8 compiler_4.3.2
[51] remotes_2.5.0 withr_3.0.2 BiocParallel_1.36.0 fastDummies_1.7.4 R.utils_2.12.3
[56] MASS_7.3-60.0.1 tools_4.3.2 lmtest_0.9-40 httpuv_1.6.15 future.apply_1.11.3
[61] goftest_1.2-3 R.oo_1.26.0 glue_1.8.0 nlme_3.1-166 promises_1.3.2
[66] grid_4.3.2 Rtsne_0.17 cluster_2.1.6 reshape2_1.4.4 generics_0.1.3
[71] gtable_0.3.6 spatstat.data_3.1-4 tzdb_0.4.0 R.methodsS3_1.8.2 hms_1.1.3
[76] data.table_1.16.4 XVector_0.42.0 BiocGenerics_0.48.1 spatstat.geom_3.3-4 RcppAnnoy_0.0.22
[81] ggrepel_0.9.6 RANN_2.6.2 pillar_1.10.0 spam_2.11-0 RcppHNSW_0.6.0
[86] later_1.4.1 splines_4.3.2 lattice_0.22-6 survival_3.7-0 deldir_2.0-4
[91] tidyselect_1.2.1 Biostrings_2.70.3 miniUI_0.1.1.1 pbapply_1.7-2 gridExtra_2.3
[96] IRanges_2.36.0 scattermore_1.2 RhpcBLASctl_0.23-42 stats4_4.3.2 matrixStats_1.4.1
[101] stringi_1.8.4 lazyeval_0.2.2 codetools_0.2-20 BiocManager_1.30.24 cli_3.6.3
[106] uwot_0.2.2 xtable_1.8-4 reticulate_1.42.0 munsell_0.5.1 GenomeInfoDb_1.38.8
[111] globals_0.16.3 spatstat.random_3.3-2 png_0.1-8 spatstat.univar_3.1-1 parallel_4.3.2
[116] dotCall64_1.2 bitops_1.0-9 listenv_0.9.1 viridisLite_0.4.2 scales_1.3.0
[121] ggridges_0.5.6 leiden_0.4.3.1 crayon_1.5.3 rlang_1.1.4 cowplot_1.1.3
[126] fastmatch_1.1-4
This info can help ensure that you have the same package versions and R environment as I do, which is crucial for reproducing the bug. The sessionInfo()
output lists all the attached packages and their versions, as well as the R version and operating system. If you encounter different results when running the code, comparing your session info with mine might help identify potential discrepancies.
For example, if you're using an older version of Seurat or one of its dependencies, the bug might have already been fixed. Alternatively, if you're using a different operating system or R version, there might be subtle differences in how the code is executed. By providing this detailed information, I hope to make it easier for others to reproduce the issue and contribute to finding a solution.
Conclusion
This bug in ScaleData()
is a big deal, guys. It just goes to show how important it is to dig deep and validate our methods, even in well-established packages like Seurat. If you're working with single-cell data, keep this in mind and double-check your results. Hopefully, this will get resolved soon, but in the meantime, stay vigilant and keep exploring!