13 SingleR

13.1 Learning outcomes

Apply bioconductor packages to automate cell type annotations
Evaluate the trade-offs between reference-based and marker-based annotation

13.2 Overview

Why do we need to do this?

In the previous section we annotated clusters manually using canonical marker genes. This works well for well-characterised cell types, but it is slow, subjective, and depends on the analyst’s domain knowledge. Different researchers may reach different conclusions from the same data.

Reference-based annotation automates this process by comparing each cell’s expression profile against a curated reference dataset of known cell types. This is faster, reproducible, and less dependent on prior knowledge, making it easier to annotate large datasets or cell types you are less familiar with.

Read in our manually annotated seurat object.

seurat_object <- qs2::qs_read(here("data", "seurat_object_preprocessed.harmony.clustered.markers.qs2"))
Idents(seurat_object) <- seurat_object$markers

In this workshop we have focused on the Seurat package. However, there is another whole ecosystem of R packages for single cell analysis within Bioconductor. We won’t go into any detail on these packages in this workshop, but there is good material describing the object type online : OSCA.

For now, we’ll just convert our Seurat object into an object called SingleCellExperiment. This allows us to utilise the Bioconductor packages.

sce <- as.SingleCellExperiment(seurat_object)
#> Found more than one class "package_version" in cache; using the first, from namespace 'SeuratObject'
#> Also defined by 'alabaster.base'
#> Found more than one class "package_version" in cache; using the first, from namespace 'SeuratObject'
#> Also defined by 'alabaster.base'
#> Found more than one class "package_version" in cache; using the first, from namespace 'SeuratObject'
#> Also defined by 'alabaster.base'
#> Found more than one class "package_version" in cache; using the first, from namespace 'SeuratObject'
#> Also defined by 'alabaster.base'
#> Found more than one class "package_version" in cache; using the first, from namespace 'SeuratObject'
#> Also defined by 'alabaster.base'
#> Found more than one class "package_version" in cache; using the first, from namespace 'SeuratObject'
#> Also defined by 'alabaster.base'
#> Found more than one class "package_version" in cache; using the first, from namespace 'SeuratObject'
#> Also defined by 'alabaster.base'
#> Found more than one class "package_version" in cache; using the first, from namespace 'SeuratObject'
#> Also defined by 'alabaster.base'
sce
#> class: SingleCellExperiment 
#> dim: 35635 4877 
#> metadata(0):
#> assays(3): counts logcounts scaledata
#> rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
#> rowData names(0):
#> colnames(4877): AGGGCGCTATTTCC-1 GGAGACGATTCGTT-1 ...
#>   ATGTTGCTAAAAGC-1 GATGACACTAGCGT-1
#> colData names(24): orig.ident nCount_RNA ...
#>   NK.cells1 ident
#> Found more than one class "package_version" in cache; using the first, from namespace 'SeuratObject'
#> Also defined by 'alabaster.base'
#> reducedDimNames(4): PCA UMAP HARMONY UMAP_HARMONY
#> mainExpName: RNA
#> Found more than one class "package_version" in cache; using the first, from namespace 'SeuratObject'
#> Also defined by 'alabaster.base'
#> altExpNames(0):

We will now use a package called SingleR to label each cell. SingleR uses a reference data set of cell types with expression data to infer the best label for each cell. A convenient collection of cell type reference is in the celldex package which currently contains the follow sets:

ls('package:celldex')
#>  [1] "BlueprintEncodeData"             
#>  [2] "DatabaseImmuneCellExpressionData"
#>  [3] "defineTextQuery"                 
#>  [4] "fetchLatestVersion"              
#>  [5] "fetchMetadata"                   
#>  [6] "fetchReference"                  
#>  [7] "HumanPrimaryCellAtlasData"       
#>  [8] "ImmGenData"                      
#>  [9] "listReferences"                  
#> [10] "listVersions"                    
#> [11] "MonacoImmuneData"                
#> [12] "MouseRNAseqData"                 
#> [13] "NovershternHematopoieticData"    
#> [14] "saveReference"                   
#> [15] "searchReferences"                
#> [16] "surveyReferences"

We will use HumanPrimaryCellAtlasData because our dataset is human PBMCs - broad, well-characterised immune cell types that are well represented in this atlas. It provides both high-level (label.main) and fine-grained (label.fine) annotations.

This reference would be a poor choice if your data came from a specific tissue (e.g. brain, liver), a non-human species, or contained rare or novel cell types not present in the atlas (e.g. tumour cells). In those cases, you would need a more appropriate reference. For example a tissue-specific atlas or one built from your own annotated data.

Choosing a reference for your own data

The best reference is a published single-cell study in your model system. Ideally this means a study from the same tissue, species, and experimental context as your data. A well-annotated scRNA-seq dataset from the literature gives you cell type labels that are directly relevant to what you expect to find. Generic atlases like HPCA are useful for common human cell types, but will lack resolution for tissue-specific subtypes or anything outside their scope.

A key principle: annotation quality is bounded by the reference. If a cell type is not represented in the reference, SingleR cannot call it. It will assign the closest available label, which may be misleading.

Where to look:

Published scRNA-seq studies in your tissue or cell type of interest
Tissue-specific atlases: Human Cell Atlas, Tabula Sapiens (human), Tabula Muris (mouse)
Bioconductor’s celldex package for common reference types

Any annotated SingleCellExperiment can be used as a SingleR reference directly, so a downloaded and annotated published dataset is a valid option.

Non-model organisms and no-reference scenarios

For non-model organisms, curated scRNA-seq references are often sparse or non-existent. Options in this case:

Cross-species references with ortholog mapping: convert gene IDs to their human or mouse orthologs and apply a well-annotated reference from a related species. This can work for broadly conserved cell types (e.g. immune populations) but will fail for lineages with divergent gene expression programs.
Bulk RNA-seq references: SingleR accepts bulk RNA-seq data as a reference. If a well-annotated bulk dataset exists for your organism or tissue, it can serve as a fallback.
Marker-based annotation only: when no suitable reference exists, the manual top-down approach from the previous section is your primary tool. This requires more domain knowledge and literature review, but is the most flexible option.
De novo annotation: for truly novel cell types, annotation becomes a research question in itself and will require integration of marker genes, trajectory analysis, and cross-validation against external data sources.

The absence of a reference is common in emerging model systems and is not a dead end, it shifts the bulk of the investigative work back to marker-based interpretation.

Let’s download the reference dataset:

# This too is a sce object,
# colData is equivalent to seurat's metadata
ref.set <- celldex::HumanPrimaryCellAtlasData()

HPCA contains two categories of labels at different granularities, main and fine. Display the main labels.

unique(ref.set$label.main)
#>  [1] "DC"                   "Smooth_muscle_cells" 
#>  [3] "Epithelial_cells"     "B_cell"              
#>  [5] "Neutrophils"          "T_cells"             
#>  [7] "Monocyte"             "Erythroblast"        
#>  [9] "BM & Prog."           "Endothelial_cells"   
#> [11] "Gametocytes"          "Neurons"             
#> [13] "Keratinocytes"        "HSC_-G-CSF"          
#> [15] "Macrophage"           "NK_cell"             
#> [17] "Embryonic_stem_cells" "Tissue_stem_cells"   
#> [19] "Chondrocytes"         "Osteoblasts"         
#> [21] "BM"                   "Platelets"           
#> [23] "Fibroblasts"          "iPS_cells"           
#> [25] "Hepatocytes"          "MSC"                 
#> [27] "Neuroepithelial_cell" "Astrocyte"           
#> [29] "HSC_CD34+"            "CMP"                 
#> [31] "GMP"                  "MEP"                 
#> [33] "Myelocyte"            "Pre-B_cell_CD34-"    
#> [35] "Pro-B_cell_CD34+"     "Pro-Myelocyte"

An example of the types of “fine” labels.

head(unique(ref.set$label.fine), n = 20)
#>  [1] "DC:monocyte-derived:immature"       
#>  [2] "DC:monocyte-derived:Galectin-1"     
#>  [3] "DC:monocyte-derived:LPS"            
#>  [4] "DC:monocyte-derived"                
#>  [5] "Smooth_muscle_cells:bronchial:vit_D"
#>  [6] "Smooth_muscle_cells:bronchial"      
#>  [7] "Epithelial_cells:bronchial"         
#>  [8] "B_cell"                             
#>  [9] "Neutrophil"                         
#> [10] "T_cell:CD8+_Central_memory"         
#> [11] "T_cell:CD8+"                        
#> [12] "T_cell:CD4+"                        
#> [13] "T_cell:CD8+_effector_memory_RA"     
#> [14] "T_cell:CD8+_effector_memory"        
#> [15] "T_cell:CD8+_naive"                  
#> [16] "Monocyte"                           
#> [17] "Erythroblast"                       
#> [18] "BM"                                 
#> [19] "DC:monocyte-derived:rosiglitazone"  
#> [20] "DC:monocyte-derived:AM580"

Now we’ll label our cells using the SingleCellExperiment object, with the above reference set. This will take a minute.

pred.cnts <- SingleR::SingleR(test = sce, ref = ref.set, labels = ref.set$label.main)

Labels assigned to fewer than 10 cell are likely low-confidence predictions - too few cells to represent a genuine population. We collapse these into an “Other” category and transfer the remaining labels back to the Seurat object.

lbls.keep <- table(pred.cnts$labels)>10
seurat_object$SingleR.labels <- ifelse(lbls.keep[pred.cnts$labels], pred.cnts$labels, 'Other')
DimPlot(seurat_object, reduction='umap_harmony', group.by='SingleR.labels')

13.2.1 Comparing annotation methods

Before comparing, it is worth summarising what each approach offers:

	Reference-based (SingleR)	Marker-based (manual)
Speed	Fast, automated	Slow, manual
Objectivity	Reproducible	Analyst-dependent
Flexibility	Bounded by reference	Uses any domain knowledge
Limitations	Misses cell types absent from reference	Subjective, error-prone

The $cell column in our metadata contains the cell type labels provided by the original dataset authors. We can use these as a ground truth to assess both annotation approaches side by side.

library(patchwork)
p1 <- DimPlot(seurat_object, reduction = "umap_harmony", label = TRUE, repel = TRUE) +
        ggtitle("Marker-based") + NoLegend()
p2 <- DimPlot(seurat_object, reduction = "umap_harmony", group.by = "SingleR.labels",
              label = TRUE, repel = TRUE) +
        ggtitle("SingleR (HPCA main)") + NoLegend()
p3 <- DimPlot(seurat_object, reduction = "umap_harmony", group.by = "cell",
              label = TRUE, repel = TRUE) +
        ggtitle("Ground truth") + NoLegend()
p1 | p2 | p3

To go beyond the visual and count exact agreement, compare SingleR labels against the ground truth:

table(SingleR = seurat_object$SingleR.labels, Ground_truth = seurat_object$cell)
#>              Ground_truth
#> SingleR       B cells CD14+ Monocytes CD4 T cells
#>   B_cell          424               0           5
#>   DC                2              14           1
#>   Macrophage        0               9           0
#>   Monocyte         12            1039           9
#>   Neutrophils       3              23           0
#>   NK_cell           4               0          19
#>   Other             5               0           2
#>   T_cells          41               2        2031
#>              Ground_truth
#> SingleR       CD8 T cells Dendritic cells FCGR3A+ Monocytes
#>   B_cell                0               9                 3
#>   DC                    0              25                 4
#>   Macrophage            0               0                 2
#>   Monocyte              0              44               293
#>   Neutrophils           0               0                 3
#>   NK_cell              95               5                 2
#>   Other                 1               2                 3
#>   T_cells             326               1                 2
#>              Ground_truth
#> SingleR       Megakaryocytes NK cells
#>   B_cell                   6        0
#>   DC                       0        0
#>   Macrophage               0        0
#>   Monocyte                 8        2
#>   Neutrophils              0        0
#>   NK_cell                  5      338
#>   Other                    5        0
#>   T_cells                 24       24

Answer in Slack

Looking at the three plots and the table above, where do manual annotation, SingleR, and the ground truth agree? Where do they differ? What might explain the discrepancies?

Exercise: Fine-grained annotation

Rerun SingleR using label.fine instead of label.main and plot the results.

# Add your code here

Discuss

How does the granularity of the labels change? Does finer annotation make the results easier or harder to interpret? What trade-off does this introduce?

Fine-grained labels capture biologically meaningful distinctions that broad labels miss — for example, separating CD4+ naive from CD4+ memory T cells. However, the more granular the labels, the harder it is for the algorithm to distinguish them reliably: closely related subtypes have very similar expression profiles, and the reference may not have enough examples of each subtype to anchor them confidently. In practice, run both: use label.main to confirm the broad populations are correct, then apply label.fine to any cluster where a finer distinction matters for your biological question.

#qs2::qs_save(seurat_object, here("data", "seurat_object_preprocessed.harmony.clustered.markers.singler.qs2"))

12 Cluster Markers

14 Differential Expression