11 Clustering

11.1 Learning outcomes

  • Explain how clustering groups cells with similar gene expression profiles, and how the resolution parameter controls how broadly or finely those clusters are defined
  • Run the FindNeighbors and FindClusters functions to cluster cells
  • Evaluate a clustree to distinguish resolutions that capture the variability of the data from those that over-cluster

11.2 Overview

Slides

Why do we need to do this?

Before we can compare stimulated and control PBMCs, we need to know which cell types we are comparing. Clustering groups cells by similarity in gene expression so that each cluster can be inspected and labelled as a cell type.

11.3 Cluster cells

Seurat uses a graph-based clustering approach. Think of it like mapping a social network: each cell is a node, and connections are drawn to its nearest neighbours based on similar gene expression. Cells that are more tightly connected form communities which become your clusters.

This is what we ran at the end of the previous Harmony section.

seurat_object <- FindNeighbors(seurat_object, dims = 1:10, reduction = "harmony")
seurat_object <- FindClusters(seurat_object, resolution = 0.5)

Two functions do the work:

  • FindNeighbors(): builds the graph by connecting each cell to its most similar neighbours in PCA space.
  • FindClusters(): partitions that graph into communities. The resolution parameter controls the granularity: lower values produce fewer, broader clusters; higher values produce more, finer clusters.

Because the right resolution depends on your data, the next section shows how to test several values and pick the most biologically meaningful one.

Further reading

Read in the harmonised data for this lesson.

seurat_object <- qs2::qs_read("data/seurat_object_preprocessed.harmony.qs2")

Look at cluster IDs of the first 5 cells

head(Idents(seurat_object), 5)
#> AGGGCGCTATTTCC-1 GGAGACGATTCGTT-1 CACCGTTGTCGTAG-1 
#>                1                5                4 
#> TATCGTACACGCAT-1 TACGAGACCTATTC-1 
#>                3                0 
#> Levels: 0 1 2 3 4 5 6 7

Check out the clusters.

DimPlot(seurat_object, reduction = "umap_harmony")
# Equivalent to
# DimPlot(seurat_object,reduction="umap", group.by="seurat_clusters")
# DimPlot(seurat_object,reduction="umap", group.by="RNA_snn_res.0.5")

11.4 Choosing a cluster resolution

Its a good idea to try different resolutions when clustering to identify the variability of your data.

min_res <- 0.4
max_res <- 2
interval <- 0.2
seurat_object <- FindClusters(seurat_object, resolution = seq(min_res, max_res, interval), verbose = F)

# the different clustering created
names(seurat_object@meta.data)
#>  [1] "orig.ident"       "nCount_RNA"      
#>  [3] "nFeature_RNA"     "ind"             
#>  [5] "stim"             "cell"            
#>  [7] "multiplets"       "percent.mt"      
#>  [9] "RNA_snn_res.0.6"  "seurat_clusters" 
#> [11] "pca_clusters"     "harmony_clusters"
#> [13] "RNA_snn_res.0.4"  "RNA_snn_res.0.8" 
#> [15] "RNA_snn_res.1"    "RNA_snn_res.1.2" 
#> [17] "RNA_snn_res.1.4"  "RNA_snn_res.1.6" 
#> [19] "RNA_snn_res.1.8"  "RNA_snn_res.2"
# Look at cluster IDs of the first 5 cells
head(Idents(seurat_object), 5)
#> AGGGCGCTATTTCC-1 GGAGACGATTCGTT-1 CACCGTTGTCGTAG-1 
#>                3               17               12 
#> TATCGTACACGCAT-1 TACGAGACCTATTC-1 
#>               13                9 
#> 21 Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ... 20
DimPlot(seurat_object, reduction = "umap_harmony")

11.4.1 How to read a clustree

A clustree plots each cluster at each resolution as a node, with arrows showing where cells move as resolution increases. Use it to find the resolution that captures real biological structure without over-splitting.

Stable trunk — a large node that persists across several resolutions with arrows pointing straight down. Cells are consistently assigned to this cluster regardless of resolution. This is a cluster you can trust.

Branching point — a node that splits into two or more child nodes at higher resolution. A clean branch (one dominant arrow per child) suggests the split is capturing a genuine subpopulation. Multiple small arrows from the same parent (“scatter”) suggests noise rather than biology.

Over-clustering — at high resolutions, previously stable clusters start fragmenting with many small arrows going in multiple directions. The tree looks “bushy”. Cells are being split arbitrarily rather than by meaningful gene expression differences.

Choosing a resolution — pick the highest resolution at which the major trunks remain stable and branches are still clean. Pushing beyond this point adds clusters that don’t reflect biology.

library(clustree)
#> Loading required package: ggraph
#> 
#> Attaching package: 'ggraph'
#> The following object is masked from 'package:sp':
#> 
#>     geometry
clustree(seurat_object, prefix = "RNA_snn_res.", show_axis=TRUE) + 
  theme(legend.key.size = unit(0.05, "cm"))

Exercise: Choosing a resolution

  1. Plot a clustree to decide how many clusters you have and what resolution captures them. Look for the point where stable trunks begin to fragment.
  2. Validate the resolution using DimPlot() — do the clusters look well-separated on the UMAP?

Tip: The name of the cluster is prefixed with ‘RNA_snn_res’ and the number of the resolution e.g. RNA_snn_res.0.4

# Add your DimPlots here
#DimPlot(seurat_object, reduction = "umap_harmony", group.by = "")

Looking at the clustree, resolution 0.6 is a reasonable choice for this dataset:

  • The major trunks are stable between 0.4 and 0.6 (no effect on clusters)
  • Beyond 0.6 the tree becomes unstable, clusters begin fragmenting with scattered arrows, which suggests the splits no longer reflect genuine biology.

Set the resolution for 0.6:

Idents(seurat_object) <- seurat_object$RNA_snn_res.0.6

Plot the UMAP with coloured clusters.

DimPlot(seurat_object, reduction = "umap_harmony", label = TRUE, repel = TRUE, label.box = TRUE) + NoLegend()

Answer in Slack

At this point we have clusters, but do they reflect genuine cell types or could they still be driven by technical variation? What additional evidence would you want to see before trusting these clusters as biological groups?

qs2::qs_save(seurat_object, "data/seurat_object_preprocessed.harmony.clustered.qs2")