3 Seurat PBMC3k Tutorial
3.1 Load data
3.1.1 Input data
What do the input files look like? It varies, but this is the output of the CellRanger pipleine for a single sample.
├── analysis
│ ├── clustering
│ ├── diffexp
│ ├── pca
│ ├── tsne
│ └── umap
├── cloupe.cloupe
├── filtered_feature_bc_matrix
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── filtered_feature_bc_matrix.h5
├── metrics_summary.csv
├── molecule_info.h5
├── possorted_genome_bam.bam
├── possorted_genome_bam.bam.bai
├── raw_feature_bc_matrix
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── raw_feature_bc_matrix.h5
└── web_summary.html
We don’t usually need to look at all of these, the most useful files are:
- web_summary.html Nice QC summary
- filtered_feature_bc_matrix/ folder has 3 files which contain the ‘counts matrix’, how many copies of each gene per cell. Often available for public data, e.g. this kidney organioid study
The contents are more fully described here. And 10X provides some example data like this (requires email): https://www.10xgenomics.com/datasets/5k-human-pbmcs-3-v3-1-chromium-controller-3-1-standard
3.1.2 Setup the Seurat Object
Tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial#setup-the-seurat-object
library(dplyr)
library(ggplot2)
library(Seurat)
library(patchwork)
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "data/pbmc3k/filtered_gene_bc_matrices/hg19/") # This path on our VMs!
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
3.2 QC Filtering
Tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial#qc-and-selecting-cells-for-further-analysis
Low quality cells can add noise to your results leading you to the wrong biological conclusions. Using only good quality cells helps you to avoid this. Reduce noise in the data by filtering out low quality cells such as dying or stressed cells (high mitochondrial expression) and cells with few features that can reflect empty droplets.
The metadata of a seurat object contains any information about a cell, not just QC pbmc@meta.data
or pbmc[[]]
.
Running PercentageFeatureSet()
added new columns, but we can also add information ourselves. We will use the metadata a lot!
E.g. In a real experiment with multiple samples, you can load each sample, and record information about it- then combine into one, big, seurat object. NB: We won’t cover that today, but see merging seurat objects tutorial
so.1 <- # load sample 1
so.1$samplename <- "sample1"
so.1$group <- "treatment"
# then sample 2-12
Challenge: Filter the cells
Apply the filtering thresolds defined above.
- How many cells survived filtering?
The PBMC3k dataset we’re working with in this tutorial is quite old. There are a number of other example datasets available from the 10X website, including this one - published in 2022, sequencing 10k PBMCs with a newer chemistry and counting method.
- What thresholds would you chose to apply to this modern dataset?
pbmc10k_unfiltered <- readRDS("data/10k_PBMC_v3.1ChromiumX_Intronic.rds")
VlnPlot(pbmc10k_unfiltered, features = "nCount_RNA") + scale_y_log10()
VlnPlot(pbmc10k_unfiltered, features = "nFeature_RNA") + scale_y_log10()
3.3 Normalisation
Tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial#normalizing-the-data
The sequencing depth can be different per cell. This can bias the counts of expression showing higher numbers for more sequenced cells leading to the wrong biological conclusions. To correct this the feature counts are normalized.
3.4 PCAs and UMAPs
There are a few steps of number-crunching before we get to generating a nice UMAP representation of our data.
Why are these steps needed?
-
Identification of highly variable features (feature selection) Tutorial: Identifying the most variable features allows retaining the real biological variability of the data and reduce noise in the data.
- Scaling the data Tutorial: Highly expresed genes can overpower the signal of other less expresed genes with equal importance. Within the same cell the assumption is that the underlying RNA content is constant. Aditionally, If variables are provided in vars.to.regress, they are individually regressed against each feature, and the resulting residuals are then scaled and centered. This step allows controling for cell cycle and other factors that may bias your clustering.
- Dimensionality reduction (PCA) Tutorial: Imagine each gene represents a dimension - or an axis on a plot. We could plot the expression of two genes with a simple scatterplot. But a genome has thousands of genes - how do you collate all the information from each of those genes in a way that allows you to visualise it in a 2 dimensional image. This is where dimensionality reduction comes in, we calculate meta-features that contains combinations of the variation of different genes. From thousands of genes, we end up with 10s of meta-features
- Run non-linear dimensional reduction (UMAP/tSNE) Tutorial: Finally, make the UMAP plot.
NB: We skip clustering for now, as we will return to cover it in greater depth.
Challenge: PC genes
You can plot gene expression on the UMAP with the FeaturePlot()
function.
Try out some genes that were highly weighted in the principal component analysis. How do they look?
3.4.1 Save
You can save the object at this point so that it can easily be loaded back in without having to rerun the computationally intensive steps performed above, or easily shared with collaborators.
saveRDS(pbmc, file = "pbmc_tutorial_saved.rds")
3.5 The Seurat object
To accommodate the complexity of data arising from a single cell RNA seq experiment, the seurat object keeps this as a container of multiple data tables that are linked.
Note that seurat supports multimodal data; e.g. RNA, and protein tags in the same cells. As of seurat v5, these are stored as assays each of which can have multiple layers ). Today we only work with one assay; RNA.
- RNA assay (The default, and only, Assay)
- counts layer : raw counts data
- data layer : normalised data
- scale.data layer : scaled data (may only be available for highly variable genes)
Beware: The notation around Layers and Assays has changed between Seurat v4 and v5! In older seurat objects ‘assay’ is used to refer to v5 layers.