4 Load data
Let’s get started with a single cell introduction
4.1 Learning outcomes
- Be able to load raw single cell RNAseq into a Seurat object
- Understand the basic structure of a Seurat object, how to get counts and cell annotations
4.2 The example experiment data
For today’s workshop, we’ll be using some data from Kang et al 2018 - Multiplexed droplet single-cell RNA-sequencing using natural genetic variation.
This dataset represents human peripheral blood mononuclear cells (PBMCs), pooled from eight individual donors. There were two conditions for the samples, one of which was stimulated with IFN-beta, and a unstimulated control.
They used the genetic differences among donors enable the identification of the individual for each cell, and to identify some cell doublets.
For the purposes of today’s workshop, we will load the individual IDs, stimulation condition and doublet status for this data from a separate metadata table.
PLEASE NOTE: The dataset used in this workshop is a modified version derived from this study (see here). It has been adapted to introduce additional complexity for instructional purposes. (Mitochondrial expression levels have been introduced to demonstrate how to interpret and apply mitochondrial thresholds for filtering purposes.) Please refrain from drawing any biological conclusions from this data as it does not represent real experimental results.
4.3 What does the raw data look like?
What do the input files look like? It varies, but this is the output of the CellRanger pipeline, described here
├── analysis
│ ├── clustering
│ ├── diffexp
│ ├── pca
│ ├── tsne
│ └── umap
├── cloupe.cloupe
├── filtered_feature_bc_matrix
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── filtered_feature_bc_matrix.h5. --> matrix to read in h5 format
├── metrics_summary.csv
├── molecule_info.h5
├── possorted_genome_bam.bam
├── possorted_genome_bam.bam.bai
├── raw_feature_bc_matrix
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
├── raw_feature_bc_matrix.h5
└── web_summary.html
4.4 Setup the Seurat Object
We start by reading in the data. There are several options for loading the data. The Read10X() function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).
We next use the count matrix to create a Seurat object. The object serves as a container that contains both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. For a technical discussion of the Seurat object structure, check out the GitHub Wiki. For example, the count matrix is stored in pbmc@assays$RNA@counts.
4.5 Different ways of loading the data
We’ll use method 3 today.
4.5.1 Example 1: From the whole data folder
Load your data using the path to the folder filtered_feature_bc_matrix that is in the output folder of your cellranger run using the Read10X function.
# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "outs/filtered_feature_bc_matrix")
# Initialize the Seurat object with the raw (non-normalized data).
seurat_object <- CreateSeuratObject(counts = pbmc.data, min.cells = 3, min.features = 200)4.5.2 Example 2: From filtered_feature_bc_matrix folder
Load your data directing the ReadMtx function to each of the relevant files in the filtered_feature_bc_matrix folder in the outputs from your cellranger run. MTX is a simple text format for sparse matrices.
expression_matrix <- ReadMtx(
mtx = "outs/filtered_feature_bc_matrix/count_matrix.mtx.gz", features = "outs/filtered_feature_bc_matrix/features.tsv.gz",
cells = "outs/filtered_feature_bc_matrix/barcodes.tsv.gz"
)
seurat_object <- CreateSeuratObject(counts = expression_matrix)4.5.3 Example 3: From .h5 files
Load your data using the Read10X_h5 function to each of the relevant HDF5 files. HDF5 is an efficient binary format.
For this workshop, we also have a text file with some of the metadata from the original study, which we add with AddMetaData.
pbmc.data <- Read10X_h5("data/filtered_feature_bc_matrix.h5")
metadata <- read.table("data/metadata.txt")
seurat_object <- CreateSeuratObject(counts = pbmc.data ,
assay = "RNA", project = 'pbmc')
#> Warning: Feature names cannot have underscores ('_'),
#> replacing with dashes ('-')
# Use AddMetaData to add new meta data to object
seurat_object <- AddMetaData(object = seurat_object, metadata = metadata)4.6 What does the seurat object look like?
seurat_object
#> An object of class Seurat
#> 35635 features across 5000 samples within 1 assay
#> Active assay: RNA (35635 features, 0 variable features)
#> 1 layer present: countsLets examine a few genes in the first thirty cells.
The . values in the matrix represent 0s (no molecules detected). Since most values in an scRNA-seq matrix are 0, Seurat uses a sparse-matrix representation whenever possible. This results in significant memory and speed savings for Drop-seq/inDrop/10x data.
GetAssayData(seurat_object, layer='counts')[c("CD3D","TCL1A","MS4A1"), 1:30]
#> 3 x 30 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 30 column names 'AGGGCGCTATTTCC-1', 'GGAGACGATTCGTT-1', 'CACCGTTGTCGTAG-1' ... ]]
#>
#> CD3D . 2 . . . 5 . . . . 2 . . . 1 . . 1 1 . . . 1 . 1 3 .
#> TCL1A . . . . . . . . . . . . . . . . . . . . . . . 3 . . .
#> MS4A1 . . . . . . . . . . . . . . . . . . . . . . . 1 . . .
#>
#> CD3D 2 . .
#> TCL1A . . .
#> MS4A1 . . .And the cell metadata for the first 10 cells (most of these columns come from the original paper, we’ll get into them later.)
head(seurat_object@meta.data)
#> orig.ident nCount_RNA nFeature_RNA ind
#> AGGGCGCTATTTCC-1 SeuratProject 2020 523 1256
#> GGAGACGATTCGTT-1 SeuratProject 840 381 1256
#> CACCGTTGTCGTAG-1 SeuratProject 3097 995 1016
#> TATCGTACACGCAT-1 SeuratProject 1011 540 1488
#> TGACGCCTTGCTTT-1 SeuratProject 570 367 101
#> TACGAGACCTATTC-1 SeuratProject 2399 770 1244
#> stim cell multiplets
#> AGGGCGCTATTTCC-1 stim CD14+ Monocytes singlet
#> GGAGACGATTCGTT-1 stim CD4 T cells singlet
#> CACCGTTGTCGTAG-1 ctrl FCGR3A+ Monocytes singlet
#> TATCGTACACGCAT-1 stim B cells singlet
#> TGACGCCTTGCTTT-1 ctrl CD4 T cells ambs
#> TACGAGACCTATTC-1 stim CD4 T cells singletWhat about Multiple sampes?
In the real world, you’ll probably want to load multiple samples, and join them together into one big data object before analysis.
This is described in the seurat merge vignette
How much RAM does the sparse matrix save?
dense.size <- object.size(as.matrix(pbmc.data))
#> Warning in asMethod(object): sparse->dense coercion:
#> allocating vector of size 1.3 GiB
dense.size
#> 1428252816 bytes
sparse.size <- object.size(pbmc.data)
sparse.size
#> 39615768 bytes
dense.size / sparse.size
#> 36.1 bytes