4 Load data

Let’s get started with a single cell introduction

4.1 Learning outcomes

Be able to load raw single cell RNAseq into a Seurat object
Understand the basic structure of a Seurat object, how to get counts and cell annotations

4.2 The example experiment data

For today’s workshop, we’ll be using some data from Kang et al 2018 - Multiplexed droplet single-cell RNA-sequencing using natural genetic variation.

This dataset represents human peripheral blood mononuclear cells (PBMCs), pooled from eight individual donors. There were two conditions for the samples, one of which was stimulated with IFN-beta, and a unstimulated control.

They used the genetic differences among donors enable the identification of the individual for each cell, and to identify some cell doublets.

For the purposes of today’s workshop, we will load the individual IDs, stimulation condition and doublet status for this data from a separate metadata table.

PLEASE NOTE: The dataset used in this workshop is a modified version derived from this study (see here). It has been adapted to introduce additional complexity for instructional purposes. (Mitochondrial expression levels have been introduced to demonstrate how to interpret and apply mitochondrial thresholds for filtering purposes.) Please refrain from drawing any biological conclusions from this data as it does not represent real experimental results.

4.3 What does the raw data look like?

What do the input files look like? It varies, but this is the output of the CellRanger pipeline, described here

├── analysis
│   ├── clustering
│   ├── diffexp
│   ├── pca
│   ├── tsne
│   └── umap
├── cloupe.cloupe
├── filtered_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── filtered_feature_bc_matrix.h5.  --> matrix to read in h5 format
├── metrics_summary.csv
├── molecule_info.h5
├── possorted_genome_bam.bam
├── possorted_genome_bam.bam.bai
├── raw_feature_bc_matrix
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── raw_feature_bc_matrix.h5 
└── web_summary.html

4.4 Setup the Seurat Object

We start by reading in the data. There are several options for loading the data. The Read10X() function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).

We next use the count matrix to create a Seurat object. The object serves as a container that contains both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. For a technical discussion of the Seurat object structure, check out the GitHub Wiki. For example, the count matrix is stored in pbmc@assays$RNA@counts.

library(here)
library(dplyr)
library(ggplot2)
library(Seurat)
library(patchwork)

4.5 Different ways of loading the data

We’ll use method 3 today.

4.5.1 Example 1: From the whole data folder

Load your data using the path to the folder filtered_feature_bc_matrix that is in the output folder of your cellranger run using the Read10X function.

# Load the PBMC dataset
 pbmc.data <- Read10X(data.dir = "outs/filtered_feature_bc_matrix")
# Initialize the Seurat object with the raw (non-normalized data).
 seurat_object <- CreateSeuratObject(counts = pbmc.data, min.cells = 3, min.features = 200)

4.5.2 Example 2: From filtered_feature_bc_matrix folder

Load your data directing the ReadMtx function to each of the relevant files in the filtered_feature_bc_matrix folder in the outputs from your cellranger run. MTX is a simple text format for sparse matrices.

expression_matrix <- ReadMtx(
  mtx = "outs/filtered_feature_bc_matrix/count_matrix.mtx.gz", features = "outs/filtered_feature_bc_matrix/features.tsv.gz",
  cells = "outs/filtered_feature_bc_matrix/barcodes.tsv.gz"
)
seurat_object <- CreateSeuratObject(counts = expression_matrix)

4.5.3 Example 3: From .h5 files

Load your data using the Read10X_h5 function to each of the relevant HDF5 files. HDF5 is an efficient binary format.

For this workshop, we also have a text file with some of the metadata from the original study, which we add with AddMetaData.

pbmc.data <- Read10X_h5(here("data", "filtered_feature_bc_matrix.h5"))
metadata <- read.table(here("data", "metadata.txt"))
  
seurat_object <- CreateSeuratObject(counts = pbmc.data ,
                                 assay = "RNA", project = 'pbmc')
#> Warning: Feature names cannot have underscores ('_'),
#> replacing with dashes ('-')

# Use AddMetaData to add new meta data to object
seurat_object  <- AddMetaData(object = seurat_object, metadata = metadata)

4.6 What does the seurat object look like?

seurat_object 
#> An object of class Seurat 
#> 35635 features across 5000 samples within 1 assay 
#> Active assay: RNA (35635 features, 0 variable features)
#>  1 layer present: counts

Lets examine a few genes in the first thirty cells.

The . values in the matrix represent 0s (no molecules detected). Since most values in an scRNA-seq matrix are 0, Seurat uses a sparse-matrix representation whenever possible. This results in significant memory and speed savings for Drop-seq/inDrop/10x data.

GetAssayData(seurat_object, layer='counts')[c("CD3D","TCL1A","MS4A1"), 1:30]
#> 3 x 30 sparse Matrix of class "dgCMatrix"
#>   [[ suppressing 30 column names 'AGGGCGCTATTTCC-1', 'GGAGACGATTCGTT-1', 'CACCGTTGTCGTAG-1' ... ]]
#>                                                            
#> CD3D  . 2 . . . 5 . . . . 2 . . . 1 . . 1 1 . . . 1 . 1 3 .
#> TCL1A . . . . . . . . . . . . . . . . . . . . . . . 3 . . .
#> MS4A1 . . . . . . . . . . . . . . . . . . . . . . . 1 . . .
#>            
#> CD3D  2 . .
#> TCL1A . . .
#> MS4A1 . . .

And the cell metadata for the first 10 cells (most of these columns come from the original paper, we’ll get into them later.)

head(seurat_object@meta.data)
#>                     orig.ident nCount_RNA nFeature_RNA  ind
#> AGGGCGCTATTTCC-1 SeuratProject       2020          523 1256
#> GGAGACGATTCGTT-1 SeuratProject        840          381 1256
#> CACCGTTGTCGTAG-1 SeuratProject       3097          995 1016
#> TATCGTACACGCAT-1 SeuratProject       1011          540 1488
#> TGACGCCTTGCTTT-1 SeuratProject        570          367  101
#> TACGAGACCTATTC-1 SeuratProject       2399          770 1244
#>                  stim              cell multiplets
#> AGGGCGCTATTTCC-1 stim   CD14+ Monocytes    singlet
#> GGAGACGATTCGTT-1 stim       CD4 T cells    singlet
#> CACCGTTGTCGTAG-1 ctrl FCGR3A+ Monocytes    singlet
#> TATCGTACACGCAT-1 stim           B cells    singlet
#> TGACGCCTTGCTTT-1 ctrl       CD4 T cells       ambs
#> TACGAGACCTATTC-1 stim       CD4 T cells    singlet

What about Multiple sampes?

In the real world, you’ll probably want to load multiple samples, and join them together into one big data object before analysis.

This is described in the seurat merge vignette


#so.sample1 <- # load it
#so.sample1$sample_name <- "S1" # label it 
#so.sample1$group       <- "control" 
# ...
# # load the other samples
# Join them
seurat_object.all <- merge(so.sample1, y = c(so.sample2, so.sample3), add.cell.ids = c("S1", "S2", "S3"), project = "Someproject")

How much RAM does the sparse matrix save?

dense.size <- object.size(as.matrix(pbmc.data))
#> Warning in asMethod(object): sparse->dense coercion:
#> allocating vector of size 1.3 GiB
dense.size
#> 1428252816 bytes
sparse.size <- object.size(pbmc.data)
sparse.size
#> 39615768 bytes
dense.size / sparse.size
#> 36.1 bytes

3 Schedule

5 QC Filtering