What is the difference between Crohn’s Disease and healthy samples, in terms of gene expression within each cell type?
We can test for differential expression as though this was a single cell experiment. For a review of approaches see Soneson & Robinson 2018.
We’re going to use a pseudobulk approach; summing up all cells within a group, and treating it more like a bulk RNAseq experiment.
For a more detailed walkthrough of this test, see Spatial Sampler: Differential expression between groups using pseudobulk
We need to create a cell-level annotation for the pseudobulk grouping in our analysis. We can do this by concatenating the tissue sample ID with the cell type; here we call this column pdb_sample.
Now use the aggregateAcrossCells() function from the scuttle package to sum up each gene’s expression across each pdb_sample.
This will build a SummarizedExperiment object with one entry per pdb_sample combined sample-cell type group. Each entry contains the sum of copies of the gene for all cells within the group.
This can take a while to run - we will load one that’s already been preprocessed
NB: We might consider applying an additional filter here for minimum counts per cell (particularly since we don’t need to maintain spatial information), but this data has already been filtered.
## DO NOT RUN ##
# sfe$pdb_sample <- paste0(sfe$tissue_sample, '_', sfe$celltype_subset)
# se.pdb <- aggregateAcrossCells(sfe, ids=sfe$pdb_sample,
# BPPARAM = MulticoreParam(workers=4) )
Load pseudobulk
se.pdb <- readRDS(spe_pseudobulk_by_celltype_file)
The counts per cell look like this. Many zeros.
counts(sfe)[1:10,1:8]
## <10 x 8> sparse DelayedMatrix object of type "double":
## HC_a_1000_1 HC_a_1000_2 HC_a_1000_4 ... HC_a_1001_4 HC_a_1002_1
## AATK 0 0 0 . 0 0
## ABL1 1 0 0 . 0 0
## ABL2 0 0 0 . 0 0
## ACE 0 0 0 . 0 0
## ACE2 0 0 0 . 1 0
## ACKR1 0 0 0 . 0 0
## ACKR3 0 0 0 . 0 0
## ACKR4 0 0 1 . 0 0
## ACTA2 0 0 0 . 0 0
## ACTG2 0 0 0 . 0 0
But the counts per pseudobulk sample look like this:
counts(se.pdb)[1:10,1:8]
## CD_a_epi CD_a_myeloids CD_a_plasmas CD_a_stroma CD_a_tcells CD_b_epi
## AATK 176 40 83 257 9 0
## ABL1 263 33 114 391 11 1
## ABL2 172 37 67 174 17 0
## ACE 142 19 45 117 6 0
## ACE2 310 31 130 253 14 2
## ACKR1 106 29 56 267 10 0
## ACKR3 102 21 73 166 2 0
## ACKR4 152 17 56 168 13 2
## ACTA2 120 30 60 852 15 0
## ACTG2 130 38 86 1529 22 0
## CD_b_myeloids CD_b_plasmas
## AATK 483 353
## ABL1 483 750
## ABL2 566 589
## ACE 540 373
## ACE2 399 619
## ACKR1 271 411
## ACKR3 284 370
## ACKR4 348 436
## ACTA2 385 377
## ACTG2 287 433
There’s going to be cases where a given sample doesn’t have enough cells to reasonably test. We will pick a threshold for ‘minimum cells per pseudobulk sample’ based on the distribution of the data. A pool of just 6 cells would be way too small!
DT::datatable(data.frame(colData(se.pdb))[,c('pdb_sample','individual_code','tissue_sample','ncells')])
100 is a decent threshold for this data. We could consider 200, but we would start to lose more of the T-cells, which are rarer overall.
# What is the minimum acceptable number of cells in a pool?
min_cells_per_pdbsample <- 100
ggplot(colData(se.pdb), aes(x=ncells, col=celltype_subset)) +
geom_density() +
geom_vline(xintercept=min_cells_per_pdbsample, lty=3) +
geom_rug() +
scale_x_log10() +
theme_bw() +
ggtitle("How many cells per pseduobulk sample?")
Apply filter
# Filter on minimum number of cells per pdb group
# (ncells column was added by aggregateAcrossCells)
se.pdb <- se.pdb[,se.pdb$ncells >= min_cells_per_pdbsample]
First, we pull out the counts data, and corresponding pseudobulk annotation.
Then, we build a table of the contrasts we want to test - here there is only one - ‘CD-HC’; Crohn’s disease vs Healthy Controls. Having this table makes it easier to dynamically skip cell types or contrasts which don’t have enough cells to test.
# Pull out the pseudobulk counts matrix for passed samples
pseudobulk_counts_matrix <- counts(se.pdb)
# And the corresponding annotations
pseudobulk_anno_table <- as_tibble(colData(se.pdb)[,c('pdb_sample','group', 'tissue_sample', 'celltype_subset', 'ncells')]) # + any other experimental factors
# Build a table of each contrast we might want to do.
contrasts_wanted <- bind_cols(
A= c("CD"), # First terms
B= c("HC") # Second terms (usually control)
)
This section of code will test for differential expression, looping through each cell type.
DGEList object with edgeR for testingvoom and limma to test specified contrasts# Empty list to collect results
de_result_list <- list()
## Cycle through each cell type
for (the_celltype in levels(se.pdb$celltype_subset)) {
# Subset pseudobulk object to one cell type
se.pdb.this <- se.pdb[,se.pdb$celltype_subset == the_celltype]
# And pull out the annotation and counts
anno_table.this <- as_tibble(colData(se.pdb.this))
count_matrix.this <- counts(se.pdb.this)
## Check for sufficient replicates ##
# To do any calculations, we need at least 2 pseudobulk groups per contrast.
# There are plenty in this experiment, but with less replicates and rare cell types.
# It's very common to have to skip some contrasts for some cell types.
# Skip clusters with no samples after filtering
if( nrow(anno_table.this) < 1 ) {next}
# Count how many biological replicates per group (only need one FOV to count it)
biosample_per_group <- anno_table.this %>%
select(group, tissue_sample) %>%
pull(group) %>% # pull out group column, one entry per tissue sample
table() # count how many
# Are there enough biological replicates to consider the contrast?
min_biosample_per_group <- 2
enough_biosamples <-
(unname(biosample_per_group[contrasts_wanted$A]) >= min_biosample_per_group) &
(unname(biosample_per_group[contrasts_wanted$B]) >= min_biosample_per_group)
# Make a new table with contrasts that we have enough biological replicates for.
contrasts_to_test <- contrasts_wanted[enough_biosamples,]
## Setup model
# Setup objects for limma
dge <- DGEList(count_matrix.this)
dge <- calcNormFactors(dge)
# Build model
group <- anno_table.this$group
# CUSTOMISE:
# Model design
# Add other experimental factors here
# ~0 + group
# ~0 + group + individual
# ~0 + group + individual + slide
design <- model.matrix( ~0 + group)
# Run voom
vm <- voom(dge, design = design, plot = FALSE)
# Fit model
fit <- lmFit(vm, design)
# Define and fit contrasts and run eBayes
# Doing this in an automated way so we can include/exclude contrasts where
# there are/aren't enough replicates available.
# The list of strings generated here is parsed by the makeContrasts function
# to build a contrasts matrix.
contrast_str_list <- paste0("group",contrasts_to_test$A,"-","group",contrasts_to_test$B)
contrasts <- makeContrasts(contrasts=contrast_str_list,
levels=coef(fit))
fit <- contrasts.fit(fit, contrasts)
fit <- eBayes(fit)
## Loop through contrasts.
# You can run multiple contrasts at onces, but doing it this way allows us
# to skip individual contrasts within a cell type.
for ( the_coef in colnames(contrasts) ) {
# Make a version of the coefficient name that doesn't include a '-', or the
# factor name (group)
# Life is easier without special characters
# groupUC-groupHC => UCvHC
contrast_name <- gsub("group","", gsub("-","v",the_coef))
de_result.this <- topTable(fit, n = Inf, adjust.method = "BH", coef = the_coef) %>%
rownames_to_column("target") %>%
mutate(contrast=contrast_name,
contrast_group="pairwise",
celltype=the_celltype) %>%
select(celltype,contrast_group, contrast,target,everything()) %>%
arrange(P.Value)
# Build a unique name for this result by adding <celltype>_<A>v<B>
de_result_list[[paste(the_celltype, contrast_name, sep="_")]] <- de_result.this
}
}
## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.
# Join together results for all cell types, and pull out those with a significant
# adjusted p-value
de_results_all.celltype <- bind_rows(de_result_list)
de_results_sig.celltype <- filter(de_results_all.celltype, adj.P.Val < 0.05)
The significant results:
DT::datatable(de_results_sig.celltype)
We can plot each cell’s expression, but most counts are zero, which makes the plot difficult to interpret.
plotExpression(sfe, "SELENOP", x="group", colour_by = "group",
other_fields = "celltype_subset") +
facet_wrap(~celltype_subset)
While differential expression used counts data, we can still calculate some normalised expression at the pseudobulk level for plotting.
# Get some normalised values, purely for plotting.
# (Can't use logged counts because the sizes are so dramatically different!)
dge <- DGEList(counts(se.pdb))
dge <- calcNormFactors(dge)
norm_counts <- cpm(dge)
# Store normalised log2 scale expression in
assay(se.pdb, "logcounts") <- log2(norm_counts)
Here we can see the expression per biological sample.
plotExpression(se.pdb, "SELENOP", x="group", colour_by = "group",
other_fields = "celltype_subset") +
facet_wrap(~celltype_subset)