What is the difference between Crohn’s Disease and healthy samples, in terms of gene expression within each cell type?

We can test for differential expression as though this was a single cell experiment. For a review of approaches see Soneson & Robinson 2018.

We’re going to use a pseudobulk approach; summing up all cells within a group, and treating it more like a bulk RNAseq experiment.

For a more detailed walkthrough of this test, see Spatial Sampler: Differential expression between groups using pseudobulk

Pool samples into pseudobulk

We need to create a cell-level annotation for the pseudobulk grouping in our analysis. We can do this by concatenating the tissue sample ID with the cell type; here we call this column pdb_sample.

Now use the aggregateAcrossCells() function from the scuttle package to sum up each gene’s expression across each pdb_sample.

This will build a SummarizedExperiment object with one entry per pdb_sample combined sample-cell type group. Each entry contains the sum of copies of the gene for all cells within the group.

This can take a while to run - we will load one that’s already been preprocessed

NB: We might consider applying an additional filter here for minimum counts per cell (particularly since we don’t need to maintain spatial information), but this data has already been filtered.

## DO NOT RUN ##
# sfe$pdb_sample <- paste0(sfe$tissue_sample, '_', sfe$celltype_subset)
# se.pdb <- aggregateAcrossCells(sfe, ids=sfe$pdb_sample,
#                                BPPARAM = MulticoreParam(workers=4)  )

Load pseudobulk

se.pdb <- readRDS(spe_pseudobulk_by_celltype_file)

The counts per cell look like this. Many zeros.

counts(sfe)[1:10,1:8]
## <10 x 8> sparse DelayedMatrix object of type "double":
##       HC_a_1000_1 HC_a_1000_2 HC_a_1000_4 ... HC_a_1001_4 HC_a_1002_1
## AATK            0           0           0   .           0           0
## ABL1            1           0           0   .           0           0
## ABL2            0           0           0   .           0           0
## ACE             0           0           0   .           0           0
## ACE2            0           0           0   .           1           0
## ACKR1           0           0           0   .           0           0
## ACKR3           0           0           0   .           0           0
## ACKR4           0           0           1   .           0           0
## ACTA2           0           0           0   .           0           0
## ACTG2           0           0           0   .           0           0

But the counts per pseudobulk sample look like this:

counts(se.pdb)[1:10,1:8]
##       CD_a_epi CD_a_myeloids CD_a_plasmas CD_a_stroma CD_a_tcells CD_b_epi
## AATK       176            40           83         257           9        0
## ABL1       263            33          114         391          11        1
## ABL2       172            37           67         174          17        0
## ACE        142            19           45         117           6        0
## ACE2       310            31          130         253          14        2
## ACKR1      106            29           56         267          10        0
## ACKR3      102            21           73         166           2        0
## ACKR4      152            17           56         168          13        2
## ACTA2      120            30           60         852          15        0
## ACTG2      130            38           86        1529          22        0
##       CD_b_myeloids CD_b_plasmas
## AATK            483          353
## ABL1            483          750
## ABL2            566          589
## ACE             540          373
## ACE2            399          619
## ACKR1           271          411
## ACKR3           284          370
## ACKR4           348          436
## ACTA2           385          377
## ACTG2           287          433

Filter to a minimum number of cells per pool

There’s going to be cases where a given sample doesn’t have enough cells to reasonably test. We will pick a threshold for ‘minimum cells per pseudobulk sample’ based on the distribution of the data. A pool of just 6 cells would be way too small!

DT::datatable(data.frame(colData(se.pdb))[,c('pdb_sample','individual_code','tissue_sample','ncells')])

100 is a decent threshold for this data. We could consider 200, but we would start to lose more of the T-cells, which are rarer overall.

# What is the minimum acceptable number of cells in a pool?
min_cells_per_pdbsample <- 100

ggplot(colData(se.pdb), aes(x=ncells, col=celltype_subset)) +
  geom_density() +
  geom_vline(xintercept=min_cells_per_pdbsample, lty=3) +
  geom_rug() +
  scale_x_log10() +
  theme_bw() +
  ggtitle("How many cells per pseduobulk sample?")

Apply filter

# Filter on minimum number of cells per pdb group
# (ncells column was added by aggregateAcrossCells)
se.pdb <- se.pdb[,se.pdb$ncells >= min_cells_per_pdbsample]

Differential expression

First, we pull out the counts data, and corresponding pseudobulk annotation.

Then, we build a table of the contrasts we want to test - here there is only one - ‘CD-HC’; Crohn’s disease vs Healthy Controls. Having this table makes it easier to dynamically skip cell types or contrasts which don’t have enough cells to test.

# Pull out the pseudobulk counts matrix for passed samples
pseudobulk_counts_matrix <- counts(se.pdb)
# And the corresponding annotations
pseudobulk_anno_table    <- as_tibble(colData(se.pdb)[,c('pdb_sample','group', 'tissue_sample', 'celltype_subset', 'ncells')]) # + any other experimental factors

# Build a table of each contrast we might want to do. 
contrasts_wanted <- bind_cols(
  A= c("CD"), # First terms
  B= c("HC")  # Second terms (usually control)
)

This section of code will test for differential expression, looping through each cell type.

  1. Subset the counts and annotation to only that cell type
  2. And then, for each contrast we want to test (which is only 1)
    1. Check if there are enough biological replicates to test (ie. at least 2vs2)
    2. Build a DGEList object with edgeR for testing
    3. Define the design matrix, here simply: ~0 + group
    4. Use voom and limma to test specified contrasts
    5. Pull out a results table, annotated with cell type and contrast tested
    6. Record it
  3. Merge together all results tables
# Empty list to collect results
de_result_list <- list()

## Cycle through each cell type
for (the_celltype in levels(se.pdb$celltype_subset)) {

  # Subset pseudobulk object to one cell type
  se.pdb.this <- se.pdb[,se.pdb$celltype_subset == the_celltype]
  
  # And pull out the annotation and counts
  anno_table.this   <- as_tibble(colData(se.pdb.this))
  count_matrix.this <- counts(se.pdb.this)

  
  ## Check for sufficient replicates ##
  # To do any calculations, we need at least 2 pseudobulk groups per contrast.
  # There are plenty in this experiment, but with less replicates and rare cell types.
  # It's very common to have to skip some contrasts for some cell types.

  # Skip clusters with no samples after filtering
  if( nrow(anno_table.this) < 1 ) {next}
  
  # Count how many biological replicates per group (only need one FOV to count it)
  biosample_per_group <- anno_table.this %>% 
    select(group, tissue_sample) %>%
    pull(group) %>% # pull out group column, one entry per tissue sample
    table() # count how many

  # Are there enough biological replicates to consider the contrast?
  min_biosample_per_group <- 2
  enough_biosamples <- 
    (unname(biosample_per_group[contrasts_wanted$A]) >= min_biosample_per_group) &
    (unname(biosample_per_group[contrasts_wanted$B]) >= min_biosample_per_group)
    
  # Make a new table with contrasts that we have enough biological replicates for.
  contrasts_to_test <- contrasts_wanted[enough_biosamples,]
  
  
  
  ## Setup model
  
  # Setup objects for limma
  dge <- DGEList(count_matrix.this)
  dge <- calcNormFactors(dge)
  
  # Build model
  group           <- anno_table.this$group

  # CUSTOMISE:
  # Model design 
  # Add other experimental factors here 
  # ~0 + group
  # ~0 + group + individual
  # ~0 + group + individual + slide
  design    <- model.matrix( ~0 + group)
  
  # Run voom
  vm  <- voom(dge, design = design, plot = FALSE)
  # Fit model
  fit     <- lmFit(vm, design) 

  # Define and fit contrasts and run eBayes
  # Doing this in an automated way so we can include/exclude contrasts where 
  # there are/aren't enough replicates available. 
  # The list of strings generated here is parsed by the makeContrasts function
  # to build a contrasts matrix.
  contrast_str_list <- paste0("group",contrasts_to_test$A,"-","group",contrasts_to_test$B)
  
  contrasts <- makeContrasts(contrasts=contrast_str_list,
                           levels=coef(fit))
  
  fit <- contrasts.fit(fit, contrasts)
  fit <- eBayes(fit)

  
  
  ## Loop through contrasts.
  # You can run multiple contrasts at onces, but doing it this way allows us
  # to skip individual contrasts within a cell type.
  for ( the_coef in colnames(contrasts) ) {
    # Make a version of the coefficient name that doesn't include a '-', or the 
    # factor name (group)
    # Life is easier without special characters
    # groupUC-groupHC => UCvHC
    contrast_name <- gsub("group","", gsub("-","v",the_coef))
    
    de_result.this <- topTable(fit, n = Inf, adjust.method = "BH", coef = the_coef) %>%
      rownames_to_column("target") %>%
      mutate(contrast=contrast_name,
             contrast_group="pairwise",
             celltype=the_celltype) %>%
      select(celltype,contrast_group, contrast,target,everything()) %>%
      arrange(P.Value)
    
      # Build a unique name for this result by adding <celltype>_<A>v<B>
      de_result_list[[paste(the_celltype, contrast_name, sep="_")]] <- de_result.this
    
  }

}
## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.
## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.
# Join together results for all cell types, and pull out those with a significant 
# adjusted p-value
de_results_all.celltype <- bind_rows(de_result_list)
de_results_sig.celltype <- filter(de_results_all.celltype, adj.P.Val < 0.05)

The significant results:

DT::datatable(de_results_sig.celltype)

Evaluate DE (by cell type) genes

We can plot each cell’s expression, but most counts are zero, which makes the plot difficult to interpret.

plotExpression(sfe, "SELENOP", x="group", colour_by = "group",
               other_fields = "celltype_subset") + 
 facet_wrap(~celltype_subset)

While differential expression used counts data, we can still calculate some normalised expression at the pseudobulk level for plotting.

# Get some normalised values, purely for plotting.
# (Can't use logged counts because the sizes are so dramatically different!)
dge <- DGEList(counts(se.pdb))
dge <- calcNormFactors(dge)
norm_counts <- cpm(dge)

# Store normalised log2 scale expression in 
assay(se.pdb, "logcounts") <- log2(norm_counts)

Here we can see the expression per biological sample.

plotExpression(se.pdb, "SELENOP", x="group", colour_by = "group",
               other_fields = "celltype_subset") + 
 facet_wrap(~celltype_subset)