Differential Expression: By niche

Generate niches
- Explore niches
Pool samples into pseudobulk
Filter to a minimum number of cells per pool
Differential expression
Evaluate DE (by regional niche) genes
- IGHG1 and IGHG2
- WNT2B

What is the difference between Crohn’s Disease and healthy samples in different regions of the tissue?

There are clear ‘regions’ in the tissue, e.g. a lymphoid region full of T cells and B cells. Or the very outer layer of epithelial, or the basal layer or stromal cells. We can identify and test for differences within these ‘niches’.

plotSpatialFeature(sfe.sample.HC,'celltype_subset', colGeometryName = "cellSeg") + 
  theme(legend.title=element_blank()) +
  ggtitle(sample)

Generate niches

We can use BANKSY to define niches. Rather than using cell types, BANKSY uses three pieces of information: each cell’s gene expression, its surrounding gene expression and a ‘azimuthal gabor filter’ (AGF) to capture the surrounding gene expression gradient - see Figure 1.

Again, this takes some time to run. The results are already loaded in the sfe object.

# ## DO NOT RUN ##
# 
# # Lambda defines how much BANKSY should consider the gene expression surrounding a cell
# # O uses only *this* cell. 0.8 is recommended for finding domains (niches).
# lambdas      <- c(0.6) 
# 
# # Resolution of clustering step 
# # Higher values = more niches.
# resolutions <- c(0.3, 0.5)
# 
# # Try to limit to just highly variable genes
# hvgs <- rownames(rowData(sfe))[rowData(sfe)$hvg]
# sfe.hvg <- sfe[hvgs,]
# sfe.hvg$tissue_sample<- droplevels(sfe.hvg$tissue_sample)
# 
# # Split by sample
# get_one_sample <- function(the_sample){sfe.hvg[,sfe.hvg$tissue_sample == the_sample]}
# sfe.list <- lapply(FUN=get_one_sample, X=levels(sfe.hvg$tissue_sample))
# 
# # Get the underlying BANKSY computation
# sfe.list <- lapply(sfe.list, computeBanksy, assay_name = 'logcounts', k_geom = k_geom)
# sfe.merged <- do.call(cbind, sfe.list)
# sfe.merged <- Banksy::runBanksyPCA(sfe.merged, lambda = lambdas, npcs = npcs, group='tissue_sample', seed=12)
# sfe.merged <- Banksy::clusterBanksy(sfe.merged, lambda = lambdas, npcs = npcs, resolution = resolutions, seed=12)
# 
# niches_table <- colData(sfe.merged)
# niche_lookup <- setNames(niches_table$clust_M0_lam0.6_k50_res0.5, nm=as.character( niches_table$cell ))
# niche_lookup2 <- setNames(niches_table$clust_M0_lam0.6_k50_res0.3, nm=as.character( niches_table$cell ))
# sfe$clust_M0_lam0.6_k50_res0.5 <- niche_lookup[as.character(sfe$cell)]
# sfe$clust_M0_lam0.6_k50_res0.3 <- niche_lookup2[as.character(sfe$cell)]
# sfe$niche <- factor(paste0('n', as.character(sfe$clust_M0_lam0.6_k50_res0.3)),
#                      levels=paste0('n',levels(sfe$clust_M0_lam0.6_k50_res0.3)))

See clusters assigned at different resolutions and the niche assigned using 0.3 resolution.

head(colData(sfe), n=100) %>% as_tibble() %>% 
  select(cell, tissue_sample, celltype_subset, clust_M0_lam0.6_k50_res0.5, clust_M0_lam0.6_k50_res0.3, niche )

Observe niches plotted on the healthy tissue.

plotSpatialFeature(sfe.sample.HC, "niche",colGeometryName = "cellSeg") + 
  theme(legend.title=element_blank()) +
  ggtitle(sample)

Compare with one of the Crohn’s samples.

plotSpatialFeature(sfe.sample.CD,'celltype_subset', colGeometryName = "cellSeg") + 
  theme(legend.title=element_blank()) +
  ggtitle(sample)

plotSpatialFeature(sfe.sample.CD, "niche",colGeometryName = "cellSeg") + # Choose this one
  theme(legend.title=element_blank()) +
  ggtitle(sample)

Explore niches

What cell types make up the niches?

Can see that;

Niche 7 is almost entirely epithelia along the outer edge.
Niche 2 is the epithelia in the epithelial ‘crypt’ structures in the lamina propria.
Niche 6 captures the lymphoid regions seen in in HC_a, full of T cells and plasma cells.

## Cell type proportions
celltype_summary_table<- colData(sfe) %>% as_tibble() %>%
  group_by(celltype_subset, niche ) %>%
  summarise(n_cells = n())

## `summarise()` has grouped output by 'celltype_subset'. You can override using
## the `.groups` argument.

ggplot(celltype_summary_table, aes(fill=celltype_subset, y=n_cells, x=niche) )+
  geom_bar(position="fill", stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(legend.position = "bottom") +
  scale_y_continuous(expand = c(0,0)) +
  ggtitle( "Niche composition by cell type")

Likewise, its helpful to see the proportion of niche types across different samples. This can be quite variable - particularly when we’re working on a particularly small region of each sample!

## Niche proportions
celltype_summary_table<- colData(sfe) %>% as_tibble() %>%
  group_by(tissue_sample, niche ) %>%
  summarise(n_cells = n())

## `summarise()` has grouped output by 'tissue_sample'. You can override using the
## `.groups` argument.

ggplot(celltype_summary_table, aes(fill=niche, y=n_cells, x=tissue_sample) )+
  geom_bar(position="fill", stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(legend.position = "bottom") +
  scale_y_continuous(expand = c(0,0)) +
  ggtitle( "Sample composition by niche")

Pool samples into pseudobulk

Like when conducting differential expression by cell type, we need to create a cell-level annotation for the pseudobulk grouping in our analysis, but this time by niche. We do this by concatenating the tissue sample ID with the niche; here we call this column pdb_sample.

The aggregateAcrossCells() function from the scuttle package is used to sum up each gene’s expression across each pdb_sample.

This will build a SummarizedExperiment object with one entry per pdb_sample combined sample-niche group. Each entry contains the sum of copies of the gene for all cells within the group.

This can take a while to run - we will load one that’s already been preprocessed.

# # DO NOT RUN
# sfe$pdb_sample <- paste0(sfe$tissue_sample, '_', sfe$niche)
# se.pdb <- aggregateAcrossCells(sfe, ids=sfe$pdb_sample,
#                                BPPARAM = MulticoreParam(workers=4)  )

Load preprocessed pseudobulk

se.pdb <- readRDS(spe_pseudobulk_by_niche_file)

counts(se.pdb)[1:5,1:5]

##      CD_a_n1 CD_a_n2 CD_a_n3 CD_a_n4 CD_a_n5
## AATK      22     141     123      70     200
## ABL1      46     208     139      87     319
## ABL2      19     130     109      64     131
## ACE       13     123      52      44      83
## ACE2      33     242     168      90     172

Filter to a minimum number of cells per pool

How many cell per pool?

DT::datatable(data.frame(colData(se.pdb))[,c('pdb_sample','individual_code','tissue_sample','niche','ncells')])

We will pick a threshold for ‘minimum cells per pseudobulk sample’ based on the distribution of the data. A pool of 2 or 5 cells would be too small to reasonably test.

# What is the minimum acceptable number of cells in a pool?

min_cells_per_pdbsample <- 100

ggplot(colData(se.pdb), aes(x=ncells, col=niche)) +
  geom_density() +
  geom_vline(xintercept=min_cells_per_pdbsample, lty=3) +
  geom_rug() +
  scale_x_log10() +
  theme_bw() +
  ggtitle("How many cells per pseduobulk sample?")

Here we can see that 100 cells is a good cutoff for all niches.

Apply filter.

# Filter on minimum number of cells per pdb group
# (ncells column was added by aggregateAcrossCells)
se.pdb <- se.pdb[,se.pdb$ncells >= min_cells_per_pdbsample]

Differential expression

Differential expression calculated exactly as per cell type, just with niche.

# Pull out the pseudobulk counts matrix for passed samples
pseudobulk_counts_matrix <- counts(se.pdb)
# And the corresponding annotations
pseudobulk_anno_table    <- as_tibble(colData(se.pdb)[,c('pdb_sample','group', 'tissue_sample', 'niche', 'ncells')]) # + any other experimental factors


# Build a table of each contrast we might want to do. 
contrasts_wanted <- bind_cols(
  A= c("CD"), # First terms
  B= c("HC")  # Second terms (usually control)
)

We won’t be able to test niches 6 or 8. But the below code will test for the number of available conditions per side, and skip them automatically.

table(pseudobulk_anno_table$niche, pseudobulk_anno_table$group)

##     
##      HC CD
##   n1  3  3
##   n2  3  3
##   n3  3  3
##   n4  3  3
##   n5  3  2
##   n6  3  1
##   n7  3  2
##   n8  0  1

# Empty list to collect results
de_result_list <- list()

## Cycle through each niche
for (the_niche in levels(se.pdb$niche)) {

  # Subset pseudobulk object to one niche
  se.pdb.this <- se.pdb[,se.pdb$niche == the_niche]
  
  # And pull out the annotation and counts
  anno_table.this   <- as_tibble(colData(se.pdb.this))
  count_matrix.this <- counts(se.pdb.this)

  
  ## Check for sufficient replicates ##
  # To do any calculations, we need at least 2 pseudobulk groups per contrast.
  # There are plenty in this experiment, but with less replicates and rare cell types
  # It's very common to have to skip some contrasts for some niches.

  # Skip clusters with no samples after filtering
  if( nrow(anno_table.this) < 1 ) {next}
  
  # Count how many biological replicates per group (only need one FOV to count it)
  biosample_per_group <- anno_table.this %>% 
    select(group, tissue_sample) %>%
    pull(group) %>% # pull out group column, one entry per tissue sample
    table() # count how many

  # Are there enough biological replicates to consider the contrast?
  min_biosample_per_group <- 2
  enough_biosamples <- 
    (unname(biosample_per_group[contrasts_wanted$A]) >= min_biosample_per_group) &
    (unname(biosample_per_group[contrasts_wanted$B]) >= min_biosample_per_group)
    
  # Make a new table with contrasts that we have enough biological replicates for.
  contrasts_to_test <- contrasts_wanted[enough_biosamples,]
  
  # Skip if no contrasts are testable
  if (nrow(contrasts_to_test) == 0) {next}
  
  ## Setup model
  
  # Setup objects for limma
  dge <- DGEList(count_matrix.this)
  dge <- calcNormFactors(dge)
  
  # Build model
  group           <- anno_table.this$group

  # Model design 
  # Add other experimental factors here 
  # ~0 + group
  # ~0 + group + individual
  # ~0 + group + individual + slide
  design    <- model.matrix( ~0 + group)
  
  # Run voom
  vm  <- voom(dge, design = design, plot = FALSE)
  
  # Fit model
  fit     <- lmFit(vm, design) 

  # Define and fit contrasts and run eBayes
  # Doing this in an automated way so we can include/exclude contrasts where 
  # there are/aren't enough replicates available. 
  # The list of strings generated here is parsed by the makeContrasts function
  # to build a contrasts matrix.
  contrast_str_list <- paste0("group",contrasts_to_test$A,"-","group",contrasts_to_test$B)
  
  contrasts <- makeContrasts(contrasts=contrast_str_list,
                           levels=coef(fit))
  
  fit <- contrasts.fit(fit, contrasts)
  fit <- eBayes(fit)

  
  
  ## Loop through contrasts.
  # You can run multiple contrasts at once, but doing it this way allows us
  # to skip individual contrasts within a niche.
  for ( the_coef in colnames(contrasts) ) {
    # Make a version of the coefficient name that doesn't include a '-', or the 
    # factor name (group)
    # Life is easier without special characters
    # groupUC-groupHC => UCvHC
    contrast_name <- gsub("group","", gsub("-","v",the_coef))
    
    de_result.this <- topTable(fit, n = Inf, adjust.method = "BH", coef = the_coef) %>%
      rownames_to_column("target") %>%
      mutate(contrast=contrast_name,
             contrast_group="pairwise",
             niche=the_niche) %>%
      select(niche,contrast_group, contrast,target,everything()) %>%
      arrange(P.Value)
    
      # Build a unique name for this result by adding <niche>_<A>v<B>
      de_result_list[[paste(the_niche, contrast_name, sep="_")]] <- de_result.this
    
  }

}

## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473688_CD_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473690_CD_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473682_HC_a, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473683_HC_b, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473684_HC_c, graphs are meaningless, dropping graphs.

## Only 1 column left in sample GSM7473689_CD_b, graphs are meaningless, dropping graphs.

# Join together results for all niches, and pull out those with a significant adjusted p-value
de_results_all.niche <- bind_rows(de_result_list)
de_results_sig.niche <- filter(de_results_all.niche, adj.P.Val < 0.05)

DT::datatable(de_results_sig.niche)

Evaluate DE (by regional niche) genes

Again, make some normalised expression just for plotting.

# Get some normalised values, purely for plotting.
# (can't use logged counts because the sizes are so dramatically different!)
dge <- DGEList(counts(se.pdb))
dge <- calcNormFactors(dge)
norm_counts <- cpm(dge)

# Store normalised log2 scale expression in 
assay(se.pdb, "logcounts") <- log2(norm_counts)

IGHG1 and IGHG2

IGHG1 and IGHG2 (heavy chain immunoglobulins) are behaving very similarly, but the change seems to be fairly global.

p1 <- plotExpression(se.pdb, "IGHG2", x="group", colour_by = "group",
               other_fields = "niche") + 
  ggtitle("IGHG2") + 
  facet_wrap(~niche)
p2 <- plotExpression(se.pdb, "IGHG1", x="group", colour_by = "group",
               other_fields = "niche") + 
  ggtitle("IGHG1") + 
  facet_wrap(~niche)

p1 + p2

## Warning: Groups with fewer than two datapoints have been dropped.
## i Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## i Set `drop = FALSE` to consider such groups for position adjustment purposes.

## Warning: Computation failed in `stat_ydensity()`.
## Caused by error in `$<-.data.frame`:
## ! replacement has 1 row, data has 0

## Warning: Groups with fewer than two datapoints have been dropped.
## i Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## i Set `drop = FALSE` to consider such groups for position adjustment purposes.

## Warning: Computation failed in `stat_ydensity()`.
## Caused by error in `$<-.data.frame`:
## ! replacement has 1 row, data has 0

Just showing IGHG2; Suddenly Niche 1 makes sense!

p1 <- plotSpatialFeature(sfe.sample.HC, "IGHG2",colGeometryName = "cellSeg") + 
  theme(legend.title=element_blank()) +
  ggtitle(sample) +
  ggtitle("IGHG2 - HC_b")
p2 <- plotSpatialFeature(sfe.sample.CD, "IGHG2",colGeometryName = "cellSeg") + 
  theme(legend.title=element_blank()) +
  ggtitle(sample) +
    ggtitle("IGHG2 - CD_a")

p1 / p2

WNT2B

Could a lack of WNT2B be relevant? https://pubmed.ncbi.nlm.nih.gov/38697357/

plotExpression(se.pdb, "WNT2B", x="group", colour_by = "group",
               other_fields = "niche") + 
  ggtitle("WNT2B") + 
  facet_wrap(~niche)

## Warning: Groups with fewer than two datapoints have been dropped.
## i Set `drop = FALSE` to consider such groups for position adjustment purposes.
## Groups with fewer than two datapoints have been dropped.
## i Set `drop = FALSE` to consider such groups for position adjustment purposes.

## Warning: Computation failed in `stat_ydensity()`.
## Caused by error in `$<-.data.frame`:
## ! replacement has 1 row, data has 0

p1 <- plotSpatialFeature(sfe.sample.HC, "WNT2B",colGeometryName = "cellSeg") + 
  theme(legend.title=element_blank()) +
  ggtitle(sample) +
  ggtitle("WNT2B - HC_b")
p2 <- plotSpatialFeature(sfe.sample.CD, "WNT2B",colGeometryName = "cellSeg") + 
  theme(legend.title=element_blank()) +
  ggtitle(sample) +
    ggtitle("WNT2B - CD_a")

p1 / p2