6 Input Data

What does the data look like when you get it? Examine raw data.

  • Explain e.g. cosmx in detail, refer to similarities in e.g. xenium
  • Load data into object
  • Brief explanation of different data structures
  • Sample annotation: Applying sample names to individual slides. Mention difficulty of multiple samples per slide (can we recommend some approach?)

=====

6.1 The raw data

For a cosmx SMI slide (like this experiment).

  • Flat files : Most of what we need to do processing. Typically a directory of flat files is read in as a whole, and we don’t need to worry too much. But note there are no images in here.
    • SLIDE-polygons.csv.gz : Cell borders
    • SLIDE_exprMat_file.csv.gz : Counts of genes per cell (Counts matrix)
    • SLIDE_fov_positions_file.csv.gz : Location of FOVs on slide
    • SLIDE_metadata_file.csv.gz : Cell level QC metadata
    • SLIDE_tx_file.csv.gz : Location of individual transcripts.
  • Raw files : Giant ugly directory with lots of files including microscopy images.

For a Xenium slide there is just the one typical output directory; they discuss the formats in the section on data archiving, but again, software tools take the directory as a whole.

6.2 Loading the data

6.2.1 Load one sample from raw data

Load libraries, and paths

## Loading required package: SeuratObject
## Loading required package: sp
## 
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
## 
##     intersect, t
## ── Attaching core tidyverse packages ─────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## here() starts at /home/s.williams/projects/intro-spatial-transcriptomics-workshop
## Paths
data_dir              <- file.path("data/")
raw_data_dir          <- file.path("raw_data/")

The code to load one sample should be as follows.

## Load one sample?
sample_path = file.path(raw_data_dir, "GSM7473682_HC_a")
so <- LoadNanostring(sample_path,
                     assay='RNA',
                     fov="GSM7473682_HC_a")


# HOWEVER 
# This default method drops most of the metatdata in the seurat object.
# e.g. what fov is each cell a member of? is missing.

#so@meta.data
#orig.ident nCount_RNA nFeature_RNA
#1_1   SeuratProject        368          189
#2_1   SeuratProject        810          286
#3_1   SeuratProject        119           74

# An alternative function is here;
# In time, this should be fixed within seruat
# See comments here
#https://github.com/satijalab/seurat/discussions/9261
source("scripts/LoadNanostring_edited_function.R")

Each sample should be annotated with its experimental details. This particular study has one sample per slide (easy!), but there are typically more.

so$tissue_sample   <- "HC_A"
so$group           <- "HC"
so$condition       <- "Healthy Controls"

6.2.2 Load all samples from raw data

Code example goes here (not run)

6.2.3 Load the workshop data

This dataset is a subset of the experimenal data - only the first 5 fov views of each, and only the CD and HC sample groups.

so_raw <- readRDS(file.path(data_dir, "GSE234713_CosMx_IBD_seurat_00_raw_subsampled.RDS"))