6 Input Data

What does the data look like when you get it? Examine raw data.

Explain e.g. cosmx in detail, refer to similarities in e.g. xenium
Load data into object
Brief explanation of different data structures
Sample annotation: Applying sample names to individual slides. Mention difficulty of multiple samples per slide (can we recommend some approach?)

=====

6.1 The raw data

For a cosmx SMI slide (like this experiment).

Flat files : Most of what we need to do processing. Typically a directory of flat files is read in as a whole, and we don’t need to worry too much. But note there are no images in here.
- SLIDE-polygons.csv.gz : Cell borders
- SLIDE_exprMat_file.csv.gz : Counts of genes per cell (Counts matrix)
- SLIDE_fov_positions_file.csv.gz : Location of FOVs on slide
- SLIDE_metadata_file.csv.gz : Cell level QC metadata
- SLIDE_tx_file.csv.gz : Location of individual transcripts.
Raw files : Giant ugly directory with lots of files including microscopy images.
- RawFiles/SLIDE/RUN_CODE/CellStatsDir/Morphology2D : Location of images
- https://github.com/Nanostring-Biostats/CosMxDACustomModules/blob/main/Export/CosMxDAExportSetup.docx

For a Xenium slide there is just the one typical output directory; they discuss the formats in the section on data archiving, but again, software tools take the directory as a whole.

6.2 Loading the data

6.2.1 Load one sample from raw data

Load libraries, and paths

library(Seurat)

## Loading required package: SeuratObject

## Loading required package: sp

## 
## Attaching package: 'SeuratObject'

## The following objects are masked from 'package:base':
## 
##     intersect, t

library(SeuratObject)
library(tidyverse)

## ── Attaching core tidyverse packages ─────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(here)

## here() starts at /home/s.williams/projects/intro-spatial-transcriptomics-workshop

## Paths
data_dir              <- file.path("data/")
raw_data_dir          <- file.path("raw_data/")

The code to load one sample should be as follows.

## Load one sample?
sample_path = file.path(raw_data_dir, "GSM7473682_HC_a")
so <- LoadNanostring(sample_path,
                     assay='RNA',
                     fov="GSM7473682_HC_a")


# HOWEVER 
# This default method drops most of the metatdata in the seurat object.
# e.g. what fov is each cell a member of? is missing.

#so@meta.data
#orig.ident nCount_RNA nFeature_RNA
#1_1   SeuratProject        368          189
#2_1   SeuratProject        810          286
#3_1   SeuratProject        119           74

# An alternative function is here;
# In time, this should be fixed within seruat
# See comments here
#https://github.com/satijalab/seurat/discussions/9261
source("scripts/LoadNanostring_edited_function.R")

Each sample should be annotated with its experimental details. This particular study has one sample per slide (easy!), but there are typically more.

so$tissue_sample   <- "HC_A"
so$group           <- "HC"
so$condition       <- "Healthy Controls"

6.2.2 Load all samples from raw data

Code example goes here (not run)

6.2.3 Load the workshop data

This dataset is a subset of the experimenal data - only the first 5 fov views of each, and only the CD and HC sample groups.

so_raw <- readRDS(file.path(data_dir, "GSE234713_CosMx_IBD_seurat_00_raw_subsampled.RDS"))

5 Aims for today

7 QC