6 Input Data
What does the data look like when you get it? Examine raw data.
- Explain e.g. cosmx in detail, refer to similarities in e.g. xenium
- Load data into object
- Brief explanation of different data structures
- Sample annotation: Applying sample names to individual slides. Mention difficulty of multiple samples per slide (can we recommend some approach?)
=====
6.1 The raw data
For a cosmx SMI slide (like this experiment).
-
Flat files : Most of what we need to do processing. Typically a directory of flat files is read in as a whole, and we don’t need to worry too much. But note there are no images in here.
- SLIDE-polygons.csv.gz : Cell borders
- SLIDE_exprMat_file.csv.gz : Counts of genes per cell (Counts matrix)
- SLIDE_fov_positions_file.csv.gz : Location of FOVs on slide
- SLIDE_metadata_file.csv.gz : Cell level QC metadata
- SLIDE_tx_file.csv.gz : Location of individual transcripts.
-
Raw files : Giant ugly directory with lots of files including microscopy images.
- RawFiles/SLIDE/RUN_CODE/CellStatsDir/Morphology2D : Location of images
- https://github.com/Nanostring-Biostats/CosMxDACustomModules/blob/main/Export/CosMxDAExportSetup.docx
For a Xenium slide there is just the one typical output directory; they discuss the formats in the section on data archiving, but again, software tools take the directory as a whole.
6.2 Loading the data
6.2.1 Load one sample from raw data
Load libraries, and paths
## Loading required package: SeuratObject
## Loading required package: sp
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
## ── Attaching core tidyverse packages ─────────────────────────────────────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## here() starts at /home/s.williams/projects/intro-spatial-transcriptomics-workshop
The code to load one sample should be as follows.
## Load one sample?
sample_path = file.path(raw_data_dir, "GSM7473682_HC_a")
so <- LoadNanostring(sample_path,
assay='RNA',
fov="GSM7473682_HC_a")
# HOWEVER
# This default method drops most of the metatdata in the seurat object.
# e.g. what fov is each cell a member of? is missing.
#so@meta.data
#orig.ident nCount_RNA nFeature_RNA
#1_1 SeuratProject 368 189
#2_1 SeuratProject 810 286
#3_1 SeuratProject 119 74
# An alternative function is here;
# In time, this should be fixed within seruat
# See comments here
#https://github.com/satijalab/seurat/discussions/9261
source("scripts/LoadNanostring_edited_function.R")
Each sample should be annotated with its experimental details. This particular study has one sample per slide (easy!), but there are typically more.
so$tissue_sample <- "HC_A"
so$group <- "HC"
so$condition <- "Healthy Controls"