7 PCAs and UMAPs

7.1 Identification of highly variable features (feature selection)

Tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial#identification-of-highly-variable-features-feature-selection

Why do we need to do this?

Identifying the most variable features allows retaining the real biological variability of the data and reduce noise in the data.

7.2 Scaling the data

Tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial#scaling-the-data

Why do we need to do this?

Highly expresed genes can overpower the signal of other less expresed genes with equal importance. Within the same cell the assumption is that the underlying RNA content is constant. Aditionally, If variables are provided in vars.to.regress, they are individually regressed against each feature, and the resulting residuals are then scaled and centered. This step allows controling for cell cycle and other factors that may bias your clustering.