哈佛大学单细胞课程

针对单细胞基础分析深入学习

Posted by CHY on August 21, 2020

记录学习哈佛单细胞数据分析课程笔记。

单细胞分析优势及特点

单细胞分析可以解决的问题:

  • 探究组织中存在的细胞类型
  • 鉴定未知的或稀有的细胞类型
  • 探究发育过程中基因表达差异
  • 不同环境下不同类型细胞的表达差异
  • 空间转录组表达信息探索

单细胞分析存在的挑战:

  • 数据量过大
  • 每个细胞对应测序量过低(often detecting only 10-50% of the transcriptome per cell)
  • 细胞样本间存在技术差异
  • 细胞样本间存在生物差异

单细胞关于批次需要考虑的问题:

  • Were all RNA isolations performed on the same day?
  • Were all library preparations performed on the same day?
  • Did the same person perform the RNA isolation/library preparation for all samples?
  • Did you use the same reagents for all samples?
  • Did you perform the RNA isolation/library preparation in the same location?

raw data to count matrix

3’(or5’)-end sequencing 与 Full length sequencing 对比
3’(or5’)-end sequencing

  • More accurate quantification through use of unique molecular identifiers distinguishing biological duplicates from amplification (PCR) duplicates

  • Larger number of cells sequenced allows better identity of cell type populations
  • Cheaper per cell cost
  • Best results with > 10,000 cells Full length sequencing
  • Detection of isoform-level differences in expression
  • Identification of allele-specific differences in expression
  • Deeper sequencing of a smaller number of cells
  • Best for samples with low number of cells
常规分析流程
  • Generation of the count matrix (method-specific steps): formating reads, demultiplexing samples, mapping and quantification
  • Quality control of the raw counts: filtering of poor quality cells
  • Clustering of filtered counts: clustering cells based on similarities in transcriptional activity (cell types = different clusters)
  • Marker identification: identifying gene markers for each cluster
  • Optional downstream steps
表达矩阵获取

表达矩阵获取 对于 3’(or5’)-end sequencing 测序得到的序列可以用 umis 和 zUMIS 工具进行表达矩阵的获取,具体原理如下:

  1. Formatting reads and filtering noisy cellular barcodes
  2. Demultiplexing the samples
  3. Mapping/pseudo-mapping to transcriptome
  4. Collapsing UMIs and quantification of reads (Kallisto or featureCounts) 对于 10x Genomics 获取的测序数据,直接采用官方的 cellranger 流程分析。
Formatting reads and filtering noisy cellular barcodes

在 droplet-based 方法中,许多的微珠只能捕获少量的 reads,主要是由于:

  • encapsulation of free floating RNA from dying cells
  • simple cells (RBCs, etc.) expressing few genes
  • cells that failed for some reason 在进行比对之前,需要将这些进行过滤。

Quality control set-up(读取数据到 R)

数据读入:

  1. readMM() Matrix 包函数,直接生成稀疏矩阵
  2. Read10X() Seurat 包函数,使用 Cellranger 输出目录作为输入,生成稀疏矩阵
### readMM()函数
# Read in `matrix.mtx`
counts <- readMM("data/ctrl_raw_feature_bc_matrix/matrix.mtx.gz")
# Read in `genes.tsv`
genes <- read_tsv("data/ctrl_raw_feature_bc_matrix/features.tsv.gz", col_names = FALSE)
gene_ids <- genes$X1
# Read in `barcodes.tsv`
cell_ids <- read_tsv("data/ctrl_raw_feature_bc_matrix/barcodes.tsv.gz", col_names = FALSE)$X1
# Make the column names as the cell IDs and the row names as the gene IDs
rownames(counts) <- gene_ids
colnames(counts) <- cell_ids
# Create each individual Seurat object for every sample
for (file in c("ctrl_raw_feature_bc_matrix", "stim_raw_feature_bc_matrix")){
        seurat_data <- Read10X(data.dir = paste0("data/", file))
        seurat_obj <- CreateSeuratObject(counts = seurat_data,
                                         min.features = 100,
                                         project = file)
        assign(file, seurat_obj)
}
# Create a merged Seurat object
merged_seurat <- merge(x = ctrl_raw_feature_bc_matrix,
                       y = stim_raw_feature_bc_matrix,
                       add.cell.id = c("ctrl", "stim"))

Quality control 质量控制

Goals:

  • To filter the data to only include true cells that are of high quality, so that when we cluster our cells it is easier to identify distinct cell type populations
  • To identify any failed samples and either try to salvage the data or remove from analysis, in addition to, trying to understand why the sample failed Challenges:
  • Delineating cells that are poor quality from less complex cells
  • Choosing appropriate thresholds for filtering, so as to keep high quality cells without removing biologically relevant cell types

构建 Seurat 对象后,会自动生成 nCount_RNA/nFeature_RNA 数据,还需要其他指标:

  • number of genes detected per UMI: this metric with give us an idea of the complexity of our dataset (more genes detected per UMI, more complex our data)
# Add number of genes per UMI for each cell to metadata
merged_seurat$log10GenesPerUMI <- log10(merged_seurat$nFeature_RNA) / log10(merged_seurat$nCount_RNA)
  • mitochondrial ratio: this metric will give us a percentage of cell reads originating from the mitochondrial genes
# Compute percent mito ratio
merged_seurat$mitoRatio <- PercentageFeatureSet(object = merged_seurat, pattern = "^MT-")
merged_seurat$mitoRatio <- merged_seurat@meta.data$mitoRatio / 100
Cell counts
# Visualize the number of cell counts per sample
metadata %>%
  	ggplot(aes(x=sample, fill=sample)) +
  	geom_bar() +
  	theme_classic() +
  	theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  	theme(plot.title = element_text(hjust=0.5, face="bold")) +
  	ggtitle("NCells")
UMI counts (transcripts) per cell
# Visualize the number UMIs/transcripts per cell
metadata %>%
  	ggplot(aes(color=sample, x=nUMI, fill= sample)) +
  	geom_density(alpha = 0.2) +
  	scale_x_log10() +
  	theme_classic() +
  	ylab("Cell density") +
  	geom_vline(xintercept = 500)
Genes detected per cell
# Visualize the distribution of genes detected per cell via histogram
metadata %>%
  	ggplot(aes(color=sample, x=nGene, fill= sample)) +
  	geom_density(alpha = 0.2) +
  	theme_classic() +
  	scale_x_log10() +
  	geom_vline(xintercept = 300)

# Visualize the distribution of genes detected per cell via boxplot
metadata %>%
  	ggplot(aes(x=sample, y=log10(nGene), fill=sample)) +
  	geom_boxplot() +
  	theme_classic() +
  	theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  	theme(plot.title = element_text(hjust=0.5, face="bold")) +
  	ggtitle("NCells vs NGenes")
UMIs vs. genes detected

Mitochondrial read fractions are only high (light blue color) in particularly low count cells with few detected genes. This could be indicative of damaged/dying cells whose cytoplasmic mRNA has leaked out through a broken membrane, and thus, only mRNA located in the mitochondria is still conserved.

# Visualize the correlation between genes detected and number of UMIs and determine whether strong presence of cells with low numbers of genes/UMIs
metadata %>%
 	ggplot(aes(x=nUMI, y=nGene, color=mitoRatio)) +
 	geom_point() +
scale_colour_gradient(low = "gray90", high = "black") +
 	stat_smooth(method=lm) +
 	scale_x_log10() +
 	scale_y_log10() +
 	theme_classic() +
 	geom_vline(xintercept = 500) +
 	geom_hline(yintercept = 250) +
 	facet_wrap(~sample)
Mitochondrial counts ratio
# Visualize the distribution of mitochondrial gene expression detected per cell
metadata %>%
 	ggplot(aes(color=sample, x=mitoRatio, fill=sample)) +
 	geom_density(alpha = 0.2) +
 	scale_x_log10() +
 	theme_classic() +
 	geom_vline(xintercept = 0.2)
Complexity

Outlier cells in these samples might be cells that have a less complex RNA species than other cells. expect the novelty score to be above 0.80.

# Visualize the overall complexity of the gene expression by visualizing the genes detected per UMI
metadata %>%
  	ggplot(aes(x=log10GenesPerUMI, color = sample, fill=sample)) +
  	geom_density(alpha = 0.2) +
  	theme_classic() +
  	geom_vline(xintercept = 0.8)
Cell-level Filtering
  • nUMI > 500
  • nGene > 250
  • log10GenesPerUMI > 0.8
  • mitoRatio < 0.2
# Filter out low quality reads using selected thresholds - these will change with experiment
filtered_seurat <- subset(x = merged_seurat,
                         subset= (nUMI >= 500) &
                           (nGene >= 250) &
                           (log10GenesPerUMI > 0.80) &
                           (mitoRatio < 0.20))
Gene-level filtering

keep only genes which are expressed in 10 or more cells.

# Output a logical vector for every gene on whether the more than zero counts per cell
# Extract counts
counts <- GetAssayData(object = filtered_seurat, slot = "counts")
# Output a logical vector for every gene on whether the more than zero counts per cell
nonzero <- counts > 0
# Sums all TRUE values and returns TRUE if more than 10 TRUE values per gene
keep_genes <- Matrix::rowSums(nonzero) >= 10
# Only keeping those genes expressed in more than 10 cells
filtered_counts <- counts[keep_genes, ]
# Reassign to filtered Seurat object
filtered_seurat <- CreateSeuratObject(filtered_counts, meta.data = filtered_seurat@meta.data)

当数据进行过滤后,最好是重新去看看各项指标是否正确利于下游分析。

Count Normalization and Principal Component Analysis

count normalization, which is essential to make accurate comparisons of gene expression between cells (or samples). 细胞层面
主要考虑的因素:Sequencing depth,Gene length
In scRNA-seq analysis, we will be comparing the expression of different genes within the cells to cluster the cells. If using a 3’ or 5’ droplet-based method, the length of the gene will not affect the analysis because only the 5’ or 3’ end of the transcript is sequenced. However, if using full-length sequencing, the transcript length should be accounted for.

聚类分析准备

To identify clusters, the following steps will be performed:

  1. Normalization, variance stabilization, and regression of unwanted variation (e.g. mitochondrial transcript abundance, cell cycle phase, etc.) for each sample
  2. Integration of the samples using shared highly variable genes (optional, but recommended to align cells from different samples/conditions if cell types are separating by sample/condition)
  3. Clustering cells based on top PCs (metagenes)
  4. Exploration of quality control metrics: determine whether clusters are unbalanced wrt UMIs, genes, cell cycle, mitochondrial content, samples, etc.
  5. Searching for expected cell types using known cell type-specific gene markers
# Normalization, variance stabilization, and regression of unwanted variation for each sample
# sctransform
# 计算Cell cycle score前需要先Normalize
seurat_phase <- NormalizeData(filtered_seurat)
# Load cell cycle markers
load("data/cycle.rda")
# Score cells for cell cycle
seurat_phase <- CellCycleScoring(seurat_phase,
                                 g2m.features = g2m_genes,
                                 s.features = s_genes)
# View cell cycle scores and phases assigned to cells
View(seurat_phase@meta.data)
# Identify the most variable genes
seurat_phase <- FindVariableFeatures(seurat_phase,
                     selection.method = "vst",
                     nfeatures = 2000,
                     verbose = FALSE)
# Scale the counts
seurat_phase <- ScaleData(seurat_phase)
# Perform PCA
seurat_phase <- RunPCA(seurat_phase)
# Plot the PCA colored by cell cycle phase
DimPlot(seurat_phase,
        reduction = "pca",
        group.by= "Phase",
        split.by = "Phase")

聚类分析

Identify significant PCs
  • 热图,探究 PC 中是否包含高变基因列表
  • elbow plot
聚类

For datasets of 3,000 - 5,000 cells, the resolution set between 0.4-1.4 generally yields good clustering.
一般建议多设置几个 reslution,进而判断哪个更适合后续分析。

# Determine the clusters for various resolutions
seurat_integrated <- FindClusters(object = seurat_integrated,
                               resolution = c(0.4, 0.6, 0.8, 1.0, 1.4))

在进行降维可视化的时候,建议选择与聚类一直的 PC 数。

聚类质量控制

# Extract identity and sample information from seurat object to determine the number of cells per cluster per sample
n_cells <- FetchData(seurat_integrated,
                     vars = c("ident", "orig.ident")) %>%
        dplyr::count(ident, orig.ident) %>%
        tidyr::spread(ident, n)

# View table
View(n_cells)
# Explore whether clusters segregate by cell cycle phase
DimPlot(seurat_integrated,
        label = TRUE,
        split.by = "Phase")  + NoLegend()
# Determine metrics to plot present in seurat_integrated@meta.data
metrics <-  c("nUMI", "nGene", "S.Score", "G2M.Score", "mitoRatio")
FeaturePlot(seurat_integrated,
            reduction = "umap",
            features = metrics,
            pt.size = 0.4,
            sort.cell = TRUE,
            min.cutoff = 'q10',
            label = TRUE)
# Exploration of the PCs driving the different clusters
# Defining the information in the seurat object of interest
columns <- c(paste0("PC_", 1:16),
            "ident",
            "UMAP_1", "UMAP_2")
# Extracting this data from the seurat object
pc_data <- FetchData(seurat_integrated,
                     vars = columns)
# Adding cluster label to center of cluster on UMAP
umap_label <- FetchData(seurat_integrated,
                        vars = c("ident", "UMAP_1", "UMAP_2"))  %>%
  group_by(ident) %>%
  summarise(x=mean(UMAP_1), y=mean(UMAP_2))
# Plotting a UMAP plot for each of the PCs
map(paste0("PC_", 1:16), function(pc){
        ggplot(pc_data,
               aes(UMAP_1, UMAP_2)) +
                geom_point(aes_string(color=pc),
                           alpha = 0.7) +
                scale_color_gradient(guide = FALSE,
                                     low = "grey90",
                                     high = "blue")  +
                geom_text(data=umap_label,
                          aes(label=ident, x, y)) +
                ggtitle(pc)
}) %>%
        plot_grid(plotlist = .)

在进行 Marker gene 分析时,采用 RNA assay slot 中的 normalized count data.
在利用 Marker gene 进行注释时,需要多个 Marker gene 同时表达才能断定.

Marker 鉴定

Before we start our marker identification we will explicitly set our default assay, we want to use the original counts and not the integrated data.鉴定 Marker gene 时选择原始数据值.

  1. Identification of all markers for each cluster 存在三个重要参数:logfc.threshold、min.diff.pct、min.pct
  2. Identification of conserved markers for each cluster
DefaultAssay(seurat_integrated) <- "RNA"
FindConservedMarkers(seurat_integrated,
                     ident.1 = cluster,
                     grouping.var = "sample",
                     only.pos = TRUE,
		     min.diff.pct = 0.25,
                     min.pct = 0.25,
		     logfc.threshold = 0.25)
  • gene: gene symbol
  • condition_p_val: p-value not adjusted for multiple test correction for condition
  • condition_avg_logFC: average log2 fold change for condition. Positive values indicate that the gene is more highly expressed in the cluster.
  • condition_pct.1: percentage of cells where the gene is detected in the cluster for condition
  • condition_pct.2: percentage of cells where the gene is detected on average in the other clusters for condition
  • condition_p_val_adj: adjusted p-value for condition, based on bonferroni correction using all genes in the dataset, used to determine significance
  • max_pval: largest p value of p value calculated by each group/condition
  • minimump_p_val: combined p value we suggest looking for markers with large differences in expression between pct.1 and pct.2 and larger fold changes.
  1. Marker identification between specific clusters
# Create function to get conserved markers for any given cluster
get_conserved <- function(cluster){
  FindConservedMarkers(seurat_integrated,
                       ident.1 = cluster,
                       grouping.var = "sample",
                       only.pos = TRUE) %>%
    rownames_to_column(var = "gene") %>%
    left_join(y = unique(annotations[, c("gene_name", "description")]),
               by = c("gene" = "gene_name")) %>%
    cbind(cluster_id = cluster, .)
  }
# Iterate function across desired clusters
conserved_markers <- map_dfr(c(7,20), get_conserved)

分析细节

cell cycle scoring

assign each cell a score, based on its expression of G2/M and S phase markers.

# Download cell cycle genes for organism at https://github.com/hbc/tinyatlas/tree/master/cell_cycle. Read it in with:
cc_file <- getURL("https://raw.githubusercontent.com/hbc/tinyatlas/master/cell_cycle/Homo_sapiens.csv")
cell_cycle_genes <- read.csv(text = cc_file)

# ID转换(视情况具体转换)
# Connect to AnnotationHub
ah <- AnnotationHub()
# Access the Ensembl database for organism
ahDb <- query(ah,
              pattern = c("Homo sapiens", "EnsDb"),
              ignore.case = TRUE)
# Acquire the latest annotation files
id <- ahDb %>%
        mcols() %>%
        rownames() %>%
        tail(n = 1)
# Download the appropriate Ensembldb database
edb <- ah[[id]]
# Extract gene-level information from database
annotations <- genes(edb,
                     return.type = "data.frame")
# Select annotations of interest
annotations <- annotations %>%
        dplyr::select(gene_id, gene_name, seq_name, gene_biotype, description)
# Get gene names for Ensembl IDs for each gene
cell_cycle_markers <- dplyr::left_join(cell_cycle_genes, annotations, by = c("geneID" = "gene_id"))

# Acquire the S phase genes
s_genes <- cell_cycle_markers %>%
        dplyr::filter(phase == "S") %>%
        pull("gene_name")
# Acquire the G2M phase genes
g2m_genes <- cell_cycle_markers %>%
        dplyr::filter(phase == "G2/M") %>%
        pull("gene_name")

# Perform cell cycle scoring
seurat_control <- CellCycleScoring(seurat_control,
                                   g2m.features = g2m_genes,
                                   s.features = s_genes)

# Perform PCA and color by cell cycle phase
seurat_control <- RunPCA(seurat_control)

# Visualize the PCA, grouping by cell cycle phase
DimPlot(seurat_control,
        reduction = "pca",
        group.by= "Phase")

elbow plot 判断

通常 elbow plot 的判断都较为主观,需要更为定量的方法。
判断依据:

  1. The point where the principal components only contribute 5% of standard deviation and the principal components cumulatively contribute 90% of the standard deviation.
  2. The point where the percent change in variation between the consecutive PCs is less than 0.1%.
# first metric
# Determine percent of variation associated with each PC
pct <- seurat_integrated[["pca"]]@stdev / sum(seurat_integrated[["pca"]]@stdev) * 100
# Calculate cumulative percents for each PC
cumu <- cumsum(pct)
# Determine which PC exhibits cumulative percent greater than 90% and % variation associated with the PC as less than 5
co1 <- which(cumu > 90 & pct < 5)[1]
co1

# second metric
# Determine the difference between variation of PC and subsequent PC
co2 <- sort(which((pct[1:length(pct) - 1] - pct[2:length(pct)]) > 0.1), decreasing = T)[1] + 1
# last point where change of % of variation is more than 0.1%.
co2

# Minimum of the two calculation
pcs <- min(co1, co2)
pcs

# Create a dataframe with values
plot_df <- data.frame(pct = pct,
           cumu = cumu,
           rank = 1:length(pct))

# Elbow plot to visualize
  ggplot(plot_df, aes(cumu, pct, label = rank, color = rank > pcs)) +
  geom_text() +
  geom_vline(xintercept = 90, color = "grey") +
  geom_hline(yintercept = min(pct[pct > 5]), color = "grey") +
  theme_bw()

####