FiRE通过为每个细胞计算rareness/outlierness score来鉴定rare cell。为了计算邻近度,FiRE使用了Sketching算法。
Github: https://github.com/princethewinner/FiRE
Example
## L <- number of estimators
## M <- Number of dims to sample
## H <- Number of bins
## seed <- seed for random number generator
## verbose <- verbose level
library('FiRE')
data(sample_data) #Samples * Features
data(sample_label)
## Samples with label'1'represent abundant,
## while samples with label'2'represent rare.
## Saving rownames and colnames
rnames <- rownames(sample_data)
cnames <- colnames(sample_data)
## Converting data.frame to matrix
sample_data <- as.matrix(sample_data)
sample_label <- as.matrix(sample_label)
L <- 100 # Number of estimators
M <- 50 # Dims to be sampled
# Model creation without optional parameter
model <- new(FiRE::FiRE, L, M)
## There are 3 more optional parameters they can be passed as
## model <- new(FiRE::FiRE, L, M, H, seed, verbose)
## Hashing all samples
model$fit(sample_data)
## Computing score of each sample
rareness_score <- model$score(sample_data)
## Apply IQR-based criteria to identify rare samples for further downstream analysis.
q3 <- quantile(rareness_score, 0.75)
iqr <- IQR(rareness_score)
th <- q3 + (1.5*iqr)
## Select indexes that satisfy IQR-based thresholding criteria.
indIqr <- which(rareness_score >= th)
## Create a vector for binary predictions
predictions <- rep(1, dim(sample_data)[1])
predictions[indIqr] <- 2 #Replace predictions for rare samples with'2'.
## To access model parameters
## Parameters - generated weights# model$w
## Parameters - sample dimensions
# model$d
## Parameters - generated thresholds
# model$ths
## Parameters - Bins
# model$b
FiRE只能分析UMI数据?
目前FiRE主要针对海量单细胞数据进行分析。
Model of FiRE参数应该如何设置?
L:Total number of estimators估计的次数。
M:Number of features to be randomly sampled for estimator每次估计选择的基因数,不能设置的过大,不然会导致对noisy expression readings的敏感。
H:代表hash table size,应该足够大能够避免异质细胞间的不必要的碰撞。一般要比细胞数大十倍。
判断细胞是否为rare cell的标准
目前FiRE默认采用分位数的标准进行筛选,FiRE score is ≥q3 +1.5 ×IQR,IQR表示75th percentile−25th percentile。
FiRE采用的dropClust原理
单细胞转录组数据特点
- 高维度。矩阵不仅在列的维度(基因)达到了上万,而且在行维度(细胞)也达到了上万,例如文中所用的数据集68k PBMC来说,就是一个68579 cells x 32738 genes的一个表达矩阵。
- 稀疏矩阵。由于在测序过程中发生干扰,或基因本身没有表达而造成的大量的表达值0(即droplet)的矩阵元素出现,68k PBMC中dropout rate达到了98.33%。
目前聚类算法的缺点
- 基于Nearest Neighbour Network的聚类方法,由于涉及到对细胞与细胞两两之间相似度的计算(即把细胞看成是向量,计算向量之间的相似度),往往会造成很大的计算开销。
- 对细胞进行简单的Random Sampling会使得细胞数量较少的那些类别有可能无法被采样,导致最终聚类的精度不够高。(Highly Parallel Genome-Wide Expression Profling of Individual Cells Using Nanoliter Droplets )
- 基于KMeans的方法有两个明显的缺点。一是需要明确指定聚类的数量,这对发现细胞类别来说是不足的;二是难以挖掘那些非球状的聚类。
dropClust优势
- 采用局部敏感哈希(LSH)的方法去找到细胞的Nearest Neighbour。(加速聚类的过程)
- 使用这些Neighbour的信息,提出了一种细胞采样的方法Structure Preserving Sampling (SPS) ,使得细胞数量较少的那些类别被采样到的比率会相对比较大,降低由于采样而导致的聚类精度损失。
- 提出的整个过程能够处理大规模、高稀疏度的单细胞RNA数据。
dropClust原理
dropClust大致可以分为四个步骤,首先通过Structure Preserving Sampling和Gene Selection的方法,筛选出一个小规模的表达矩阵。再通过一个简单的层次聚类方法对这个小的表达矩阵的Cell进行分类,最后把没有被Sample到的那些Cell通过LSH的方法分配到已经分好类的这些Cell所属的类别中,完成所有Cell的聚类过程。
Sketching算法原理
A Sketch encodes a high-dimensional data point to a bit vector. Th length of the bit vector is usually much smaller than the data dimension. These bit vectors are constructed using a randomized algorithm as outlined below: For each cell, c in a given scRNA-seq data, a Sketch or bit vector (containing 0s and 1 s) of size M is generated by randomly selecting M genes at a time and applying thresholds on each of them. A threshold is a randomly chosen numeric value lying between the minimum and maximum values observed in a given expression matrix. A weight vector w is generated randomly such that w 2 R M . The dot product between a Sketch and w is mapped to one of the predetermined hash codes using the modulo hashing technique. The hamming distance between a pair of Sketches approximates the L 1 distance between their corresponding high-dimensional data points (cells).