FiRE

FiRE通过为每个细胞计算rareness/outlierness score来鉴定rare cell。为了计算邻近度，FiRE使用了Sketching算法。
Github: https://github.com/princethewinner/FiRE

Example

## L <- number of estimators
## M <- Number of dims to sample
## H <- Number of bins
## seed <- seed for random number generator
## verbose <- verbose level
library('FiRE')
data(sample_data) #Samples * Features
data(sample_label)
## Samples with label'1'represent abundant,
## while samples with label'2'represent rare.
## Saving rownames and colnames
rnames <- rownames(sample_data)
cnames <- colnames(sample_data)
## Converting data.frame to matrix
sample_data <- as.matrix(sample_data)
sample_label <- as.matrix(sample_label)
L <- 100 # Number of estimators
M <- 50 # Dims to be sampled
# Model creation without optional parameter
model <- new(FiRE::FiRE, L, M)
## There are 3 more optional parameters they can be passed as
## model <- new(FiRE::FiRE, L, M, H, seed, verbose)
## Hashing all samples
model$fit(sample_data)
## Computing score of each sample
rareness_score <- model$score(sample_data)
## Apply IQR-based criteria to identify rare samples for further downstream analysis.
q3 <- quantile(rareness_score, 0.75)
iqr <- IQR(rareness_score)
th <- q3 + (1.5*iqr)
## Select indexes that satisfy IQR-based thresholding criteria.
indIqr <- which(rareness_score >= th)
## Create a vector for binary predictions
predictions <- rep(1, dim(sample_data)[1])
predictions[indIqr] <- 2 #Replace predictions for rare samples with'2'.
## To access model parameters
## Parameters - generated weights# model$w
## Parameters - sample dimensions
# model$d
## Parameters - generated thresholds
# model$ths
## Parameters - Bins
# model$b

FiRE只能分析UMI数据？

目前FiRE主要针对海量单细胞数据进行分析。

Model of FiRE参数应该如何设置？

L：Total number of estimators估计的次数。
M：Number of features to be randomly sampled for estimator每次估计选择的基因数，不能设置的过大，不然会导致对noisy expression readings的敏感。
H：代表hash table size,应该足够大能够避免异质细胞间的不必要的碰撞。一般要比细胞数大十倍。

判断细胞是否为rare cell的标准

目前FiRE默认采用分位数的标准进行筛选，FiRE score is ≥q3 +1.5 ×IQR，IQR表示75th percentile−25th percentile。

FiRE采用的dropClust原理

单细胞转录组数据特点

高维度。矩阵不仅在列的维度（基因）达到了上万，而且在行维度（细胞）也达到了上万，例如文中所用的数据集68k PBMC来说，就是一个68579 cells x 32738 genes的一个表达矩阵。
稀疏矩阵。由于在测序过程中发生干扰，或基因本身没有表达而造成的大量的表达值0（即droplet）的矩阵元素出现，68k PBMC中dropout rate达到了98.33%。
目前聚类算法的缺点
基于Nearest Neighbour Network的聚类方法，由于涉及到对细胞与细胞两两之间相似度的计算（即把细胞看成是向量，计算向量之间的相似度），往往会造成很大的计算开销。
对细胞进行简单的Random Sampling会使得细胞数量较少的那些类别有可能无法被采样，导致最终聚类的精度不够高。（Highly Parallel Genome-Wide Expression Profling of Individual Cells Using Nanoliter Droplets ）
基于KMeans的方法有两个明显的缺点。一是需要明确指定聚类的数量，这对发现细胞类别来说是不足的；二是难以挖掘那些非球状的聚类。
dropClust优势
采用局部敏感哈希（LSH）的方法去找到细胞的Nearest Neighbour。（加速聚类的过程）
使用这些Neighbour的信息，提出了一种细胞采样的方法Structure Preserving Sampling (SPS) ，使得细胞数量较少的那些类别被采样到的比率会相对比较大，降低由于采样而导致的聚类精度损失。
提出的整个过程能够处理大规模、高稀疏度的单细胞RNA数据。
dropClust原理

dropClust大致可以分为四个步骤，首先通过Structure Preserving Sampling和Gene Selection的方法，筛选出一个小规模的表达矩阵。再通过一个简单的层次聚类方法对这个小的表达矩阵的Cell进行分类，最后把没有被Sample到的那些Cell通过LSH的方法分配到已经分好类的这些Cell所属的类别中，完成所有Cell的聚类过程。

Sketching算法原理

A Sketch encodes a high-dimensional data point to a bit vector. Th length of the bit vector is usually much smaller than the data dimension. These bit vectors are constructed using a randomized algorithm as outlined below: For each cell, c in a given scRNA-seq data, a Sketch or bit vector (containing 0s and 1 s) of size M is generated by randomly selecting M genes at a time and applying thresholds on each of them. A threshold is a randomly chosen numeric value lying between the minimum and maximum values observed in a given expression matrix. A weight vector w is generated randomly such that w 2 R M . The dot product between a Sketch and w is mapped to one of the predetermined hash codes using the modulo hashing technique. The hamming distance between a pair of Sketches approximates the L 1 distance between their corresponding high-dimensional data points (cells).

单细胞rare cell鉴定软件

Example

FiRE只能分析UMI数据？

Model of FiRE参数应该如何设置？

判断细胞是否为rare cell的标准

FiRE采用的dropClust原理

单细胞转录组数据特点

目前聚类算法的缺点

dropClust优势

dropClust原理

Sketching算法原理

CATALOG

FEATURED TAGS