SymSim

单细胞数据模拟(2)

Posted by CHY on June 16, 2020

本节主要学习另一个单细胞数据模拟软件SymSim。
SymSim与Splatter类似,可以用于模拟多种情况的数据(用于聚类、差异分析等等)。

  1. Benchmark clustering methods;
  2. Benchmark methods for differentially expressed genes;
  3. Benchmark trajectory inference methods;
  4. Test the effects of different confounding factors on the performance of each computational method;
  5. Estimate how many cells we need to sequence in order to detect a rare population under various realistic scenarios.

安装

# SymSim: 模拟单细胞RNA测序数据

# R包加载安装
library("devtools")
devtools::install_github("YosefLab/SymSim")
library(SysSim)

主要函数

SimulateTrueCounts( )主要生成真实的转录计数 and True2ObservedCounts( )SimulateTrueCounts会针对给定数量的基因和细胞生成真实的转录计数,其中细胞可以来自一个单一种群,多个离散种群或连续种群。 然后,True2ObservedCounts模拟文库的制备和测序程序,并将真实的成绩单计数转换为观察到的读数计数或UMI计数。
SimulateTrueCounts( )结果为含有4个元素的列表list,1.真实的转录计数的表达矩阵;2.基因meta信息;3.细胞meta信息;4.模拟所用的参数。
True2ObservedCounts( )结果为含有两个元素的列表,1.reads count矩阵或UMI矩阵;2.细胞meta信息。

批次数据模拟

DivideBatches( )利用True2ObservedCounts( )的结果作为输入,将数据拆分为多个批次的数据。批次信息存在输出结果的meta信息中。

Glodstandard金标准信息获取

SimulateTrueCounts()用于获取细胞聚类的真实信息;
getDEgenes()用于获取差异基因的信息;
getTrajectoryGenes()用于获取细胞的拟时信息。

真实数据集参考

BestMatchParams( )可参考真实数据进行参数估计,最终返回各项参数。

SimulateTrueCounts()参数

ncells_total total number of cells from all populations;
min_popsize number of cells in the rarest population;
i_minpop specifies which population has the smallest size;
ngenes number of genes;
evf_center the value which evf mean is generated from (default=1);
nevf number of EVFs for each kinetic parameter (default=30);
evf_type indicates the population structure of the cells, can be “one.population”, “discrete” or “continuous”;
n_de_evf number of differential evfs between populations for one kinetic parameter (default=18 when vary=’s’);
impulse when generating continuous populations, use the impulse model or not. Default is FALSE;
vary which kinetic parameters have differential evfs. Can be “all”, “kon”, “koff”, “s”, “except_kon”, “except_koff”, “except_s”;
Sigma controls heterogeneity each population;
phyla a tree which defines relationship between populations;
geffect_mean the mean of gene effect size;
gene_effects_sd controls differences between genes;
gene_effect_prob probability of non-zero values in the gene effect vectors;
bimod adjusts the bimodality of gene expression, thus controls intrinsic variation;
param_realdata the experimental dataset used to estimate kon, koff and s parameters;
scale_s the cell size parameter in (0,1). Use smaller value for cell types known to be small (like naive cells);
prop_hge proportion of high expression outlier genes (default=0.015);
mean_hge the inflation parameter to increase s for the high expression outlier genes;
randseed random seed to reproduce the results;

True2ObservedCounts()参数

true_counts true transcript counts from function SimulateTrueCounts;
meta_cell cell identity from function SimulateTrueCounts;
nbatch number of batches the cells are sequenced on;
protocol protocol for library preparation, can be “nonUMI” (without UMIs) or “10x” (with UMIs);
alpha_mean mean capture effeciency of all cells;
alpha_sd standard deviation of capture efficiency of all cells;
lenslope controls the amount of gene length bias;
nbins number of bins to simulate gene length bias;
gene_len gene lengths;
amp_bias_limit amount of amplification bias;
rate_2PCR PCR efficiency during amplification;
nPCR1 number of PCR cycles in the pre-amplification step;
nPCR2 number of PCR cycles in the second amplification step for fragments;
LinearAmp if linear amplification should be used instead of PCR amplification for the pre-amplification step. Default is FALSE;
LinearAmp_coef the number by which the number of transcript is multiplied if linear amplification is used;
depth_mean mean sequencing depth of all cells;
depth_sd standard deviation of sequencing depth of all cells;

Simulate one population模拟单个类群

ngenes <- 500
true_counts_res <- SimulateTrueCounts(ncells_total=300, ngenes=ngenes, evf_type="one.population", Sigma=0.4, randseed=0)
tsne_true_counts <- PlotTsne(meta=true_counts_res[[3]], data=log2(true_counts_res[[1]]+1), evf_type="one.population", n_pc=20, label='pop', saving = F, plotname="one.population")
tsne_true_counts[[2]]

data(gene_len_pool)
gene_len <- sample(gene_len_pool, ngenes, replace = FALSE)
observed_counts <- True2ObservedCounts(true_counts=true_counts_res[[1]], meta_cell=true_counts_res[[3]], protocol="nonUMI", alpha_mean=0.1, alpha_sd=0.05, gene_len=gene_len, depth_mean=1e5, depth_sd=3e3)

plot(log2(rowMeans(observed_counts[[1]])+1), log2(apply(observed_counts[[1]],1,cv)), col=adjustcolor("blue", alpha.f = 0.5), pch=19, xlab="log2(mean+1)", ylab="log2(CV)")

Simulate multiple discrete populations多个离散类群

true_counts_res <- SimulateTrueCounts(ncells_total=300, min_popsize=50, i_minpop=2, ngenes=ngenes, nevf=10, evf_type="discrete", n_de_evf=9, vary="s", Sigma=0.4, phyla=Phyla5(), randseed=0)
true_counts_res_dis <- true_counts_res
tsne_true_counts <- PlotTsne(meta=true_counts_res[[3]], data=log2(true_counts_res[[1]]+1), evf_type="discrete", n_pc=20, label='pop', saving = F, plotname="discrete populations (true counts)")
tsne_true_counts[[2]]

observed_counts <- True2ObservedCounts(true_counts=true_counts_res[[1]], meta_cell=true_counts_res[[3]], protocol="nonUMI", alpha_mean=0.1, alpha_sd=0.05, gene_len=gene_len, depth_mean=1e5, depth_sd=3e3)
tsne_nonUMI_counts <- PlotTsne(meta=observed_counts[[2]], data=log2(observed_counts[[1]]+1), evf_type="discrete", n_pc=20, label='pop', saving = F, plotname="observed counts nonUMI")
tsne_nonUMI_counts[[2]]
observed_counts <- True2ObservedCounts(true_counts=true_counts_res[[1]], meta_cell=true_counts_res[[3]], protocol="UMI", alpha_mean=0.05, alpha_sd=0.02, gene_len=gene_len, depth_mean=5e4, depth_sd=3e3)
tsne_UMI_counts <- PlotTsne(meta=observed_counts[[2]], data=log2(observed_counts[[1]]+1), evf_type="discrete", n_pc=20, label='pop', saving = F, plotname="observed counts UMI")
tsne_UMI_counts[[2]]

batch effect信息添加

observed_counts_2batches <- DivideBatches(observed_counts_res = observed_counts, nbatch = 2, batch_effect_size = 1)
tsne_batches <- PlotTsne(meta=observed_counts_2batches[[2]], data=log2(observed_counts_2batches[[1]]+1), evf_type="discrete", n_pc=20, label='batch', saving = F, plotname="observed counts in batches")
tsne_batches[[2]]

Simulate continuous populations连续类群(拟时分析相关)

true_counts_res <- SimulateTrueCounts(ncells_total=500, ngenes=ngenes, nevf=20, evf_type="continuous", n_de_evf=12, vary="s", Sigma=0.4, phyla=Phyla5(), randseed=1)
tsne_true_counts <- PlotTsne(meta=true_counts_res[[3]], data=log2(true_counts_res[[1]]+1), evf_type="continuous", n_pc=20, label='pop', saving = F, plotname="continuous populations (true counts)")
tsne_true_counts[[2]]

金标准数据获取

提取准确的差异基因等信息可以用于评估相关软件的性能。

DEinfo <- getDEgenes(true_counts_res = true_counts_res_dis, popA = 1, popB = 3)
summary(DEinfo)
TrajInfo <- getTrajectoryGenes(true_counts_res$cell_meta)
head(TrajInfo)