单细胞知识点细节

单细胞分析过程中遇到的知识点细节

Posted by CHY on January 10, 2021

本文主要记录一些为提升单细胞分析能力的知识点,应用于单细胞的一些高级分析。

Pseudocell概念

为了从高通量单细胞mRNA数据中增加基因数量和基因表达相关性,从同一细胞群中的多个细胞中收集数据,制作假细胞(Pseudocell)用于网络解释。Pseudocell概念是为了弥补稀疏矩阵在计算相关性上的缺陷,毕竟零值太多,影响相关性的计算。
Pseudocell概念常是将单细胞数据应用于一些bulk seq工具时,如WGCNA或GSEA。

load("/home/jingjingw/Jingjingw/Project/2018-MH-new/Pseudocell/FetalStomach1_500more.RData")
name<-"FetalStomach1"
outfile1<-"Human_FetalStomach1_pseudocell20.Rds"
outfile2<-"Human_FetalStomach1_pseudocell20.pheno.out"

Inter<-get(paste(name,"pbmc",sep = "_"))
Inter[Inter<0]=0
idd<-get(paste(name,"Anno1",sep = "_"))
Inter.id<-cbind(rownames(idd),idd$Cluster_id)

rownames(Inter.id)<-rownames(idd)
colnames(Inter.id)<-c("CellID","Celltype")
Inter.id<-as.data.frame(Inter.id)
Inter1<-Inter[,Inter.id$CellID]
Inter<-as.matrix(Inter1)
pseudocell.size = 20 ## 10 test
new_ids_list = list()
for (i in 1:length(levels(Inter.id$Celltype))) {
    cluster_id = levels(Inter.id$Celltype)[i]
    cluster_cells <- rownames(Inter.id[Inter.id$Celltype == cluster_id,])
    cluster_size <- length(cluster_cells)       
    pseudo_ids <- floor(seq_along(cluster_cells)/pseudocell.size)
    pseudo_ids <- paste0(cluster_id, "_Cell", pseudo_ids)
    names(pseudo_ids) <- sample(cluster_cells)  
    new_ids_list[[i]] <- pseudo_ids     
    }
    
new_ids <- unlist(new_ids_list)
new_ids <- as.data.frame(new_ids)
new_ids_length <- table(new_ids)

new_colnames <- rownames(new_ids)  ###add
all.data<-Inter[,as.character(new_colnames)] ###add
all.data <- t(all.data)###add

new.data<-aggregate(list(all.data[,1:length(all.data[1,])]),
    list(name=new_ids[,1]),FUN=mean)
rownames(new.data)<-new.data$name
new.data<-new.data[,-1]

new_ids_length<-as.matrix(new_ids_length)##
short<-which(new_ids_length<10)##
new_good_ids<-as.matrix(new_ids_length[-short,])##
result<-t(new.data)[,rownames(new_good_ids)]
colnames(result)<-paste("Human",colnames(result),sep="")
rownames(result)<-rownames(Inter)
#saveRDS(result,file=outdir1[i]) ###
saveRDS(result,file=outfile1) ###
cellty<-gsub("[_]Cell[0-9]|[_]Cell[0-9][0-9]|[_]Cell[0-9][0-9][0-9]|[_]Cell[0-9][0-9][0-9][0-9]|[_]Cell[0-9][0-9][0-9][0-9][0-9]","",colnames(result))
new.phe<-paste(colnames(result),'HumanFetal',cellty,sep="\t")

#write.table(new.phe,file=outdir2[i],quote=F,row.names=F) ###

write.table(new.phe,file=outfile2,quote=F,row.names=F) ###
library(scater)
summed <- aggregateAcrossCells(merged, 
    id=colData(merged)[,c("celltype.mapped", "sample")])
summed

Pseudocell源码

提取Seurat数据做GSEA

SeuratRunGSEA<-  function(seuo,group,test1,test2,gmt,outdir,testname){
        Idents(seuo) <- group
        seuo <- subset(seuo,idents=c(test1,test2))  
        seuo@assays$RNA@counts-> counts   # 是用count还是data? 
        expr_data <- as.matrix(counts)  # 数可以取 pseudocell 

# 整理GSEA需要的表达谱格式
          write.table(rbind(c('symbols',colnames(expr_data)),
                    cbind(rownames(expr_data),expr_data)),
              file='expr.txt',quote=F,sep='\t',col.names=F,row.names=F)

# 整理GSEA需要的分组格式
        pheno<-as.character(seuo@meta.data[,group])
        con<-file('pheno.cls',open='w')
        write(paste(length(pheno),'2 1'),con)
        write(paste('# ',test1,' ',test2,sep=''),con)
        classes<-''
        for (i in 1:length(pheno)){
                classes<-paste(classes,pheno[i])
        }
        write(classes,con)
        close(con)

#   GSEA 命令 
        command <- paste('java -Xmx512m -cp  gsea-3.0.jar xtools.gsea.Gsea -res expr.txt -cls pheno.cls#',test1,'_versus_',test2,' -gmx ',gmt,
                   ' -collapse false -nperm 1000 -permute gene_set -rnd_type no_balance -scoring_scheme weighted -rpt_label ',testname,
                   ' -metric Diff_of_Classes -sort real -order descending -include_only_symbols false -make_sets true -median false -num 100',
                   ' -plot_top_x 20 -rnd_seed 123456 -save_rnd_lists false -set_max 10000 -set_min 5 -zip_report false -out ', outdir, ' -gui false',sep='')

        system(command)

        com<-file('command.log',open='w')
        write(command,com)  # 保存运行的命令,以便在命令行中调参
        close(com)
}

批量下载GSEA基因集
制作自己的gene set文件给gsea软件
ssGSEA分析结果解读

10x比对bam文件拆分

samtools index *.bam
/workcenters/workcenter1/chenhy/software/subset-bam_linux --bam *.bam --cell-barcodes barcodes.txt --out-bam *.bam --cores 2
samtools merge out.bam in1.bam in2.bam in3.bam

subset-bam

Benchmarking常用指标对于的R包

aircode : 主要用于cluster之间的比较。

Drop-seq分析流程

read 1 contains a cell barcode and a molecular barcode (also known as a UMI); read 2 is aligned to the reference genome.
准备metadata,meta数据可以都放在同一目录下

  • fasta物种基因组文件
  • gtf物种基因组注释文件
  • dict Picards基于基因组文件创建的序列字典文件,http://broadinstitute.github.io/picard/command-line-overview.html#CreateSequenceDictionary
  • refFlat 基因组注释文件另一种形式,Picards专用,可以使用ConvertToRefFlat命令转换
  • genes.intervals 基因间隔,optional ;http://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/util/IntervalList.html
  • exons.intervals 外显子间隔,optional
  • rRNA.intervals 核糖体RNA,optional
  • reduced.gtf GTF相关信息,ReduceGTF命令可以操作
# CreateSequenceDictionary
java -jar /workcenters/workcenter1/chenhy/software/Drop-seq-2.4.0/lib/picard-2.20.5.jar CreateSequenceDictionary  REFERENCE=TAIR10.fasta  OUTPUT=TAIR10.dict  SPECIES=Arabidopsis &

# ConvertToRefFlat
ConvertToRefFlat
ANNOTATIONS_FILE=my.gtf
SEQUENCE_DICTIONARY=my.dict
OUTPUT=my.refFlat

比对前处理1:convert paired FastQ files to unaligned BAM, 主要是为了添加UMI和barcode标签

nohup java -Xmx4g -Djava.io.tmpdir=./tmp -jar /workcenters/workcenter1/chenhy/software/Drop-seq_tools-2.4.0/3rdParty/picard/picard.jar FastqToSam FASTQ=../SRR8206656_S1_L001_R1_001.fastq.gz FASTQ2=../SRR8206656_S1_L001_R2_001.fastq.gz QUALITY_FORMAT=Standard OUTPUT=SRR8206656_unaligned_data.bam SAMPLE_NAME=SRR8206656_ara SORT_ORDER=queryname TMP_DIR=./tmp &

比对前处理2:创建新的标记bam

java ­Xmx4g ­jar /path/to/dropseq/TagBamWithReadSequenceExtended.jar 
INPUT=my_unaligned_data.bam 
OUTPUT=my_unaligned_data_tagged_Cell.bam 
SUMMARY=unaligned_taggedMolecular.bam_summary.txt 
BASE_RANGE=13­20   
BASE_QUALITY=10   
BARCODED_READ=1   
DISCARD_READ=False   
TAG_NAME=XM   
NUM_BASES_BELOW_QUALITY=1 

Cell Barcode处理

java ­Xmx4g ­jar /path/to/dropseq/TagBamWithReadSequenceExtended.jar 
INPUT=my_unaligned_data_taggedCell.bam 
OUTPUT=my_unaligned_data_tagged_CellMolecular.bam 
SUMMARY=unaligned_taggedCell.bam_summary.txt 
BASE_RANGE=1­12   
BASE_QUALITY=10   
BARCODED_READ=1   
DISCARD_READ=True   
TAG_NAME=XC   
NUM_BASES_BELOW_QUALITY=1

比对前处理3:TrimStartingSequence

java ­Xmx4g ­jar /path/to/dropseq/TrimStartingSequence.jar 
INPUT=my_unaligned_data_tagged_CellMolecular.bam 
OUTPUT=my_unaligned_data_tagged_CellMolecular_trimmedSmart.bam 
OUTPUT_SUMMARY=adapter_trimming_report.txt 
SEQUENCE=AAGCAGTGGTATCAACGCAGAGTGAATGGG 
MISMATCHES=0 
NUM_BASES=5 

PloyA裁剪

java ­Xmx4g ­jar /path/to/dropseq/PolyATrimmer.jar 
INPUT=my_unaligned_data_tagged_CellMolecular_trimmedSmart.bam 
OUTPUT=my_unaligned_data_tagged_trimmed_filtered.bam 
OUTPUT_SUMMARY=polyA_trimming_report.txt 
MISMATCHES=0 
NUM_BASES=6

比对前处理4:SamToFastq

java ­Xmx4g ­jar /path/to/picard/picard.jar SamToFastq 
INPUT=my_unaligned_data_tagged_CellMolecular_trimmedSmart_polyAFiltered.bam 
FASTQ=my_prepared_reference.fastq

比对::STAR, STAR比对前,需要对参考基因组构建index

/path/to/STAR/STAR   
­­genomeDir /path/to/STAR_REFERENCE   
­­readFilesIn my_prepared_reference.fastq   
­­outFileNamePrefix star  

比对2:SortSam, 调用picard中SortSam命令

java ­Xmx4g ­jar /path/to/picard/picard.jar SortSam 
I=starAligned.out.sam 
O=starAligned.out.bam 
SO=queryname

比对3:MergeBamAlignment, 将标签与序列比对文件合并

java ­Xmx4g ­jar /path/to/picard/picard.jar MergeBamAlignment 
REFERENCE_SEQUENCE=my_fasta.fasta 
UNMAPPED_BAM=my_unaligned_data_tagged_trimmed_filtered.bam
ALIGNED_BAM=starAligned.out.bam
OUTPUT=out.bam 
INCLUDE_SECONDARY_ALIGNMENTS=false 
PAIRED_RUN=false VALIDATION_STRINGENCY=SILENT

比对4:TagReadWithGeneExon, 将比对的注释信息添加

java ­Xmx4g ­jar /path/to/dropseq/TagReadWithGeneExon.jar
I=/dev/stdin 
O=out_gene_exon_tagged.bam
ANNOTATIONS_FILE=${refFlat} 
TAG=GE

流程化解决分析, 通过构建流程,一次性运行命令

java ­Xmx4g ­jar /path/to/dropseq/TagBamWithReadSequenceExtended.jar
INPUT=/my_unaligned_data.bam
OUTPUT=/dev/stdout
COMPRESSION_LEVEL=0
SUMMARY=​ unaligned_taggedMolecular.bam_summary.txt  
BASE_RANGE=13­20 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=False
TAG_NAME=XM NUM_BASES_BELOW_QUALITY=1 |
java ­Xmx4g ­jar /path/to/dropseq/TagBamWithReadSequenceExtended.jar
INPUT=/dev/stdin
OUTPUT=/dev/stdout
COMPRESSION_LEVEL=0
SUMMARY=​ unaligned_taggedCell.bam_summary.txt 
BASE_RANGE=1­12 BASE_QUALITY=10 BARCODED_READ=1 DISCARD_READ=True
TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1

Drop-seq_alignment.sh脚本运行文件, 通过编写的sh脚本统一运行文件

# 201
nohup /workcenters/workcenter1/chenhy/software/Drop-seq-2.4.0/src/scripts/Drop-seq_alignment.sh \
-g /workcenters/workcenter3/chenhy/PlantSingleCellParadise/ara/ara_genome_dropseq/Drop-ref/STAR \
-r /workcenters/workcenter3/chenhy/PlantSingleCellParadise/ara/ara_genome_dropseq/Drop-ref/TAIR10_ara.fasta.gz \
-d /workcenters/workcenter1/chenhy/software/Drop-seq_tools-2.4.0 \
-o /workcenters/workcenter3/chenhy/PlantSingleCellParadise/ara/ara_seqrawdata/GSE122687/Bam_file/SRR8206655 \
/workcenters/workcenter3/chenhy/PlantSingleCellParadise/ara/ara_seqrawdata/GSE122687/Bam_file/SRR8206655_unaligned_data.bam &
# 202
nohup /data1/chy/software/Drop-seq-2.4.0/src/scripts/Drop-seq_alignment.sh \
-g /data2/chy/singlecell/ara_root1_5/GSE122687/seq_data/Drop-ref/STAR \
-r /data2/chy/singlecell/ara_root1_5/GSE122687/seq_data/Drop-ref/TAIR10_ara.fasta.gz \
-d /data1/chy/software/Drop-seq_tools-2.4.0 \
-o /data2/chy/singlecell/ara_root1_5/GSE122687/seq_data/SRR8206654 \
/data2/chy/singlecell/ara_root1_5/GSE122687/seq_data/SRR8206654_unaligned_data.bam &

基因表达矩阵获取, 输入文件为比对步骤得到的aligned BAM文件

java ­Xmx4g ­jar /path/to/dropseq/DigitalExpression.jar
I=out_gene_exon_tagged.bam
O=out_gene_exon_tagged.dge.txt.gz
SUMMARY=out_gene_exon_tagged.dge.summary.txt
NUM_CORE_BARCODES=100

CEL-seq2原始数据比对分析

构建yaml文件

# 构建bowtie2_index
nohup bowtie2-build B73_RefGen_v3.fa B73_V3_index &
# 构建STAR_index
nohup STAR --runThreadN 10 --runMode genomeGenerate \
--genomeDir ./STAR \
--genomeFastaFiles ./Zea_mays.B73_RefGen_v4.dna.toplevel.fa \
--sjdbGTFfile ./Zea_mays.B73_RefGen_v4.49.chr.gff3 &
## CEL-seq2 Tech Setting ##
BC_INDEX_FPATH: '/workcenters/workcenter3/chenhy/PlantSingleCellParadise/ara/maize_genome/CELseq2/barcodes_cel-seq_umis96.tab'
BC_IDs_DEFAULT: '1-96'
UMI_LENGTH: 6
BC_LENGTH: 6

## Tools ##
BOWTIE2_INDEX_PREFIX: '/workcenters/workcenter3/chenhy/PlantSingleCellParadise/ara/maize_genome/bowtie2_index/B73_V3_index'
BOWTIE2: '/workcenters/workcenter1/chenhy/software/bowtie2-2.4.1-linux-x86_64/bowtie2'

## Annotations ##
GFF: '/workcenters/workcenter3/chenhy/PlantSingleCellParadise/ara/maize_genome/V3/B73_RefGen_v3.fa'

## Demultiplexing ##
FASTQ_QUAL_MIN_OF_BC: 10
CUT_LENGTH: 35
## Alignment ##
ALIGNER: 'bowtie2'
## UMI Count ##
ALN_QUAL_MIN: 0

## Running Parameters ##
num_threads: 5
verbose: True

直接运行

nohup celseq2 --config-file /workcenters/workcenter3/chenhy/PlantSingleCellParadise/ara/maize_scdata/maize_SAM/SAM+P2/CELseq2/wonderful_CEL-Seq2_config.yaml \
--experiment-table /workcenters/workcenter3/chenhy/PlantSingleCellParadise/ara/maize_scdata/maize_SAM/SAM+P2/CELseq2/wonderful_experiment_table.txt \
--output-dir /workcenters/workcenter3/chenhy/PlantSingleCellParadise/ara/maize_scdata/maize_SAM/SAM+P2/CELseq2_result/SRR11943509 \
--keep-temp -j 10 &

生信相关网站

PH525x series - Biomedical Data Science
PLOB-bioinformatics

参考基因组下载

动物参考基因组:http://asia.ensembl.org/index.html
植物参考基因组:http://plants.ensembl.org/index.html
其他真菌细菌等参考基因组:http://ensemblgenomes.org/
基因组数据中MT表示线粒体基因组,PT表示叶绿体基因组。

单细胞知识链接

单细胞测序常见问题答疑!

单细胞分析小问题

什么是nFeature_RNA和nCount_RNA? nFeature_RNA is the number of genes detected in each cell. nCount_RNA is the total number of molecules detected within a cell.
nFeature_RNA & nCount_RNA

10X三个文件代表什么? read type有3种,I1为sample index read, R1为barcode和UMI, R2才是测序read.

10X测序文件的编号如何理解? Sample_S1_L00X_R1_001.fastq.gz. The files names indicate that they were all from the same sample called pbmc_1k_v3 and the library was run on two lanes, Lane 1: L001 and lane 2: L002.
10X官方解释
10X官方命名规则

如何通过Seurat对象获取细胞对应的坐标?

UMAP_zuobiao <- data_obj@reductions$umap@cell.embeddings
TSNE_zuobiao <- data_obj@reductions$tsne@cell.embeddings
write.table(UMAP_zuobiao,"./finial_result/UMAP_zuobiao.txt")
write.table(TSNE_zuobiao,"./finial_result/TSNE_zuobiao.txt")

如何通过Seurat对象获取原始表达数据?

DefaultAssay(data_obj) <- 'RNA'
countdata <- GetAssayData(data_obj, slot = "counts")
write.table(countdata,"./finial_result/raw_count_combine.txt")

单细胞数据整合后如何查看整合效果?

DimPlot(data_obj,split.by = 'orig.ident')

Drop-seq分析中运行picard遇到一个错误: “java.io.IOException: No space left on device” https://www.biostars.org/p/42613/
修改前
java -Xmx2g -jar ${picard_path} SortSam I=${bam_info}.bam O=${bam_info}.s.bam SO=coordinate
java -Xmx2g -jar ${picard_path} MarkDuplicates I=${bam_info}.s.bam O=${bam_info}.sm.bam M=${bam_info}.markdup_metrics.txt
修改后
java -Xmx2g -Djava.io.tmpdir=./tmp -jar ${picard_path} SortSam I=${bam_info}.bam O=${bam_info}.s.bam SO=coordinate TMP_DIR=./tmp
java -Xmx2g -Djava.io.tmpdir=./tmp -jar ${picard_path} MarkDuplicates I=${bam_info}.s.bam O=${bam_info}.sm.bam M=${bam_info}.markdup_metrics.txt TMP_DIR=./tmp

提取特定样本细胞 Seurat软件

control <- data_obj@meta.data$orig.ident[data_obj@meta.data$orig.ident=="Root_single_cell_shr_datamatrix.csv"]
length(control)
cell_name <- colnames(data_obj)
control <- cell_name[1:length(control)]
WT_obj <- subset(data_obj, cells = control)

wig等相关格式转换工具 http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/

gbff文件转为gff文件

from BCBio import GFF
from Bio import SeqIO
in_file = "your_file.gb"
out_file = "your_file.gff"
in_handle = open(in_file)
out_handle = open(out_file, "w")
GFF.write(SeqIO.parse(in_handle, "genbank"), out_handle)
in_handle.close()
out_handle.close()

FindConservedMarkers vs FindMarkers vs FindAllMarkers Seurat FindMarkers will find markers between two different identity groups - you have to specify both identity groups. This is useful for comparing the differences between two specific groups.
FindAllMarkers will find markers differentially expressed in each identity group by comparing it to all of the others - you don’t have to manually define anything. Note that markers may bleed over between closely-related groups - they are not forced to be specific to only one group. This is what most people use (and likely what you want).
FindConservedMarkers will find markers that are conserved between two groups - this can be useful if you want to find markers that are conserved between a treated and untreated condition for a specific cell type or group of cells. It means they are differentially expressed compared to other groups, but have similar expression between the two groups you’re actually comparing.

拟时分析中每个PC轴生物学解释 即如何量化坐标轴主成分的方向与哪些因素相关呢?
如果将各个细胞X轴的坐标与细胞中的各个基因或某些基因集合的表达量进行相关性分析,就可以知道细胞的X轴坐标与哪些基因正相关,然后从相关基因的功能就可以推测X轴对应的生物学意义。
还可以量化这种坐标轴位置信息与相关基因表达量的关系,采用图形化的方式展示,其中X轴是各个细胞在分化轨迹图中的X轴坐标值,Y轴是对应坐标轴相关基因的平均表达量,最终进行拟合。
参考文献:Single-cell Map of Diverse Immune Phenotypes in the Breast Tumor Microenvironment.

SingleR自动注释软件收集的数据库 SingleR目前自带了7个参考数据集合,这些数据都是来自利用细胞分选的方法得到的纯细胞。都是bulkRNA或者microarray数据。

  • BlueprintEncodeData Labels
  • HumanPrimaryCellAtlasData Labels
  • DatabaseImmuneCellExpressionData Labels
  • NovershternHematopoieticData Labels
  • MonacoImmuneData Labels
  • ImmGenData Labels
  • MouseRNAseqData Labels legacy SingleR package提供包含归一化表达值和基于bulk RNA-seq, microarray和single-cell RNA-seq数据的细胞类型标签的RDA文件,这些数据来自:
  • Blueprint (Martens and Stunnenberg 2013) and Encode (The ENCODE Project Consortium 2012),
  • the Human Primary Cell Atlas (Mabbott et al. 2013),
  • the murine ImmGen (Heng et al. 2008), and
  • a collection of mouse data sets downloaded from GEO (Benayoun et al. 2019). 前三个参考数据集的bulk RNA-seq 和 microarray数据集是从预先分选的细胞群体中获得的,即这些样本的细胞标记大多是基于各自的分选/纯化策略而不是通过电子预测(in silico prediction)方法导出的。

还准备了来自bulk RNA-seq 和免疫细胞微阵列数据的另外三个参考数据集。这些数据集中的每一个也是从预先分类的细胞群体中获得的:

  • The Database for Immune Cell Expression(/eQTLs/Epigenomics) (Schmiedel et al. 2018),
  • Novershtern Hematopoietic Cell Data - GSE24759 - formerly known as * * * Differentiation Map (Novershtern et al. 2011), and
  • Monaco Immune Cell Data - GSE107011 (Monaco et al. 2019). 目前植物中类似的数据很少,往下开发比较困难。

cellranger构建gtf等参考文件时出现问题 EXITING because of FATAL PARAMETER ERROR: limitGenomeGenerateRAM=25300000000is too small for your genome
SOLUTION: please specify –limitGenomeGenerateRAM not less than 271935602954 and make that much RAM available
修改内存及–genomeChrBinNbits参数
nohup STAR –runMode genomeGenerate –genomeDir /workcenters/workcenter3/chenhy/BY2_singlecell/Tobacco_genome_K326_2014/K326_2014_strand/star –runThreadN 2 –genomeFastaFiles /workcenters/workcenter3/chenhy/BY2_singlecell/Tobacco_genome_K326_2014/Ntab-K326_AWOJ-SS.fa –sjdbGTFfile /workcenters/workcenter3/chenhy/BY2_singlecell/Tobacco_genome_K326_2014/Ntab-K326_AWOJ-SS_K326_rnaseq_strand.gtf –limitGenomeGenerateRAM 25300000000 –genomeSAsparseD 1 –genomeSAindexNbases 12 –genomeChrBinNbits 18 &