基因组注释文件转换

几种gff/gtf文件格式转换方法

Posted by CHY on January 27, 2021

多种方法用于转换gff和gtf文件

第一种:gffread

gffread my.gff3 -T -o my.gtf gffread my.gtf -o- >my.gff3

第二种: NBISweden/AGAT(安装perl模块有点繁琐)

conda install -c bioconda agat conda install -c bioconda perl-sort-naturally cpanm Bio::Tools::GFF # 需要sudo权限 agat_convert_sp_gff2gtf.pl -gff augustus_out.gff3 -o augustus_out.gtf

第三种:R包 rtracklayer (有待验证)

BiocManager::install(“rtracklayer”) library(rtracklayer) gff_file<- import(“/workcenters/workcenter3/chenhy/BY2_singlecell/Tobacco_genome_K326_2014/Ntab-K326_AWOJ-SS_K326_rnaseq.gff3”) export(gff_file,”/workcenters/workcenter3/chenhy/BY2_singlecell/Tobacco_genome_K326_2014/Ntab-K326_AWOJ-SS_K326_rnaseq_kb.gtf”,”gtf”)

第四种:Python脚本

import sys

inFile = open(sys.argv[1], 'r')

ID_gene = ''

for line in inFile:
    # skip comment lines that start with the '#' character
    if line[0] != '#':
        # split line into columns by tab
        data = line.strip().split('\t')

        ID_gene = ID_gene
        ID_mRNA = ''

        # if the feature is a gene
        if data[2] == "gene":
            # get the id
            ID_gene = data[-1].split('ID=')[-1].split(';')[0]
            data[-1] = 'gene_id "' + ID_gene + '"; '

        elif data[2] == "mRNA":
            # get two id
            ID_mRNA = data[-1].split('ID=')[-1].split(';')[0]
            ID_gene = data[-1].split('Parent=')[-1].split(';')[0]
            data[-1] = 'gene_id "' + ID_gene + '"; transcript_id "' + ID_mRNA + '"; '

        # if the feature is anything else
        else:
            # get the parent as the ID
            ID_mRNA = data[-1].split('Parent=')[-1].split(';')[0]
            data[-1] = 'gene_id "' + ID_gene + '"; transcript_id "' + ID_mRNA + '"; '

        # print out this new GTF line
        print('\t'.join(data))