学python:使用python的pyRanges模块中的read_gtf函数读取gtf文件总是报错的

源码 · 2024-9-8 06:18:46

pyRanges的资助文档
https://biocore-ntnu.github.io/pyranges/loadingcreating-pyranges.html
我本身的gtf文件是如许的 ID和背面字符串是用等号链接的，通常
是用空格，以是他界说函数用来查拆分字符串的时间是用空格来分隔的，以是这个地方我们把读代替码稍微改动一下，就是增长一个等号作为分隔符
首先界说拆分末了一列的函数
def to_rows(anno): rowdicts = [] try:       l = anno.head(1)       for l in l:          l.replace('"', '').replace(";", "").split() except AttributeError:       raise Exception("Invalid attribute string: {l}. If the file is in GFF3 format, use pr.read_gff3 instead.".format(l=l)) for l in anno:       rowdicts.append({kk[0]: kk[-1]                      for kk in [re.split(' |=',kv.replace('""', '"NA"').replace('"', ''), 1)                                  for kv in re.split('; |;',l)]}) return pd.DataFrame.from_dict(rowdicts).set_index(anno.index)读取gtf的函数
def read_gtf_full(f, as_df=False, nrows=None, skiprows=0): dtypes = {       "Chromosome": "category",       "Feature": "category",       "Strand": "category" } names = "Chromosome Source Feature Start End Score Strand Frame Attribute".split( ) df_iter = pd.read_csv(       f,       sep="\t",       header=None,       names=names,       dtype=dtypes,       chunksize=int(1e5),       skiprows=skiprows,       nrows=nrows,comment="#") _to_rows =  to_rows dfs = [] for df in df_iter:       extra = _to_rows(df.Attribute)       df = df.drop("Attribute", axis=1)       ndf = pd.concat([df, extra], axis=1, sort=False)       dfs.append(ndf) df = pd.concat(dfs, sort=False) df.loc[:, "Start"] = df.Start - 1 if not as_df:       return PyRanges(df) else:       return df读取gtf文件
import pyranges as prfrom pyranges import PyRangesread_gtf_full("example02.gtf")example02.gtf文件的内容
##gff-version 3# gffread v0.12.7# gffread -E --keep-genes /mnt/shared/scratch/wguo/barkeRTD/stringtie/B1/Stringtie_B1.gtf -o 00.newgtf/B1/Stringtie_B1_new.gtfchr1H_part_1 StringTie gene 72141 73256 . + . ID=STRG.1chr1H_part_1 StringTie transcript  72141 73256 1000 + . ID=STRG.1.1

arent=STRG.1chr1H_part_1 StringTie exon 72141 72399 1000 + . Parent=STRG.1.1chr1H_part_1 StringTie exon 72822 73256 1000 + . Parent=STRG.1.1chr1H_part_1 StringTie gene 102332 103882 . + . ID=STRG.2chr1H_part_1 StringTie transcript 102332 103882 1000 + . ID=STRG.2.1

arent=STRG.2chr1H_part_1 StringTie exon 102332 103882 1000 + . Parent=STRG.2.1chr1H_part_1 StringTie transcript 102332 103750 1000 + . ID=STRG.2.2

arent=STRG.2chr1H_part_1 StringTie exon 102332 103533 1000 + . Parent=STRG.2.2chr1H_part_1 StringTie exon 103640 103750 1000 + . Parent=STRG.2.2chr1H_part_1 StringTie gene 104391 108013 . - . ID=STRG.3chr1H_part_1 StringTie transcript 104391 108013 1000 - . ID=STRG.3.4

arent=STRG.3接待各人关注我的公众号
小明的数据分析条记本

小明的数据分析条记本公众号紧张分享：1、R语言和python做数据分析和数据可视化的简单小例子；2、园艺植物干系转录组学、基因组学、群体遗传学文献阅读条记；3、生物信息学入门学习资料及本身的学习条记！

学python:使用python的pyRanges模块中的read_gtf函数读取gtf文件总是报错的

所属分类: 问答交流

新帖推荐: 30日

推荐作品