pyRanges的资助文档
https://biocore-ntnu.github.io/pyranges/loadingcreating-pyranges.html
我本身的gtf文件是如许的 ID和背面字符串是用等号链接的,通常
是用空格,以是他界说函数用来查拆分字符串的时间是用空格来分隔的,以是这个地方我们把读代替码稍微改动一下,就是增长一个等号作为分隔符
首先界说拆分末了一列的函数
def to_rows(anno): rowdicts = [] try: l = anno.head(1) for l in l: l.replace('"', '').replace(";", "").split() except AttributeError: raise Exception("Invalid attribute string: {l}. If the file is in GFF3 format, use pr.read_gff3 instead.".format(l=l)) for l in anno: rowdicts.append({kk[0]: kk[-1] for kk in [re.split(' |=',kv.replace('""', '"NA"').replace('"', ''), 1) for kv in re.split('; |;',l)]}) return pd.DataFrame.from_dict(rowdicts).set_index(anno.index)读取gtf的函数
def read_gtf_full(f, as_df=False, nrows=None, skiprows=0): dtypes = { "Chromosome": "category", "Feature": "category", "Strand": "category" } names = "Chromosome Source Feature Start End Score Strand Frame Attribute".split( ) df_iter = pd.read_csv( f, sep="\t", header=None, names=names, dtype=dtypes, chunksize=int(1e5), skiprows=skiprows, nrows=nrows,comment="#") _to_rows = to_rows dfs = [] for df in df_iter: extra = _to_rows(df.Attribute) df = df.drop("Attribute", axis=1) ndf = pd.concat([df, extra], axis=1, sort=False) dfs.append(ndf) df = pd.concat(dfs, sort=False) df.loc[:, "Start"] = df.Start - 1 if not as_df: return PyRanges(df) else: return df读取gtf文件
import pyranges as prfrom pyranges import PyRangesread_gtf_full("example02.gtf")example02.gtf文件的内容
##gff-version 3# gffread v0.12.7# gffread -E --keep-genes /mnt/shared/scratch/wguo/barkeRTD/stringtie/B1/Stringtie_B1.gtf -o 00.newgtf/B1/Stringtie_B1_new.gtfchr1H_part_1 StringTie gene 72141 73256 . + . ID=STRG.1chr1H_part_1 StringTie transcript 72141 73256 1000 + . ID=STRG.1.1arent=STRG.1chr1H_part_1 StringTie exon 72141 72399 1000 + . Parent=STRG.1.1chr1H_part_1 StringTie exon 72822 73256 1000 + . Parent=STRG.1.1chr1H_part_1 StringTie gene 102332 103882 . + . ID=STRG.2chr1H_part_1 StringTie transcript 102332 103882 1000 + . ID=STRG.2.1arent=STRG.2chr1H_part_1 StringTie exon 102332 103882 1000 + . Parent=STRG.2.1chr1H_part_1 StringTie transcript 102332 103750 1000 + . ID=STRG.2.2arent=STRG.2chr1H_part_1 StringTie exon 102332 103533 1000 + . Parent=STRG.2.2chr1H_part_1 StringTie exon 103640 103750 1000 + . Parent=STRG.2.2chr1H_part_1 StringTie gene 104391 108013 . - . ID=STRG.3chr1H_part_1 StringTie transcript 104391 108013 1000 - . ID=STRG.3.4arent=STRG.3接待各人关注我的公众号
小明的数据分析条记本
小明的数据分析条记本 公众号 紧张分享:1、R语言和python做数据分析和数据可视化的简单小例子;2、园艺植物干系转录组学、基因组学、群体遗传学文献阅读条记;3、生物信息学入门学习资料及本身的学习条记!
|