一样平常瞎掰
工欲善其事必先利其器,好用的工具对服务来说很紧张,可以让你做起事来收到事半功倍的结果。前段时间分析数据时,须要给intervals做基因注释,须要利用gtf文件,末了发现一款非常好用的python包 — pyranges。该包可以很轻松地将bed、gtf、gff3等格式文件读取为PyRanges对象,该对象有点雷同潘各人的DataFrame,熟悉潘各人的同砚应该能了解到DataFrame利用数据的快乐。同样,皮软杰斯内置利用intervals的方法也绝对能给你带来无语伦比的痛快酣畅。废话不多说,下面来感受一下使用皮软杰斯做注释基因有多方便。
读取文件
read_bed方法可以用来读取bed格式的文件,read_gtf、read_gff3分别用来读取gtf、gff3文件:
import pyranges as prbed = pr.read_bed('data.bed')bed+--------------+-----------+-----------+------------+-----------+--------------+| Chromosome | Start | End | Name | Score | Strand || (category) | (int32) | (int32) | (object) | (int64) | (category) ||--------------+-----------+-----------+------------+-----------+--------------|| chr1 | 212609534 | 212609559 | U0 | 0 | + || chr1 | 169887529 | 169887554 | U0 | 0 | + || chr1 | 216711011 | 216711036 | U0 | 0 | + || chr1 | 144227079 | 144227104 | U0 | 0 | + || ... | ... | ... | ... | ... | ... || chrY | 15224235 | 15224260 | U0 | 0 | - || chrY | 13517892 | 13517917 | U0 | 0 | - || chrY | 8010951 | 8010976 | U0 | 0 | - || chrY | 7405376 | 7405401 | U0 | 0 | - |+--------------+-----------+-----------+------------+-----------+--------------+Stranded PyRanges object has 10,000 rows and 6 columns from 24 chromosomes.For printing, the PyRanges was sorted on Chromosome and Strand.gtf = pr.read_gtf('gencode.vM24.annotation.gtf')gtf+--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+| Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_id | gene_type | +15 || (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | (object) | ... ||--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------|| chr1 | HAVANA | gene | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... || chr1 | HAVANA | transcript | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... || chr1 | HAVANA | exon | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... || chr1 | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ENSMUSG00000064842.1 | snRNA | ... || ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... || chrY | ENSEMBL | CDS | 90838871 | 90839177 | . | - | 0 | ENSMUSG00000096850.1 | protein_coding | ... || chrY | ENSEMBL | start_codon | 90839174 | 90839177 | . | - | 0 | ENSMUSG00000096850.1 | protein_coding | ... || chrY | ENSEMBL | stop_codon | 90838868 | 90838871 | . | - | 0 | ENSMUSG00000096850.1 | protein_coding | ... || chrY | ENSEMBL | UTR | 90838868 | 90838871 | . | - | . | ENSMUSG00000096850.1 | protein_coding | ... |+--------------+------------+-------------+-----------+-----------+------------+--------------+------------+----------------------+----------------+-------+Stranded PyRanges object has 1,870,769 rows and 25 columns from 22 chromosomes.For printing, the PyRanges was sorted on Chromosome and Strand.15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.) 上面说过PyRanges对象雷同潘各人的DataFrame,以是可以用雷同的方法来过滤数据,如下面只保存gtf内里Feature属性为gene的行:
gene = gtf[gtf.Feature == 'gene']gene+--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+| Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_id | gene_type | +15 || (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | (object) | (object) | ... ||--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------|| chr1 | HAVANA | gene | 3073252 | 3074322 | . | + | . | ENSMUSG00000102693.1 | TEC | ... || chr1 | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ENSMUSG00000064842.1 | snRNA | ... || chr1 | HAVANA | gene | 3252756 | 3253236 | . | + | . | ENSMUSG00000102851.1 | processed_pseudogene | ... || chr1 | HAVANA | gene | 3466586 | 3513553 | . | + | . | ENSMUSG00000089699.1 | lncRNA | ... || ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... || chrY | HAVANA | gene | 90603500 | 90605864 | . | - | . | ENSMUSG00000099619.6 | lncRNA | ... || chrY | HAVANA | gene | 90665345 | 90667625 | . | - | . | ENSMUSG00000099399.6 | lncRNA | ... || chrY | HAVANA | gene | 90752426 | 90755467 | . | - | . | ENSMUSG00000095366.2 | lncRNA | ... || chrY | ENSEMBL | gene | 90838868 | 90839177 | . | - | . | ENSMUSG00000096850.1 | protein_coding | ... |+--------------+------------+------------+-----------+-----------+------------+--------------+------------+----------------------+----------------------+-------+Stranded PyRanges object has 55,385 rows and 25 columns from 22 chromosomes.For printing, the PyRanges was sorted on Chromosome and Strand.15 hidden columns: gene_name, level, mgi_id, havana_gene, transcript_id, transcript_type, transcript_name, transcript_support_level, tag, havana_transcript, exon_number, ... (+ 4 more.)基因注释
皮软杰斯内里有nearest、k_nearest两种方法可以用。通过函数名我们就可以猜到这两个函数都可以用来查找interval1聚集里每一个区间在interval2聚集内里距离近来的区间,默认是返回有交集的区间,而k_nearest可以指定返回的区间个数。通过代码看一下区别:
bed_test = pr.PyRanges(chromosomes="chr1", strands="+", starts=[3074300], ends=[3074322])bed+--------------+-----------+-----------+--------------+| Chromosome | Start | End | Strand || (category) | (int32) | (int32) | (category) ||--------------+-----------+-----------+--------------|| chr1 | 3074300 | 3074322 | + |+--------------+-----------+-----------+--------------+bed_test.nearest(gtf,suffix="_gtf")+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+| Chromosome | Start | End | Strand | Source | Feature | Start_gtf | End_gtf | Score | Strand_gtf | Frame | +18 || (category) | (int32) | (int32) | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... ||--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------|| chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3073252 | 3074322 | . | + | . | ... |+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+bed_test.nearest(gene,overlap=False,suffix="_gtf")+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+| Chromosome | Start | End | Strand | Source | Feature | Start_gtf | End_gtf | Score | Strand_gtf | Frame | +18 || (category) | (int32) | (int32) | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... ||--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------|| chr1 | 3074300 | 3074322 | + | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ... |+--------------+-----------+-----------+--------------+------------+------------+-------------+-----------+------------+--------------+------------+-------+bed_test.k_nearest(gene,k=5,suffix="_gtf")+--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+| Chromosome | Start | End | Strand | Source | Feature | Start_b | End_b | Score | Strand_b | Frame | +18 || (category) | (int32) | (int32) | (category) | (object) | (object) | (int32) | (int32) | (object) | (category) | (object) | ... ||--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------|| chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3073252 | 3074322 | . | + | . | ... || chr1 | 3074300 | 3074322 | + | ENSEMBL | gene | 3102015 | 3102125 | . | + | . | ... || chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3205900 | 3671498 | . | - | . | ... || chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3252756 | 3253236 | . | + | . | ... || chr1 | 3074300 | 3074322 | + | HAVANA | gene | 3365730 | 3368549 | . | - | . | ... |+--------------+-----------+-----------+--------------+------------+------------+-----------+-----------+------------+--------------+------------+-------+ 从上面的代码可知有无overlap=False参数的区别,以及nearest和k_nearest函数的区别。固然,尚有一些比力实用的参数如strandedness:"same",指定链是否雷同;how:"upstream"、 "downstream"分别指定选取近来区间的方位,nb_cpu:指定使用的cpu数。除此之外,k_nearest尚有一个参数ties可以指定有距离雷同的区间时怎样返回,可选值为"first"、"last"、"different"。
suffix参数指定当两个聚集有雷同列名时,后一个聚集列名会添加的字段,默认是'_b'。不外,似乎k_nearest函数有BUG这个参数无效,但并不影响结果可以忽略。
从上面可以看出k_nearest函数功能上略胜一筹,注释基因时用此函数相对更自由一些:
anno = bed.k_nearest(gene, ties='different')anno = anno[['Chromosome', 'Start', 'End, 'Strand', 'gene_id']]anno+--------------+-----------+-----------+--------------+----------------------+| Chromosome | Start | End | Strand | gene_id || (category) | (int32) | (int32) | (category) | (object) ||--------------+-----------+-----------+--------------+----------------------|| chr1 | 212609534 | 212609559 | + | ENSMUSG00000102307.1 || chr1 | 169887529 | 169887554 | + | ENSMUSG00000103110.1 || chr1 | 216711011 | 216711036 | + | ENSMUSG00000102307.1 || chr1 | 144227079 | 144227104 | + | ENSMUSG00000100389.1 || ... | ... | ... | ... | ... || chrY | 15224235 | 15224260 | - | ENSMUSG00000102835.5 || chrY | 13517892 | 13517917 | - | ENSMUSG00000102174.1 || chrY | 8010951 | 8010976 | - | ENSMUSG00000099861.1 || chrY | 7405376 | 7405401 | - | ENSMUSG00000118447.1 |+--------------+-----------+-----------+--------------+----------------------+anno.to_csv("data_anno.txt", sep="\t", header=True)竣事语
pyranges包的功能还是挺多的,不但包罗区间的交集、差集,还可以举行一些统计运算如Jaccard、Fisher等。固然,现在有不少现成的软件如homer、chipseeker可以做基因注释,很多时间我们可以直接使用这些软件即可,但pyranges还是值得学习收藏一下,大概做个性化数据处置惩罚的时间使用它会来得更为方便些。
往期回顾
scanpy踩坑实录
差别基因密度分布
R绘图配色总结
saddleplot | A/B compartments
双曲线火山图一键拿捏
|