无参转录组GO、KEGG富集分析——diamond+idmapping+GOstats

⽆参转录组GO、KEGG富集分析——
diamond+idmapping+GOstats
准备
(1)⽤⽆参转录组分析软件得到unigene fasta⽂件,命名为my_unigenes.fa,格式如下表所⽰:
>MSTRG.5.1 gene=MSTRG.5 TGATGTCATCGATCCGTGACGTTTAGTATTCAACCAATAGGAATCAACCACGTAGGATTGGCGATCCTCG TCAAACGTGTAAACGGTAGATCTGAACCGTTGACTGGCTGAGAGAAAACACATATTGTGGTATTTTAGTC GGTACGATTAAGAAACACGAGAAACACACCGATGAGCGATAACAACCGTAGGATCGTCAACTTGCTTTTG ACATCGTTGCCTATAAATTAAATATCAACAGCCCTACACGTGAGATTACTTTCATTATCTTCCCTTTTCG AGATCGGAGCACTAATTTGAATTCAGAGAAGCGAAAGCGACGCAGAGAAGATGGCTCGTACCAAGCAGAC CGCTCGTAAGTCTACCGGAGGAAAGGCCCCGAGGAAGCAGCTTGCTACTAAGGTATGGACTCTCTCTCTC TCTCTCTCTCTCTCTCTCTCTCTC
>MSTRG.6.1 gene=MSTRG.6 GTAGAAATATAATGGGCTTAAAGATAAGGCCCATTAATTACAGAATCCTACGGGCACGTTACGTGCCGGT
TTAGTTTATTTTGCATTAAAAACCAATTTCGGAGTCGTCATCTTCCTCTTCGCCCAGATTCACTTCCTTC GAGAAGATTCGCATCATTTTCCCAGTTATGGATTGGCAAGGACAGAAACTAGCGGAGCAGCTGATGCAGA TCCTGCTCTTGATCGCCGCCGTTGTGGCGTTCGTCGTTGGTTACACGACGGCGTCGTTTAGGACGATGAT GTTGATTTACGCGGGAGGGGTTGGTGTCACGACGTTGATCACGGTGCCGAACTGGCCATTCTTTAACCGT CATCCACTCAAGTGGTTGGAACCAAGTGAAGCGGAGAAGCATCCTAAACCGGAAGTCGTTGTTAGCTCGA AGAAGAAGTCATCTAAAAAGTAGCAAATCGTCTTTTGTAATCTCTCTTTTTTTTCTAGACCTTTGATATA AAAAAAAAAGACATGTTCTGGATTTTGCTTATGAATAAGATAGTCTAAATGACATAATAATTTCGATTGA TTCTGAGACATCCTTGCTTAATTGTTATGTA
>BnaA01g00010D.1 gene=BnaA01g00010D TTTAGTTTATTTTGCATTAAAAACCAATTTCGGAGTCGTCATCTTCCTCTTCGCCCAGATTCACTTCCTT CGAGAAGATTCGCATCATTTTCCCAGGTATAGATTCTCTGACGAGATCTGATTCTAGTTTGTTGCTTATT GTTCTTGTAGATTCGAATCCGGCGATTATCAATTGCATTTCGTGCTGGATTCAATTCGAAAGATCCGATC TAATCGTTTTGGTTGGTGTTGATTCAGTTATGGATTGGCAAGGACAGAAACTAGCGGAGCAGCTGATGCA GATCCTGCTCTTGATCGCCGCCGTTGTGGCGTTCGTCGTTGGTTACACGACGGCGTCGTTTAGGACGATG ATGTTGATTTACGCGGGAGGGGTTGGTGTCACGACGTTGATCACGGTGCCGAACTGGCCATTCTTTAACC GTCATCCACTCAAGTGGTTGGAACCAAGTGAAGCGGAGAAGCATCCTAAACCGGAAGTCGTTGTTAGCTC GAAGAAGAAGTCATCTAAAAAGTAG
王文颖(2)下载并安装DIAMOND⼯具,可提升BLAST⽐对速度。
wget github/bbuchfink/diamond/releases/download/v0.9.18/
tar xvzf
1.将unigenes⽐对到swissprot数据库(NR数据库同)
(1)获取swissprot数据库:
wget ftp://v/blast/db/
gzip -
陈学祥
(2)建库:
diamond makedb --in swissprot -d swissprot
(3)blastx⽐对:
Evalue设定为1e-5,每query输出1条对应hit。阈值设定规则见参考2。
diamond blastx -d swissprot -q my_unigenes.fa -k 1 -e 0.00001 -o swiss_dia_matches.m8
得到的swiss_dia_matches.m8⽂件格式如下表所⽰,第⼆列为swissprot accession number:
MSTRG.5.1    Q5MYA4.3    96.2    26    1    0    331    408    1    26    7.6e-06    51.6
MSTRG.6.1    Q944J0.1    83.7    92    14    1    168    440    1    92    4.2e-36    152.5
BnaA01g00010D.1    Q944J0.1    83.7    92    14    1    240    512    1    92    1.4e-35    150.6
2.GO注释
(1)获取⽂件和Uniprot2GO_annotated.py⽂件(py⽂件下载见参考3):
wget ftp://town.edu/databases/idmapping/
idmapping⽂件另⼀下载路径:
wget /pub/databases/uniprot/knowledgebase/idmapping/idmapping_
⽂件格式如下表所⽰:靳树增
Q6GZX4 001R_FRG3G 2947773 YP_031579.1 81941549; 49237298  PF04947 GO:0006355; GO:0046782; GO:0006351  UniRef100_Q6GZX4 UniRef9
Q6GZX3 002L_FRG3G 2947774 YP_031580.1 49237299; 81941548; 47060117  PF03003 GO:0033644; GO:0016021  UniRef100_Q6GZX3 UniRef90_Q6GZX3 Q197F8 002R_IIV3 4156251 YP_654574.1 109287880; 123808694; 106073503      UniRef100_Q197F8 UniRef90_Q197F8 UniRef50_Q197F8 UPI0000D83464
⽂件以tab键分隔,第1列为swissport accession number(Uniportkb ID),第4列为NR ID,第8列为GO注释。
(2)更改swiss_dia_matches.m8⽂件,将其第⼆列以“.”分开,只取“.”前⾯的字符串。更改后如下表所⽰:
MSTRG.5.1    Q5MYA4    96.2    26    1    0    331    408    1    26    7.6e-06    51.6
汪天云
MSTRG.6.1    Q944J0    83.7    92    14    1    168    440    1    92    4.2e-36    152.5
BnaA01g00010D.1    Q944J0    83.7    92    14    1    240    512    1    92    1.4e-35    150.6
(3)修改下载的Uniprot2GO_annotated.py⽂件:
由⽂件格式可知:
为swissprot⽐对结果做GO注释时,第16⾏应改为:
UniProt_GO[lsplit[0]] = lsplit[7]
为NR⽐对结果作GO注释时,第16⾏应改为:
UniProt_GO[lsplit[3]] = lsplit[7]
此外,若在windows CMD中运⾏,第14⾏需加⼊decode()函数,linux中则不需要:
lsplit = line.decode().rstrip().split("\t")
(4)运⾏py⽂件进⾏GO注释:
python UniProt2GO_annotate.py swiss_dia_matches.m8 swiss_go.out
swiss_go.out⽂件格式如下表所⽰:
BnaA08g27970D  GO:0045271,GO:0016021,GO:0005739,GO:0031966,GO:0009853,GO:0005747,GO:0055114
BnaAnng26910D  GO:0031087,GO:0010606,GO:0000932,GO:0017148,GO:0042803,GO:0006397
BnaC07g50970D  GO:0042938,GO:0042939,GO:0016021,GO:0042936,GO:0042937,GO:0022857,GO:0006807,GO:0005886,GO:0015031,GO:0009506(5)⽤R包db为GO注释增添EVIDENCE注释:
⽤python将swiss_go.out⽂件的GO条⽬按“,”拆开,存⼊⽂件swiss_go_forStats:
def before_GOstat(f1,f2):
for i adlines():
j = i.split('\t')
for k in j[1].split(','):
m = j[0] + '\t' + k
if(m[-1] != '\n'):
m = m + '\n'
print(m)
f2.write(m)
f1 = open('swiss_go.out','r')
f2 = open('swiss_go_forStats','w')
f1.close()
f2.close()
⽂件swiss_go_forStats格式如下表所⽰:
BnaA08g27970D GO:0005747
BnaA08g27970D GO:0055114
BnaAnng26910D GO:0031087
EVIDENCE注释:
>library(db)
>#keytypes(db)
>swiss_id <- read.delim('swiss_go_forStats',header = F)
>colnames(swiss_id) <- c('gene_id','GO')
>ev_id <- select(db,keys = as.vector(swiss_id$GO),columns = c('EVIDENCE'),keytype = "GO")
>library(dplyr)
>swiss_goev <- left_join(swiss_id,ev_id[,1:2])
>write.table(swiss_goev,'swiss_goev_forStats',row.names = F,quote = F)
swiss_goev_forStats⽂件格式如下表所⽰:
gene_id    GO    EVIDENCE
BnaA08g27970D    GO:0045271    NA
BnaA08g27970D    GO:0016021    IDA
BnaA08g27970D    GO:0016021    IBA
BnaA08g27970D    GO:0016021    IEA
3.KEGG注释(以Brassica napus为例)
(1)KAAS⽹站⾃动注释my_unigenes.fa⽂件,得到基因名称对应的KO list,保存为⽂件,格式如下表所⽰:
MSTRG.6.1 K12946
BnaA01g00010D.1 K12946
BnaA01g00050D.1 K09338
bna00001.keg⽂件格式如下:
+D    GENES    KO
#<h2><a href="/kegg/kegg2.html"><img src="/Fig/bget/kegg3.gif" align="middle" border=0></a>  KEGG Orthology (KO) - Brassica napus (rape) </h2>
%<style type="text/css"><!-- #grid{ table-layout:fixed; font-family: monospace; position: relative; width: 1200px; color: black; }.col1{ position: relative; background : !
A<b>Metabolism</b>
B
B  <b>Carbohydrate metabolism</b>
C    00010 Glycolysis / Gluconeogenesis [PATH:bna00010]
D      106359065 probable hexokinase-like 2 protein    K00844 HK; hexokinase [EC:2.7.1.1]
D      106439029 hexokinase-3-like isoform X1    K00844 HK; hexokinase [EC:2.7.1.1]
D      106384641 probable hexokinase-like 2 protein    K00844 HK; hexokinase [EC:2.7.1.1]
(3)python解析bna00001.keg⽂件,得到path id(C)对应的KO号(D K*****),为⽂
件注释path id,结果保存在
all_kegg_path⽂件中。
import re
f1 = open('bna00001.keg','r')
dict1 = {}
dict2 = {}
for i adlines():
if(i[0] == 'C'):
d = re.search('\d+',i)
if(i[0] == 'D'):
h = re.search('K\d+',i)
up()] = d.group()
dict1.up(), set()).up()) f1.close()
f2 = open('','r')
f3 = open('all_kegg_path','w')经纬度地图
for j adlines():
高考2019jj = j.split('\t')[1].split('\n')[0]
for k,v in dict1.items():
#print(v)
if(jj in v):
u = j.split('\n')[0]+'\t'+ k + '\n'
print(u)
f3.write(u)
f2.close()
f3.close()
all_kegg_path⽂件格式如下表所⽰:
MSTRG.6.1 K12946 03060
BnaA01g00010D.1 K12946 03060
MSTRG.16.1 K02266 00190
4.⽤R包GOstats进⾏GO及KEGG富集分析(1)安装GOstats
>source("/biocLite.R")
>biocLite("GOstats")
(2)GO富集分析(以CC为例,BP、MF同)

本文发布于:2024-09-21 05:41:56,感谢您对本站的认可!

本文链接:https://www.17tex.com/xueshu/200712.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:注释   分析   表所   格式
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议