Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow

被引:60
作者
Wright, James C. [1 ]
Mudge, Jonathan [1 ]
Weisser, Hendrik [1 ]
Barzine, Mitra P. [2 ]
Gonzalez, Jose M. [1 ]
Brazma, Alvis [2 ]
Choudhary, Jyoti S. [1 ]
Harrow, Jennifer [1 ]
机构
[1] Wellcome Trust Sanger Inst, Wellcome Genome Campus, Cambridge CB10 1SA, England
[2] EMBL, European Bioinformat Inst, Wellcome Genome Campus, Cambridge CB10 1SA, England
基金
英国惠康基金; 美国国家卫生研究院;
关键词
PROTEIN-CODING GENES; FALSE DISCOVERY RATE; MS-GF PLUS; PEPTIDE IDENTIFICATION; MASS-SPECTROMETRY; TRANSCRIPTOMES; ACCURATE; DATABASE; REVEALS; EXPRESSION;
D O I
10.1038/ncomms11778
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
070301 [无机化学]; 070403 [天体物理学]; 070507 [自然资源与国土空间规划学]; 090105 [作物生产系统与生态工程];
摘要
Complete annotation of the human genome is indispensable for medical research. The GENCODE consortium strives to provide this, augmenting computational and experimental evidence with manual annotation. The rapidly developing field of proteogenomics provides evidence for the translation of genes into proteins and can be used to discover and refine gene models. However, for both the proteomics and annotation groups, there is a lack of guidelines for integrating this data. Here we report a stringent workflow for the interpretation of proteogenomic data that could be used by the annotation community to interpret novel proteogenomic evidence. Based on reprocessing of three large-scale publicly available human data sets, we show that a conservative approach, using stringent filtering is required to generate valid identifications. Evidence has been found supporting 16 novel protein-coding genes being added to GENCODE. Despite this many peptide identifications in pseudogenes cannot be annotated due to the absence of orthogonal supporting evidence.
引用
收藏
页数:11
相关论文
共 56 条
[1]
Agarwala R, 2018, NUCLEIC ACIDS RES, V46, pD8, DOI [10.1093/nar/gks1189, 10.1093/nar/gkx1095, 10.1093/nar/gkq1172]
[2]
Proteogenomic Analysis of Human Chromosome 9-Encoded Genes from Human Samples and Lung Cancer Tissues [J].
Ahn, Jung-Mo ;
Kim, Min-Sik ;
Kim, Yong-In ;
Jeong, Seul-Ki ;
Lee, Hyoung-Joo ;
Lee, Sun Hee ;
Paik, Young-Ki ;
Pandey, Akhilesh ;
Cho, Je-Yoel .
JOURNAL OF PROTEOME RESEARCH, 2014, 13 (01) :137-146
[3]
HTSeq-a Python']Python framework to work with high-throughput sequencing data [J].
Anders, Simon ;
Pyl, Paul Theodor ;
Huber, Wolfgang .
BIOINFORMATICS, 2015, 31 (02) :166-169
[4]
Non-model organisms, a species endangered by proteogenomics [J].
Armengaud, Jean ;
Trapp, Judith ;
Pible, Olivier ;
Geffard, Olivier ;
Chaumot, Arnaud ;
Hartmann, Erica M. .
JOURNAL OF PROTEOMICS, 2014, 105 :5-18
[5]
UniProt: a hub for protein information [J].
Bateman, Alex ;
Martin, Maria Jesus ;
O'Donovan, Claire ;
Magrane, Michele ;
Apweiler, Rolf ;
Alpi, Emanuele ;
Antunes, Ricardo ;
Arganiska, Joanna ;
Bely, Benoit ;
Bingley, Mark ;
Bonilla, Carlos ;
Britto, Ramona ;
Bursteinas, Borisas ;
Chavali, Gayatri ;
Cibrian-Uhalte, Elena ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dogan, Tunca ;
Fazzini, Francesco ;
Gane, Paul ;
Cas-tro, Leyla Garcia ;
Garmiri, Penelope ;
Hatton-Ellis, Emma ;
Hieta, Reija ;
Huntley, Rachael ;
Legge, Duncan ;
Liu, Wudong ;
Luo, Jie ;
MacDougall, Alistair ;
Mutowo, Prudence ;
Nightin-gale, Andrew ;
Orchard, Sandra ;
Pichler, Klemens ;
Poggioli, Diego ;
Pundir, Sangya ;
Pureza, Luis ;
Qi, Guoying ;
Rosanoff, Steven ;
Saidi, Rabie ;
Sawford, Tony ;
Shypitsyna, Aleksandra ;
Turner, Edward ;
Volynkin, Vladimir ;
Wardell, Tony ;
Watkins, Xavier ;
Zellner, Hermann ;
Cowley, Andrew ;
Figueira, Luis ;
Li, Weizhong ;
McWilliam, Hamish .
NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) :D204-D212
[6]
Comprehensive proteomics [J].
Beck, Martin ;
Claassen, Manfred ;
Aebersold, Ruedi .
CURRENT OPINION IN BIOTECHNOLOGY, 2011, 22 (01) :3-8
[7]
Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies [J].
Blakeley, Paul ;
Overton, Ian M. ;
Hubbard, Simon J. .
JOURNAL OF PROTEOME RESEARCH, 2012, 11 (11) :5221-5234
[8]
Branca RMM, 2014, NAT METHODS, V11, P59, DOI [10.1038/nmeth.2732, 10.1038/NMETH.2732]
[9]
Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome [J].
Brosch, Markus ;
Saunders, Gary I. ;
Frankish, Adam ;
Collins, Mark O. ;
Yu, Lu ;
Wright, James ;
Verstraten, Ruth ;
Adams, David J. ;
Harrow, Jennifer ;
Choudhary, Jyoti S. ;
Hubbard, Tim .
GENOME RESEARCH, 2011, 21 (05) :756-767
[10]
Accurate and Sensitive Peptide Identification with Mascot Percolator [J].
Brosch, Markus ;
Yu, Lu ;
Hubbard, Tim ;
Choudhary, Jyoti .
JOURNAL OF PROTEOME RESEARCH, 2009, 8 (06) :3176-3181