Using proteomics to mine genome sequences

被引:23
作者
Arthur, JW [1 ]
Wilkins, MR [1 ]
机构
[1] Proteome Syst Ltd, N Ryde, NSW 1670, Australia
关键词
proteomics; genome annotation; open reading frames; peptide mass fingerprinting; mass spectrometry;
D O I
10.1021/pr034056e
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We present a method for mining unannotated or annotated genome sequences with proteomic data to identify open reading frames. The region of a genome coding for a protein sequence is identified by using information from the analysis of proteins and peptides with MALDI-TOF mass spectrometry. The raw genome sequence or any unassembled contigs of an organism are theoretically cleaved into a number of equal sized but overlapping fragments, and these are then translated in all six frames into a series of virtual proteins. Each virtual protein is then subjected to a theoretical enzymatic digestion. Standard proteomic sample preparation methods are used to separate, array, and digest the proteins of interest to peptides. The masses of the resulting peptides are measured using mass spectrometry and compared to the theoretical peptide masses of the virtual proteins. The region of the genome responsible for coding for a particular protein can then be identified when there are a large number of hits between peptides from the protein and peptides from the virtual protein. The Method makes no assumptions about the location of a protein in a particular gene sequence or the positions or types of start and stop codons. To illustrate this approach, all 773 proteins of Pseudomonas aeruginosa contained in SWISS-PROT were used to theoretically test the method and optimize parameters. Increasing the size of the virtual proteins results in an overall improvement in the ability to detect the coding region, at the cost of decreasing the sensitivity of the method for smaller proteins. Increasing the minimum number of matching peptides, lowering the mass error tolerance, or increasing the signal-to-noise ratio of the simulated mass spectrum, improves the ability to detect coding regions. The method is further demonstrated on experimental data from Mycobacterium tuberculosis and is also shown to work with eukaryotic organisms (e.g., Homo sapiens).
引用
收藏
页码:393 / 402
页数:10
相关论文
共 22 条
[1]  
ARTHUR JW, 2003, Patent No. 0300300
[2]   The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 [J].
Boeckmann, B ;
Bairoch, A ;
Apweiler, R ;
Blatter, MC ;
Estreicher, A ;
Gasteiger, E ;
Martin, MJ ;
Michoud, K ;
O'Donovan, C ;
Phan, I ;
Pilbout, S ;
Schneider, M .
NUCLEIC ACIDS RESEARCH, 2003, 31 (01) :365-370
[3]  
Breen EJ, 2000, ELECTROPHORESIS, V21, P2243, DOI 10.1002/1522-2683(20000601)21:11<2243::AID-ELPS2243>3.0.CO
[4]  
2-K
[5]  
BREEN EJ, 2003, IN PRESS SPECTROSCOP
[6]   Evaluation of gene structure prediction programs [J].
Burset, M ;
Guigo, R .
GENOMICS, 1996, 34 (03) :353-367
[7]  
Choudhary JS, 2001, PROTEOMICS, V1, P651, DOI 10.1002/1615-9861(200104)1:5<651::AID-PROT651>3.0.CO
[8]  
2-N
[9]   Genome-based peptide fingerprint scanning [J].
Giddings, MC ;
Shah, AA ;
Gesteland, R ;
Moore, B .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (01) :20-25
[10]   IDENTIFYING PROTEINS FROM 2-DIMENSIONAL GELS BY MOLECULAR MASS SEARCHING OF PEPTIDE-FRAGMENTS IN PROTEIN-SEQUENCE DATABASES [J].
HENZEL, WJ ;
BILLECI, TM ;
STULTS, JT ;
WONG, SC ;
GRIMLEY, C ;
WATANABE, C .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1993, 90 (11) :5011-5015