Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters

被引:88
作者
Pavy, N
Paule, C
Parsons, L
Crow, JA
Morency, MJ
Cooke, J
Johnson, JE
Noumen, E
Guillet-Claude, C
Butterfield, Y
Barber, S
Yang, G
Liu, J
Stott, J
Kirkpatrick, R
Siddiqui, A
Holt, R
Marra, M
Seguin, A
Retzel, E
Bousquet, J
MacKay, J
机构
[1] Univ Laval, ARBOREA, Ste Foy, PQ G1K 7P4, Canada
[2] Univ Laval, Canada Res Chair Forest Genom, Ste Foy, PQ G1K 7P4, Canada
[3] Univ Minnesota, Ctr Computat Genom & Bioinformat, Minneapolis, MN 55455 USA
[4] Nat Resources Canada, Canadian Forestry Serv, Laurentian Forestry Ctr, Quebec City, PQ G1V 4C7, Canada
[5] British Columbia Canc Agcy, Genome Sci Ctr, Vancouver, BC V5Z 1L3, Canada
[6] Univ Alberta, Dept Biol Sci, Edmonton, AB T6G 2E9, Canada
关键词
D O I
10.1186/1471-2164-6-144
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: The sequencing and analysis of ESTs is for now the only practical approach for large-scale gene discovery and annotation in conifers because their very large genomes are unlikely to be sequenced in the near future. Our objective was to produce extensive collections of ESTs and cDNA clones to support manufacture of cDNA microarrays and gene discovery in white spruce (Picea glauca [Moench] Voss). Results: We produced 16 cDNA libraries from different tissues and a variety of treatments, and partially sequenced 50,000 cDNA clones. High quality 3' and 5' reads were assembled into 16,578 consensus sequences, 45% of which represented full length inserts. Consensus sequences derived from 5' and 3' reads of the same cDNA clone were linked to define 14,471 transcripts. A large proportion (84%) of the spruce sequences matched a pine sequence, but only 68% of the spruce transcripts had homologs in Arabidopsis or rice. Nearly all the sequences that matched the Populus trichocarpa genome (the only sequenced tree genome) also matched rice or Arabidopsis genomes. We used several sequence similarity search approaches for assignment of putative functions, including blast searches against general and specialized databases (transcription factors, cell wall related proteins), Gene Ontology term assignation and Hidden Markov Model searches against PFAM protein families and domains. In total, 70% of the spruce transcripts displayed matches to proteins of known or unknown function in the Uniref100 database (blastx e-value < 1e-10). We identified multigenic families that appeared larger in spruce than in the Arabidopsis or rice genomes. Detailed analysis of translationally controlled tumour proteins and S-adenosylmethionine synthetase families confirmed a twofold size difference. Sequences and annotations were organized in a dedicated database, SpruceDB. Several search tools were developed to mine the data either based on their occurrence in the cDNA libraries or on functional annotations. Conclusion: This report illustrates specific approaches for large-scale gene discovery and annotation in an organism that is very distantly related to any of the fully sequenced genomes. The ArboreaSet sequences and cDNA clones represent a valuable resource for investigations ranging from plant comparative genomics to applied conifer genetics.
引用
收藏
页数:19
相关论文
共 55 条
[1]   Recent advances in molecular genetics of forest trees [J].
Ahuja, MR .
EUPHYTICA, 2001, 121 (02) :173-195
[2]   Analysis of xylem formation in pine by cDNA sequencing [J].
Allona, I ;
Quinn, M ;
Shoop, E ;
Swope, K ;
St Cyr, S ;
Carlis, J ;
Riedl, J ;
Retzel, E ;
Campbell, MM ;
Sederoff, R ;
Whetten, RW .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (16) :9693-9698
[3]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[4]  
Ashburner M, 2001, GENOME RES, V11, P1425
[5]   The universal protein resource (UniProt) [J].
Bairoch, A ;
Apweiler, R ;
Wu, CH ;
Barker, WC ;
Boeckmann, B ;
Ferro, S ;
Gasteiger, E ;
Huang, HZ ;
Lopez, R ;
Magrane, M ;
Martin, MJ ;
Natale, DA ;
O'Donovan, C ;
Redaschi, N ;
Yeh, LSL .
NUCLEIC ACIDS RESEARCH, 2005, 33 :D154-D159
[6]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
[7]   Trans-species shared polymorphisms at orthologous nuclear gene loci among distant species in the conifer Picea (Pinaceae):: Implications for the long-term maintenance of genetic diversity in trees [J].
Bouillé, M ;
Bousquet, J .
AMERICAN JOURNAL OF BOTANY, 2005, 92 (01) :63-73
[8]   Expressed sequence tag analysis in Cycas, the most primitive living seed plant -: art. no. R78 [J].
Brenner, ED ;
Stevenson, DW ;
McCombie, RW ;
Katari, MS ;
Rudd, SA ;
Mayer, KFX ;
Palenchar, PM ;
Runko, SJ ;
Twigg, RW ;
Dai, GW ;
Martienssen, RA ;
Benfey, PN ;
Coruzzi, GM .
GENOME BIOLOGY, 2003, 4 (12)
[9]   Nucleotide diversity and linkage disequilibrium in loblolly pine [J].
Brown, GR ;
Gill, GP ;
Kuntz, RJ ;
Langley, CH ;
Neale, DB .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2004, 101 (42) :15255-15260
[10]   Variation in lignin content and composition - Mechanism of control and implications for the genetic improvement of plants [J].
Campbell, MM ;
Sederoff, RR .
PLANT PHYSIOLOGY, 1996, 110 (01) :3-13