Database of Trypanosoma cruzi repeated genes:: 20000 additional gene variants

被引:41
作者
Arner, Erik [1 ]
Kindlund, Ellen [1 ]
Nilsson, Daniel [1 ]
Farzana, Fatima [2 ]
Ferella, Marcela [1 ]
Tammi, Martti T. [2 ]
Andersson, Bjoern [1 ]
机构
[1] Karolinska Inst, Dept Cell & Mol Biol, Stockholm, Sweden
[2] Natl Univ Singapore, Dept Biol Sci Biochem, Singapore 117548, Singapore
关键词
D O I
10.1186/1471-2164-8-391
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Repeats are present in all genomes, and often have important functions. However, in large genome sequencing projects, many repetitive regions remain uncharacterized. The genome of the protozoan parasite Trypanosoma cruzi consists of more than 50% repeats. These repeats include surface molecule genes, and several other gene families. In the T. cruzi genome sequencing project, it was clear that not all copies of repetitive genes were present in the assembly, due to collapse of nearly identical repeats. However, at the time of publication of the T. cruzi genome, it was not clear to what extent this had occurred. Results: We have developed a pipeline to estimate the genomic repeat content, where shotgun reads are aligned to the genomic sequence and the gene copy number is estimated using the average shotgun coverage. This method was applied to the genome of T. cruzi and copy numbers of all protein coding sequences and pseudogenes were estimated. The 22 640 results were stored in a database available online. 18% of all protein coding sequences and pseudogenes were estimated to exist in 14 or more copies in the T. cruzi CL Brener genome. The average coverage of the annotated protein coding sequences and pseudogenes indicate a total gene copy number, including allelic gene variants, of over 40 000. Conclusion: Our results indicate that the number of protein coding sequences and pseudogenes in the T. cruzi genome may be twice the previous estimate. We have constructed a database of the T. cruzi gene repeat data that is available as a resource to the community. The main purpose of the database is to enable biologists interested in repeated, unfinished regions to closely examine and resolve these regions themselves using all available shotgun data, instead of having to rely on annotated consensus sequences that often are erroneous and possibly misleading. Five repetitive genes were studied in more detail, in order to illustrate how the database can be used to analyze and extract information about gene repeats with different characteristics in Trypanosoma cruzi.
引用
收藏
页数:15
相关论文
共 45 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions [J].
Arner, E ;
Tammi, MT ;
Tran, AN ;
Kindlund, E ;
Andersson, B .
BMC BIOINFORMATICS, 2006, 7 (1)
[3]   Analysis of segmental duplications and genome assembly in the mouse [J].
Bailey, JA ;
Church, DM ;
Ventura, M ;
Rocchi, M ;
Eichler, EE .
GENOME RESEARCH, 2004, 14 (05) :789-801
[4]   Segmental duplications: Organization and impact within the current Human Genome Project assembly [J].
Bailey, JA ;
Yavor, AM ;
Massa, HF ;
Trask, BJ ;
Eichler, EE .
GENOME RESEARCH, 2001, 11 (06) :1005-1017
[5]   BACCardI -: a tool for the validation of genomic assemblies, assisting genome finishing and intergenome comparison [J].
Bartels, D ;
Kespohl, S ;
Albaum, S ;
Drüke, T ;
Goesmann, A ;
Herold, J ;
Kaiser, O ;
Pühler, A ;
Pfeiffer, F ;
Raddatz, G ;
Stoye, J ;
Meyer, F ;
Schuster, SC .
BIOINFORMATICS, 2005, 21 (07) :853-859
[6]   Trypanosoma cruzi surface mucins:: host-dependent coat diversity [J].
Buscaglia, CA ;
Campo, VA ;
Frasch, ACC ;
Di Noia, JM .
NATURE REVIEWS MICROBIOLOGY, 2006, 4 (03) :229-236
[7]   Integrating data on DNA copy number with gene expression levels and drug sensitivities in the NCI-60 cell line panel [J].
Bussey, KJ ;
Chin, K ;
Lababidi, S ;
Reimers, M ;
Reinhold, WC ;
Kuo, WL ;
Gwadry, F ;
Jain, A ;
Kouros-Mehr, H ;
Fridlyand, J ;
Jain, A ;
Collins, C ;
Nishizuka, S ;
Tonon, G ;
Roschke, A ;
Gehlhaus, K ;
Kirsch, I ;
Scudiero, DA ;
Gray, JW ;
Weinstein, JN .
MOLECULAR CANCER THERAPEUTICS, 2006, 5 (04) :853-867
[8]   THE MAJOR CYSTEINE PROTEINASE (CRUZIPAIN) FROM TRYPANOSOMA-CRUZI IS ENCODED BY MULTIPLE POLYMORPHIC TANDEMLY ORGANIZED GENES LOCATED ON DIFFERENT CHROMOSOMES [J].
CAMPETELLA, O ;
HENRIKSSON, J ;
ASLUND, L ;
FRASCH, ACC ;
PETTERSSON, U ;
CAZZULO, JJ .
MOLECULAR AND BIOCHEMICAL PARASITOLOGY, 1992, 50 (02) :225-234
[9]   Multiple sequence alignment with the Clustal series of programs [J].
Chenna, R ;
Sugawara, H ;
Koike, T ;
Lopez, R ;
Gibson, TJ ;
Higgins, DG ;
Thompson, JD .
NUCLEIC ACIDS RESEARCH, 2003, 31 (13) :3497-3500
[10]   Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence [J].
Cheung, J ;
Estivill, X ;
Khaja, R ;
MacDonald, JR ;
Lau, K ;
Tsui, LC ;
Scherer, SW .
GENOME BIOLOGY, 2003, 4 (04)