Efficient computation of absent words in genomic sequences

被引:50
作者
Herold, Julia [1 ]
Kurtz, Stefan [2 ]
Giegerich, Robert [1 ]
机构
[1] Univ Bielefeld, Ctr Biotechnol, D-33501 Bielefeld, Germany
[2] Univ Hamburg, Ctr Bioinformat, D-20146 Hamburg, Germany
关键词
D O I
10.1186/1471-2105-9-167
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Analysis of sequence composition is a routine task in genome research. Organisms are characterized by their base composition, dinucleotide relative abundance, codon usage, and so on. Unique subsequences are markers of special interest in genome comparison, expression profiling, and genetic engineering. Relative to a random sequence of the same length, unique subsequences are overrepresented in real genomes. Shortest words absent from a genome have been addressed in two recent studies. Results: We describe a new algorithm and software for the computation of absent words. It is more efficient than previous algorithms and easier to use. It directly computes unwords without the need to specify a length estimate. Moreover, it avoids the space requirements of index structures such as suffix trees and suffix arrays. Our implementation is available as an open source package. We compute unwords of human and mouse as well as some other organisms, covering a genome size range from 10(9) down to 10(5) bp. Conclusion: The new algorithm computes absent words for the human genome in 10 minutes on standard hardware, using only 2.5 Mb of space. This enables us to perform this type of analysis not only for the largest genomes available so far, but also for the emerging pan- and meta-genome data.
引用
收藏
页数:9
相关论文
共 16 条
[1]  
Abouelhoda M. I., 2004, Journal of Discrete Algorithms, V2, P53, DOI 10.1016/S1570-8667(03)00065-0
[2]   Nullomers: Really a Matter of Natural Selection? [J].
Acquisti, Claudia ;
Poste, George ;
Curtiss, David ;
Kumar, Sudhir .
PLOS ONE, 2007, 2 (10)
[3]  
[Anonymous], 2002, P 6 INT C RES COMP M
[4]   Verbumculus and the discovery of unusual words [J].
Apostolico, A ;
Gong, FC ;
Lonardi, S .
JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2004, 19 (01) :22-41
[5]   Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii [J].
Bult, CJ ;
White, O ;
Olsen, GJ ;
Zhou, LX ;
Fleischmann, RD ;
Sutton, GG ;
Blake, JA ;
FitzGerald, LM ;
Clayton, RA ;
Gocayne, JD ;
Kerlavage, AR ;
Dougherty, BA ;
Tomb, JF ;
Adams, MD ;
Reich, CI ;
Overbeek, R ;
Kirkness, EF ;
Weinstock, KG ;
Merrick, JM ;
Glodek, A ;
Scott, JL ;
Geoghagen, NSM ;
Weidman, JF ;
Fuhrmann, JL ;
Nguyen, D ;
Utterback, TR ;
Kelley, JM ;
Peterson, JD ;
Sadow, PW ;
Hanna, MC ;
Cotton, MD ;
Roberts, KM ;
Hurst, MA ;
Kaine, BP ;
Borodovsky, M ;
Klenk, HP ;
Fraser, CM ;
Smith, HO ;
Woese, CR ;
Venter, JC .
SCIENCE, 1996, 273 (5278) :1058-1073
[6]   Mauve: Multiple alignment of conserved genomic sequence with rearrangements [J].
Darling, ACE ;
Mau, B ;
Blattner, FR ;
Perna, NT .
GENOME RESEARCH, 2004, 14 (07) :1394-1403
[7]   Complete genome sequence of the hyperthermophilic archaeon Thermococcus kodakaraensis KOD1 and comparison with Pyrococcus genomes [J].
Fukui, T ;
Atomi, H ;
Kanai, T ;
Matsumi, R ;
Fujiwara, S ;
Imanaka, T .
GENOME RESEARCH, 2005, 15 (03) :352-363
[8]   The genome sequence of the filamentous fungus Neurospora crassa [J].
Galagan, JE ;
Calvo, SE ;
Borkovich, KA ;
Selker, EU ;
Read, ND ;
Jaffe, D ;
FitzHugh, W ;
Ma, LJ ;
Smirnov, S ;
Purcell, S ;
Rehman, B ;
Elkins, T ;
Engels, R ;
Wang, SG ;
Nielsen, CB ;
Butler, J ;
Endrizzi, M ;
Qui, DY ;
Ianakiev, P ;
Pedersen, DB ;
Nelson, MA ;
Werner-Washburne, M ;
Selitrennikoff, CP ;
Kinsey, JA ;
Braun, EL ;
Zelter, A ;
Schulte, U ;
Kothe, GO ;
Jedd, G ;
Mewes, W ;
Staben, C ;
Marcotte, E ;
Greenberg, D ;
Roy, A ;
Foley, K ;
Naylor, J ;
Stabge-Thomann, N ;
Barrett, R ;
Gnerre, S ;
Kamal, M ;
Kamvysselis, M ;
Mauceli, E ;
Bielke, C ;
Rudd, S ;
Frishman, D ;
Krystofova, S ;
Rasmussen, C ;
Metzenberg, RL ;
Perkins, DD ;
Kroken, S .
NATURE, 2003, 422 (6934) :859-868
[9]  
Hampikian G, 2007, PACIFIC SYMPOSIUM ON BIOCOMPUTING 2007, P355
[10]   Genome comparison without alignment using shortest unique substrings -: art. no. 123 [J].
Haubold, B ;
Pierstorff, N ;
Möller, F ;
Wiehe, T .
BMC BIOINFORMATICS, 2005, 6 (1)