Nested containment list (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases

被引:36
作者
Alekseyenko, Alexander V.
Lee, Christopher J. [1 ]
机构
[1] Univ Calif Los Angeles, David Geffen Sch Med, Dept Biomath, Los Angeles, CA 90095 USA
[2] Univ Calif Los Angeles, Inst Mol Biol, Inst Genom & Proteom, Ctr Computat Biol,Dept Chem & Biochem, Los Angeles, CA 90095 USA
关键词
D O I
10.1093/bioinformatics/btl647
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The exponential growth of sequence databases poses a major challenge to bioinformatics tools for querying alignment and annotation databases. There is a pressing need for methods for finding overlapping sequence intervals that are highly scalable to database size, query interval size, result size and construction/updating of the interval database. Results: We have developed a new interval database representation, the Nested Containment List (NCList), whose query time is O(n + log N), where N is the database size and n is the size of the result set. In all cases tested, this query algorithm is 5-500-fold faster than other indexing methods tested in this study, such as MySQL multi-column indexing, MySQL binning and R-Tree indexing. We provide performance comparisons both in simulated datasets and real-world genome alignment databases, across a wide range of database sizes and query interval widths. We also present an in-place NCList construction algorithm that yields database construction times that are similar to 100-fold faster than other methods available. The NCList data structure appears to provide a useful foundation for highly scalable interval database applications.
引用
收藏
页码:1386 / 1393
页数:8
相关论文
共 15 条
[1]   Ensembl 2006 [J].
Birney, E. ;
Andrews, D. ;
Caccamo, M. ;
Chen, Y. ;
Clarke, L. ;
Coates, G. ;
Cox, T. ;
Cunningham, F. ;
Curwen, V. ;
Cutts, T. ;
Down, T. ;
Durbin, R. ;
Fernandez-Suarez, X. M. ;
Flicek, P. ;
Graf, S. ;
Hammond, M. ;
Herrero, J. ;
Howe, K. ;
Iyer, V. ;
Jekosch, K. ;
Kahari, A. ;
Kasprzyk, A. ;
Keefe, D. ;
Kokocinski, F. ;
Kulesha, E. ;
London, D. ;
Longden, I. ;
Melsopp, C. ;
Meidl, P. ;
Overduin, B. ;
Parker, A. ;
Proctor, G. ;
Prlic, A. ;
Rae, M. ;
Rios, D. ;
Redmond, S. ;
Schuster, M. ;
Sealy, I. ;
Searle, S. ;
Severin, J. ;
Slater, G. ;
Smedley, D. ;
Smith, J. ;
Stabenau, A. ;
Stalker, J. ;
Trevanion, S. ;
Ureta-Vidal, A. ;
Vogel, J. ;
White, S. ;
Woodwark, C. .
NUCLEIC ACIDS RESEARCH, 2006, 34 :D556-D561
[2]   Aligning multiple genomic sequences with the threaded blockset aligner [J].
Blanchette, M ;
Kent, WJ ;
Riemer, C ;
Elnitski, L ;
Smit, AFA ;
Roskin, KM ;
Baertsch, R ;
Rosenbloom, K ;
Clawson, H ;
Green, ED ;
Haussler, D ;
Miller, W .
GENOME RESEARCH, 2004, 14 (04) :708-715
[3]  
Cormen T.H., 2001, INTRO ALGORITHMS, P311
[4]  
Edelsbrunner H., 1980, Dynamic Rectangle Intersection Searching
[5]  
Enderle J., 2004, Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, P683, DOI DOI 10.1145/1007568.1007645
[6]   GALA, a database for genomic sequence alignments and annotations [J].
Giardine, B ;
Elnitski, L ;
Riemer, C ;
Makalowska, L ;
Schwartz, S ;
Miller, W ;
Hardison, RC .
GENOME RESEARCH, 2003, 13 (04) :732-741
[7]  
Guttman Antonin., 1984, SIGMOD Conference, P47, DOI [DOI 10.1145/602259.602266, 10.1145/ 971697.602266, DOI 10.1145/971697.602266]
[8]   The human genome browser at UCSC [J].
Kent, WJ ;
Sugnet, CW ;
Furey, TS ;
Roskin, KM ;
Pringle, TH ;
Zahler, AM ;
Haussler, D .
GENOME RESEARCH, 2002, 12 (06) :996-1006
[9]  
Kolovson C. P., 1991, SIGMOD Record, V20, P138, DOI 10.1145/119995.115807
[10]  
KRIEGEL H., 2000, Proc. 26th Int. Conf. on Very Large Databases (VLDB), P407