Efficient large-scale sequence comparison by locality-sensitive hashing

被引:145
作者
Buhler, J [1 ]
机构
[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
基金
美国国家科学基金会;
关键词
D O I
10.1093/bioinformatics/17.5.419
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Comparison of multimegabase genomic DNA sequences is a popular technique for finding and annotating conserved genome features. Performing such comparisons entails finding many short local alignments between sequences up to tens of megabases in length. To process such long sequences efficiently, existing algorithms find alignments by expanding around short runs of matching bases with no substitutions or other differences. Unfortunately, exact matches that are short enough to occur often in significant alignments also occur frequently by chance in the background sequence. Thus, these algorithms must trade off between efficiency and sensitivity to features without long exact matches. Results: We introduce a new algorithm, LSH-ALL-PAIRS, to find ungapped local alignments in genomic sequence with up to a specified fraction of substitutions. The length and substitution rate of these alignments can be chosen so that they appear frequently in significant similarities yet still remain rare in the background sequence. The algorithm finds ungapped alignments efficiently using a randomized search technique, locality-sensitive hashing. We have found LSH-ALL-PAIRS to be both efficient and sensitive for finding local similarities with as little as 63% identity in mammalian genomic sequences up to tens of megabases in length.
引用
收藏
页码:419 / 428
页数:10
相关论文
共 23 条
  • [11] Long human-mouse sequence alignments reveal novel regulatory elements: A reason to sequence the mouse genome
    Hardison, RC
    Oeltjen, J
    Miller, W
    [J]. GENOME RESEARCH, 1997, 7 (10): : 959 - 966
  • [12] The DNA sequence of human chromosome 21
    Hattori, M
    Fujiyama, A
    Taylor, TD
    Watanabe, H
    Yada, T
    Park, HS
    Toyoda, A
    Ishii, K
    Totoki, Y
    Choi, DK
    Soeda, E
    Ohki, M
    Takagi, T
    Sakaki, Y
    Taudien, S
    Blechschmidt, K
    Polley, A
    Menzel, U
    Delabar, J
    Kumpf, K
    Lehmann, R
    Patterson, D
    Reichwald, K
    Rump, A
    Schillhabel, M
    Schudy, A
    Zimmermann, W
    Rosenthal, A
    Kudoh, J
    Shibuya, K
    Kawasaki, K
    Asakawa, S
    Shintani, A
    Sasaki, T
    Nagamine, K
    Mitsuyama, S
    Antonarakis, SE
    Minoshima, S
    Shimizu, N
    Nordsiek, G
    Hornischer, K
    Brandt, P
    Scharfe, M
    Schön, O
    Desario, A
    Reichelt, J
    Kauer, G
    Blöcker, H
    Ramser, J
    Beck, A
    [J]. NATURE, 2000, 405 (6784) : 311 - 319
  • [13] A TIME-EFFICIENT, LINEAR-SPACE LOCAL SIMILARITY ALGORITHM
    HUANG, XQ
    MILLER, W
    [J]. ADVANCES IN APPLIED MATHEMATICS, 1991, 12 (03) : 337 - 357
  • [14] Indyk P, 1998, P 1998 S THEOR COMP
  • [15] METHODS FOR ASSESSING THE STATISTICAL SIGNIFICANCE OF MOLECULAR SEQUENCE FEATURES BY USING GENERAL SCORING SCHEMES
    KARLIN, S
    ALTSCHUL, SF
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1990, 87 (06) : 2264 - 2268
  • [16] Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons
    Loots, GG
    Locksley, RM
    Blakespoor, CM
    Wang, ZE
    Miller, W
    Rubin, EM
    Frazer, KA
    [J]. SCIENCE, 2000, 288 (5463) : 136 - 140
  • [17] Motwani R., 1995, RANDOMIZED ALGORITHM
  • [18] MULTIPLE FILTRATION AND APPROXIMATE PATTERN-MATCHING
    PEVZNER, PA
    WATERMAN, MS
    [J]. ALGORITHMICA, 1995, 13 (1-2) : 135 - 154
  • [19] PipMaker - A Web server for aligning two genomic DNA sequences
    Schwartz, S
    Zhang, Z
    Frazer, KA
    Smit, A
    Riemer, C
    Bouck, J
    Gibbs, R
    Hardison, R
    Miller, W
    [J]. GENOME RESEARCH, 2000, 10 (04) : 577 - 586
  • [20] Smit AFA, 1999, REPEATMASKER