Annotating large genomes with exact word matches

被引:53
作者
Healy, J
Thomas, EE
Schwartz, JT
Wigler, M
机构
[1] Cold Spring Harbor Lab, Cold Spring Harbor, NY 11724 USA
[2] NYU, Courant Inst Math Sci, New York, NY 10003 USA
关键词
D O I
10.1101/gr.1350803
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
We have developed a tool for rapidly determining the number of exact matches of any word within large, internally repetitive genomes or sets of genomes. Thus we can readily annotate any sequence, including the entire human genome, with the counts of its constituent words. We create a Burrows-Wheeler transform of the genome, which together with auxiliary data structures facilitating counting, can reside in about one gigabyte of RAM. Our original interest was motivated by oligonucleotide probe design, and we describe a general protocol for defining unique hybridization probes. But our method also has applications for the analysis of genome structure and assembly. We demonstrate the identification of chromosome-specific repeats, and outline a general procedure for finding undiscovered repeats. We also illustrate the changing contents of the human genome assemblies by comparing the annotations built from different genome freezes.
引用
收藏
页码:2306 / 2315
页数:10
相关论文
共 18 条
  • [1] ALTSCHUL SF, 1990, J MOL BIOL, V215, P403, DOI 10.1006/jmbi.1990.9999
  • [2] Opportunistic data structures with applications
    Ferragina, P
    Manzini, G
    [J]. 41ST ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 2000, : 390 - 398
  • [3] Gusfield D, 1997, ALGORITHMS STRINGS T
  • [4] Repbase Update - a database and an electronic journal of repetitive elements
    Jurka, J
    [J]. TRENDS IN GENETICS, 2000, 16 (09) : 418 - 420
  • [5] The UCSC Genome Browser Database
    Karolchik, D
    Baertsch, R
    Diekhans, M
    Furey, TS
    Hinrichs, A
    Lu, YT
    Roskin, KM
    Schwartz, M
    Sugnet, CW
    Thomas, DJ
    Weber, RJ
    Haussler, D
    Kent, WJ
    [J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (01) : 51 - 54
  • [6] Kent WJ, 2002, GENOME RES, V12, P656, DOI [10.1101/gr.229202, 10.1101/gr.229202. Article published online before March 2002]
  • [7] REPuter: fast computation of maximal repeats in complete genomes
    Kurtz, S
    Schleiermacher, C
    [J]. BIOINFORMATICS, 1999, 15 (05) : 426 - 427
  • [8] REPuter: the manifold applications of repeat analysis on a genomic scale
    Kurtz, S
    Choudhuri, JV
    Ohlebusch, E
    Schleiermacher, C
    Stoye, J
    Giegerich, R
    [J]. NUCLEIC ACIDS RESEARCH, 2001, 29 (22) : 4633 - 4642
  • [9] Kurtz S, 1999, SOFTWARE PRACT EXPER, V29, P1149, DOI 10.1002/(SICI)1097-024X(199911)29:13<1149::AID-SPE274>3.0.CO
  • [10] 2-O