A machine-learning approach to combined evidence validation of genome assemblies

被引:13
作者
Choi, Jeong-Hyeon [1 ]
Kim, Sun [1 ,2 ]
Tang, Haixu [1 ,2 ]
Andrews, Justen [1 ,3 ]
Gilbert, Don G. [1 ]
Colbourne, John K. [1 ]
机构
[1] Indiana Univ, Ctr Genom & Bioinformat, Bloomington, IN 47405 USA
[2] Indiana Univ, Sch Informat, Bloomington, IN 47405 USA
[3] Indiana Univ, Dept Biol, Bloomington, IN 47405 USA
基金
美国国家科学基金会;
关键词
D O I
10.1093/bioinformatics/btm608
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: While it is common to refer to 'the genome sequence' as if it were a single, complete and contiguous DNA string, it is in fact an assembly of millions of small, partially overlapping DNA fragments. Sophisticated computer algorithms (assemblers and scaffolders) merge these DNA fragments into contigs, and place these contigs into sequence scaffolds using the paired-end sequences derived from large-insert DNA libraries. Each step in this automated process is susceptible to producing errors; hence, the resulting draft assembly represents (in practice) only a likely assembly that requires further validation. Knowing which parts of the draft assembly are likely free of errors is critical if researchers are to draw reliable conclusions from the assembled sequence data. Results: We develop a machine-learning method to detect assembly errors in sequence assemblies. Several in silico measures for assembly validation have been proposed by various researchers. Using three benchmarking Drosophila draft genomes, we evaluate these techniques along with some new measures that we propose, including the good-minus-bad coverage (GMB), the good-to-bad-ratio (RGB), the average Z-score (AZ) and the average absolute Z-score (ASZ). Our results show that the GMB measure performs better than the others in both its sensitivity and its specificity for assembly error detection. Nevertheless, no single method performs sufficiently well to reliably detect genomic regions requiring attention for further experimental verification. To utilize the advantages of all these measures, we develop a novel machine learning approach that combines these individual measures to achieve a higher prediction accuracy (i.e. greater than 90%). Our combined evidence approach avoids the difficult and often ad hoc selection of many parameters the individual measures require, and significantly improves the overall precisions on the benchmarking data sets.
引用
收藏
页码:744 / 750
页数:7
相关论文
共 31 条
  • [1] [Anonymous], 2007, Genome sequencing technology and algorithms
  • [2] BACCardI -: a tool for the validation of genomic assemblies, assisting genome finishing and intergenome comparison
    Bartels, D
    Kespohl, S
    Albaum, S
    Drüke, T
    Goesmann, A
    Herold, J
    Kaiser, O
    Pühler, A
    Pfeiffer, F
    Raddatz, G
    Stoye, J
    Meyer, F
    Schuster, SC
    [J]. BIOINFORMATICS, 2005, 21 (07) : 853 - 859
  • [3] Batzoglou S, 2002, GENOME RES, V12, P177, DOI 10.1101/gr.208902
  • [4] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [5] A tool for analyzing mate pairs in assemblies (TAMPA)
    Dew, IM
    Walenz, B
    Sutton, G
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2005, 12 (05) : 497 - 513
  • [6] An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization
    Dietterich, TG
    [J]. MACHINE LEARNING, 2000, 40 (02) : 139 - 157
  • [7] Duda R., 1973, PATTERN CLASSIFICATI, P10
  • [8] DroSpeGe:: rapid access database for new Drosophila species genomes
    Gilbert, Donald G.
    [J]. NUCLEIC ACIDS RESEARCH, 2007, 35 : D480 - D485
  • [9] Against a whole-genome shotgun
    Green, P
    [J]. GENOME RESEARCH, 1997, 7 (05): : 410 - 417
  • [10] LEARNING BAYESIAN NETWORKS - THE COMBINATION OF KNOWLEDGE AND STATISTICAL-DATA
    HECKERMAN, D
    GEIGER, D
    CHICKERING, DM
    [J]. MACHINE LEARNING, 1995, 20 (03) : 197 - 243