DNA sequence quality trimming and vector removal

被引:404
作者
Chou, HH [1 ]
Holmes, MH
机构
[1] Iowa State Univ, Dept Comp Sci, Dept Zool & Genet, Ames, IA 50011 USA
[2] Inst Genom Res, Dept Bioinformat, Rockville, MD 20850 USA
关键词
D O I
10.1093/bioinformatics/17.12.1093
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Most sequence comparison methods assume that the data being compared are trustworthy, but this is not the case with raw DNA sequences obtained from automatic sequencing machines. Nevertheless, sequence comparisons need to be done on them in order to remove vector splice sites and contaminants. This step is necessary before other genomic data processing stages can be carried out, such as fragment assembly or EST clustering. A specialized tool is therefore needed to solve this apparent dilemma. Results: We have designed and implemented a program that specifically addresses the problem. This program, called Lucy, has been in use since 1998 at The Institute for Genomic Research (TIGR). During this period, many rounds of experience-driven modifications were made to Lucy to improve its accuracy and its ability to deal with extremely difficult input cases. We believe we have finally obtained a useful program which strikes a delicate balance among the many issues involved in the raw sequence cleaning problem, and we wish to share it with the research community.
引用
收藏
页码:1093 / 1104
页数:12
相关论文
共 10 条
  • [1] Sequence assembly with CAFTOOLS
    Dear, S
    Durbin, R
    Hillier, L
    Marth, G
    Thierry-Mieg, J
    Mott, R
    [J]. GENOME RESEARCH, 1998, 8 (03): : 260 - 267
  • [2] Base-calling of automated sequencer traces using phred.: II.: Error probabilities
    Ewing, B
    Green, P
    [J]. GENOME RESEARCH, 1998, 8 (03): : 186 - 194
  • [3] Base-calling of automated sequencer traces using phred.: I.: Accuracy assessment
    Ewing, B
    Hillier, L
    Wendl, MC
    Green, P
    [J]. GENOME RESEARCH, 1998, 8 (03): : 175 - 185
  • [4] A tool for analyzing and annotating genomic sequences
    Huang, XQ
    Adams, MD
    Zhou, H
    Kerlavage, AR
    [J]. GENOMICS, 1997, 46 (01) : 37 - 45
  • [5] Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana
    Lin, XY
    Kaul, SS
    Rounsley, S
    Shea, TP
    Benito, MI
    Town, CD
    Fujii, CY
    Mason, T
    Bowman, CL
    Barnstead, M
    Feldblyum, TV
    Buell, CR
    Ketchum, KA
    Lee, J
    Ronning, CM
    Koo, HL
    Moffat, KS
    Cronin, LA
    Shen, M
    Pai, G
    Van Aken, S
    Umayam, L
    Tallon, LJ
    Gill, JE
    Adams, MD
    Carrera, AJ
    Creasy, TH
    Goodman, HM
    Somerville, CR
    Copenhaver, GP
    Preuss, D
    Nierman, WC
    White, O
    Eisen, JA
    Salzberg, SL
    Fraser, CM
    Venter, JC
    [J]. NATURE, 1999, 402 (6763) : 761 - +
  • [6] *NAT CTR BIOT INF, 2001, FASTA FIL FORM SPEC
  • [7] *PARACEL, 2000, TRACETUNER CAPT MOST
  • [8] Smith TM, 1997, COMPUT APPL BIOSCI, V13, P175
  • [9] VEKLEROV E, 1996, MICROB COMP GENOMICS, V1
  • [10] Automated sequence preprocessing in a large-scale sequencing environment
    Wendl, MC
    Dear, S
    Hodgson, D
    Hillier, L
    [J]. GENOME RESEARCH, 1998, 8 (09): : 975 - 984