Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II

被引:50
作者
Archer, John [1 ]
Baillie, Greg [2 ]
Watson, Simon J. [2 ]
Kellam, Paul [2 ,3 ]
Rambaut, Andrew [4 ,5 ]
Robertson, David L. [1 ]
机构
[1] Univ Manchester, Fac Life Sci, Manchester, Lancs, England
[2] Wellcome Trust Sanger Inst, Cambridge, England
[3] UCL, Div Infect & Immun, UCL MRC Ctr Med Mol Virol, London, England
[4] Univ Edinburgh, Inst Evolutionary Biol, Edinburgh, Midlothian, Scotland
[5] NIH, Fogarty Int Ctr, Bethesda, MD 20892 USA
来源
BMC BIOINFORMATICS | 2012年 / 13卷
基金
英国生物技术与生命科学研究理事会;
关键词
SNP DISCOVERY; HIV-1; READS; BIOINFORMATICS; ALIGNMENT; VARIANTS; MINORITY; ACCURACY; VACCINES; THERAPY;
D O I
10.1186/1471-2105-13-47
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Next generation sequencing provides detailed insight into the variation present within viral populations, introducing the possibility of treatment strategies that are both reactive and predictive. Current software tools, however, need to be scaled up to accommodate for high-depth viral data sets, which are often temporally or spatially linked. In addition, due to the development of novel sequencing platforms and chemistries, each with implicit strengths and weaknesses, it will be helpful for researchers to be able to routinely compare and combine data sets from different platforms/chemistries. In particular, error associated with a specific sequencing process must be quantified so that true biological variation may be identified. Results: Segminator II was developed to allow for the efficient comparison of data sets derived from different sources. We demonstrate its usage by comparing large data sets from 12 influenza H1N1 samples sequenced on both the 454 Life Sciences and Illumina platforms, permitting quantification of platform error. For mismatches median error rates at 0.10 and 0.12%, respectively, suggested that both platforms performed similarly. For insertions and deletions median error rates within the 454 data (at 0.3 and 0.2%, respectively) were significantly higher than those within the Illumina data (0.004 and 0.006%, respectively). In agreement with previous observations these higher rates were strongly associated with homopolymeric stretches on the 454 platform. Outside of such regions both platforms had similar indel error profiles. Additionally, we apply our software to the identification of low frequency variants. Conclusion: We have demonstrated, using Segminator II, that it is possible to distinguish platform specific error from biological variation using data derived from two different platforms. We have used this approach to quantify the amount of error present within the 454 and Illumina platforms in relation to genomic location as well as location on the read. Given that next generation data is increasingly important in the analysis of drug-resistance and vaccine trials, this software will be useful to the pathogen research community. A zip file containing the source code and jar file is freely available for download from http://www.bioinf.manchester.ac.uk/segminator/.
引用
收藏
页数:11
相关论文
共 38 条
  • [1] The Evolutionary Analysis of Emerging Low Frequency HIV-1 CXCR4 Using Variants through Time-An Ultra-Deep Approach
    Archer, John
    Rambaut, Andrew
    Taillon, Bruce E.
    Harrigan, P. Richard
    Lewis, Marilyn
    Robertson, David L.
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2010, 6 (12)
  • [2] Detection of low-frequency pretherapy chemokine (CXC motif) receptor 4 (CXCR4)-using HIV-1 with ultra-deep pyrosequencing
    Archer, John
    Braverman, Michael S.
    Taillon, Bruce E.
    Desany, Brian
    James, Ian
    Harrigan, P. Richard
    Lewis, Marilyn
    Robertson, David L.
    [J]. AIDS, 2009, 23 (10) : 1209 - 1218
  • [3] RETRACTED: Evaluation of next-generation sequencing software in mapping and assembly (Retracted article. See vol. 56, pg. 687, 2011)
    Bao, Suying
    Jiang, Rui
    Kwan, WingKeung
    Wang, BinBin
    Ma, Xu
    Song, You-Qiang
    [J]. JOURNAL OF HUMAN GENETICS, 2011, 56 (06) : 406 - 414
  • [4] Mosaic HIV-1 vaccines expand the breadth and depth of cellular immune responses in rhesus monkeys
    Barouch, Dan H.
    O'Brien, Kara L.
    Simmons, Nathaniel L.
    King, Sharon L.
    Abbink, Peter
    Maxfield, Lori F.
    Sun, Ying-Hua
    La Porte, Annalena
    Riggs, Ambryice M.
    Lynch, Diana M.
    Clark, Sarah L.
    Backus, Katherine
    Perry, James R.
    Seaman, Michael S.
    Carville, Angela
    Mansfield, Keith G.
    Szinger, James J.
    Fischer, Will
    Muldoon, Mark
    Korber, Bette
    [J]. NATURE MEDICINE, 2010, 16 (03) : 319 - U116
  • [5] Solexa Ltd
    Bennett, S
    [J]. PHARMACOGENOMICS, 2004, 5 (04) : 433 - 438
  • [6] The Use of Coded PCR Primers Enables High-Throughput Sequencing of Multiple Homolog Amplification Products by 454 Parallel Sequencing
    Binladen, Jonas
    Gilbert, M. Thomas P.
    Bollback, Jonathan P.
    Panitz, Frank
    Bendixen, Christian
    Nielsen, Rasmus
    Willerslev, Eske
    [J]. PLOS ONE, 2007, 2 (02):
  • [7] Quality scores and SNP detection in sequencing-by-synthesis systems
    Brockman, William
    Alvarez, Pablo
    Young, Sarah
    Garber, Manuel
    Giannoukos, Georgia
    Lee, William L.
    Russ, Carsten
    Lander, Eric S.
    Nusbaum, Chad
    Jaffe, David B.
    [J]. GENOME RESEARCH, 2008, 18 (05) : 763 - 770
  • [8] Massively parallel pyrosequencing in HIV research
    Bushman, Frederic D.
    Hoffmann, Christian
    Ronen, Keshet
    Malani, Nirav
    Minkah, Nana
    Rose, Heather Marshall
    Tebas, Pablo
    Wang, Gary P.
    [J]. AIDS, 2008, 22 (12) : 1411 - 1415
  • [9] The genome sequencer FLX™ system-longer reads, more applications, straight forward bioinformatics and more complete data sets
    Droege, Marcus
    Hill, Brendon
    [J]. JOURNAL OF BIOTECHNOLOGY, 2008, 136 (1-2) : 3 - 10
  • [10] MUSCLE: multiple sequence alignment with high accuracy and high throughput
    Edgar, RC
    [J]. NUCLEIC ACIDS RESEARCH, 2004, 32 (05) : 1792 - 1797