A data-mining approach to spacer oligonucleotide typing of Mycobacterium tuberculosis
被引:67
作者:
Sebban, M
论文数: 0引用数: 0
h-index: 0
机构:Inst Pasteur Guadeloupe, Unite TB & Mycobacteries, BP 484, F-97165 Pointe A Pitre, Guadeloupe, France
Sebban, M
Mokrousov, I
论文数: 0引用数: 0
h-index: 0
机构:Inst Pasteur Guadeloupe, Unite TB & Mycobacteries, BP 484, F-97165 Pointe A Pitre, Guadeloupe, France
Mokrousov, I
Rastogi, N
论文数: 0引用数: 0
h-index: 0
机构:Inst Pasteur Guadeloupe, Unite TB & Mycobacteries, BP 484, F-97165 Pointe A Pitre, Guadeloupe, France
Rastogi, N
Sola, C
论文数: 0引用数: 0
h-index: 0
机构:
Inst Pasteur Guadeloupe, Unite TB & Mycobacteries, BP 484, F-97165 Pointe A Pitre, Guadeloupe, FranceInst Pasteur Guadeloupe, Unite TB & Mycobacteries, BP 484, F-97165 Pointe A Pitre, Guadeloupe, France
Sola, C
[1
]
机构:
[1] Inst Pasteur Guadeloupe, Unite TB & Mycobacteries, BP 484, F-97165 Pointe A Pitre, Guadeloupe, France
[2] French W Indies & Guiana Univ, TRIVIA, Dept Math & Comp Sci, F-97159 Pointe A Pitre, Guadeloupe, France
Motivation: The Direct Repeat (DR) locus of Mycobacterium tuberculosis is a suitable model to study (i) molecular epidemiology and (ii) the evolutionary genetics of tuberculosis. This is achieved by a DNA analysis technique (genotyping), called spacer oligonucleotide typing (spoligotyping). In this paper, we investigated data analysis methods to discover intelligible knowledge rules from spoligotyping, that has not yet been applied on such representation. This processing was achieved by applying the C4.5 induction algorithm and knowledge rules were produced. Finally, a Prototype Selection (PS) procedure was applied to eliminate noisy data. This both simplified decision rules, as well as the number of spacers to be tested to solve classification tasks. In the second part of this paper, the contribution of 25 new additional spacers and the knowledge rules inferred were studied from a machine learning point of view. From a statistical point of view, the correlations between spacers were analyzed and suggested that both negative and positive ones may be related to potential structural constraints within the DR locus that may shape its evolution directly or indirectly. Results: By generating knowledge rules induced from decision trees, it was shown that not only the expert knowledge may be modeled but also improved and simplified to solve automatic classification tasks on unknown patterns. A practical consequence of this study may be a simplification of the spoligotyping technique, resulting in a reduction of the experimental constraints and an increase in the number of samples processed.