Consistency of systematic chemical identifiers within and between small-molecule databases

被引:30
作者
Akhondi, Saber A. [1 ]
Kors, Jan A. [1 ]
Muresan, Sorel [2 ]
机构
[1] Erasmus Univ, Med Ctr, Dept Med Informat, NL-3000 CA Rotterdam, Netherlands
[2] AstraZeneca R&D, Discovery Sci, Chem Innovat Ctr, S-43183 Molndal, Sweden
关键词
Molecular structure; Chemical databases; Systematic chemical identifiers; Quality control; InChI; SMILES; IUPAC; CURATION; STANDARD; QUALITY; SMILES;
D O I
10.1186/1758-2946-4-35
中图分类号
O6 [化学];
学科分类号
070301 [无机化学];
摘要
Background: Correctness of structures and associated metadata within public and commercial chemical databases greatly impacts drug discovery research activities such as quantitative structure-property relationships modelling and compound novelty checking. MOL files, SMILES notations, IUPAC names, and InChI strings are ubiquitous file formats and systematic identifiers for chemical structures. While interchangeable for many cheminformatics purposes there have been no studies on the inconsistency of these structure identifiers due to various approaches for data integration, including the use of different software and different rules for structure standardisation. We have investigated the consistency of systematic identifiers of small molecules within and between some of the commonly used chemical resources, with and without structure standardisation. Results: The consistency between systematic chemical identifiers and their corresponding MOL representation varies greatly between data sources (37.2%-98.5%). We observed the lowest overall consistency for MOL-IUPAC names. Disregarding stereochemistry increases the consistency (84.8% to 99.9%). A wide variation in consistency also exists between MOL representations of compounds linked via cross-references (25.8% to 93.7%). Removing stereochemistry improved the consistency (47.6% to 95.6%). Conclusions: We have shown that considerable inconsistency exists in structural representation and systematic chemical identifiers within and between databases. This can have a great influence especially when merging data and if systematic identifiers are used as a key index for structure integration or cross-querying several databases. Regenerating systematic identifiers starting from their MOL representation and applying well-defined and documented chemistry standardisation rules to all compounds prior to creating them can dramatically increase internal consistency.
引用
收藏
页数:7
相关论文
共 24 条
[1]
[Anonymous], ANN REPORTS COMPUTAT
[2]
[Anonymous], CHEMAXON NAM
[3]
DESCRIPTION OF SEVERAL CHEMICAL-STRUCTURE FILE FORMATS USED BY COMPUTER-PROGRAMS DEVELOPED AT MOLECULAR DESIGN LIMITED [J].
DALBY, A ;
NOURSE, JG ;
HOUNSHELL, WD ;
GUSHURST, AKI ;
GRIER, DL ;
LELAND, BA ;
LAUFER, J .
JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1992, 32 (03) :244-255
[4]
Chemical Entities of Biological Interest: an update [J].
de Matos, Paula ;
Alcantara, Rafael ;
Dekker, Adriano ;
Ennis, Marcus ;
Hastings, Janna ;
Haug, Kenneth ;
Spiteri, Inmaculada ;
Turner, Steve ;
Steinbeck, Christoph .
NUCLEIC ACIDS RESEARCH, 2010, 38 :D249-D254
[5]
Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research [J].
Fourches, Denis ;
Muratov, Eugene ;
Tropsha, Alexander .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2010, 50 (07) :1189-1204
[6]
GARFIELD E, 1961, ALGORITHM TRANSLATIN
[7]
ChEMBL: a large-scale bioactivity database for drug discovery [J].
Gaulton, Anna ;
Bellis, Louisa J. ;
Bento, A. Patricia ;
Chambers, Jon ;
Davies, Mark ;
Hersey, Anne ;
Light, Yvonne ;
McGlinchey, Shaun ;
Michalovich, David ;
Al-Lazikani, Bissan ;
Overington, John P. .
NUCLEIC ACIDS RESEARCH, 2012, 40 (D1) :D1100-D1107
[8]
The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics [J].
Huang, Ruili ;
Southall, Noel ;
Wang, Yuhong ;
Yasgar, Adam ;
Shinn, Paul ;
Jadhav, Ajit ;
Dac-Trung Nguyen ;
Austin, Christopher P. .
SCIENCE TRANSLATIONAL MEDICINE, 2011, 3 (80)
[9]
DrugBank 3.0: a comprehensive resource for 'Omics' research on drugs [J].
Knox, Craig ;
Law, Vivian ;
Jewison, Timothy ;
Liu, Philip ;
Ly, Son ;
Frolkis, Alex ;
Pon, Allison ;
Banco, Kelly ;
Mak, Christine ;
Neveu, Vanessa ;
Djoumbou, Yannick ;
Eisner, Roman ;
Guo, An Chi ;
Wishart, David S. .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D1035-D1041
[10]
Chemical Name to Structure: OPSIN, an Open Source Solution [J].
Lowe, Daniel M. ;
Corbett, Peter T. ;
Murray-Rust, Peter ;
Glen, Robert C. .
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2011, 51 (03) :739-753