Reactome pathway analysis: a high-performance in-memory approach

被引:607
作者
Fabregat, Antonio [1 ,3 ]
Sidiropoulos, Konstantinos [1 ]
Viteri, Guilherme [1 ]
Forner, Oscar [1 ]
Marin-Garcia, Pablo [4 ,5 ]
Arnau, Vicente [6 ,7 ]
D'Eustachio, Peter [8 ]
Stein, Lincoln [9 ,10 ]
Hermjakob, Henning [1 ,2 ]
机构
[1] European Bioinformat Inst EMBL EBI, European Mol Biol Lab, Wellcome Genome Campus, Hinxton, England
[2] Natl Ctr Prot Sci, Beijing Inst Radiat Med, Beijing Proteome Res Ctr, State Key Lab Prote, Beijing 102206, Peoples R China
[3] Open Targets, Wellcome Genome Campus, Hinxton, England
[4] Univ Valencia, Fdn Invest INCLIVA, Valencia, Spain
[5] Inst Med Genom, Valencia, Spain
[6] Univ Valencia, Escuela Tecn Sup Ingn, Valencia, Spain
[7] Univ Valencia CSIC, Inst Integrat Syst Biol I2SysBio, Valencia, Spain
[8] NYU Langone Med Ctr, New York, NY USA
[9] Ontario Inst Canc Res, Toronto, ON, Canada
[10] Univ Toronto, Dept Mol Genet, Toronto, ON, Canada
基金
美国国家卫生研究院;
关键词
Pathway analysis; Over-representation analysis; Data structures; GENE; PROTEIN; EXPRESSION;
D O I
10.1186/s12859-017-1559-2
中图分类号
Q5 [生物化学];
学科分类号
070307 [化学生物学];
摘要
Background: Reactome aims to provide bioinformatics tools for visualisation, interpretation and analysis of pathway knowledge to support basic research, genome analysis, modelling, systems biology and education. Pathway analysis methods have a broad range of applications in physiological and biomedical research; one of the main problems, from the analysis methods performance point of view, is the constantly increasing size of the data samples. Results: Here, we present a new high-performance in-memory implementation of the well-established overrepresentation analysis method. To achieve the target, the over-representation analysis method is divided in four different steps and, for each of them, specific data structures are used to improve performance and minimise the memory footprint. The first step, finding out whether an identifier in the user's sample corresponds to an entity in Reactome, is addressed using a radix tree as a lookup table. The second step, modelling the proteins, chemicals, their orthologous in other species and their composition in complexes and sets, is addressed with a graph. The third and fourth steps, that aggregate the results and calculate the statistics, are solved with a double-linked tree. Conclusion: Through the use of highly optimised, in-memory data structures and algorithms, Reactome has achieved a stable, high performance pathway analysis service, enabling the analysis of genome-wide datasets within seconds, allowing interactive exploration and analysis of high throughput data. The proposed pathway analysis approach is available in the Reactome production web site either via the AnalysisService for programmatic access or the user submission interface integrated into the PathwayBrowser. Reactome is an open data and open source project and all of its source code, including the one described here, is available in the AnalysisTools repository in the Reactome GitHub
引用
收藏
页数:9
相关论文
共 21 条
[1]
Abiteboul Serge, 1995, FDN DATABASES LOGICA
[2]
[Anonymous], 2008, Algorithm Design Manual
[3]
UniProt: a hub for protein information [J].
Bateman, Alex ;
Martin, Maria Jesus ;
O'Donovan, Claire ;
Magrane, Michele ;
Apweiler, Rolf ;
Alpi, Emanuele ;
Antunes, Ricardo ;
Arganiska, Joanna ;
Bely, Benoit ;
Bingley, Mark ;
Bonilla, Carlos ;
Britto, Ramona ;
Bursteinas, Borisas ;
Chavali, Gayatri ;
Cibrian-Uhalte, Elena ;
Da Silva, Alan ;
De Giorgi, Maurizio ;
Dogan, Tunca ;
Fazzini, Francesco ;
Gane, Paul ;
Cas-tro, Leyla Garcia ;
Garmiri, Penelope ;
Hatton-Ellis, Emma ;
Hieta, Reija ;
Huntley, Rachael ;
Legge, Duncan ;
Liu, Wudong ;
Luo, Jie ;
MacDougall, Alistair ;
Mutowo, Prudence ;
Nightin-gale, Andrew ;
Orchard, Sandra ;
Pichler, Klemens ;
Poggioli, Diego ;
Pundir, Sangya ;
Pureza, Luis ;
Qi, Guoying ;
Rosanoff, Steven ;
Saidi, Rabie ;
Sawford, Tony ;
Shypitsyna, Aleksandra ;
Turner, Edward ;
Volynkin, Vladimir ;
Wardell, Tony ;
Watkins, Xavier ;
Zellner, Hermann ;
Cowley, Andrew ;
Figueira, Luis ;
Li, Weizhong ;
McWilliam, Hamish .
NUCLEIC ACIDS RESEARCH, 2015, 43 (D1) :D204-D212
[4]
CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[5]
Comparison of human cell signaling pathway databases-evolution, drawbacks and challenges [J].
Chowdhury, Saikat ;
Sarkar, Ram Rup .
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2015,
[6]
Codd E.F., 1972, DATA BASE SYST, V6, P33
[7]
The Protein Identifier Cross-Referencing (PICR) service:: reconciling protein identifiers across multiple source databases [J].
Cote, Richard G. ;
Jones, Philip ;
Martens, Lennart ;
Kerrien, Samuel ;
Reisinger, Florian ;
Lin, Quan ;
Leinonen, Rasko ;
Apweiler, Rolf ;
Hermjakob, Henning .
BMC BIOINFORMATICS, 2007, 8 (1) :401
[8]
De La Briandais R., 1959, W JOINT COMPUTER C, P295, DOI [10.1145/1457838.1457895, DOI 10.1145/1457838.1457895]
[9]
Global functional profiling of gene expression [J].
Draghici, S ;
Khatri, P ;
Martins, RP ;
Ostermeier, GC ;
Krawetz, SA .
GENOMICS, 2003, 81 (02) :98-104
[10]
Pathway Analysis: State of the Art [J].
Garcia-Campos, Miguel A. ;
Espinal-Enriquez, Jesus ;
Hernandez-Lemus, Enrique .
FRONTIERS IN PHYSIOLOGY, 2015, 6