A parallel computing approach to creating engineering concept spaces for semantic retrieval: The Illinois Digital Library Initiative project

被引:42
作者
Chen, HC
Schatz, B
Ng, T
Martinez, J
Kirchhoff, A
Lin, CT
机构
[1] UNIV ILLINOIS,NATL CTR SUPERCOMP APPLICAT,BECKMAN INST,URBANA,IL 61801
[2] UNIV ARIZONA,SCI & ENGN LIB,TUCSON,AZ 85712
[3] UNIV ARIZONA,DEPT LIB & INFORMAT STUDIES,TUCSON,AZ 85712
基金
美国国家科学基金会; 美国国家航空航天局;
关键词
semantic retrieval; concept space; concept association; parallel computing; digital library;
D O I
10.1109/34.531798
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This research presents preliminary results generated from the semantic retrieval research component of the illinois Digital Library Initiative (DLI) project. Using a variation of the automatic thesaurus generation techniques, to which we refer as the concept space approach, we aimed to create graphs of domain-specific concepts (terms) and their weighted co-occurrence relationships for all major engineering domains. Merging these concept spaces and providing traversal paths across:different concept spaces could potentially help alleviate the vocabulary (difference) problem evident in large-scale information retrieval. We have experimented previously with such a technique for a smaller molecular biology domain (Worm Community System, with 10+ MBs of document collection) with encouraging results. In order to address the scalability issue related to large-scale information retrieval and analysis for the current Illinois DLI project, we recently conducted experiments using the concept space approach on parallel supercomputers. Our test collection included 2+ GBs of computer science and electrical engineering abstracts extracted from the INSPEC database. The concept space approach called for extensive textual and statistical analysis (a form of knowledge discovery) based on automatic indexing and cooccurrence analysis algorithms, both previously tested in the biology domain. Initial testing results using a 512-node CM-5 and a 16-processor SGI Power Challenge were promising. Power Challenge was later selected to create a comprehensive computer engineering concept space of about 270,000 terms and 4,000,000+ links using 24.5 hours of CPU time. Our system evaluation involving 12 knowledgeable subjects revealed that the automatically-created computer engineering concept space generated significantly higher concept recall than the human-generated INSPEC computer engineering thesaurus. However, the INSPEC was more precise than the automatic concept space. Our current work mainly involves creating concept spaces for other major engineering domains and developing robust graph matching and traversal algorithms for cross-domain, concept-based retrieval. Future work also will include generating individualized concept spaces for assisting user-specific concept-based information retrieval.
引用
收藏
页码:771 / 782
页数:12
相关论文
共 46 条
[1]  
AHLSWEDE T, 1988, INT J LEXICOGR, V1, P214
[2]  
Anderson J.R., 2010, COGNITIVE PSYCHOL IT
[3]  
ANDERSON JR, 1985, INFORMATION BEHAV, V1
[4]  
BATES MJ, 1986, J AM SOC INFORM SCI, V37, P357
[5]   ASK FOR INFORMATION-RETRIEVAL .1. BACKGROUND AND THEORY [J].
BELKIN, NJ ;
ODDY, RN ;
BROOKS, HM .
JOURNAL OF DOCUMENTATION, 1982, 38 (02) :61-71
[6]  
Card StuartK., 1983, ERLBAUM
[7]  
CHEN H, 1995, J AM SOC INFORM SCI, V46, P348, DOI 10.1002/(SICI)1097-4571(199506)46:5<348::AID-ASI6>3.0.CO
[8]  
2-1
[9]  
CHEN H, 1987, 6TH P NAT C ART INT, P285
[10]  
CHEN H, IN PRESS J AM SOC IN