A NEW FAMILY OF POWERFUL MULTIVARIATE STATISTICAL SEQUENCE-ANALYSIS TECHNIQUES

被引:67
作者
VANHEEL, M
机构
[1] Fritz Haber Institute, the Max Planck Society, W-1000 Berlin Dahlem
关键词
MULTIVARIATE SEQUENCE ANALYSIS; SEQUENCE ALIGNMENTS; ATLAS OF SEQUENCES; STRUCTURE PREDICTION; HUMAN GENOME PROJECT;
D O I
10.1016/0022-2836(91)90360-I
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A novel multivariate statistical approach is presented for extracting and exploiting intrinsic information present in our ever-growing sequence data banks. The information extraction from the sequences avoids the pitfalls of intersequence alignment by analyzing secondary invariant functions derived from the sequences in the data bank rather than the sequences themselves. Such typical invariant function is a 20 × 20 histogram of occurrences of amino acid pairs in a given sequence or fragment thereof. To illustrate the potential of the approach an analysis of 10,000 protein sequences from the National Biomedical Research Foundation Protein Identification Resource is presented, whose analysis already reveals great biological detail. For example, ζ-hemoglobin is found to lie close to amphibian and fish α-hemoglobin which, in turn, is an important clue to the physiological function of this mammalian early embryonic hemoglobin. The multivariate statistical framework presented unifies such apparently unrelated issues as phylogenetic comparisons between a set of sequences and distance matrices between the constituents of the biological sequences. The Multivariate Statistical Sequence Analysis (MSSA) principles can be used for a wide spectrum of sequence analysis problems such as: assignment of family memberships to new sequences, validation of new incoming sequences to be entered into the database, prediction of structure from sequence, discrimination of coding from non-coding DNA regions, and automatic generation of an atlas of protein or DNA sequences. The MSSA techniques represent a self-contained approach to learning continuously and automatically from the growing stream of new sequences. The MSSA approach is particularly likely to play a significant role in major sequencing efforts such as the human genome project. © 1991.
引用
收藏
页码:877 / 887
页数:11
相关论文
共 31 条
[1]   A SENSITIVE PROCEDURE TO COMPARE AMINO-ACID-SEQUENCES [J].
ARGOS, P .
JOURNAL OF MOLECULAR BIOLOGY, 1987, 193 (02) :385-396
[2]  
Benzecri JP, 1973, ANAL DONNEES, VI
[3]   CLASSIFICATION OF IMAGE DATA IN CONJUGATE REPRESENTATION SPACES [J].
BORLAND, L ;
VANHEEL, M .
JOURNAL OF THE OPTICAL SOCIETY OF AMERICA A-OPTICS IMAGE SCIENCE AND VISION, 1990, 7 (04) :601-610
[4]   ORCHESTRATING THE HUMAN GENOME PROJECT [J].
CANTOR, CR .
SCIENCE, 1990, 248 (4951) :49-51
[5]  
CHAPMAN BS, 1980, J BIOL CHEM, V225, P9051
[6]   CONFORMATIONAL PARAMETERS FOR AMINO-ACIDS IN HELICAL, BETA-SHEET, AND RANDOM COIL REGIONS CALCULATED FROM PROTEINS [J].
CHOU, PY ;
FASMAN, GD .
BIOCHEMISTRY, 1974, 13 (02) :211-222
[7]   PREDICTION OF PROTEIN CONFORMATION [J].
CHOU, PY ;
FASMAN, GD .
BIOCHEMISTRY, 1974, 13 (02) :222-245
[8]   STRUCTURE OF THE ZETA-CHAIN OF HUMAN-EMBRYONIC HEMOGLOBIN [J].
CLEGG, JB ;
GAGNON, J .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA-BIOLOGICAL SCIENCES, 1981, 78 (10) :6076-6080
[9]  
Dayhoff MO, 1978, ATLAS PROTEIN SEQUEN, V5
[10]   STRUCTURE OF THE PROTEIN SUBUNITS IN THE PHOTOSYNTHETIC REACTION CENTER OF RHODOPSEUDOMONAS-VIRIDIS AT 3A RESOLUTION [J].
DEISENHOFER, J ;
EPP, O ;
MIKI, K ;
HUBER, R ;
MICHEL, H .
NATURE, 1985, 318 (6047) :618-624