Paralogous genes are genes which descend from a progenitor gene which has duplicated as an ancestral gene, each copy having diverged prior to speciation. With comprehensive information available on functions of Escherichia coli proteins, analysis of sequence-related E. coli paralogous proteins can give information on the early ancestors of families of proteins now residing in many contemporary organisms, such as the enzymes of metabolism, some kinds of transport mechanisms and some kinds of regulatory mechanisms. Ln the first step, we have confirmed that E. coli contains a very high proportion of paralogous proteins. Next, we have defined two main classes of paralogous proteins. One class is formed of proteins which contain a unique structural segment homologous to a single set of related proteins. The other class corresponds to proteins which contain more than one structural segment of homology, each segment homologous to unrelated sets of proteins. We define such an independent structural segment of homology as a module. This modular structure (mean length equivalent to 209 amino acids) corresponds often to entire proteins, but there are also proteins that appear to be assembled from two or three independent modules having independent origins. Most multimodular proteins appear to have been formed early in their history, a minority appear to be relatively recent fusions of independent modules. Examining 1404 independent structural segments of homology, composed of both modules and entire proteins, we found that the segments of homology fell into 352 sequence-related groups or families. The majority of these families (ranging from 2 to 62 members) are functionally homogeneous. This strongly suggests that the 1404 present-day modules and proteins derive from a minimal set of 352 ancestral modules, each one being already of the same size and having a function similar to all members of its progeny. (C) 1997 Academic Press Limited.