Databases
3D-GENOMICS (at Imperial College London)
A MySQL-based relational database that includes protein structure and function annotation across different genomes. The analysis pipeline currently has a focus on protein sequences for which we perform several steps of analysis such as: Identification of transmembrane regions, coiled-coils, low complexity regions, Prosite-patterns, PFAM and SCOP domains, repeats, homologous sequences and secondary structure prediction. Structural information (fold classification) is assigned to sequences of the genomes via homology (using BLAST, PSI-BLAST and our in-house software 3D-PSSM).
Genomic Threading Database (at UCL)
The database will contain structural annotations for genes within key completed genomes, made using a modified version of our recently developed GenTHREADER software.
GenTHREADER is a new fast and powerful fold recognition method, which can be applied to either whole, translated genomic sequences (proteomes) or individual protein sequences as in the case of the PSIPRED server. When GenTHREADER was applied to the genome of Mycoplasma genitalium it predicted that ~47%(presently 61%) of coding regions showed a significant relationship to proteins of known structure. Consequently, these regions could be modelled on known folds. It is important to note that unlike most threading methods, such as the original THREADER, GenTHREADER attempts to make inferences about possible evolutionary relationships. This allows false positive predictions to be filtered out thus producing a more reliable overall prediction (Jones, 1999).
Gene3D (at UCL)
The Gene3D database currently holds over 200 completed genomes (Pearl et al, 2005; Lee et al, 2005; Sillitoe et al, 2005). Gene3D aims to determine the domain architecture of each member sequence. Gene 3D is developed by searching a library of hidden Markov models (HMMs) relating to over 1500 CATH and over 8000 Pfam families against the sequences of the completed genomes using a SAM-t99 protocol (Sillitoe et al. 2005). CATH domain matches are given priority over Pfam domain matches since these are identified from structure, which is generally considered to be a more reliable approach for protein domain delineation than their identification from sequence.
Catalytic Site Atlas (at EBI)
The Catalytic Site Atlas (CSA) is a database documenting enzyme active sites and catalytic residues in enzymes of 3D structure. It defines a
classification of catalytic residues which includes only those residues thought to be directly involved in some aspect of the reaction catalysed by an enzyme. The CSA contains
2 types of entry:
Original hand-annotated entries, derived from the primary literature. References for these entries are given.
Homologous entries, found by PSI-BLAST alignment (using an e value cut-off of 0.00005) to one of the original entries. The equivalent residues, which align in sequence to the catalytic residues found in the original entry are documented.
ProFunc (at EBI)
The ProFunc server had been developed to help identify the likely biochemical function of a protein from its three-dimensional structure. It uses both sequence- and structure-based methods (see below) to try to provide clues as the the protein's likely or possible function. Often, where one method fails to provide any functional insight another may be more helpful.
|