Databases

3D-GENOMICS (at Imperial College London)

A MySQL-based relational database that includes protein structure and function annotation across different genomes. The analysis pipeline currently has a focus on protein sequences for which we perform several steps of analysis such as: Identification of transmembrane regions, coiled-coils, low complexity regions, Prosite-patterns, PFAM and SCOP domains, repeats, homologous sequences and secondary structure prediction. Structural information (fold classification) is assigned to sequences of the genomes via homology (using BLAST, PSI-BLAST and our in-house software 3D-PSSM).

 

Genomic Threading Database (at UCL)

The database will contain structural annotations for genes within key completed genomes, made using a modified version of our recently developed GenTHREADER software.

GenTHREADER is a new fast and powerful fold recognition method, which can be applied to either whole, translated genomic sequences (proteomes) or individual protein sequences as in the case of the PSIPRED server. When GenTHREADER was applied to the genome of Mycoplasma genitalium it predicted that ~47%(presently 61%) of coding regions showed a significant relationship to proteins of known structure. Consequently, these regions could be modelled on known folds. It is important to note that unlike most threading methods, such as the original THREADER, GenTHREADER attempts to make inferences about possible evolutionary relationships. This allows false positive predictions to be filtered out thus producing a more reliable overall prediction (Jones, 1999).

 

Gene3D (at UCL)

The Gene3D database currently holds over 200 completed genomes (Pearl et al, 2005; Lee et al, 2005; Sillitoe et al, 2005). Gene3D aims to determine the domain architecture of each member sequence. Gene 3D is developed by searching a library of hidden Markov models (HMMs) relating to over 1500 CATH and over 8000 Pfam families against the sequences of the completed genomes using a SAM-t99 protocol (Sillitoe et al. 2005). CATH domain matches are given priority over Pfam domain matches since these are identified from structure, which is generally considered to be a more reliable approach for protein domain delineation than their identification from sequence.

 

Catalytic Site Atlas (at EBI)

The Catalytic Site Atlas (CSA) is a database documenting enzyme active sites and catalytic residues in enzymes of 3D structure. It defines a classification of catalytic residues which includes only those residues thought to be directly involved in some aspect of the reaction catalysed by an enzyme. The CSA contains 2 types of entry:

  • Original hand-annotated entries, derived from the primary literature. References for these entries are given.
  • Homologous entries, found by PSI-BLAST alignment (using an e value cut-off of 0.00005) to one of the original entries. The equivalent residues, which align in sequence to the catalytic residues found in the original entry are documented.

  •  

    ProFunc (at EBI)

    The ProFunc server had been developed to help identify the likely biochemical function of a protein from its three-dimensional structure. It uses both sequence- and structure-based methods (see below) to try to provide clues as the the protein's likely or possible function. Often, where one method fails to provide any functional insight another may be more helpful.

     

     

    Data sources

    Source

    Ensembl

    TIGR

    WormBase

    WormBase

    Joint Genome Institute

    Ensembl

    Ensembl

    SwissProt

    Singapore

    SwissProt

    Ensembl

    Ensembl

    Broad Institute

    TIGR

    SwissProt

    Ensembl

    SwissProt

    SwissProt

     

    SwissProt

    SwissProt

    Data type

    Anopheles gambiae

    Arabidopsis thaliana

    Caenorhabditis briggsae

    Caenorhabditis elegans

    Ciona intestinalis

    Danio rerio

    Drosophila melanogaster

    Encephalitozoon cuniculi

    Fugu rubripes

    Guillardia theta

    Homo sapiens

    Mus musculus

    Neurospora crassa

    Oryza sativa

    Plasmodium falciparum

    Rattus norvegicus

    Saccharomyces cerevisiae

    Schizosaccharomyces pombe

     

    Archaeal genomes

    Bacterial genomes

    Date/version stamp

    MOZ2

    23/03/04

    vCB25

    v116

    v1.0

    v3

    MOZ2

    23/03/04

    v6.28

    23/03/04

    NCBI 34

    NCBI m32

    v3

    23/03/04

    23/03/04

    RGSC 3.1

    23/03/04

    23/03/04

     

    23/03/04

    23/03/04

    Funding

    BBSRC has provided substantial funds from their BEP Bioinformatics and E-Science Programme. BBSRC wishes to encourage the use of GRID technology research in areas underpinning innovation, prosperity and improvement in the quality of life, especially genomics, structural biology, dynamic processes in cells and biodiversity. BBSRC has made a significant investment in bioinformatics research in recent years to establish a small but strong bioinformatics research community and now wishes to build on this investment.

    DTI has provided additional funds from its Harnessing Genomics programme to facilitate industrial contacts throughout the programme and to extend the grant by 3 months to produce a technical report on the product that would intergrate and expand on academic publications. In particular, the funding would support two meetings, the first a consultancy exercise with UK industry and the second a three day industrial workshop to showcase the product at the end of the grant.