Appendix (Technical details)
A1 Development of structure and function-based annotations of proteomes
The foundation of the e-Protein project is the component bioinformatics resources developed at the three sites. Under e-Protein each group has developed their database and associated computational resources
keeping pace with the rapid expansion in sequenced proteomes and the novel methods of annotation. The component databases have over 2,000 hits to their web pages per month. Here we describe the current status of the component bioinformatics
resources and subsequently we report the DAS server that provides a common front end for use by the community.
The 3D-Genomics database at Imperial College was developed as part of the e-Protein project. 3D-Genomics contains annotations for the protein sequences for over 200 genomes (Fleming et al , 2004).
Structural annotations inferred from homology to proteins of known structure, i.e. PDB and SCOP domains, have been supplemented by functional annotations, i.e. using GO (gene ontology) terms and Pfam. The modular architecture of the 3D-Genomics
annotation pipeline allows the integration of novel annotation software without the need to recompute existing annotations. Several novel annotation methods have been integrated over the course of the project, including assignment of GO terms,
intermediate sequence search (ISS) to identify remote homologues, and fold recognition using Phyre. Currently we are able to keep pace with the current rate of genome sequencing using resources from the London e-Science centre. 3D-Genomics has
been used to analyse the organisation of proteins in the human genome (Mayor et al, 2004).
The Genomic Threading Database (GTD) at UCL was established during the course of the e-Protein project (McGuffin et al., 2004a; McGuffin et al., 2004b). Improved Web interfaces to the database have been
developed to allow general users to carry out searches, download data and view summary statistics. Over 218 proteomes have been annotated to date, using distributed versions of the GenTHREADER fold recognition method. The major development that
has occurred in the past year has been in the automation of GTD updates. Password protected interfaces have been developed to allow trusted users to automatically update and maintain the database through a web browser. This allows a) submission
of new or updated proteomes to clusters for annotation and subsequent tracking of annotation progress, b) configuration of the resulting raw annotation data and automatic upload of the data into the GTD and c) automatic generation of useful statistics
concerning the uploaded data i.e. on coverage and reliability of annotations and fold frequencies per genome. GTD has been extended to include the prediction of disordered regions (Ward et al, 2004a, 2004b), enhanced protein structure prediction
(Jones and McGuffin, 2003; Ward et al, 2003; McGuffin and Jones, 2003).
The Gene 3D database currently holds over 200 completed genomes (Pearl et al, 2005; Lee et al, 2005; Sillitoe et al, 2005). Gene3D aims to determine the domain architecture of each member sequence. Gene 3D is
developed by searching a library of hidden Markov models (HMMs) relating to over 1500 CATH and over 8000 Pfam families against the sequences of the completed genomes using a SAM-t99 protocol (Sillitoe et al. 2005). CATH domain matches are given priority
over Pfam domain matches since these are identified from structure, which is generally considered to be a more reliable approach for protein domain delineation than their identification from sequence.
Functional annotation was tackled at the EBI. Two sources of functional annotation have been made available by DAS the Catayltic Site Atlas (CSA) (Porter et al, 2003) and ligand information obtained from PDBsum
(SAS) (Laskowski, 2001). The CSA consists of a set of sequences whose catalytic residues are annotated by reference to the primary literature, and extended by homology to cover all sequences contained in Uniprot (Apweiler, 2004). Similarly, Sequences
Annotated by Structure (SAS) annotates possible ligand binding regions in a sequence by homology to those in crystallised protein structures where a ligand observed to be bound. In contrast to model-based approaches, which make predictions by statistical
matching of patterns, each annotation from CSA or SAS can be traced back to primary literature for evaluation. In this respect, the annotations provided by the CSA and SAS servers are derived from sources external to sequence databases, and so information
provided about protein function is complementary to that from statistical modelling approaches. The CSA indicates potential catalytic sites, and thus provides additional information in regions that may already be annotated as catalytic by another DAS server.
Todd et al.[2001] showed that there is good agreement between the function of enzymes at 30% sequence identity, and so the Enzyme Commission classification of the matched enzyme suggests possible roles for the sequence under study. The ligand information from
SAS indicates potential bound metals and cofactors, providing clues to function, the active region of a protein, and mechanism.
A2 The Distributed Annotation System (DAS) within the e-Protein project
The major bioinformatics aim of e-Protein is to provide a single front end for the community to the above data resources. The e-Protein project decided at an early stage to make the annotations from each of the contributing
sites available through the Distributed Annotation System (DAS) server. This way, no central repository of feature annotations would be necessary, and each site would have full control over the updates and modification to their data.
Additionally, no agreement on database schema would be needed as a DAS annotation server is database and schema agnostic. This approach has later been adopted by other prominent projects for the same or similar reasons. The current status is that the DAS server
is operational and provides the required single front end to distributed proteome annotation resources. The client is easily accessible to the community from www.e-protein.org/e-proteindastypr.html (must have Flash installed) or via
the main page of the e-Protein web site.
A3 Distributed Computing Resource Management
JYDE - The UCL team investigated a new distributed computing resource management system which could cope with the scale of problems we were facing in e-Protein. The problem is that the design of the system has to be done in a
manner that acknowledges the size of the processing problem and the diversity of application types in need of support. Our current work on these issues has so far resulted in the creation of the Job Yield Distribution Environment (JYDE). The system uses the Sun Grid Engine
(SGE and SGEEE) and Condor as local schedulers on member clusters and relies on these to provide information about local resource availability. This information is then used to select where to send individual tasks. JYDE.s only concern is the resource selection and load
balancing between clusters. Local management is maintained by local schedulers. JYDE consists of two parts. The Grid Distribution Manager (GriDM) acts as a broker. It is a distributed P2P system based on JXTA. GriDM performs global resource discovery and scheduling services.
The Job Portal (JPortal) is responsible for task preparation and distribution. It breaks user requests into tasks and issues requests to the GriDM for processing permits. Upon receiving a permit the JPortal takes the responsibility to deliver the task to the selected cluster
and recover any data resulting from the processing. All tasks are owned by a proxy user so clusters deal only with a single user entity. Authentication, access and priority issues are handled internally inside JYDE. JYDE has been used to support eProtein activities locally
at UCL for most of the duration of the project.
GridSAM - Imperial explored the use of GRID Sam within its ICENI package to capture the workflow of the proteome annotation pipeline and map it to multiple Grid resources, providing the capability of true resource brokering. At
Imperial two ICENI components (a Batch Binary and a JDML Provider) were developed, which are capable of creating a copy of the proteome pipeline in ICENI, and executing it in a sequential manner. We have developed a Synchronisation Component to manage data flow between
stages of the e-Protein pipeline. This model enables us to reuse code provided by Keiran Fleming (Structural Bioinformatics Group) for database access that has been integrated into the e-Protein workflow. Components use this code in order to place output directly into a
database.
|