EMPWR: Computational Exploration of Molecules in the Context of Biological Pathway Networks

From HMS Genetics Department Wiki

Project Description

This proposal is to combine BioPAX [1]and the instance Store (iS)[2], two cutting-edge research efforts, in order to explore a molecule's context in a network of different biological pathways. The recently developed BioPAX standard for biological pathway exchange is an ontology written in the Web Ontology Language (OWL)[3] to describe genes, proteins, enzymes, and small molecules in metabolic, signal transduction, and molecular interaction pathways. An ontology is an explicit formal specification of how to represent the semantics of objects, concepts, etc. that are understood to exist in some area of interest and the relationships that hold among them[4]. The recently developed instance Store takes instances of classes formalized in OWL and, combined with the ontology itself, uses reasoning technology to ask questions and make inferences about these instance data.

Current reasoning technology is designed to work with instances stored in working memory, which limits its usefulness. The instance Store extends the reasoning capabilities to use a database so that the reasoning capability is not limited to the number of instances that can be stored in working memory[2]. The newly available pathway data in BioPAX format, including BIND, KEGG, BioCyc, PUMA2, etc. is a necessary step in exploring the utility of these data, the instance store and reasoning technology.

OWL is underpinned by strong semantics based upon a fragment of first order logic[3]. This means that statements in OWL have a precise meaning, and this strong formalism dictates that the collection of statements about the domain made in OWL can be reasoned over computationally. Ontologies written in OWL can be checked for consistency and contradictions [5]. In addition, a reasoner is able to infer the subsumption hierarchy implied by the description of classes in the ontology. Similarly, it is possible to reason that an instance with a particular description belongs to a particular class. It is this formalism of the OWL language that will enable data instances in the BioPAX format to be explored and reasoned over via the BioPAX ontology.

The purpose of the BioPAX standard is to provide the common language for data export and integration of heterogeneous pathway databases. It uses an ontology written in OWL to describe biological pathways at three different levels. Level 1 describes metabolic pathways. These include biochemical reactions, enzymes, metabolites, and other reactions. Level 2 describes molecular interactions such as those obtained through yeast two-hybrid experiments[6]. Level 3 describes signal transduction pathways. The ontology is used as a schema into which instance data from compliant databases can be semantically mapped for transfer between databases or to form a knowledge base consistent with the BioPAX ontology. In this proposal, we wish to explore cutting edge Semantic Web technology to form such a knowledge base.

Intellectual Merit

The combination of these two nascent technologies will enable the knowledge captured in the BioPAX OWL schema to realize its full potential rather than simply being used as an exchange language. As each data instance is added to the iS, it is classified to its place within the ontology[2]. This means that the BioPAX ontology can easily be used to explore the data in a biologically meaningful manner – the BioPAX ontology captures a community's understanding of the pathway domain. This classification process will also capture those instances not consistent with the ontology, therefore either improving data quality or highlighting novel facts. New classes of information can be dynamically created to retrieve instances, e.g., all the pathways from a species in which a specific chemical is involved. Such queries are very easy using the iS and a reasoner. The main feature, however, is the systematic, consistent use of the data via the ontology and reasoner – it easily finds those data that do not fit the ontology and highlights novel or obscure findings.

Exploratory Nature of Proposed Work

This combination of BioPAX and the instance Store will enable exploration in two ways: it should facilitate the exploration of molecules within the biological context of pathways, and it combines two bleeding edge technologies in a novel manner to address a grand challenge within post-genomic biology. The exploratory nature of the proposed work falls into the following areas: • Technology – Will the instance Store work; be scalable; etc.? • Standards – Is the BioPAX standard able to adequately represent knowledge of biological pathways such that all data required for reasoning can be generated? • Capabilities – Can we ask and answer biologically interesting questions?

One of the grand challenges within bioinformatics is the ability to gain a holistic view of all the entities forming a molecular context, an area of growing importance in pharmaceutical development. To explore the context of a molecular entity, it is necessary to move from gene networks, protein networks, signal networks, often using small molecules as intermediates. All these entities need to be first class citizens in an interlinked representation of biology pathways. This necessary pre-condition is supplied by BioPAX, and the instance Store has the potential to enable biologically relevant exploration of these data.

Prior Work

For example, in recent work at Manchester we have used an ontology of the Phosphatase family of proteins and a store of protein instances to produce insights into the phosphatase complement of a genome[7].

In the ontology, each class of phosphatase is defined by its domain composition. This means it is possible to recognize to which class an individual phosphatase belongs purely by its domain composition. We take the protein complement of a cell, perform an INTERPRO scan of those proteins and write out descriptions of each protein's domain composition in OWL. These are placed in the instance Store and we use the ontology and the reasoner to ask to which classes any protein belongs. We have done this for two species and found our performance at classification to match human curators and at a speed that is several orders of magnitude faster. In addition, the systematic, thorough nature of the survey is able to detect phosphatases that do not conform to the classification. This kind of biological insight is difficult, if not impossible, to produce with standard database technology.

In preparation for this work, the investigators have explored converting pathway data into BioPAX OWL format through unpaid collaboration. This has shown us that it is possible to gather the wide range of data necessary for this collaboration.

Goals/Visions for the Future

The instance Store is under constant development and the queries and inferences it can handle are expanding. We have used the instance Store to ask simple questions over a small amount of data (thousands of proteins). We wish to use the developing instance Store in conjunction with complex BioPAX data to explore biological pathway data to make novel and interesting biological inferences. This fulfills one of the key objectives of the pathways community from which the BioPAX initiative arose -- to enable the creation of an Open Source pathway resource - in this case, a knowledge base[8]. Currently, the instance Store is one of the few technologies potentially capable of handling the large amount of complex, interlinked data instances such as the BioPAX format enables. Large datasets are those which have instances that exceed the capacity of working memory.

We wish to ask questions such as: • What metabolites do a collection of pathways have in common; • Direct and indirect interactions of entities involved in all kinds of pathways; • Establishing the context of a molecule within a variety of pathway types; • Describing variations on classes and instances within the BioPAX model and exploring their effect.

The current work at Manchester on resistance to the parasite disease trypanosomiasis in cattle,[9] for example , is an excellent example of the kind of problems one encounters when dealing with complex interactions across many kinds of pathway, where molecules involved in a process have complex relationships to each other. Experiment-based research papers often concentrate upon a single experiment that tests a particular hypothesis. For example, in both mouse and cattle models of trypanosomiasis it is known that the resistance/susceptibility phenotype is controlled by one or more genes lying within a quantitative trait loci (QTL)[9]. However, the QTL region contains over 50 genes – and it is not known which of these 50 genes are responsible for the resistance phenotype. Microarray gene expression experiments have demonstrated that trypanosomiasis resistant mice have up regulated genes involved in superoxide production[9]. Researchers would like to understand whether there is a significant link between these two pieces of information. In one paper it is found that one of the genes in the QTL is important in controlling the production of the metabolite NADPH. In another paper it is found that a defect in this enzyme reduces NADPH production, which reduces the body's ability to produce superoxides. These facts involve jumping from gene networks, through metabolic pathways and metabolites to an observed phenotype - an activity not supported by current bioinformatics databases. One must also jump from publication to publication to find the necessary information, an annoyance that becomes unnecessary when everything is contained within one database that also includes a program that finds and interprets the information for you. Time-consuming research is eliminated.

One of the objectives of this work will be to take the exemplars where biological pathways have been elucidated in order to connect genotype to phenotype and recreate them within The Instance Store Pathway Exploratory. We wish, through this work on a broad pathway exploratory, to create a virtuous circle of exciting biology and a driving force for innovative computer science. Support for this grant would bring together two new, cutting edge developments in the computer science and bioinformatics arenas in a venture full of potential.

Availability of Work Product

All aspects of this work will be freely available. The instance Store is Open Source software for reasoning over instances in a database, and it is available over the internet. Both the data and the converters producing those data will be freely available. A web site will be generated containing links to all these resources, the use cases gathered, project details and the publications deriving from this work. All parts of the BioPAX standard are already feely available.

Broader Impact

This work will have major impact on the development of Semantic web technology, because the life sciences provide the vast quantities of data and use cases that are needed to both test and drive forward the development of next generation Web technology. The World Wide Web Consortium has recently recognized the Life Sciences by holding a Workshop in Cambridge, Massachusetts to learn about its relevance and Sir Tim Berners-Lee, Director of the W3C and inventor of the World Wide Web gave the keynote presentation at BioIT World. An official life science initiative is underway. In addition, the proposed work will have major impact on multiple challenging areas in biological research, because biological pathways are present in all living organisms and therefore present in every organism under study. Thus understanding and reasoning over these data will enhance life science research in drug discovery, where it will provide a way of exploring the context in which a particular drug target lies; disease research such as cancer, to facilitate the understanding of the cancer pathways, thus leading to better diagnosis and treatment; and environmental research such as bioremediation, by providing integrated data that will facilitate analyses of metabolic pathways such as Flux Balance Analysis. Plan of Work and Time Line


References

1. http://www.biopax.org

2. Horrocks, I., et al. The Instance Store: DL reasoning with large numbers of individuals. in Proc. of the 2004 Description Logic Work-shop,. 2004.

3. Horrocks, I., P. Patel-Schneider, and F.V. Harmelen, From SHIQ and RDF to OWL: The making of a web ontology language. Journal of Web Semantics, 2003. 1(1): p. 7-26.

4. Gruber, T.R. Towards principles for the design of ontologies used for knowledge sharing. in Proc of Int. Workshop on Formal Ontology. 1993.

6. Stevens, R., et al., Building a Bioinformatics Ontology Using OIL. IEEE Transactions on Information Technology and Biomedicine, Jun 2002. 6(2): p. 135--41. Interactions among members of the Bcl-2 protein family analyzed with a yeast two-hybrid system. Proc Natl Acad Sci U S A., 1994. 91(20).

7. Wolstencroft, K., et al., Intelligent Classification of Proteins Using an Ontology. 2005.

8. Luciano, J.S., PAX of mind for pathway researchers. Drug Discovery Today, Jul 2005. 10(13): p. 937- 942.

9. Black, S.J., et al., Innate and acquired control of trypanosome parasitaemia in Cape buffalo. International Journal for Parasitology, 2001. 31: p. 562-65.

Investigators

The Principal Investigator for Harvard Medical School is Dr. Joanne Luciano.

The Principal Investigator for the University of Manchester (UK) is Dr. Robert Stevens.

Go back to Joanne's home page