Research

Knowledge Management in Bioinformatics is a new research area covering all methods for the management and analysis of complex biological data sets. Topics include knowledge representation and extraction, specialized data structures, modeling of complex domains, intelligent methods for information retrieval, and semantic database integration.

The professorship for Knowledge Management in Bioinformatics at the Humboldt University in Berlin was established in October 2002. Research is mainly devoted to the following topics:


Data Integration

The integration of molecular biology databases is considered as one of the key challenges to bioinformatics since the first days of the human genome project. Integrated data sets provide new insights, can be used to cross validate experimental studies, and open doors to relationships that are not detectable otherwise. Information sources are available in abundance: Several hundred biological databases are freely accessible for the researcher, not counting only commercially available databases. However, data integration is a difficult task due to the heterogeneity of the resources. Data in biological databases describes many different "things": DNA sequences, protein structures, biological pathways, protein interactions, gene expression arrays, references to literature, genomic maps, taxonomies and ontologies, etc.

The goal of combining these different things under a unified interface is the driving force behind data integration, but the existing heterogeneity is also a huge handicap due to the many syntactic, structural, and semantic differences. Different sources often have different intentions about such common concepts as function, gene, or locus. These differences must be properly accounted for to avoid data sets that are integrated, but meaningless due to the poor data quality.

We work on several aspects of data integration, especially methods for semantic database integration, integration of structured and unstructured sources, query optimization in distributed scenarios, query transformation using query correspondences, and automatic schema discovery.

Text Mining

Despite the many databases available in bioinformatics, most of the available knowledge is still encoded in in publications, i.e., unstructured natural language. Due to the reputation associated with publications (and not yet with database entries), this is not likely to change in the near future. But extracting this knowledge for more than a handful of objects, such as genes, is difficult since it requires the parsing of human language. Text mining uses a combination of algorithms from machine learning, natural language processing, statistics, and data mining, to automatically discover biological objects, their relationships, and their properties, in text databases. Discovered facts may then be combined to structured data from databases.

Our group develops tools for the identification of biological objects in text (genes, proteins, diseases, etc.), the detection of relationships between these objects (protein-protein interaction, gene regulation, etc.), and the extraction of further properties of objects and relationships (kinetic parameters, direction of interactions, contextual knowledge, etc.).

Complex databases: modeling, scalability, interfaces

Historically, most data in molecular biology was stored in flatfiles. Almost all major public databases started as such, including Genbank, SWISS-PROT, PDB, etc. Today, many of these databases are still distributed as flat-files, but are internally managed using relational database management systems (RDBMS). This shift was mainly motivated by the advantages that RDBMS offer in terms of availability, tool support, scalability, accessibility, archival, etc. However, commercial RDBMS have been developed with transactional applications in mind, and certain properties of molecular biology data sets are not optimally supported, such as hierarchical or graph-based data, large and structured controlled vocabularies (ontologies), importance of very large sequences, tight integration of data storage with data analysis, high flexibility in adding annotations to every piece of data, or the management of versioned, imprecise, or inconsistent data. In many cases, it is not trivial to determine the best way to support a biomedical application using a RDBMS.

We are interested in all ways to support bioinformatics with databases. This includes specialized data models, development of tailored data structures and access methods, annotation management, performance otimization, support for special domains such as proteomics, expression data or network data, and the integration of workflow with database technologies.

Knowledge Representation

Molecular biology research studies complex systems that are not yet fully understood. The representation of knowledge about these systems is a key concern of bioinformatics. Every piece of knowledge, may it be gained experimentally or in-silico, must be encoded in suitable data models to prevent the loss of information, to optimally support the usage of the data, and to avoid misinterpretation. This applies to various applications. To pursue data integration in the presence of heterogeneous and autonomous data sources, one has to properly represent the semantics of the data in the different sources. To support automatic data analysis in the presence of complex knowledge, annotations of biological objects must be be uniformly structured, semantically consistent, and encoded in a computer-readable manner. Also, computational inference using logical representation and reasoning systems such as OWL is only feasible if the semantics of the data under study are defined precisely and unambiguously.

We investigate different forms of knowledge representation for bioinformatics. This includes efficient management and modeling of ontologies, the role of metadata in general, and the inclusion of external knowledge into the text mining process.