Stage-onderwerp banaba bio-informatica 2015-2016: Semantische disambiguatie van farma- en biotechbedrijven in klinische studies en patenten
Abstract 2019-2020: Creation of a knowledge graph for biomarker-disease associations
Biomarkers, or biological markers, refer to a broad subcategory of biomedical indicators. They are supposed to be objective and quantifiable characteristics of biological processes, states or conditions and are intended to be used in biomedical research and clinical decision making. Though biomarkers are often discussed in the scientific literature, the level of detail and the quality of their descriptions and annotations vary widely between different sources. The absence of a minimum information standard and a harmonized data model to clearly describe the concept of a biomarker poses a challenge to verify, integrate and analyze available biomarker information. The aim of this project is to develop a semantic model applicable to capture the essential minimal information to clearly identify and describe biomarkers as meaningful data. In a next step, the model is applied to assess the quality of the provided biomarker data from selected data sources and to integrate this data into a large knowledge graph. This enables the re-use and interoperability of harmonized biomarker data in a research or clinical setting.
The project started with investigating the variety of biomarkers annotations in the scientific literature. Based on this, the requirements for a minimum biomarker information model were defined. This minimal model is further extended with defining additional data properties and relations between entities within the biomarker domain. The model was created as a semantic model using OWL (Web Ontology Language) with a major emphasis on linking the biomarker properties to existing, widely used biomedical ontologies. In addition, the FAIR (Findable, Accessible, Interoperable and Re-usable) data principles are applied to make the data FAIR enough for a number of predefined use cases.
Several data sources describing biomarkers were extracted and the data was processed with Python scripts to fit the ontology model. The processed data, describing biomarkers in a harmonized manner, was ready for integration and subsequent use in two different ways. Firstly, representing the data in RDF (Resource Description Framework) format and merging it into a semantic graph database. This allows to solve a wide variety of search questions via the construction of SPARQL queries. Secondly, ingesting the data in DISQOVER, developed by ONTOFORCE as a linked data search, navigation and analysis platform for life sciences and healthcare. In DISQOVER, the integrated biomarker data is further enriched with 145 semantically harmonized public data sources available via DISQOVER data federation. Automatic links are made with instances from gene, protein, variant, disease and publication data types which are derived from core resources such as PubMed, NCBI Gene, UniProt and SNOMED CT. As a result, the modelled biomarker data is merged into this major pre-existing knowledge graph and is made available for searching, filtering, and analytics via the user interface. Dashboards are prepared for different persona such as researchers, clinicians and patients to solve a wide variety of use cases.
After this traineeship, this project will be continued and further developed within ONTOFORCE.
Technologiepark 122 (3F)