Search form


Contact details
Traineeship proposition

Stage-onderwerp banaba bio-informatica 2015-2016: Semantische disambiguatie van farma- en biotechbedrijven in klinische studies en patenten

Abstract 2019-2020: Creation of a knowledge graph for biomarker-disease associations

Biomarkers, or biological markers, refer to a broad subcategory of biomedical indicators. They are supposed to be objective and quantifiable characteristics of biological processes, states or conditions and are intended to be used in biomedical research and clinical decision making. Though biomarkers are often discussed in the scientific literature, the level of detail and the quality of their descriptions and annotations vary widely between different sources. The absence of a minimum information standard and a harmonized data model to clearly describe the concept of a biomarker poses a challenge to verify, integrate and analyze available biomarker information. The aim of this project is to develop a semantic model applicable to capture the essential minimal information to clearly identify and describe biomarkers as meaningful data. In a next step, the model is applied to assess the quality of the provided biomarker data from selected data sources and to integrate this data into a large knowledge graph. This enables the re-use and interoperability of harmonized biomarker data in a research or clinical setting.

The project started with investigating the variety of biomarkers annotations in the scientific literature. Based on this, the requirements for a minimum biomarker information model were defined. This minimal model is further extended with defining additional data properties and relations between entities within the biomarker domain. The model was created as a semantic model using OWL (Web Ontology Language) with a major emphasis on linking the biomarker properties to existing, widely used biomedical ontologies. In addition, the FAIR (Findable, Accessible, Interoperable and Re-usable) data principles are applied to make the data FAIR enough for a number of predefined use cases.

Several data sources describing biomarkers were extracted and the data was processed with Python scripts to fit the ontology model. The processed data, describing biomarkers in a harmonized manner, was ready for integration and subsequent use in two different ways. Firstly, representing the data in RDF (Resource Description Framework) format and merging it into a semantic graph database. This allows to solve a wide variety of search questions via the construction of SPARQL queries.  Secondly, ingesting the data in DISQOVER, developed by ONTOFORCE as a linked data search, navigation and analysis platform for life sciences and healthcare. In DISQOVER, the integrated biomarker data is further enriched with 145 semantically harmonized public data sources available via DISQOVER data federation. Automatic links are made with instances from gene, protein, variant, disease and publication data types which are derived from core resources such as PubMed, NCBI Gene, UniProt and SNOMED CT. As a result, the modelled biomarker data is merged into this major pre-existing knowledge graph and is made available for searching, filtering, and analytics via the user interface. Dashboards are prepared for different persona such as researchers, clinicians and patients to solve a wide variety of use cases.

After this traineeship, this project will be continued and further developed within ONTOFORCE.    

Samenvatting eindwerk 1 2014-2015: Integratie en semantische conversie van autisme gerelateerde gendata in een linked data omgeving
The work described in this thesis has been performed in collaboration with ONTOFORCE. The aim was to convert free accessible data from AutDB and Sfari Gene, two databases containing curated gene information related to autism, into a semantic web format.
The source data was only accessible in html-format, which implied that scraping techniques had to be developed to capture the data from the databases. The data was first converted in a csv-format and subsequently in rdf-format that is one of the standard formats for linked open data. The intermediate step to .csv was done to be able to generate a multipurpose csv-to-rdf program. The rdf-converted data, containing lots of data represented in the typical triple format, will be ready to integrate in DISQOVER. This is a knowledge search system for life science and healthcare data developed by ONTOFORCE and is based on the semantic web and linked open data principles. By integrating the AutDB and Sfari Gene data into DISQOVER, people active in autism research will be able to get a more integrated overview of the genetic information in that field.
To obtain this goal, the scraping and conversion process was performed with Python scripting. This programming language and the SPARQL-language, a SQL-like language to query semantic web data, were learned during the internship period by following tutorials in order to obtain a high scripting level. Also the semantic web principles had been taught, to get a good insights in the semantic web. The project is successfully ended at the stage that data will be ready in a short time for uploading into DISQOVER. This will further enrich the already present data and generate new links between data that can be crucial in further research projects.
By this thesis, hopefully new “links for lives” will be found, that can be used to search smarter for new insights to boost autism research. (Ontoforce,2015)
Samenvatting eindwerk 2 2014-2015: Integration and semantic conversion of oncological mutation data in a linked data environment
Semantic Web programming is a whole new technology in the world of informatics. There’s a lot of big data available over the WWW, this data is stored in databases. The Semantic Web offers a powerful and practical approach to gain mastery over the multitude of information and information services. Semantics offer the leverage to make more information better and not overwhelmingly worse.
There are a few requirements to create a Semantic Web. New data representations are needed and some knowledge of computer science and insight in the data are required. The data in a Semantic Web expresses the meaning of the data and gives a whole new dimension by adding extra information to it.
This project is about the integration and the semantic conversion of somatic mutations in cancer, the data is captured from the Catalogue Of Somatic Mutations in Cancer (COSMIC) database. The purpose of this project is to convert this somatic mutation data in an environment such that searches are easier and more efficient. A Semantic Web is very interesting in cancer research. Every day scientists discover new technologies, methods, medicines to heal tumors. By storing this data and creating a Semantic Web, new solutions can be discovered and new technologies and methods will be developed.
A Semantic Web is developed by creating an ontology of the COSMIC classification. On the basis of this ontology the full COSMIC data can be converted in a linked data environment.


Technologiepark 122 (3F)
9052 Zwijnaarde


Traineeship supervisor
Filip Pattyn
Via Map