Search form

VIB – Bioinformatics Core (BITS)

Contact details
Traineeship proposition
Voorstel stage bio-informatica (2017-2018): Development and implementation of integrative bioinformatics algorithms
The goal of the project is to support the integrative efforts at VIB bioinformatics Core facility, with emphasis on integration of data from VIB’s research activities In specific, the intern will use state-of-the-art bioinformatics software to integrate proteomics, metabolomics, and transcriptomics data sets. The main focus will be on: • Prioritization of different integration algorithms based on current results and needs • Implementation of data integration scripts and parsing Big Data files with costume scripts
Stage-onderwerp (2010-2011)
De Bioinformatica training en service (BITS) faciliteit voorziet verscheidene bioinformatica voorzieningen aan alle onderzoeksdepartementen van het VIB (zie
Recent heeft BITS besloten in een nieuwe server te investeren, genaamd 'Biobase 2'. Specificaties van deze server zijn (nog niet volledig bepaald): ~48 tot 96 Gb RAM, 16 – 32 cores, 2 Tb opslag. Het doel van het project is deze server klaar te maken voor gebruik binnen het VIB voor analyse van biologische data, met focus op sequentie data.
1) Installatie van het besturingssysteem
De linux distributie CentOS zal geïnstalleerd worden als OS. De machine moet met de nodige services (ssh, httpd) geconfigureerd worden voor gebruik binnen het BITS netwerk.
2) Installatie van het Galaxy portaal (
Na onderzoek van de werking van het Galaxy portaal, zal dit geïnstalleerd worden op de Biobase 2. Gebruikers van het VIB moeten kunnen (beveiligd) inloggen en data kunnen bewaren. De server krijgt een externe IP en wordt bereikbaar over het internet.
3) Uitbreiden van het Galaxy portaal
Het is de bedoeling het Galaxy portaal uit te breiden met software dat BITS aangekocht heeft en eventueel met software op vraag van de gebruikers. Volgende pakketten worden geïnstalleerd en geïntegreerd in Galaxy: cgatools (linux command line tool voor de analyse van Complete Genomics data) and realtime genomics (command line tool). Een interface in Galaxy voor deze tools verhoogt de gebruiksvriendelijkheid.  Verder te overwegen uitbreidingen: Cytoscape (open source visualisatie platform) en R (open source statistisch pakket). Zoals vermeld, de uitbreidingen kunnen aan veranderingen onderhevig zijn.
4) (optioneel) Integratie van Galaxy met cloud services
Afhankelijk van de beslissingen binnen BITS, bestaat de kans dat we ons Galaxy portaal dupliceren naar een cloud instantie. Met de opgedane ervaring met het bouwen van Galaxy zou dat vlot moeten verlopen. Op termijn (en na navraag) zou ieder departement zijn eigen cloud instantie met het Galaxy portaal kunnen verkrijgen.
5) Assisteren van trainingen
Na installatie van het portaal, zal de student (als systeembeheerder van de Biobase 2) meewerken aan een trainingsessie over het gebruik van Biobase 2 voor VIB wetenschappers. Dit facet is zeer belangrijk om wensen van de gebruikers in service te kunnen omzetten.
Voorstel stage-onderwerp (2010-2011) 1: installeren van een linux computer met benodigde bioinformatica en databases in VIB departementen en configuratie van deze computers voor remote access
  • Linux machine
  • Installatie nodige software: HMM, emboss, perl, R
  • Installatie nodige databases – voorzien up-date oplossingen
  • Configuratie voor SSH toegang / remote desktop toegang – gebruikersvriendelijkheid!
  • Integratie met diensten die BITS aanbiedt
Voorstel stage-onderwerp (2010-2011) 2: Creëren van een database (mysql) met resultaten van het '100 genomen' project en een gebruiksvriendelijke interface (html - php) om data uit de database te halen
  • Creëren van een database (mysql) met informatie over humane variatie (variatie bestanden circa 300 Mb/genoom)
  • Maken van een interface (php) om de databse te doorzoeken om een oplossing te vinden voor vragen zoals: gegeven een bepaalde ziekte, geef alle variaties van chromosom 1 en vergelijk het met een referentiegenoom
Voorstel stage-onderwerp (2010-2011) 3: haalbaarheidsstudie en testen van installatie van een bittorrent server voor gebruik
  • Paths om te onderzoeken: integratie
  • Beveilig bittorrent server: tunnelling bittorrent verkeer via SSH
Voorstel stage-onderwerp (2010-2011) 4: haalbaarheidsstudie naar het implementeren van een distributed computing platform in VIB (BOINC software)
  • BOINC server
  • Implementatie van een common interface
  • Bepaalde premade algoritmen …
Abstract 2018-2019 (1): Refinement of a reproducible pipeline for consolidated protein interaction data
Protein-protein interactions have a major role in biochemical interactions and pathways, they are involved in processes in all sorts of cell types and enable signaling between cells. The Ian Donaldson lab has created the iRefIndex (Donaldson, Magklaras, & Razick, iRefIndex: A consolidated protein interaction database with provenance, 2008) which centralizes this protein-protein interaction data from 22 source databases such as Biogrid, IntAct or Innate and 15 descriptive databases such as Entrez or Uniprot into one database. This combined database will allow compilation of organism specific protein interactomes that can be used by the community. It creates a unique identifier for each interactor and every mostly binary interaction. During this project, the aim is to build an updated version (v16) of iRefIndex. The update creates a larger data source covering more interactions, organisms, and aliases for genes by adding new interaction database sources such as APID (Alonso-Lopez, et al., 2016) and descriptive data sources like gene accessions for Arabidopsis thaliana. A suite of scripts already existed (Donaldson & Botzki, Irefindex Github page, 2019) but had to be adapted due to the newly included database sources. A streamlined pipeline is followed, but in every step mistakes are being ironed out. The iRefIndex database (PostgreSQL) is initialized, data sources are downloaded and then parsed. Since the last release of iRefIndex version 15 early 2018, several source databases have been updated thus containing new information, this resulted in small adaptions of the scripts. For the newly added databases, scripts are written or adapted to make them fit into the former pipeline.Subsequent steps included building of the iRefIndex database as well as cleaning up the results. In the last step, the former version 15 and new version are compared by using the iRefIndex datafiles in MITAB format. As expected final result, the new version is successfully built and has almost one million more entries than its predecessor. An R script is used to evaluate data quality and identify missing or wrongly associated elements such as missing PubmedIDs or, authors,, wrongly associated taxa or ill-defined identifiers for interactors instead of the requested UniProtIDs. Missing PubmedIDs are inserted after manual curation. When compared to the previous version, on all but one quality parameter, the new version scores better. Further research might also improve this last parameter, the missing UniprotIDs for interactors. In summary, the newly built database has an overal higher quality than its previous version and it contains readily available gene accessions for Arabidopsis thaliana genes. Unfortunately, the inclusion of the APID database had to be dropped since the downloaded datasets did not describe the type of protein-protein interaction. Alerting the authors of the database did not trigger receiving an updated version during the period of the internship.
Bibliography Alonso-Lopez, M., Guiterrez, M., Lopes, K., Prieto, C., Santamaria, R., & De Las Rivas, J. (2016). APID interactomes: providing proteome-based interactomes with controlled quality for multiple species and derived networks. Nucleic Acids Research. Donaldson, I., & Botzki, A. (2019). Irefindex Github page. Opgehaald van Github: Donaldson, I., Magklaras, G., & Razick, S. (2008). iRefIndex: A consolidated protein interaction database with provenance. BMC Bioinformatics.
Abstract 2018-2019 (2)Understanding the relevance of aminoacyl-tRNA synthetases in dominant peripheral neuropathies
Hereditary peripheral neuropathies enclose different subgroups based on motoric and sensory loss of neurons. Dominant-intermediate Charcot–Marie–Tooth neuropathy, the most common human inherited neurological disorder, is characterized by axonal degeneration and demyelination of peripheral nerves. Dominant mutations in tRNA synthetases, essential for the ligation of a tRNA molecule to the appropriate in the protein translation, cause Charcot–Marie–Tooth neuropathy. In the past, several studies suggested a loss of aminoacylation (enzymic) function as cause. Recently a gain of toxic function is more likely to be the underlying problem. The aim of the project is to find conserved co-expression patterns between orthologs across species, providing a highly relevant list of candidate genes that potentially share similar functions and act in the same pathway. The conservation across remotely evolved species, human and fly, suggest a conserved mechanism that is essential since it survived long evolutionary distance. Transcriptomics is generated by preforming a total RNA-sequencing on Drosophila brain expressing control and mutant constructs of the YARS gene. In addition, genomic ortholog data is used in combination with transcriptomics data and other datasets, such as genetic modifier screens and YARS/GARSvsWT are accessed and integrated. The CoExpNetViz tool (Tzfadia et al., 2016) is used for cross species co-expression analysis, to search for genes involved in a common pathway/process or to find functional orthologs of the bait genes. CoExpNetViz takes as input a set of bait genes, at least one gene expression matrix for each species and optional a gene family file. Output files are generated by the algorithm, including files to visualize the co-expression network in Cytoscape. Once the network is created and the appropriate layout is applied, different properties of the co-expression network were investigated. Further down-stream analysis is executed through Cytoscape plugins. MCODE and ClusterViz were used to find highly interconnected regions in the large network. Corresponding GO annotations (Ensembl) and enriched terms (DAVID) to the genes in the clusters are added. A second Gene Set Enrichment Analysis (GSEA) is executed by the fgsea package in RStudio to get more fine-grained p-values. In the last step, an attempt was made to link the genetic modifier, YARSvsWT and GARSvsWT datasets to genes present in the resulting clusters created by the transcriptomics data. The network visualization and analysis of co-expression network resulted in the discovery of a few orthologous inter-species clusters and more non-orthologous inter-species cluster. A gene Mkp3, out of the genetic modifier data, is found in a non-orthologous cluster. The gene is related to phosphatase/kinase activity. More genes are related to tyrosyl-tRNA synthetase (YARS) and/or to glycyl-tRNA synthetase (GARS). CNPY2, ABT1 and EIF4E2 pop out as interesting genes by their function and ontology related to neuronal disease and pathway processes. Even more, some of the genes share both GARS and YARS mutational functions. Some interesting candidate genes are found in orthologous and non-orthologous clusters. However, these genes are predictions based on the data and need to verified first by biochemical assays.
Abstract traineeship advanced bachelor of bioinformatics 2017-2018Development and implementation of integrative bioinformatics algorithms

The goal of the project is to support the integrative efforts at VIB bioinformatics Core facility, with emphasis on integration of data from VIB’s research activities. In specific, state-of-the-art bioinformatics software are used to integrate proteomics, metabolomics, and transcriptomics data sets. Limitations in the functions of these software are identified and optimized in towards making them more user friendly for biologists who lack or have very basic computational skills, for enhancing biological discoveries.

Rapid technological advances have led to the massive production of different types of biological (big) data and enabled construction of complex networks with various types of interactions between diverse biological entities. Single omics analysis methods were shown to be limited in dealing with such heterogeneous networked data. Integrative methods can collectively mine multiple types of biological data and produce more holistic, systems-level biological insights [1].

Principal component analysis (PCA) combined with sparse Partial Least Squares (sPLS) implemented in the R package mixOmics [2], Weighted Gene Co-expression Network Analysis (WGCNA) [3] and Multi-Omics Factor Analysis (MOFA) [4], are used to summarize and decipher two data sets consisting of transcriptomics, proteomics and metabolomics experiments from Mus musculus and Medicago truncatula. These three robust, statistical approaches, were designed to uncover driver genes that are responsible for numerous cellular processes.  While PCA is an appropriate and commonly used method, WGCNA holds several advantages in the analysis of highly multivariate, complex data by modelling them as networks/modules [5].

The WGCNA R software package incorporates functions for performing various aspects of weighted correlation network analysis such as network construction, module detection, gene selection and visualization. MOFA is a factor analysis model for the integration of multi-omic data sets. Once trained, the model output can be used for downstream analyses, including the visualisation of samples in factor space and enrichment analysis. The mixOmics R package proposes several multivariate methods that are suited to large omics data sets and that have the properties of reducing the dimension of the data by using components that are used to produce graphical outputs that shows the relationships and correlation structure between the different integrated omics.

Whereas mixOmics and MOFA R packages can be run easily, WGCNA contains a few functions in the original protocol that are designed to run on a single cluster or metabolite. In order to apply WGCNA on the above datasets, the open source code base of WGCNA has been extended by additional utility functions. These extensions allows to circumvent the limitations that are hard to produce for non-computational experts (i.e. biologists) and hence cause them losing a lot of time and eventually interest in applying such advanced methods. As a result, an iterative method of WGCNA was developed which allows users with little or no coding skills to run the code in a swift and user-friendly manner. This improved strategy greatly expands the general applicability of WCGNA and provides processes that runs in a loop for deriving relating modules to external clinical traits and identifying important links between these traits and genes.

The final step in this project is to provide a Jupyter Notebook with SoS kernel suitable for executing WGCNA, MOFA and mixOmics in an user-friendly interface [Fig.1]. Jupyter Notebook is developed to easily run scripts by non-experts and to allow more reproducible analysis. The provided notebook includes a function to merge significant genes among the three methods to identify common genes.

Samenvatting eindwerk 2010-2011: Modificatie van Galaxy voor Next-Generation Sequencing
De bio-informatica kampt met een vloedgolf aan sequentiedata. Als gevolg is er een tekort aan ruimte om al de data te bewaren. Hierdoor ontstaat ook een tekort aan deskundigen om de sequentiedata te analyseren. In Next-Generation Sequencing worden met data files gewerkt die algauw de omvang van enkele tientallen of honderden gigabytes hebben. Analyse en verwerking vergen enorm veel tijd en rekenkracht. Dit wordt voor de mens onmogelijk zonder het gebruik van computers. Er moet dus over de nodige hardware en software beschikt worden.
Recent heeft BITS in een nieuwe server geïnvesteerd. Het doel van het project is deze server klaar te maken voor gebruik binnen het VIB voor analyse van biologische data, met focus op sequentiedata. Het voorbereiden van deze server bestond uit verschillende onderdelen.
Het eerste deel bestond uit de installatie van het besturingssysteem en zijn virtuele machines. De Linux distributie Redhat 6 werd als hoofdbesturingssysteem op de server geïnstalleerd. Linux is de omgeving waar op de stageplaats altijd in gewerkt wordt. Het is het belangrijkste besturingssysteem binnen de bio-informatica wereld. Linux is heel stabiel en dus ideaal voor serverdoeleinden. De server moest ook met de nodige services geconfigureerd worden voor gebruik binnen het BITS netwerk.
Daarna volgde de installatie van het Galaxy portaal. Dit webportaal is een verzameling van verscheidene bio-informatica tools. Galaxy integreert al de nodige functionaliteit op één plaats waardoor het voor analyse gemakkelijk te gebruiken is. Na onderzoek van de werking van het Galaxy webportaal, werd dit op de server geïnstalleerd.
Vervolgens werd het Galaxy portaal uitgebreid en aangepast met tools specifiek gericht naar het vakgebied waar het VIB onderzoek naar verricht. Het gaat hier dan vooral om tools gericht op Next-Generation Sequencing.


Rijvisschestraat 120
9000 Gent


Traineeship supervisor
Alexander Botzki
0032 9 244 66 34
Oren Tzfadia
Via Map