| Page 5 | BLT Stages

VIB – Bioinformatics Core (BITS)

Contact details

Traineeship proposition

Abstract

Testimony

Admin

Abstract 2020-2021: CREATING A NEW RELEASE OF THE NON-REDUNDANT PROTEIN-PROTEIN INTERACTION DATABASE IREFINDEX

IRefIndex is a non-redundant protein-protein interaction (PPI) database created in 2005. The purpose of creating this database was to combine data from multiple PPI databases to form an index of all publicly available PPIs. This is achieved by grouping all redundant proteins to a specific key called the redundant object identifier (rog). These keys consist of the sha-1 algorithm applied to the primary sequence of the protein concatenated with the taxonomy identifier of the host organism. The two rogs of the proteins in a PPI are concatenated and applied to the sha-1 algorithm to form the redundant interaction identifier(rig). With these two different keys the PPIs from the databases will be connected to each other. Since the creation of this database, there has been a lot more research to identify additional PPIs. Therefore, it is necessary to make a new version of this database every year, so the data is up to date. These new versions are faced with multiple problems caused by the changing of data formats, data repositories and newly added databases. The goal is to have the newly available PPIs easily accessible in the IRefindex database and have the complete information of each PPI in tabular format.

Abstract 2019-2020: Optimisation of the pipeline to set up an index for protein-protein interactions: the iRefIndex

A lot of datasets on protein-protein interactions (PPI) are publicly available at this moment. These PPI datasets are spread across multiple data sources. Naively combined, the newly created data source may contain a lot of interaction pairs which appear different due to the current annotation of the datasets but are in fact identical. Some PPIs are unique to a single source.

The goal of the iRefIndex project is to make a unifying interaction index to bring all this interaction data together. Redundant interactions and objects aka protein interactors are brought together in groups called ‘the redundant interaction groups’ or RIGs and ‘the redundant object groups’ or ROGs. This way, duplicated proteins or protein interactions are avoided in the index.

To set up this index, a sophisticated pipeline has been created by Donaldson et al eight¹ years ago. By providing a configuration file, it is possible to automatically download the preferred source files from the internet or use a local source. In this case, 24 sources including MatrixDB², Reactome³ and HPIDB⁴ were used to obtain the necessary interaction data. These source files can be in mitab format, XML format or have their own specific format. In addition, 9 curation sources are used, such as Entrez-gene⁵ and RefSeq⁶. These sources all have their own specific structure. After the download the files are parsed in the coherent format. For the mitab format, this means the PubMed-ID’s, the authors, the aliases, interactions, etc. are all described in separate files now. After the parsing, these separate files are imported into the iRefIndex database. The last step is building the index from this database. The result of this pipeline is a mitab file containing all interactions and interaction partners.

The goal of this project was to optimize the pipeline used to create the iRefIndex version 17. The main problem is that some files are malformed or even have changed format, for example changed from an XML-format to a mitab-format or removed an entire column in the source file. These malformations or changes need correction to further process them. These corrections need to be made either in the source file or in the parsing scripts. If there is a malformation in the source file, after the final release, we aim to report to the maintainer of this source.

In the final version of the iRefIndex version 17, four new sources were added in comparison with version 16: BAR⁷, MINT⁸, HuRI⁹ and SPIKE¹⁰. The MOLCON source is not included since the source was not available anymore.

Another difference to the iRefIndex version 16 is that not only interactions reported with a PubMed identifier are included into the index, but also the ones with a DOI identifier. By adding sources, the iRefIndex gets enriched, since new unique interactions are added to it. For example, the HuRI source has 51482 different interactions, from which 10368 are unique to the HuRI source.

In total, 153527 unique proteins and 1185907 unique protein interactions or complexes are available in total in version 17. This is an increase in comparison with version 16, which contained 135893 unique proteins and 967265 unique protein interactions or complexes.

Sources

Razick, S., Magklaras, G. & Donaldson, I.M. iRefIndex: A consolidated protein interaction database with provenance. BMC Bioinformatics 9, 405 (2008). https://doi.org/10.1186/1471-2105-9-405
Launay, G., Salza, R., Multedo, D., Thierry-Mieg, N. and Ricard-Blum, S. (2015) MatrixDB, the extracellular matrix interaction database: updated content, a new navigator and expanded functionalities. Nucleic acids research, 43, D321-327.
Fabregat, A., Jupe, S., Matthews, L., Sidiropoulos, K., Gillespie, M., Garapati, P., Haw, R., Jassal, B., Korninger, F., May, B. et al. (2018) The Reactome Pathway Knowledgebase. Nucleic acids research, 46, D649-D655.
Ammari, M.G., Gresham, C.R., McCarthy, F.M. and Nanduri, B. (2016) HPIDB 2.0: a curated database for host-pathogen interactions. Database: the journal of biological databases and curation, 2016.
Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2005;33(Database issue):D54‐D58. doi:10.1093/nar/gki031
Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005;33(Database issue):D501‐D504. doi:10.1093/nar/gki025
Toufighi K, Brady SM, Austin R, Ly E, Provart NJ. The Botany Array Resource: e-Northerns, Expression Angling, and promoter analyses. Plant J. 2005;43(1):153‐163. doi:10.1111/j.1365-313X.2005.02437.x
Orchard, S., Ammari, M., Aranda, B., Breuza, L., Briganti, L., Broackes-Carter, F., Campbell, N.H., Chavali, G., Chen, C., del-Toro, N. et al. (2014) The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases. Nucleic acids research, 42, D358-363.
Luck, K., Kim, D., Lambourne, L. et al. A reference map of the human binary protein interactome. Nature 580, 402–408 (2020). doi:10.1038/s41586-020-2188-x.
Paz A, Brownstein Z, Ber Y, et al. SPIKE: a database of highly curated human signaling pathways. Nucleic Acids Res. 2011;39(Database issue):D793‐D799. doi:10.1093/nar/gkq1167

Abstract 2018-2019 (1): Refinement of a reproducible pipeline for consolidated protein interaction data

Protein-protein interactions have a major role in biochemical interactions and pathways, they are involved in processes in all sorts of cell types and enable signaling between cells. The Ian Donaldson lab has created the iRefIndex (Donaldson, Magklaras, & Razick, iRefIndex: A consolidated protein interaction database with provenance, 2008) which centralizes this protein-protein interaction data from 22 source databases such as Biogrid, IntAct or Innate and 15 descriptive databases such as Entrez or Uniprot into one database. This combined database will allow compilation of organism specific protein interactomes that can be used by the community. It creates a unique identifier for each interactor and every mostly binary interaction. During this project, the aim is to build an updated version (v16) of iRefIndex. The update creates a larger data source covering more interactions, organisms, and aliases for genes by adding new interaction database sources such as APID (Alonso-Lopez, et al., 2016) and descriptive data sources like gene accessions for Arabidopsis thaliana. A suite of scripts already existed (Donaldson & Botzki, Irefindex Github page, 2019) but had to be adapted due to the newly included database sources. A streamlined pipeline is followed, but in every step mistakes are being ironed out. The iRefIndex database (PostgreSQL) is initialized, data sources are downloaded and then parsed. Since the last release of iRefIndex version 15 early 2018, several source databases have been updated thus containing new information, this resulted in small adaptions of the scripts. For the newly added databases, scripts are written or adapted to make them fit into the former pipeline.Subsequent steps included building of the iRefIndex database as well as cleaning up the results. In the last step, the former version 15 and new version are compared by using the iRefIndex datafiles in MITAB format. As expected final result, the new version is successfully built and has almost one million more entries than its predecessor. An R script is used to evaluate data quality and identify missing or wrongly associated elements such as missing PubmedIDs or, authors,, wrongly associated taxa or ill-defined identifiers for interactors instead of the requested UniProtIDs. Missing PubmedIDs are inserted after manual curation. When compared to the previous version, on all but one quality parameter, the new version scores better. Further research might also improve this last parameter, the missing UniprotIDs for interactors. In summary, the newly built database has an overal higher quality than its previous version and it contains readily available gene accessions for Arabidopsis thaliana genes. Unfortunately, the inclusion of the APID database had to be dropped since the downloaded datasets did not describe the type of protein-protein interaction. Alerting the authors of the database did not trigger receiving an updated version during the period of the internship.

Bibliography Alonso-Lopez, M., Guiterrez, M., Lopes, K., Prieto, C., Santamaria, R., & De Las Rivas, J. (2016). APID interactomes: providing proteome-based interactomes with controlled quality for multiple species and derived networks. Nucleic Acids Research. Donaldson, I., & Botzki, A. (2019). Irefindex Github page. Opgehaald van Github: https://github.com/abotzki/irefindex Donaldson, I., Magklaras, G., & Razick, S. (2008). iRefIndex: A consolidated protein interaction database with provenance. BMC Bioinformatics.

Abstract 2018-2019 (2): Understanding the relevance of aminoacyl-tRNA synthetases in dominant peripheral neuropathies

Hereditary peripheral neuropathies enclose different subgroups based on motoric and sensory loss of neurons. Dominant-intermediate Charcot–Marie–Tooth neuropathy, the most common human inherited neurological disorder, is characterized by axonal degeneration and demyelination of peripheral nerves. Dominant mutations in tRNA synthetases, essential for the ligation of a tRNA molecule to the appropriate in the protein translation, cause Charcot–Marie–Tooth neuropathy. In the past, several studies suggested a loss of aminoacylation (enzymic) function as cause. Recently a gain of toxic function is more likely to be the underlying problem. The aim of the project is to find conserved co-expression patterns between orthologs across species, providing a highly relevant list of candidate genes that potentially share similar functions and act in the same pathway. The conservation across remotely evolved species, human and fly, suggest a conserved mechanism that is essential since it survived long evolutionary distance. Transcriptomics is generated by preforming a total RNA-sequencing on Drosophila brain expressing control and mutant constructs of the YARS gene. In addition, genomic ortholog data is used in combination with transcriptomics data and other datasets, such as genetic modifier screens and YARS/GARSvsWT are accessed and integrated. The CoExpNetViz tool (Tzfadia et al., 2016) is used for cross species co-expression analysis, to search for genes involved in a common pathway/process or to find functional orthologs of the bait genes. CoExpNetViz takes as input a set of bait genes, at least one gene expression matrix for each species and optional a gene family file. Output files are generated by the algorithm, including files to visualize the co-expression network in Cytoscape. Once the network is created and the appropriate layout is applied, different properties of the co-expression network were investigated. Further down-stream analysis is executed through Cytoscape plugins. MCODE and ClusterViz were used to find highly interconnected regions in the large network. Corresponding GO annotations (Ensembl) and enriched terms (DAVID) to the genes in the clusters are added. A second Gene Set Enrichment Analysis (GSEA) is executed by the fgsea package in RStudio to get more fine-grained p-values. In the last step, an attempt was made to link the genetic modifier, YARSvsWT and GARSvsWT datasets to genes present in the resulting clusters created by the transcriptomics data. The network visualization and analysis of co-expression network resulted in the discovery of a few orthologous inter-species clusters and more non-orthologous inter-species cluster. A gene Mkp3, out of the genetic modifier data, is found in a non-orthologous cluster. The gene is related to phosphatase/kinase activity. More genes are related to tyrosyl-tRNA synthetase (YARS) and/or to glycyl-tRNA synthetase (GARS). CNPY2, ABT1 and EIF4E2 pop out as interesting genes by their function and ontology related to neuronal disease and pathway processes. Even more, some of the genes share both GARS and YARS mutational functions. Some interesting candidate genes are found in orthologous and non-orthologous clusters. However, these genes are predictions based on the data and need to verified first by biochemical assays.

Abstract traineeship advanced bachelor of bioinformatics 2017-2018: Development and implementation of integrative bioinformatics algorithms

The goal of the project is to support the integrative efforts at VIB bioinformatics Core facility, with emphasis on integration of data from VIB’s research activities. In specific, state-of-the-art bioinformatics software are used to integrate proteomics, metabolomics, and transcriptomics data sets. Limitations in the functions of these software are identified and optimized in towards making them more user friendly for biologists who lack or have very basic computational skills, for enhancing biological discoveries.

Rapid technological advances have led to the massive production of different types of biological (big) data and enabled construction of complex networks with various types of interactions between diverse biological entities. Single omics analysis methods were shown to be limited in dealing with such heterogeneous networked data. Integrative methods can collectively mine multiple types of biological data and produce more holistic, systems-level biological insights [1].

Principal component analysis (PCA) combined with sparse Partial Least Squares (sPLS) implemented in the R package mixOmics [2], Weighted Gene Co-expression Network Analysis (WGCNA) [3] and Multi-Omics Factor Analysis (MOFA) [4], are used to summarize and decipher two data sets consisting of transcriptomics, proteomics and metabolomics experiments from Mus musculus and Medicago truncatula. These three robust, statistical approaches, were designed to uncover driver genes that are responsible for numerous cellular processes. While PCA is an appropriate and commonly used method, WGCNA holds several advantages in the analysis of highly multivariate, complex data by modelling them as networks/modules [5].

The WGCNA R software package incorporates functions for performing various aspects of weighted correlation network analysis such as network construction, module detection, gene selection and visualization. MOFA is a factor analysis model for the integration of multi-omic data sets. Once trained, the model output can be used for downstream analyses, including the visualisation of samples in factor space and enrichment analysis. The mixOmics R package proposes several multivariate methods that are suited to large omics data sets and that have the properties of reducing the dimension of the data by using components that are used to produce graphical outputs that shows the relationships and correlation structure between the different integrated omics.

Whereas mixOmics and MOFA R packages can be run easily, WGCNA contains a few functions in the original protocol that are designed to run on a single cluster or metabolite. In order to apply WGCNA on the above datasets, the open source code base of WGCNA has been extended by additional utility functions. These extensions allows to circumvent the limitations that are hard to produce for non-computational experts (i.e. biologists) and hence cause them losing a lot of time and eventually interest in applying such advanced methods. As a result, an iterative method of WGCNA was developed which allows users with little or no coding skills to run the code in a swift and user-friendly manner. This improved strategy greatly expands the general applicability of WCGNA and provides processes that runs in a loop for deriving relating modules to external clinical traits and identifying important links between these traits and genes.

The final step in this project is to provide a Jupyter Notebook with SoS kernel suitable for executing WGCNA, MOFA and mixOmics in an user-friendly interface [Fig.1]. Jupyter Notebook is developed to easily run scripts by non-experts and to allow more reproducible analysis. The provided notebook includes a function to merge significant genes among the three methods to identify common genes.

Samenvatting eindwerk 2010-2011: Modificatie van Galaxy voor Next-Generation Sequencing

De bio-informatica kampt met een vloedgolf aan sequentiedata. Als gevolg is er een tekort aan ruimte om al de data te bewaren. Hierdoor ontstaat ook een tekort aan deskundigen om de sequentiedata te analyseren. In Next-Generation Sequencing worden met data files gewerkt die algauw de omvang van enkele tientallen of honderden gigabytes hebben. Analyse en verwerking vergen enorm veel tijd en rekenkracht. Dit wordt voor de mens onmogelijk zonder het gebruik van computers. Er moet dus over de nodige hardware en software beschikt worden.

Recent heeft BITS in een nieuwe server geïnvesteerd. Het doel van het project is deze server klaar te maken voor gebruik binnen het VIB voor analyse van biologische data, met focus op sequentiedata. Het voorbereiden van deze server bestond uit verschillende onderdelen.

Het eerste deel bestond uit de installatie van het besturingssysteem en zijn virtuele machines. De Linux distributie Redhat 6 werd als hoofdbesturingssysteem op de server geïnstalleerd. Linux is de omgeving waar op de stageplaats altijd in gewerkt wordt. Het is het belangrijkste besturingssysteem binnen de bio-informatica wereld. Linux is heel stabiel en dus ideaal voor serverdoeleinden. De server moest ook met de nodige services geconfigureerd worden voor gebruik binnen het BITS netwerk.

Daarna volgde de installatie van het Galaxy portaal. Dit webportaal is een verzameling van verscheidene bio-informatica tools. Galaxy integreert al de nodige functionaliteit op één plaats waardoor het voor analyse gemakkelijk te gebruiken is. Na onderzoek van de werking van het Galaxy webportaal, werd dit op de server geïnstalleerd.

Vervolgens werd het Galaxy portaal uitgebreid en aangepast met tools specifiek gericht naar het vakgebied waar het VIB onderzoek naar verricht. Het gaat hier dan vooral om tools gericht op Next-Generation Sequencing.

Tags: bioinformatics

Address

Rijvisschestraat 120

9052 Gent

Belgium

Contacts

Traineeship supervisor

Alexander Botzki

0032 9 244 66 34

Alexander.Botzki@vib.be

Oren Tzfadia

oren.tzfadia@vib.be

Zoekopdracht

Klassiek

Via Map

BLT Stages

Traineeship / bachelor project

Pages

VIB – Bioinformatics Core (BITS)

Address

Contacts

Traineeship / bachelor project

Search form

Pages

VIB – Bioinformatics Core (BITS)

Address

Contacts