Kulak, Public Health and Primary Care
Abstract traineeship advanced bachelor of bioinformatics 2017-2018: Benchmark datasets for interaction datamining
The CLUS tool developed at KU Leuven contains a Hierarchical multi-label classification (HMC) algorithm capable of predicting multiple, hierarchically organized classes for a protein at the same time. The algorithm was evaluated using different train, test and validation datasets. These datasets consist out of multiple experimentally derived attributes for all yeast proteins, and their classifications in 2007 derived from either MIPS Functional Catalogue (FunCat) (http://mips.helmholtz-muenchen.de/funcatDB/) or Gene Ontology (GO) (http://www.geneontology.org/ ). To check the predictive capabilities of the program we have updated the datasets with the current classifications according to the FunCat and Gene Ontology databases. To do this, current classifications were retrieved from GO and FunCat in an automated way using R. As such the old classifications were replaced with the new ones, while keeping old classifications for comparison purposes. Comparing the predictions CLUS makes for a given protein based on the old classifications, with the newly annotated classifications in MIPS and Gene Ontology illustrates the predictive ability of the algorithm. The updated train, test and validation sets can also be used to retrain the algorithm for more accurate predictions based on the updated knowledge in MIPS and GO.
The second part of this internship concerns the STITCH database. This involves the interactions of proteins and chemicals. We have created an interaction file containing the interaction between a chemical and every protein in the database with the score according to STITCH. Another file holding fingerprint information for the chemicals was generated using the OpenBabel software (http://openbabel.org/wiki/Main_Page ). These fingerprints are a numeric representation of different informative features of a chemical, allowing for easy comparison between them. To allow comparison between the proteins, both Pfam(https://pfam.xfam.org/ ) and GO ID’s were retrieved and converted to a binary vector. By comparing fingerprint vectors between chemicals and binary Pfam/GO vectors it’s possible to predict interactions of interest between proteins and chemicals which can then be experimentally verified.
Etienne Sabbelaan 53