Howest, opleiding Biomedische laboratoriumtechnologie en Bio-informatica
Abstract 2019-2020 (1):Identification and genome assembly of microorganisms using nanopore sequencing (the Bioinformatics Knowledge Center (BiKC) of Howest)
Quick tracking and identification of unwanted microorganisms spreading through research labs and medical facilities has always been of the utmost importance. A Flemish hospital was confronted with a bacterial contamination in their neonatal care unit. Using conventional bacterial identification techniques the genus was determined to be Enterobacter. However, a more specific classification is required to combat this infestation and to establish if only one or multiple strains are in play. To this end, DNA was extracted from single clones grown from the bacterial contaminants and sequenced using a MinION device from Oxford Nanopore Technologies (ONT). Having the sequencing data, a quest for bioinformatics tools to correctly identify microorganisms, compare between samples and even assemble the genomes of these bacteria was initiated. In a first step, multiple quality control (QC) tools were tested and scored based on their ease of installation, ease of use and the received output. One QC tool did stick out above the rest: NanoComp . With a successful QC, classification of the samples started with Kraken 2 as the main tool . Two databases were used: MiniKraken2 (version 2, an 8 Gb standard Kraken 2 database) and a custom RefSeq bacterial database built on the bioinformatics server of Howest. Two tools were tested to visually represent the output; one of them proved to be far more superior for this case. Kraken 2 resulted in a more specific classification of species and even identification of subspecies found in the different samples. One distinct species of Enterobacter was largely present in all samples: E. hormaechei. However, declaring to which subspecies of E. hormaechei each sample belongs was not yet possible. To help with the further identification, genome assembly was attempted using multiple tools. These were again tried for ease of installation, usage and the received output. One of the assemblers came out on top, which was consistent with the current literature: Flye . Beneficial to Flye is the great speed with which it operates and the GFA file included in the output. This file allows visualisation of the assembly when using appropriate software. The initial assembly constructed by Flye was then polished using Racon followed by Medaka, as suggested by ONT . This resulted in a final consensus fasta file for each sample. A last step with the genome assemblies produced by different tools is to feed them to classification and identification software. Kraken 2 was used again in order to compare results before and after assembly. Type (Strain) Genome Server (TYGS) and Genome-toGenome Distance Calculator (GGDC) were some of the tested web-based tools which proved to be very insightful; these two were specifically interesting to compare the samples among each other , . 2 Throughout this identification quest many bioinformatics tools passed the stage. Some were found to have too intricate installation procedures with too many dependencies, while others lacked in ease-of-use and output aspects. However, for each step in the analysis process the right software was found and a workflow was set up with the chosen ones. Thus the proposed workflow allows rapid identification of microorganisms, even at subspecies level and allows for an easy strain comparison between samples.
References  W. De Coster, S. D’Hert, D. T. Schultz, M. Cruts, and C. Van Broeckhoven, “NanoPack: Visualizing and processing long-read sequencing data,” Bioinformatics, vol. 34, no. 15, pp. 2666–2669, 2018, doi: 10.1093/bioinformatics/bty149.  D. E. Wood, J. Lu, and B. Langmead, “Improved metagenomic analysis with Kraken 2,” Genome Biol., vol. 20, no. 1, pp. 1–13, 2019, doi: 10.1186/s13059-019-1891-0.  M. Kolmogorov, J. Yuan, Y. Lin, and P. A. Pevzner, “Assembly of long, error-prone reads using repeat graphs,” Nat. Biotechnol., vol. 37, no. 5, pp. 540–546, 2019, doi: 10.1038/s41587-019-0072-8.  Oxford Nanopore Technologies, “microbial-genome-assembly-workflow,” 5th May 2020. .  J. P. Meier-Kolthoff and M. Göker, “TYGS is an automated high-throughput platform for state-of-the-art genome-based taxonomy,” Nat. Commun., vol. 10, no. 1, Dec. 2019, doi: 10.1038/s41467-019-10210-3.  J. P. Meier-Kolthoff, A. F. Auch, H. P. Klenk, and M. Göker, “Genome sequence-based species delimitation with confidence intervals and improved distance functions,” BMC Bioinformatics, vol. 14, 2013, doi: 10.1186/1471-2105-14-60.  R. R. Wick, M. B. Schultz, J. Zobel, and K. E. Holt, “Bandage: Interactive visualization of de novo genome assemblies,” Bioinformatics, vol. 31, no. 20, pp. 3350–3352, 2015, doi: 10.1093/bioinformatics/btv383.
Abstract 2019-2020 (2): Nanopore sequencing analysis in the Bioinformatics Knowledge Center (BiKC) of Howest
Nanopore sequencing is a third-generation portable sequencing. Third generation sequencing technologies are characterized by their ability to sequence long reads in real time. In this case the MinION was used to generate two types of samples: a microbiome sample (whole genome and 16S) and ten plasmids from GeneCorner. For both datasets, different tools regarding quality control (QC), mapping and analyzing the data were compared in order to find the best tools for our needs.
The different QC tools that were tested are: NanoComp, NanoPlot and longread_plots. To map the WGS data a lot of mapping tools were tested: tophat2/bowtie2, Hisat2, kraken2, Qiime2 and minimap2. Kraken2 was tested with two different reference databases to see the effects on the results. The results were visualized using Krona and Pavian. For the second project, ten plasmids were analyzed using the QC tools from the first project and reads were mapped with minimap2.
Out of the ten analyzed plasmids, nine had at least one variation. Most of the time these were point mutations. In one sample two deletions were present. These are often not found with the current methods since we are sequencing the entire plasmid sequence at once while most methods only determine partial sequence or use restriction enzyme pattern analysis. For the microbiome data both methods (WGS and 16S) gave the same enterotype (Firmicutes) but there were large differences in the distribution of the species between both methods.
Abstract 2019-2020 (3): Automated bioinformatics workflow for imputation of clinical relevant variants in the context of preventive health care
The goal of the traineeship was to develop an automated workflow that starts from data retrieved through Nanopore or Illumina sequencing. The result of the workflow is a file with imputated data. This data can then be used to deduce unknow genotypes for risk genes and to provide targeted preventive healthcare. The workflow must be as simple as possible with little to no human interference in the process and to be carried out by people with only basic computer knowledge. The workflow itself contains four steps which are combined in one imputation script. The first step is to convert the input files in the right format for the imputation. This is a two-step process. It starts from converting the idat files to gtc files with Illumina gencall. Afterwards, these gtc files are converted to vcf files with a python script also written by Illumina. The second step is preparing the data for imputation. It starts with renaming the files from the Array Picker code to the sample name. Afterwards all the samples are merged into one big file from which the x chromosome is extracted. In the data of the x chromosome, all haploid data is converted to homozygous diploid data. Step three is the actual imputation, done with the Beagle tool. This tool will do the imputation chromosome per chromosome with a fixed amount of seeds. The final step of the imputation script will collect all data from the Beagle tool into one big file. To automate the script, a docker container was build. In this container, all needed tools are preinstalled, and the imputation script is running automatically at certain times with a cron job. The user only needs to place the input files in a shared input folder and after some time the result file is available in the shared output folder.
Abstract 2018-2019 (1): Implementation of bioinformatic scripts for detection of structural variants on hybridization-capture NGS data of FFPE-solid tumors
Cancer is one of the leading causes of death worldwide. All cancers start in a cell due to changes in genes (mutations) that cause the cells to divide and multiply uncontrollable. The mutations may accidentally occur during cell division, are caused by environmental factors or are inherited1,3. Large-scale studies confirmed the presence of single nucleotide variants (SNV) and structural variations (SV) in the majority of cancers2. Detecting these variants is crucial for clinical diagnostics and treatment. For the detection, specialized bioinformatic scripts are used that are applied to next-generation sequencing (NGS) data of tumor DNA/RNA. The aim of this project is the implementation of bioinformatic scripts and provide proof-of-concept verification on NGS data of formalin-fixed paraffinembedded (FFPE) tumor DNA. Finally, after validation, docker containers of these tools/scripts are created and will be integrated in the routine NGS pipeline at AZ Delta. Three different bioinformatic scripts are installed and tested. The first tool is a shiny app called SNPitty that allows visualization of variant call format (VCF) files. This tool supports the detection of allelic imbalances (AI) by means of heterozygosity markers. In this project SNPitty is used to interpret the presence or absence of 1p/19q co-deletion in brain tumor samples. The second script is GeneFuse that takes fastq files as input and tries to detect gene fusions. The third script is Breakdancer that detects SV from bamfiles. This tool is used to detect intra- and inter-chromosomal translocations. A total of twenty-six samples are analyzed by SNPitty. GeneFuse is performed on nineteen samples (sixteen negative and three positive). No fusions where detected in the negative control samples. In two of the three positive samples an ALK fusion was found with a low number of supporting reads. The same samples for GeneFuse are analyzed by Breakdancer. With the current sequencing method, the analysis by GeneFuse (with predefined parameters) is not sensitive enough to detect gene fusions. Future projects to solve this issue could be changing the sequencing method and/or changing the predefined parameters of GeneFuse. The output of a breakdancer analysis shows the breakpoints of chromosomes with their position, but it is not known which genes. To find out which genes are involved further annotation of the positions is required. So breakdancer is not user-friendly and time consuming compared to GeneFuse.
References  How cancer starts. (2014, oktober 27). Geraadpleegd 10 juni 2019, van Cancer Research UK website: https://www.cancerresearchuk.org/about-cancer/what-iscancer/how-cancer-s...  Macintyre, G., Ylstra, B., & Brenton, J. D. (2016). Sequencing Structural Variants in Cancer for Precision Therapeutics. Trends in Genetics, 32(9), 530–542. https://doi.org/10.1016/j.tig.2016.07.002  What Is Cancer? [CgvArticle]. (2007, september 17). Geraadpleegd 10 juni 2019, van National Cancer Institute website: https://www.cancer.gov/aboutcancer/understanding/what-is-cancer
Abstract traineeship (advanced bachelor of bioinformatics) 2017-2018 1: The development of an application for chimerism research after a hematopoietic stem cell transplantation
Leukemia patients need a bone marrow transplant so that they can develop healthy blood cells again. From that moment on, the blood cells of the patient/host should contain the DNA of the donor, meaning healthy blood cells are being produced by the host.
Chimerism is the occurrence of two sets of DNA in an organism. The more chimerism there is in a patient the more donor DNA is active. After a transplant the patient should have a high chimerism grade.
The molecular laboratory of AZ Sint-Jan measures chimerism in a leukemia patient in order to check if a transplant worked for the patient. Multiple measurements are performed to provide a follow-up for the patient.
The tool takes data generated with Genemapper 5 and determines the number of peaks for donor, patient and sample per marker. The type of the marker is assigned based on the peaks. Type I markers have no shared alleles between donor and host, type II markers use a shared and a non-shared allele to calculate chimerism and type III markers are not informative. When stutter interference is present a stutter ratio is calculated which is used as a criterion to exclude markers from the total percentage chimerism calculation.
The result of the calculations is a total percentage for all type I markers, all type II markers and both type I & type II markers. A table is shown where values can be checked for an alternative total calculation. The tool can produce a report and a raw data text file of the results.
An additional part of the project was also the validation of the tool. Dr. F. Nollet provided data which was processed by both the lab software and the webtool. A comparison of the results was then performed to check the tool. The validation had to be run three times. First validation run there was an error with stutter ratios not being detected. This error was solved with adjusting the standard stutter range from 0.2 to 0.5. The second validation run then revealed that stutter ratios were calculated on peaks that were irrelevant. Extra lines of code were added to exclude these peaks. The third and last validation run revealed no other errors and the tool slightly outperforms the lab software when calculating stutter of the amelogenin marker. This concludes the tool is validated with Genemapper data and is ready to use.
Abstract 2018-2019 (2): BIOINFORMATICS AND GENOMIC ANALYSIS OF ALLERGY GENES IN THE PEANUT GENOME
During this traineeship different open source tools were tested on data generated by two different nanopore sequencing experiments. These experiments were performed with the MinION nanopore DNA sequencer from Oxford Nanopore Technologies (ONT). The MinION sequencer is a portable device weighing less than 100g. The heart of the device is a flow cell that bears up to 2048 nanopores. These nanopores can be controlled by an applicationspecific integrated circuit (ASIC) in groups of 512 (Jain, Olsen, Paten, & Akeson, 2016). DNA sequencing is performed by adding sample to the flowcel. DNA molecules passing through or near the nanopore cause a change in the magnitude of the current in the nanopore. This change in current is measured by a sensor. The data is passed to a microchip (the ASIC) and data processing is done by the MinKNOW software, which deals with data acquisition and analysis. (Lu, Giordano, & Ning, 2016) The first experiment was to identify a bacterial contaminant from a lab that could not be easily identified with standard biochemical tests. DNA was extracted from a bacterial culture and the sample was prepared for sequencing on MinION according to the 1D Genomic DNA by Ligation (SQK-LSK109) protocol. The second nanopore sequencing experiment was performed on DNA from cultivated peanut (Arachis hypogaea). For QC on the data of both experiments two open source tools were tested. One of these tools is NanoR. NanoR is a package for the cross-platform statistical language and environment R that was designed with the purpose to simplify and improve nanopore data visualization. (Bolognini, Bartalucci, Mingrino, Vannucchi, & Magi, 2019). The other tool that was tested for QC is Nanoplot. NanoPlot is part of NanoPack, which is a set of Python scripts for visualizing and processing long – read sequencing data. (De Coster, D’Hert, Schultz, Cruts, & Van Broeckhoven, 2018). The open source tool Kraken 2 was used to try to identify the bacterial contaminant from the first experiment. Kraken is a program for assigning taxonomic labels to metagenomic DNA sequences. Kraken achieves fast classification speeds and high accuracy by using exact alignment of k-mers. (Wood & Salzberg, 2014). The output of Kraken 2 was visualized with Krona, another open source tool for visualizing relative abundances and confidences within the complex hierarchies of metagenomic classifications. (Ondov, Bergman, & Phillippy, 2011) GraphMap was used for mapping of the reads of the second nanopore sequencing experiment. GraphMap is a mapping algorithm designed to analyse nanopore sequencing reads. It progressively refines candidate alignments to robustly handle potentially higherror rates and a fast graph traversal to align long reads with speed and high precision. (Sović e.a., 2016).
Abstract traineeship (advanced bachelor of bioinformatics) 2017-2018 2: Bio-informatic study of the gene NLRP7 and the possible role in Inflammatory Bowel Disease
The study of the gene NLRP7, and the mutations that this gene can have which can possibly play a role in the development of inflammatory bowel disease. There has also been examined which amino acids have to be changed to induce these mutations in the mouse genome.
The research has been started with making a list of all genes in the NLR gene family and the different isoforms of NLRP7. This list consisted of the subfamily of the protein, the symbol, full name, aliases, chromosome, start and end location on the chromosome, the protein identifier. This is done for both the human and the mouse genome. A bit more than 50 members were found in both of the genomes. Then the flanking genes were studied, this has been done to find possible duplications that have happened during the evolution or to find closely related genes. Remarkably the mouse genome doesn’t contain the NLRP7. Based on the NLR family list a multiple sequence alignment was performed on all protein sequences of the different family members, this has been done with the tool MUSCLE. This was done to study conserved domains between the family members. This has also the advantage that the regions that can contain the possible deleterious mutations in NLRP7 can be compared between the different family members. Next a phylogenetic tree was made from the data of MUSCLE, with the application simple phylogeny. There was found that the proteins NLRP2 and NLRP7 are closely related, so NLRP2 can be used as an alternative for NLRP7. Then there has been performed a pairwise alignment (on the EBI) of some proteins in the NLR* family. Next the protein structure has been predicted. First the protein database of structures (PDB) has been searched for the protein NLRP7 homologues, this is done by blasting the protein against the RCSB protein database. There has been found some homologs of the protein, these where loaded into pymol. These models can give an image of the protein how its looks like. To get more reliable results the NLRP7 protein was predicted/modelled with some tools. The first tool used was swissmodel this tool models the protein against one other similar protein in the protein database, here has been noticed that there are not so much proteins in the protein database that had a high sequence similarity, the highest was 5IRL, a NOD2 protein in rabbits, which had 33% sequence identity. To overcome this problem one other tool has been used, this tool is modeller. This tool has given the advantage that multiple proteins or parts of proteins can be used for modelling an other protein such as NLRP7. The algorithm calculates several models (changeable) from which the best models are chosen, the best models are chosen by looking at the different Ramachandran plots at RAMPAGE Ramachandran. These models can be loaded into pymol where they can be aligned with other proteins and/or analyzed. Next there has been performed differential expression analysis, this has been performed to investigate if the gene NLRP7 is differentially expressed in the IBD population versus the non-IBD population (control). For the analysis first there has been searched some GEO datasets on the NCBI website and they were further analyzed in R with a script. The problem was that there were not much good datasets that had clear results, an example of this are datasets with much variation in the control group. There has been found a dataset that has given some results with more significance but it’s also low, that can be verified by searching the GEO profiles, these profiles give some indication. For the mouse there hasn’t been found any good datasets or profiles that compared IBD vs. non-IBD mice. There has also been performed a research on the evolutionary history of the genes NLRP2 and NLRP7. This has been done by making a multiple sequence alignment with MUSCLE of the NLRP2 and NLRP7 proteins between the different species. Next there has been searched for some organisms if they contain the genes NLRP7 and/or NLRP2 this has been done with the UCSC genome browser. This also gives information from where the genes could have been originated. From these results there has been made an evolutionary tree with the different classes. Last there has been performed a posttranslational modification research on prediction servers, mostly on the server “elm.eu”. Where has been predicted if the places where the possible deleterious mutations can occur, could have any possible transformation after the translation step and what happens if the mutations are induced in the protein. The conclusion that can be made is that there can be tried to modify the mouse gene Nlrp2 and also there is known which amino acids has to be changed. Inducing the mutation can be used to get a better understanding of the role of the mutation for IBD, and possibly help in the development of medicine.
Abstract traineeship (advanced bachelor of bioinformatics) 2016-2017: Development of a webtool for chimerism calculations after stem cell transplantation
Stem cell transplantations are a widely used treatment for diseases like leukemia. But in order to make sure the new stem cells are used to grow healthy white blood cells, the patient needs to be tested on regular intervals. To determine whether the treatment was successful or not, the chimerism percentage must be calculated. This percentage corresponds with the amount of blood cells derived from the new stem cells (of the donor) versus the number of blood cells derived from the original stem cells of the patient.
Data is collected via a variety of different biochemical techniques based on the variable number of tandem repeats (VNTR) in genes. These can vary between people and are therefore useful to identify the origin of the blood cells. The chimerism percentage can be calculated from this data via a formula. There are different formula depending on VNTR type.
However, these calculations are still done manually in some laboratories. This is time consuming and there’s always a chance for human errors. The goal of this project was to develop a webtool, called QuickChim, that can execute the chimerism calculations automatically to speed-up the process of the follow-up.
QuickChim is a webtool, written with php, html and css. The information about the kits used to analyze the samples are stored in a mysql database.
To calculate the percent donor chimerism, the user needs to upload three files. One file for a sample taken from the donor at the start of the treatment (the donor reference file), one for the sample taken from the patient at the start of the treatment (the patient reference file) and one file for the sample taken for follow-up analysis. These files contain information about every peak viewed by the biochemical technique used to analyze the sample. Important information for this tool is the color, size and area of each peak.
When the files are uploaded and the parameters are set, the user can submit the data and QuickChim will calculate the percentage chimerism for every VNTR. When the calculations were successful, two buttons will be generated. One prints a pdf report of the results, which can be downloaded and used to report the results to the doctor in charge of the treatment. The second button lets the user download a text file containing the parameters used for analysis, together with the tables containing the used information of all three uploaded files.
8200 St.-Michiels Brugge