Howest, opleiding Biomedische laboratoriumtechnologie en Bio-informatica
Abstract Advanced bachelor of Bioinformatics 2020-2021 (1): VALIDATION AND OPTIMIZATION OF A WORKFLOW FOR TARGETED NANOPORE SEQUENCING (the Bioinformatics Knowledge Center (BiKC) of Howest)
During this traineeship an optimized workflow for targeted nanopore sequencing in Nextflow was developed, validated and optimized. Firstly, the individual tools had to be installed and tested. These tools can be split up in different parts of the workflow: basecalling, QC, mapping and variant calling. Sequencing was performed on a MinION single-use flongle flow cell from Oxford Nanopore Technologies. Fast basecalling was done during the sequencing using Guppy basecalling within the MinKNOW software. This converts fast5 files from the nanopore sequencer into fastq files. There is also the option to use the high accuracy basecaller. This option takes much longer but yields a higher accuracy and was additionally performed after the sequencing run.
The quality control tools are tools that output plots to visualize the quality of the data. In most cases the filtering is already done by the MinKNOW software itself. Default plots are generated by this software, however it is also favourable to make use of other tools like NanoPlot, MinIONQC and NanoR to include plots generated by these tools in a custom made, standard output report. The mapping is done with the minimap2 tool and aligns the DNA sequences against a large reference genome or database. This tool is ideal for mapping long genomic reads against a human reference genome. For rather short reads like the PCR fragments used in this project it took less than 2 minutes to map the reads against a reference genome. This tool outputs in the sam format. A conversion to bam format, filtering and sorting of the reads is required using samtools.
The final step in the workflow is variant calling for which initially bcftools was used. This software looks for differences in the reads versus the reference genome and reports differences in variant call format (vcf). PEPPER-Margin-DeepVariant is also a variant calling software that was tested locally, however it was not possible to get this software to run on the server, so for now bcftools is used in the workflow. This workflow is written in Nextflow which is a bioinformatics workflow manager that supports dependencies trough built-in support for Conda, Docker, Singularity and Modules, this makes it very robust and able to be run on different environments or servers with Docker enabled. This workflow works well with the HLA datasets from the Belgian Diabetes Registry and it also works well with a different targeted sequencing dataset from another project. In conclusion, this workflow can be used for other targeted sequencing analysis in the future.
Abstract Advanced bachelor of Bioinformatics 2020-2021 (2): VALIDATION OF A NANOPORE SEQUENCING WORKFLOW FOR DE NOVO ASSEMBLY OF BACTERIAL GENOMES AND PLASMIDS (the Bioinformatics Knowledge Center (BiKC) of Howest)
At the Howest BiKC, sequencing is done using third generation sequencing technology. The platform used is a MinION sequencer from Oxford Nanopore Technologies. To automate the processing of the data, a Nextflow script was developed to automatically run all steps of the de novo assembly process. To validate this Nextflow script, it is used on both a bacterial genome dataset and on a plasmid dataset. An overview of the full workflow used in the Nextflow script can be seen in Figure 1.
Currently, the quality control performed on the data is done by the MinKNOW software. This built-in QC is already a good QC but certain statistics are not included. Because of this, a separate QC program should be added to the workflow. Nanopack and MinIONQC were tested as potential QC programs for nanopore sequencing. Both programs have very similar plots. An advantage of Nanopack is that it also generates an interactive html report. However, MinIONQC is selected because it can work directly on the sequencing summary file. All plots of this program are saved as separate png-files. To make a report out of these separate plots, a python script is developed to write a Word file that incorporates some quality data and all plots generated by MinIONQC. This script will then save the Word file as docx as well as save a pdf version of the report. Both the script as well as the QC program can be added to the workflow to generate more in depth QC reports.
A first dataset contains two barcoded bacterial genome samples. When the nextflow script runs, the Flye algorithm is used to make an initial assembly. This assembly is then polished using Racon and Medaka. Lastly it is annotated with gene information using Prokka. The genome sequences are used as input in the “Type Strain Genome Server” (TYGS) to try and determine the species of which the samples are taken. To compare the performance of another assembler to Flye, the Canu assembler was used. Canu is alternative assembler that boasted very good results. Because of these positive results, Canu is added as an alternative to Flye in the Nextflow script. Then lastly, several samples are run on the TYGS server together, to see how closely related the species of the different samples are. The resulting phylogenetic tree showed very close relatedness for all but one sample.
A second dataset is analyzed with twelve plasmids that were run in Duplo, meaning a total of 24 barcoded samples. Initially these plasmids are assembled using the standard Nextflow script. However, the Flye assemblies are of very poor quality and the assemblies of the replicate samples differ in length. A paper detailing the performance of different assemblers is used as input for testing several assemblers on plasmid assembly. Firstly, Canu is used as the paper states that Canu is the best assembler for plasmids. However, the resulting Canu assemblies are of very poor quality with both replicate assemblies once again differing in length. Next, Miniasm is used in combination with Racon polishing as the paper stated that this combination produced the best circular sequences. Indeed, the assemblies between the two replicates are of very good quality and showed very close resemblance. As a result of this test, a new Nextflow script is developed specifically for plasmid assembly, using Miniasm as assembler and Racon as polisher.
Using the new plasmid workflow, all samples are analyzed and the resulting assemblies aligned to the reference sequences. The resulting assemblies have a good resemblance to the reference sequences, boasting similarity between 85% and 100%. The alignment is done using Brig. This is used because plasmids are circular and so a modified algorithm is needed that accounts for this circularity. However, improvements on the assembly algorithm could be made to account for the big hurdles of plasmid assembly, these are the short sequence length and the circularity.
The reads are also mapped to the references using Minimap2. The resulting IGV views show quite a lot of variant nucleotides throughout the sequences. To try and see if these variants are caused by the fast basecalling, the reads are basecalled using high accuracy basecalling. The resulting view shows less variants on the sequences, however some are still present. These may be variants within the plasmids or these could be errors in the basecalling.While the assemblies are of good enough quality, the assembly algorithm could still be improved upon. However, when a reference sequence is available, mapping should still be preferred as it shows the variants within the plasmids.
Abstract Advanced bachelor of Bioinformatics 2020-2021 (3): VALIDATION AND OPTIMIZATION OF MULTIPLE IMPUTATION FLOWS FOR SNP INTERPRETATION (Emma.Health, Evelyn Verlinde)
More and more people are becoming interested in knowing what their body really needs and what to do to maintain the best healthy lifestyle. By looking at the DNA, important information can be found. The used method to analyse the DNA in this study is a combination of genotyping by Illumina array and imputation. Genotype imputation estimates the genotypes of unobserved variants using the genotype data of other observed variants based on a collection of haplotypes from thousands of individuals, which is known as a haplotype reference panel. Multiple imputation servers exist. We compared and analysed Beagle and the Michigan imputation server, both using the same 1000 genomes phase 3 reference panel.
There are three major research questions. Firstly, what is the optimal sample size to impute? Multiple runs with a different number of samples were analysed both with Beagle and Michigan imputation server. When checking the data, the first thing to look at is whether the SNP is imputed (IMP) or typed (TYP). The typed SNPs are always correct. The imputed SNPs need further examination. If the R2 quality score value of an imputed SNP is higher than 0.8, we can accept the SNP. If it’s lower, we can’t trust the result. By doing this analysis with different sample sizes we can see if the R2 values will change. The more samples we use to impute, the higher the R2 value gets, which in turn mean better results. The second thing to question is what to do when you have a R2 value less than 0.8. As mentioned above we can try to use more samples. It can also be useful to include a sample with heterozygous alleles in the DNA. This further increases the R2 value. Lastly, we want to discuss what will happen when we sequence some regions of interest and compare these with the imputed results. The primers are designed with the OpenPrimer tool. After this a DNA extraction of the salvia is done followed by a PCR to amplify the targeted DNA. Of the nine primer pairs, seven were able to amplify the region of interest. Primer pair one and three were not working and primer seven gave low quality results. The last step is to make a library of the samples and use them for targeted sequencing. We used the flongle for the targeted sequencing. Flongle is an adapter for MinION that enables direct, real-time DNA sequencing on smaller, single-use flow cells. The sequenced data and the imputed data is compared. Minor differences were found, but these need further investigation.
Abstract Bachelor Project 1 (FBT) 2020-2021: Optimalisation targeted sequencing workflows using nanopore technology
DNA sequencing has a lot of useful applications. Because of this, there have been a lot of developments which led to third generation sequencing. In this research Oxford Nanopore Technologies, which is part of the third generations sequencers, is used to perform targeted DNA sequencing. During this project, two workflows regarding targeted sequencing are optimized. The first case handled the sequencing of human leukocyte antigens for the prediction of Type I diabetes. This is in partnership with the Belgian Diabetes Registry. The second case dealt with the detection of single nucleotide polymorphisms in order to predict diseases. This is in partnership with EMMA.health.
Both cases are elaborated starting from human DNA extractions from buccal swabs towards DNA sequencing. Before sequencing, first the genes of interest were amplified making use of PCR. From these amplicons, libraries can be made which can be used for sequencing. For the HLA case, the reproducibility of primers found in literature were tested in order to amplify the HLA class I genes, in addition newly developed primers are tested as well to amplify a HLA class II gene, namely HLA-DQA1. For the SNP case the reproducibility of already existing primer mixes is tested. Furthermore newly developed primers are tested to amplify new regions of interest.
Using available polymerases and primers form literature, four out of five amplicons were successfully reproduced. Using the self-designed primers, also the desired amplicons were successfully generated. After PCR amplification, libraries were developed with the HLA-A, as well as twelve HLA class II samples that were received by the external partner, the Belgian Diabetes registry. These samples were only 240 bases long. They were sequenced as proof of concept, since these fragments were already analyzed by the external partner. After running the libraries, sufficient reads could be obtained for every sample for analysis. The SNP regions were successfully generated using already existing primer mixes. New primers were also tested whereby seven out of nine new regions of interest were successfully amplified. After two multiplex PCR’s, a library was constructed for sequencing. From these results, it could perceived that five out eight regions were amplified from primer mix 1 and six out of eight regions from primer mix 2.
Two optimized protocols using targeted sequencing are now ready for use, although further optimization may be required to get better results. Using PCR, both HLA- and SNP-regions can successfully be amplified. However certain primer pares will need further optimization. Oxford Nanopore Technologies is perfectly capable to sequence the regions of interest, such as the HLA-genes at a cost efficient rate using long reads. It is also proven that short reads can be used, although longer reads are preferred because long fragments are sequenced better by the nanopores than shorter fragments and more information can be generated from long reads.
Abstract Bachelor Project 2 (FBT) 2020-2021: OPTIMISATION OF WHOLE GENOME SEQUENCING WORKFLOW USING NANOPORE TECHNOLOGY
Whole genome sequencing is a technology that is becoming more and more affordable, thanks to new technologies being developed currently. The so called third generation sequencing. In this project whole genome sequencing using Oxford Nanopore Technologies was used to sequence DNA from bacteria and SARS-CoV-2. The goals of this research are to optimize DNA-extraction and sample preparation for nanopore sequencing, and to validate if it is a viable manner to perform DNA-sequencing using a MinION with a Flongle flow cell. By comparing to other sequencing technologies for example: Illumina sequencing.
Optimalisation of the workflow mainly happened by optimizing the DNA-extraction on Streptomyces bacteria. This bacteria produces a secondary metabolite of interest but it hasn’t been fully sequenced yet, thus gene cluster mining can’t be performed yet. The extraction is done by using the Wizard® genomic DNA purification kit.
SARS-CoV-2 is a relatively new virus. A lot of research is happening at this moment because it is a very contagious and deadly disease. A hospital asked Howest for help to validate their workflow for detection of Sars-CoV-2 variants. By comparing the results of the samples that were sequenced with nanopore technology at Howest with the results sequenced with Illumina technology from a hospital, we could validate both workflows since the resulting variants for all samples were the same.
The third goal is to sequence bacteria from infections in newly born children. The aim of this study is to determine whether or not the infections are linked to each other. It is possible that these bacteria infected multiple baby’s or that the infections are separate cases. To determine this, whole genome sequencing is used and a phylogenetic tree is constructed to determine if the bacteria are the same type or not.
This project shows that whole genome sequencing using Oxford Nanopore Technologies is a viable option. If the DNA-samples are high enough in concentration and of good quality.
Abstract 2019-2020 (BIT1): Identification and genome assembly of microorganisms using nanoporesequencingIdentification and genome assembly of microorganisms using nanopore sequencing Identification and genome assembly of microorganisms using nanoporesequencingIdentification and genom of microorganisms using nane assemblyoporesequencing(the Bioinformatics Knowledge Center (BiKC) of Howest)
Quick tracking and identification of unwanted microorganisms spreading through research labs and medical facilities has always been of the utmost importance. A Flemish hospital was confronted with a bacterial contamination in their neonatal care unit. Using conventional bacterial identification techniques the genus was determined to be Enterobacter. However, a more specific classification is required to combat this infestation and to establish if only one or multiple strains are in play. To this end, DNA was extracted from single clones grown from the bacterial contaminants and sequenced using a MinION device from Oxford Nanopore Technologies (ONT). Having the sequencing data, a quest for bioinformatics tools to correctly identify microorganisms, compare between samples and even assemble the genomes of these bacteria was initiated. In a first step, multiple quality control (QC) tools were tested and scored based on their ease of installation, ease of use and the received output. One QC tool did stick out above the rest: NanoComp [1]. With a successful QC, classification of the samples started with Kraken 2 as the main tool [2]. Two databases were used: MiniKraken2 (version 2, an 8 Gb standard Kraken 2 database) and a custom RefSeq bacterial database built on the bioinformatics server of Howest. Two tools were tested to visually represent the output; one of them proved to be far more superior for this case. Kraken 2 resulted in a more specific classification of species and even identification of subspecies found in the different samples. One distinct species of Enterobacter was largely present in all samples: E. hormaechei. However, declaring to which subspecies of E. hormaechei each sample belongs was not yet possible. To help with the further identification, genome assembly was attempted using multiple tools. These were again tried for ease of installation, usage and the received output. One of the assemblers came out on top, which was consistent with the current literature: Flye [3]. Beneficial to Flye is the great speed with which it operates and the GFA file included in the output. This file allows visualisation of the assembly when using appropriate software. The initial assembly constructed by Flye was then polished using Racon followed by Medaka, as suggested by ONT [4]. This resulted in a final consensus fasta file for each sample. A last step with the genome assemblies produced by different tools is to feed them to classification and identification software. Kraken 2 was used again in order to compare results before and after assembly. Type (Strain) Genome Server (TYGS) and Genome-toGenome Distance Calculator (GGDC) were some of the tested web-based tools which proved to be very insightful; these two were specifically interesting to compare the samples among each other [5], [6]. 2 Throughout this identification quest many bioinformatics tools passed the stage. Some were found to have too intricate installation procedures with too many dependencies, while others lacked in ease-of-use and output aspects. However, for each step in the analysis process the right software was found and a workflow was set up with the chosen ones. Thus the proposed workflow allows rapid identification of microorganisms, even at subspecies level and allows for an easy strain comparison between samples.
References [1] W. De Coster, S. D’Hert, D. T. Schultz, M. Cruts, and C. Van Broeckhoven, “NanoPack: Visualizing and processing long-read sequencing data,” Bioinformatics, vol. 34, no. 15, pp. 2666–2669, 2018, doi: 10.1093/bioinformatics/bty149. [2] D. E. Wood, J. Lu, and B. Langmead, “Improved metagenomic analysis with Kraken 2,” Genome Biol., vol. 20, no. 1, pp. 1–13, 2019, doi: 10.1186/s13059-019-1891-0. [3] M. Kolmogorov, J. Yuan, Y. Lin, and P. A. Pevzner, “Assembly of long, error-prone reads using repeat graphs,” Nat. Biotechnol., vol. 37, no. 5, pp. 540–546, 2019, doi: 10.1038/s41587-019-0072-8. [4] Oxford Nanopore Technologies, “microbial-genome-assembly-workflow,” 5th May 2020. . [5] J. P. Meier-Kolthoff and M. Göker, “TYGS is an automated high-throughput platform for state-of-the-art genome-based taxonomy,” Nat. Commun., vol. 10, no. 1, Dec. 2019, doi: 10.1038/s41467-019-10210-3. [6] J. P. Meier-Kolthoff, A. F. Auch, H. P. Klenk, and M. Göker, “Genome sequence-based species delimitation with confidence intervals and improved distance functions,” BMC Bioinformatics, vol. 14, 2013, doi: 10.1186/1471-2105-14-60. [7] R. R. Wick, M. B. Schultz, J. Zobel, and K. E. Holt, “Bandage: Interactive visualization of de novo genome assemblies,” Bioinformatics, vol. 31, no. 20, pp. 3350–3352, 2015, doi: 10.1093/bioinformatics/btv383.
Abstract 2019-2020 (BIT2): Nanopore sequencing analysis in the Bioinformatics Knowledge Center (BiKC) of Howest
Nanopore sequencing is a third-generation portable sequencing. Third generation sequencing technologies are characterized by their ability to sequence long reads in real time. In this case the MinION was used to generate two types of samples: a microbiome sample (whole genome and 16S) and ten plasmids from GeneCorner. For both datasets, different tools regarding quality control (QC), mapping and analyzing the data were compared in order to find the best tools for our needs.
The different QC tools that were tested are: NanoComp, NanoPlot and longread_plots. To map the WGS data a lot of mapping tools were tested: tophat2/bowtie2, Hisat2, kraken2, Qiime2 and minimap2. Kraken2 was tested with two different reference databases to see the effects on the results. The results were visualized using Krona and Pavian. For the second project, ten plasmids were analyzed using the QC tools from the first project and reads were mapped with minimap2.
Out of the ten analyzed plasmids, nine had at least one variation. Most of the time these were point mutations. In one sample two deletions were present. These are often not found with the current methods since we are sequencing the entire plasmid sequence at once while most methods only determine partial sequence or use restriction enzyme pattern analysis. For the microbiome data both methods (WGS and 16S) gave the same enterotype (Firmicutes) but there were large differences in the distribution of the species between both methods.
Abstract 2019-2020 (BIT3): Automated bioinformatics workflow for imputation of clinical relevant variants in the context of preventive health care
The goal of the traineeship was to develop an automated workflow that starts from data retrieved through Nanopore or Illumina sequencing. The result of the workflow is a file with imputated data. This data can then be used to deduce unknow genotypes for risk genes and to provide targeted preventive healthcare. The workflow must be as simple as possible with little to no human interference in the process and to be carried out by people with only basic computer knowledge. The workflow itself contains four steps which are combined in one imputation script. The first step is to convert the input files in the right format for the imputation. This is a two-step process. It starts from converting the idat files to gtc files with Illumina gencall. Afterwards, these gtc files are converted to vcf files with a python script also written by Illumina. The second step is preparing the data for imputation. It starts with renaming the files from the Array Picker code to the sample name. Afterwards all the samples are merged into one big file from which the x chromosome is extracted. In the data of the x chromosome, all haploid data is converted to homozygous diploid data. Step three is the actual imputation, done with the Beagle tool. This tool will do the imputation chromosome per chromosome with a fixed amount of seeds. The final step of the imputation script will collect all data from the Beagle tool into one big file. To automate the script, a docker container was build. In this container, all needed tools are preinstalled, and the imputation script is running automatically at certain times with a cron job. The user only needs to place the input files in a shared input folder and after some time the result file is available in the shared output folder.
Abstract 2018-2019 (BIT1): Implementation of bioinformatic scripts for detection of structural variants on hybridization-capture NGS data of FFPE-solid tumors
Cancer is one of the leading causes of death worldwide. All cancers start in a cell due to changes in genes (mutations) that cause the cells to divide and multiply uncontrollable. The mutations may accidentally occur during cell division, are caused by environmental factors or are inherited1,3. Large-scale studies confirmed the presence of single nucleotide variants (SNV) and structural variations (SV) in the majority of cancers2. Detecting these variants is crucial for clinical diagnostics and treatment. For the detection, specialized bioinformatic scripts are used that are applied to next-generation sequencing (NGS) data of tumor DNA/RNA. The aim of this project is the implementation of bioinformatic scripts and provide proof-of-concept verification on NGS data of formalin-fixed paraffinembedded (FFPE) tumor DNA. Finally, after validation, docker containers of these tools/scripts are created and will be integrated in the routine NGS pipeline at AZ Delta. Three different bioinformatic scripts are installed and tested. The first tool is a shiny app called SNPitty that allows visualization of variant call format (VCF) files. This tool supports the detection of allelic imbalances (AI) by means of heterozygosity markers. In this project SNPitty is used to interpret the presence or absence of 1p/19q co-deletion in brain tumor samples. The second script is GeneFuse that takes fastq files as input and tries to detect gene fusions. The third script is Breakdancer that detects SV from bamfiles. This tool is used to detect intra- and inter-chromosomal translocations. A total of twenty-six samples are analyzed by SNPitty. GeneFuse is performed on nineteen samples (sixteen negative and three positive). No fusions where detected in the negative control samples. In two of the three positive samples an ALK fusion was found with a low number of supporting reads. The same samples for GeneFuse are analyzed by Breakdancer. With the current sequencing method, the analysis by GeneFuse (with predefined parameters) is not sensitive enough to detect gene fusions. Future projects to solve this issue could be changing the sequencing method and/or changing the predefined parameters of GeneFuse. The output of a breakdancer analysis shows the breakpoints of chromosomes with their position, but it is not known which genes. To find out which genes are involved further annotation of the positions is required. So breakdancer is not user-friendly and time consuming compared to GeneFuse.
References [1] How cancer starts. (2014, oktober 27). Geraadpleegd 10 juni 2019, van Cancer Research UK website: https://www.cancerresearchuk.org/about-cancer/what-iscancer/how-cancer-s... [2] Macintyre, G., Ylstra, B., & Brenton, J. D. (2016). Sequencing Structural Variants in Cancer for Precision Therapeutics. Trends in Genetics, 32(9), 530–542. https://doi.org/10.1016/j.tig.2016.07.002 [3] What Is Cancer? [CgvArticle]. (2007, september 17). Geraadpleegd 10 juni 2019, van National Cancer Institute website: https://www.cancer.gov/aboutcancer/understanding/what-is-cancer
Abstract traineeship: 2017-2018 (BIT1): The development of an application for chimerism research after a hematopoietic stem cell transplantation
Leukemia patients need a bone marrow transplant so that they can develop healthy blood cells again. From that moment on, the blood cells of the patient/host should contain the DNA of the donor, meaning healthy blood cells are being produced by the host.
Chimerism is the occurrence of two sets of DNA in an organism. The more chimerism there is in a patient the more donor DNA is active. After a transplant the patient should have a high chimerism grade.
The molecular laboratory of AZ Sint-Jan measures chimerism in a leukemia patient in order to check if a transplant worked for the patient. Multiple measurements are performed to provide a follow-up for the patient.
The Advanced Bachelor of Bioinformatics of Howest developed a webtool to calculate chimerism in patients which can be freely accessed. The program is written in HTML, CSS, PHP and Javascript. This project was already started by another student last year. The interface and calculations were loaded on a single page, but chimerism was only calculated on the individual markers. There was also no possibility to upload new kits. The tool was split into multiple pages, a total percentage chimerism page was added and a couple of buttons were added to add, remove and view kits in the tool. Some changes were also made to the layout of the tool, the CSS code was altered and a header was added to the top of the pages.
The tool takes data generated with Genemapper 5 and determines the number of peaks for donor, patient and sample per marker. The type of the marker is assigned based on the peaks. Type I markers have no shared alleles between donor and host, type II markers use a shared and a non-shared allele to calculate chimerism and type III markers are not informative. When stutter interference is present a stutter ratio is calculated which is used as a criterion to exclude markers from the total percentage chimerism calculation.
The result of the calculations is a total percentage for all type I markers, all type II markers and both type I & type II markers. A table is shown where values can be checked for an alternative total calculation. The tool can produce a report and a raw data text file of the results.
An additional part of the project was also the validation of the tool. Dr. F. Nollet provided data which was processed by both the lab software and the webtool. A comparison of the results was then performed to check the tool. The validation had to be run three times. First validation run there was an error with stutter ratios not being detected. This error was solved with adjusting the standard stutter range from 0.2 to 0.5. The second validation run then revealed that stutter ratios were calculated on peaks that were irrelevant. Extra lines of code were added to exclude these peaks. The third and last validation run revealed no other errors and the tool slightly outperforms the lab software when calculating stutter of the amelogenin marker. This concludes the tool is validated with Genemapper data and is ready to use.
Abstract 2018-2019 (BIT2): BIOINFORMATICS AND GENOMIC ANALYSIS OF ALLERGY GENES IN THE PEANUT GENOME
During this traineeship different open source tools were tested on data generated by two different nanopore sequencing experiments. These experiments were performed with the MinION nanopore DNA sequencer from Oxford Nanopore Technologies (ONT). The MinION sequencer is a portable device weighing less than 100g. The heart of the device is a flow cell that bears up to 2048 nanopores. These nanopores can be controlled by an applicationspecific integrated circuit (ASIC) in groups of 512 (Jain, Olsen, Paten, & Akeson, 2016). DNA sequencing is performed by adding sample to the flowcel. DNA molecules passing through or near the nanopore cause a change in the magnitude of the current in the nanopore. This change in current is measured by a sensor. The data is passed to a microchip (the ASIC) and data processing is done by the MinKNOW software, which deals with data acquisition and analysis. (Lu, Giordano, & Ning, 2016) The first experiment was to identify a bacterial contaminant from a lab that could not be easily identified with standard biochemical tests. DNA was extracted from a bacterial culture and the sample was prepared for sequencing on MinION according to the 1D Genomic DNA by Ligation (SQK-LSK109) protocol. The second nanopore sequencing experiment was performed on DNA from cultivated peanut (Arachis hypogaea). For QC on the data of both experiments two open source tools were tested. One of these tools is NanoR. NanoR is a package for the cross-platform statistical language and environment R that was designed with the purpose to simplify and improve nanopore data visualization. (Bolognini, Bartalucci, Mingrino, Vannucchi, & Magi, 2019). The other tool that was tested for QC is Nanoplot. NanoPlot is part of NanoPack, which is a set of Python scripts for visualizing and processing long – read sequencing data. (De Coster, D’Hert, Schultz, Cruts, & Van Broeckhoven, 2018). The open source tool Kraken 2 was used to try to identify the bacterial contaminant from the first experiment. Kraken is a program for assigning taxonomic labels to metagenomic DNA sequences. Kraken achieves fast classification speeds and high accuracy by using exact alignment of k-mers. (Wood & Salzberg, 2014). The output of Kraken 2 was visualized with Krona, another open source tool for visualizing relative abundances and confidences within the complex hierarchies of metagenomic classifications. (Ondov, Bergman, & Phillippy, 2011) GraphMap was used for mapping of the reads of the second nanopore sequencing experiment. GraphMap is a mapping algorithm designed to analyse nanopore sequencing reads. It progressively refines candidate alignments to robustly handle potentially higherror rates and a fast graph traversal to align long reads with speed and high precision. (Sović e.a., 2016).
Abstract traineeship 2017-2018 (BIT2): Bio-informatic study of the gene NLRP7 and the possible role in Inflammatory Bowel Disease
The study of the gene NLRP7, and the mutations that this gene can have which can possibly play a role in the development of inflammatory bowel disease. There has also been examined which amino acids have to be changed to induce these mutations in the mouse genome.
The research has been started with making a list of all genes in the NLR gene family and the different isoforms of NLRP7. This list consisted of the subfamily of the protein, the symbol, full name, aliases, chromosome, start and end location on the chromosome, the protein identifier. This is done for both the human and the mouse genome. A bit more than 50 members were found in both of the genomes. Then the flanking genes were studied, this has been done to find possible duplications that have happened during the evolution or to find closely related genes. Remarkably the mouse genome doesn’t contain the NLRP7. Based on the NLR family list a multiple sequence alignment was performed on all protein sequences of the different family members, this has been done with the tool MUSCLE. This was done to study conserved domains between the family members. This has also the advantage that the regions that can contain the possible deleterious mutations in NLRP7 can be compared between the different family members. Next a phylogenetic tree was made from the data of MUSCLE, with the application simple phylogeny. There was found that the proteins NLRP2 and NLRP7 are closely related, so NLRP2 can be used as an alternative for NLRP7. Then there has been performed a pairwise alignment (on the EBI) of some proteins in the NLR* family. Next the protein structure has been predicted. First the protein database of structures (PDB) has been searched for the protein NLRP7 homologues, this is done by blasting the protein against the RCSB protein database. There has been found some homologs of the protein, these where loaded into pymol. These models can give an image of the protein how its looks like. To get more reliable results the NLRP7 protein was predicted/modelled with some tools. The first tool used was swissmodel this tool models the protein against one other similar protein in the protein database, here has been noticed that there are not so much proteins in the protein database that had a high sequence similarity, the highest was 5IRL, a NOD2 protein in rabbits, which had 33% sequence identity. To overcome this problem one other tool has been used, this tool is modeller. This tool has given the advantage that multiple proteins or parts of proteins can be used for modelling an other protein such as NLRP7. The algorithm calculates several models (changeable) from which the best models are chosen, the best models are chosen by looking at the different Ramachandran plots at RAMPAGE Ramachandran. These models can be loaded into pymol where they can be aligned with other proteins and/or analyzed. Next there has been performed differential expression analysis, this has been performed to investigate if the gene NLRP7 is differentially expressed in the IBD population versus the non-IBD population (control). For the analysis first there has been searched some GEO datasets on the NCBI website and they were further analyzed in R with a script. The problem was that there were not much good datasets that had clear results, an example of this are datasets with much variation in the control group. There has been found a dataset that has given some results with more significance but it’s also low, that can be verified by searching the GEO profiles, these profiles give some indication. For the mouse there hasn’t been found any good datasets or profiles that compared IBD vs. non-IBD mice. There has also been performed a research on the evolutionary history of the genes NLRP2 and NLRP7. This has been done by making a multiple sequence alignment with MUSCLE of the NLRP2 and NLRP7 proteins between the different species. Next there has been searched for some organisms if they contain the genes NLRP7 and/or NLRP2 this has been done with the UCSC genome browser. This also gives information from where the genes could have been originated. From these results there has been made an evolutionary tree with the different classes. Last there has been performed a posttranslational modification research on prediction servers, mostly on the server “elm.eu”. Where has been predicted if the places where the possible deleterious mutations can occur, could have any possible transformation after the translation step and what happens if the mutations are induced in the protein. The conclusion that can be made is that there can be tried to modify the mouse gene Nlrp2 and also there is known which amino acids has to be changed. Inducing the mutation can be used to get a better understanding of the role of the mutation for IBD, and possibly help in the development of medicine.
Abstract traineeship (advanced bachelor of bioinformatics) 2016-2017: Development of a webtool for chimerism calculations after stem cell transplantation
Stem cell transplantations are a widely used treatment for diseases like leukemia. But in order to make sure the new stem cells are used to grow healthy white blood cells, the patient needs to be tested on regular intervals. To determine whether the treatment was successful or not, the chimerism percentage must be calculated. This percentage corresponds with the amount of blood cells derived from the new stem cells (of the donor) versus the number of blood cells derived from the original stem cells of the patient.
Data is collected via a variety of different biochemical techniques based on the variable number of tandem repeats (VNTR) in genes. These can vary between people and are therefore useful to identify the origin of the blood cells. The chimerism percentage can be calculated from this data via a formula. There are different formula depending on VNTR type.
However, these calculations are still done manually in some laboratories. This is time consuming and there’s always a chance for human errors. The goal of this project was to develop a webtool, called QuickChim, that can execute the chimerism calculations automatically to speed-up the process of the follow-up.
QuickChim is a webtool, written with php, html and css. The information about the kits used to analyze the samples are stored in a mysql database.
To calculate the percent donor chimerism, the user needs to upload three files. One file for a sample taken from the donor at the start of the treatment (the donor reference file), one for the sample taken from the patient at the start of the treatment (the patient reference file) and one file for the sample taken for follow-up analysis. These files contain information about every peak viewed by the biochemical technique used to analyze the sample. Important information for this tool is the color, size and area of each peak.
When the files are uploaded and the parameters are set, the user can submit the data and QuickChim will calculate the percentage chimerism for every VNTR. When the calculations were successful, two buttons will be generated. One prints a pdf report of the results, which can be downloaded and used to report the results to the doctor in charge of the treatment. The second button lets the user download a text file containing the parameters used for analysis, together with the tables containing the used information of all three uploaded files.
Address
Rijselstraat 5
8200 St.-Michiels Brugge
050/381277 Belgium |
Contacts
Traineeship supervisor
Marjolein Vandekerckhove
marjolein.vandekerckhove@howest.be |
Traineeship supervisor
Jasper Decuyper
jasper.decuyper@howest.be |
Traineeship supervisor
Paco Hulpiau
paco.hulpiau@howest.be |
Traineeship supervisor
Cedric Hermans
cedric.hermans2@howest.be |
|