Instituut voor tropische geneeskunde
Abstract 2019-2020: Gene discovery in Mycobacterium tuberculosis, using omics data integration methods
Mycobacterium tuberculosis (MTB) causes tuberculosis (TB), a potentially serious bacterial disease that mainly affects the lungs. TB is among the top 10 causes of death and it is the
deadliest infectious agent in the world. In 2018 approximately 10 million people became ill with TB of which 1.5 million died. A wide range of drugs against TB exist already and
are effective against most TB strains in the world. However, multi drug resistant TB remains a public health crisis and great threat. For that reason, research on potential new
drug/vaccine targets is essential and highly needed . One of the major virulence factors of MTB is ESX-1 (ESAT-6 Secretion) the Type VII
secretion system. This is a complex protein machinery that allows MTB to leave the phagosome of the macrophage. Leading the bacterium to the cytosol, where it replicates
and kills the cell . Today, there are still a lot of knowledge gaps regarding the ESX-1 system. One of these is how ESX-1 proteins cross the mycobacterial outer membrane
(MOM). It is hypothesized that the ESX-1 proteins (e.g. the virulence factor EsxAB) pass the MOM through a machinery of a yet unknown composition . Identifying the
composition and structure of this hypothetical MOM machinery would lead to new drug targets that in time could give rise to anti-TB drugs.
The aim of this project was applying omics data integration techniques to generate a list of candidate genes that are potentially involved in the ESX-1
The application of omics data integration techniques involved several steps. Three types of MTB networks were constructed and analysed. First, the environmental and gene
regulatory influence network (EGRIN) of MTB, built based on a compendium of 2325 publicly available mRNA expression profiles, was extracted from the internet
(http://networks.systemsbiology.net/mtb/) . Secondly, a protein-protein interaction network was downloaded from the STRING database . This is a database consisting of
physical and functional protein interactions based on computational predictions and knowledge combined from other primary databases (i.e. databases containing
experimental results . Thirdly, the CoExpNetViz tool was used to generate coexpression networks from publicly available gene expression matrixes. A list of 35 ESX-1
genes of interest (GOI), consisting of all known ESX-1 genes, were submitted as bait genes and co-expression values were calculated using mutual information and Pearson
correlation coefficients .
In Cytoscape, the GOI were selected together with their neighbouring nodes in all the above-mentioned networks and an appropriate layout was created. Then, different
analyses were performed using MCODE and CytoHubba to identify highly interconnected nodes/regions in all the different GOI networks . This resulted in mostly ESX-1 related
genes. However, between the most connected nodes, several unknown genes/proteins were identified as ESX-1 gene candidates. After a thorough web-search, this list of genes
with unknown function was filtered, keeping only the biologically most interesting genes (e.g. coding for proteins that were identified in the mycobacterial membrane).
To increase the understanding of the experimental conditions under which the GOI and the identified gene candidates are expressed, their expression levels were visualized with
ggplot2 in RStudio. For this the same datasets were used as for the construction of the co-expression networks.
Using SAMtools  and IGV , SNPs were identified in the GOI and candidate genes, in locally available genomic NGS data of clinical isolates covering all seven lineages of MTB
circulating globally. To see whether the same results can be achieved using a SNP calling tool, several tools were tested to identify the best one for future experiments. Using
VarScan on the GOI, a total of 103 lineage specific and 9 SNPs that were shared between at least two linages, were detected . These results corresponded with ~95% of the
visualised SNPs using SAMtools and IGV.
In conclusion, by applying omics methods on MTB, interesting candidate genes can be identified related to the ESX-1 protein machinery. In the future, additional
NGS experiments are needed to expand the available data and to make any bioinformatic approach more robust and reliable. Hereafter an expression
quantitative trait loci (eQTL) analysis should be performed to associate the observed SNPs in MTB to differential expression and link it to phenotype.
1. World Health Organization. Global tuberculosis report 2019 https://www.who.int/tb/publications/global_report/en/
2. Abdallah AM, Gey van Pittius NC, Champion PA, et al. Type VII secretion--mycobacteria show the way. Nat Rev Microbiol.
3. Bosserman RE, Champion PA. Esx Systems and the Mycobacterial Cell Envelope: What's the Connection?. J Bacteriol.
2017;199(17):e00131-17. Published 2017 Aug 8. doi:10.1128/JB.00131-17
4. Peterson EJ, Reiss DJ, Turkarslan S, et al. A high-resolution network model for global gene regulation in Mycobacterium
tuberculosis. Nucleic Acids Res. 2014;42(18):11291-11303. doi:10.1093/nar/gku777
5. Szklarczyk D, Gable AL, Lyon D, et al. STRING v11: protein-protein association networks with increased coverage, supporting
functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019;47(D1):D607-D613. doi:10.1093/nar/gky1131
6. Tzfadia O, Diels T, De Meyer S, Vandepoele K, Aharoni A, Van de Peer Y. CoExpNetViz: Comparative Co-Expression Networks
Construction and Visualization Tool. Front Plant Sci. 2016;6:1194. Published 2016 Jan 5. doi:10.3389/fpls.2015.01194
7. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment
for integrated models of biomolecular interaction networks Genome Research 2003 Nov; 13(11):2498-504
8. Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data
Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9.
9. James T. Robinson, Helga Thorvaldsdóttir, Aaron M. Wenger, Ahmet Zehir, Jill P. Mesirov. Variant Review with the Integrative
Genomics Viewer (IGV). Cancer Research 77(21) 31-34 (2017).
10. Koboldt, D., Zhang, Q., Larson, D., Shen, D., McLellan, M., Lin, L., Miller, C., Mardis, E., Ding, L., & Wilson, R. (2012). VarScan 2:
Somatic mutation and copy number alteration discovery in cancer by exome sequencing DOI: 10.1101/gr.129684.111
Abstract 2018-2019: Genomics in the tropics: building portable pipelines
The Institute of Tropical Medicine (ITM) in Antwerp is an internationally renowned research institute in tropical medicine and public health in developing countries (www.itg.be). The unit of Diagnostic Bacteriology (DIAB) at the ITM conducts research on bacterial bloodstream infections (BSI) with as main focus invasive salmonellosis and antimicrobial resistance (AMR) in Central Africa. To investigate the spread and genetic characteristics of AMR lineages of invasive Salmonella, whole genome sequencing (WGS) is performed using Illumina and MinION sequencing platforms. Short-term or even real-time genomics of AMR outbreaks requires that strains can be sequenced and analyzed locally at national reference laboratories (NRL) in Africa. Therefore DIAB develops tools that can bring the complete workflow, both wet and dry-lab, on-site in endemic regions. Currently, DIAB is working closely with the NRL in Kigali (Rwanda) to implement the developed workflows for Salmonella typhi AMR surveillance in Rwanda. The DIAB pipeline for Illumina data analysis of S. typhi consists of the following tools:
1. FastQC & MultiQC: quality-control of the reads before and after trimming
2. Trimmomatic: removing adaptors from the reads
3. SPAdes: building the assembly
4. Pathogenwatch: open-source and rapid webtool developed by the Wellcome Trust Sanger Institute (UK) that analyses the uploaded Salmonella typhi assemblies for: assembly quality control, genotyping, phylogenetic analysis and prediction of antibiotic resistance
The aim of my thesis was to transfer the DIAB pipeline to different environments while retaining full functionalities by using container software. After researching multiple container-software options, I selected the Docker system, as this was one of the few container systems that met the requirements, like support for every operating system (Windows, MacOS and UNIX). To further automate this pipeline, I used Snakemake. Snakemake allows chaining multiple tools and monitoring the in- and output for each tool. This allows good managing of the pipeline; if one input/ output file is missing, Snakemake will throw an error and stop the analysis, allowing users to look at the problem. Afterwards Snakemake will restart the pipeline at the step that failed. I successfully containerized the in-house pipeline including the Snakemake managing system using Docker. The Snakemake program was containerized to reduce the amount of programs that needed to be installed on the host OS. In order to achieve this, the Docker container containing Snakemake needed to be able to start up other docker containers, thus a ‘docker in docker’ system was created. In a next phase, the containerized pipeline was tested on Windows, MacOS and Unix. For the same test data panel (2 S. typhi samples from Rwanda that where recently sequenced with Illumina at DIAB). The complete data analysis using the containerized pipeline took about 10 minutes on both the DIAB server (Ubuntu 18.04; 30 threads) and a laptop (Windows 10; 9 threads). On MacOS it took 2 hours and 25 minutes to run the samples (MacOS v10.13.6; 2 threads). Next to the Illumina short-read-pipeline, I have developed a new WGS pipeline for analysis of MinION long-reads. The MinION-sequencing-device (Oxford Nanopore Technologies, UK) is a portable sequencing device allowing genome sequencing in remote settings and providing real-time basecalling functionality using the Guppy tool. To assemble the long reads I selected Unicycler which allows users to use only MinIONreads, or combine both Illumina and MinION reads into one “hybrid” assembly. If both Illumina and MinION reads are combined, the same tools as in the previous pipeline are used (up until Trimmomatic) to prepare the Illumina reads. After which both Illumina and MinION reads are inputted into Unicycler for hybrid assembly. The complete long-readpipeline for MinION-reads is listed below:
1. Guppy: used for both basecalling and demultiplexing of the MinION reads
2. PycoQC: quality control of the basecalled MinIon reads
3. Porechop: trimming (and demultiplex correction) of the MinION reads
4. Unicycler: a pre-existing mini assembly pipeline for bacterial genomes. Unicycle takes care of the following steps: genome assembly, polishing of the assembly and circulating replicons
5. Bandage: visualizing the quality of the assembly
6. Prokka: annotating the assembly
By visualizing the different assembly types with bandage, it appeared that for 3 out of the 6 assembled S. typhi genomes, a long-read-only assembly already produced a closed circular contig (±4.8Mb) without using Illumina reads. However long reads are more sensitive to sequencing errors then short reads. Therefore when Illumina reads are added, the overall quality of the contig improved. As for the previous Illumina pipeline, this new MinION data analysis pipeline will be automated with Snakemake and transported in a portable Docker environment beyond my thesis. In conclusion, I have created an automated and containerized short-read WGS pipeline using Docker and Snakemake as well as a WGS pipeline for MinION long-reads that is now being tested, both pipelines are available for download on Github.
Abstract traineeship advanced bachelor of bioinformatics 2017-2018: 16S metagenomics for bacteria identification in blood samples: automatization of the data analysis workflow
Bacterial bloodstream infections (BSI) are a large public health threat with high mortality rates worldwide. Routine diagnosis is currently based on blood culture followed by microbiological techniques for bacterial species identification. Molecular diagnostic tools have been developed for identifying the presence of bacteria in the blood of BSI patients and are mostly based on single or multiplexed polymerase chain reaction (PCR) assays. 16S metagenomics is an open-ended technique that can detect simultaneously all bacteria in a given sample based on PCR amplification of the 16S ribosomal RNA gene followed by deep sequencing and taxonomic labelling of the amplicons by searching for similar sequences in public databases. My host lab at the Institute of Tropical Medicine (ITM) in Antwerp has conducted pivotal work in the field of 16S metagenomics for blood analysis (Decuypere S. et al., 2016, Plos NTS 10:e0004470; Sandra Van Puyvelde 2016, MSc thesis UGent).
The objective of the internship at ITM Antwerp is to develop a standardized and semi-automated bioinformatic workflow of the 16S metagenomics data analysis pipelines for blood analysis. The following methodologies have been standardized in a semi-automated workflow on a local server: (i) Pre-processing: trimming on quality, removing of adapter sequences, filtering on length and pairing of the raw reads using Trimmomatic and filtering on species level to exclude human reads by using Kraken; (ii) Taxonomic labelling of the processed reads: taxonomic classification of the contigs to bacterial genus/species level into a table of amplicon sequence variants (ASV) using DADA2, and visualization of the ASV data using phyloseq in R.
To enable a semi-automatically 16S pipeline, different scripts were used. In the pre-processing part of the pipeline, I have developed a bash script for calling the Trimmomatic and Kraken tools. To assess the quantity and quality in this part of the pipeline, FastQC and MultiQC were used by using a bash script as well and a python script was developed to parse the number of filtered reads from the output file of the Kraken tool into a cvs file to enable visualization of the quantity in R. In the taxonomic labelling part of the pipeline, where taxonomy is assigned to the reads, the R server software was used to run the DADA2 R-script on the local server via a browser. This script runs the different steps in the DADA2 workflow: (i) filtering and trimming step, (ii) dereplication step, (iii) merging paired reads step, (iv) removing chimeras step and (v) assigning taxonomy.
To assess the 16S pipeline, a panel of 80 DNA samples with known number of colony forming units (CFU) of different bacterial species spiked in blood was used. Briefly, the samples were spiked with four different concentration (200, 20, 2 and 0,2 CFU/ml) of four different bacteria (Escherichia coli, Klebsiella pneumoniae, Pseudomonas aeruginosa and Staphylococcus aureus) and combinations of E.coli plus one other bacterial species in different ratios (1:1, 1:9) to reflect mixed infections in blood. A set of blanco samples was also included in the panel. Following the protocol of 16S metagenomics of illumina, a library prep was made and run on the Miseq platform (2 * 300 cycles, Paired End reading). The sample preparation was conducted prior to the internship.
Running the MiSeq reads of this experimental sample panel through the workflow of the 16S metagenomics pipeline enables optimization and standardization of the different parameters in every step in the pipeline in order to obtain maximum true positive classified reads in the ASVs table. During my study, I have observed that (i) Pre-filtering with Trimmomatic does not need high stringency in this pipeline since there is a quality filtering step in the DADA2 workflow and the paired end reads must be long enough to merge into one contig and (ii) Kraken should be used to exclude human reads from the dataset, since the raw data contained a high number of human reads, lowering the computational time in further processing in DADA2, (iii) True positives in the ASV table are obtained for Escherichia and Klebsiella from concentration between 2-200 CFU and in the mixed spiked samples. Pseudomonas and Staphylococcus showed false positive reads in the blanco samples, although the read counts were lower than the samples spiked with these species.
In conclusion, a 16S metagenomics pipeline was standardized, automated and evaluated on MiSeq reads of a blood sample panel spiked with known numbers of different bacterial species. Further improvements should be made in the wet lab part, such as decreasing amplification of human DNA and the number of false positive hits. Increasing the sequence depth and/or length could result in higher diagnostic sensitivities. Another limitation of the current 16S metagenomics workflow for use as a diagnostic test is that taxonomic labelling of the reads is only accurate to the genus level and not to the species level.
Sandra Van Puyvelde
Tessa de Block