Instituut voor tropische geneeskunde
Abstract 2018-2019: Genomics in the tropics: building portable pipelines
The Institute of Tropical Medicine (ITM) in Antwerp is an internationally renowned research institute in tropical medicine and public health in developing countries (www.itg.be). The unit of Diagnostic Bacteriology (DIAB) at the ITM conducts research on bacterial bloodstream infections (BSI) with as main focus invasive salmonellosis and antimicrobial resistance (AMR) in Central Africa. To investigate the spread and genetic characteristics of AMR lineages of invasive Salmonella, whole genome sequencing (WGS) is performed using Illumina and MinION sequencing platforms. Short-term or even real-time genomics of AMR outbreaks requires that strains can be sequenced and analyzed locally at national reference laboratories (NRL) in Africa. Therefore DIAB develops tools that can bring the complete workflow, both wet and dry-lab, on-site in endemic regions. Currently, DIAB is working closely with the NRL in Kigali (Rwanda) to implement the developed workflows for Salmonella typhi AMR surveillance in Rwanda. The DIAB pipeline for Illumina data analysis of S. typhi consists of the following tools:
1. FastQC & MultiQC: quality-control of the reads before and after trimming
2. Trimmomatic: removing adaptors from the reads
3. SPAdes: building the assembly
4. Pathogenwatch: open-source and rapid webtool developed by the Wellcome Trust Sanger Institute (UK) that analyses the uploaded Salmonella typhi assemblies for: assembly quality control, genotyping, phylogenetic analysis and prediction of antibiotic resistance
The aim of my thesis was to transfer the DIAB pipeline to different environments while retaining full functionalities by using container software. After researching multiple container-software options, I selected the Docker system, as this was one of the few container systems that met the requirements, like support for every operating system (Windows, MacOS and UNIX). To further automate this pipeline, I used Snakemake. Snakemake allows chaining multiple tools and monitoring the in- and output for each tool. This allows good managing of the pipeline; if one input/ output file is missing, Snakemake will throw an error and stop the analysis, allowing users to look at the problem. Afterwards Snakemake will restart the pipeline at the step that failed. I successfully containerized the in-house pipeline including the Snakemake managing system using Docker. The Snakemake program was containerized to reduce the amount of programs that needed to be installed on the host OS. In order to achieve this, the Docker container containing Snakemake needed to be able to start up other docker containers, thus a ‘docker in docker’ system was created. In a next phase, the containerized pipeline was tested on Windows, MacOS and Unix. For the same test data panel (2 S. typhi samples from Rwanda that where recently sequenced with Illumina at DIAB). The complete data analysis using the containerized pipeline took about 10 minutes on both the DIAB server (Ubuntu 18.04; 30 threads) and a laptop (Windows 10; 9 threads). On MacOS it took 2 hours and 25 minutes to run the samples (MacOS v10.13.6; 2 threads). Next to the Illumina short-read-pipeline, I have developed a new WGS pipeline for analysis of MinION long-reads. The MinION-sequencing-device (Oxford Nanopore Technologies, UK) is a portable sequencing device allowing genome sequencing in remote settings and providing real-time basecalling functionality using the Guppy tool. To assemble the long reads I selected Unicycler which allows users to use only MinIONreads, or combine both Illumina and MinION reads into one “hybrid” assembly. If both Illumina and MinION reads are combined, the same tools as in the previous pipeline are used (up until Trimmomatic) to prepare the Illumina reads. After which both Illumina and MinION reads are inputted into Unicycler for hybrid assembly. The complete long-readpipeline for MinION-reads is listed below:
1. Guppy: used for both basecalling and demultiplexing of the MinION reads
2. PycoQC: quality control of the basecalled MinIon reads
3. Porechop: trimming (and demultiplex correction) of the MinION reads
4. Unicycler: a pre-existing mini assembly pipeline for bacterial genomes. Unicycle takes care of the following steps: genome assembly, polishing of the assembly and circulating replicons
5. Bandage: visualizing the quality of the assembly
6. Prokka: annotating the assembly
By visualizing the different assembly types with bandage, it appeared that for 3 out of the 6 assembled S. typhi genomes, a long-read-only assembly already produced a closed circular contig (±4.8Mb) without using Illumina reads. However long reads are more sensitive to sequencing errors then short reads. Therefore when Illumina reads are added, the overall quality of the contig improved. As for the previous Illumina pipeline, this new MinION data analysis pipeline will be automated with Snakemake and transported in a portable Docker environment beyond my thesis. In conclusion, I have created an automated and containerized short-read WGS pipeline using Docker and Snakemake as well as a WGS pipeline for MinION long-reads that is now being tested, both pipelines are available for download on Github.
Abstract traineeship advanced bachelor of bioinformatics 2017-2018: 16S metagenomics for bacteria identification in blood samples: automatization of the data analysis workflow
Bacterial bloodstream infections (BSI) are a large public health threat with high mortality rates worldwide. Routine diagnosis is currently based on blood culture followed by microbiological techniques for bacterial species identification. Molecular diagnostic tools have been developed for identifying the presence of bacteria in the blood of BSI patients and are mostly based on single or multiplexed polymerase chain reaction (PCR) assays. 16S metagenomics is an open-ended technique that can detect simultaneously all bacteria in a given sample based on PCR amplification of the 16S ribosomal RNA gene followed by deep sequencing and taxonomic labelling of the amplicons by searching for similar sequences in public databases. My host lab at the Institute of Tropical Medicine (ITM) in Antwerp has conducted pivotal work in the field of 16S metagenomics for blood analysis (Decuypere S. et al., 2016, Plos NTS 10:e0004470; Sandra Van Puyvelde 2016, MSc thesis UGent).
The objective of the internship at ITM Antwerp is to develop a standardized and semi-automated bioinformatic workflow of the 16S metagenomics data analysis pipelines for blood analysis. The following methodologies have been standardized in a semi-automated workflow on a local server: (i) Pre-processing: trimming on quality, removing of adapter sequences, filtering on length and pairing of the raw reads using Trimmomatic and filtering on species level to exclude human reads by using Kraken; (ii) Taxonomic labelling of the processed reads: taxonomic classification of the contigs to bacterial genus/species level into a table of amplicon sequence variants (ASV) using DADA2, and visualization of the ASV data using phyloseq in R.
To enable a semi-automatically 16S pipeline, different scripts were used. In the pre-processing part of the pipeline, I have developed a bash script for calling the Trimmomatic and Kraken tools. To assess the quantity and quality in this part of the pipeline, FastQC and MultiQC were used by using a bash script as well and a python script was developed to parse the number of filtered reads from the output file of the Kraken tool into a cvs file to enable visualization of the quantity in R. In the taxonomic labelling part of the pipeline, where taxonomy is assigned to the reads, the R server software was used to run the DADA2 R-script on the local server via a browser. This script runs the different steps in the DADA2 workflow: (i) filtering and trimming step, (ii) dereplication step, (iii) merging paired reads step, (iv) removing chimeras step and (v) assigning taxonomy.
To assess the 16S pipeline, a panel of 80 DNA samples with known number of colony forming units (CFU) of different bacterial species spiked in blood was used. Briefly, the samples were spiked with four different concentration (200, 20, 2 and 0,2 CFU/ml) of four different bacteria (Escherichia coli, Klebsiella pneumoniae, Pseudomonas aeruginosa and Staphylococcus aureus) and combinations of E.coli plus one other bacterial species in different ratios (1:1, 1:9) to reflect mixed infections in blood. A set of blanco samples was also included in the panel. Following the protocol of 16S metagenomics of illumina, a library prep was made and run on the Miseq platform (2 * 300 cycles, Paired End reading). The sample preparation was conducted prior to the internship.
Running the MiSeq reads of this experimental sample panel through the workflow of the 16S metagenomics pipeline enables optimization and standardization of the different parameters in every step in the pipeline in order to obtain maximum true positive classified reads in the ASVs table. During my study, I have observed that (i) Pre-filtering with Trimmomatic does not need high stringency in this pipeline since there is a quality filtering step in the DADA2 workflow and the paired end reads must be long enough to merge into one contig and (ii) Kraken should be used to exclude human reads from the dataset, since the raw data contained a high number of human reads, lowering the computational time in further processing in DADA2, (iii) True positives in the ASV table are obtained for Escherichia and Klebsiella from concentration between 2-200 CFU and in the mixed spiked samples. Pseudomonas and Staphylococcus showed false positive reads in the blanco samples, although the read counts were lower than the samples spiked with these species.
In conclusion, a 16S metagenomics pipeline was standardized, automated and evaluated on MiSeq reads of a blood sample panel spiked with known numbers of different bacterial species. Further improvements should be made in the wet lab part, such as decreasing amplification of human DNA and the number of false positive hits. Increasing the sequence depth and/or length could result in higher diagnostic sensitivities. Another limitation of the current 16S metagenomics workflow for use as a diagnostic test is that taxonomic labelling of the reads is only accurate to the genus level and not to the species level.
Sandra Van Puyvelde
Tessa de Block