Abstract advanced bachelor of bioinformatics 2020-2021: ANALYSIS OF VIRAL AND BACTERIAL METAGENOMICS FOR PATHOGEN IDENTIFICATION AFTER THIRD GENERATION SEQUENCING
PathoSense has developed a sequence-based diagnostics platform for veterinary infectious diseases. Here, nanopore sequencing is used to generate long read data from metagenomic samples. Current taxonomic assignment of resulting viral & bacterial sequences are done using DIAMOND & Kraken2 on a local database. However, still manual curation of some results was required through online NCBI blast searches against the complete database in order to :
- check specificity of some reads (weather they were classified correctly)
- get species & strain information
- get subtypes of viruses with segmented genome (e.g. Influenza virus & Rotavirus)
These manual blasts were very time consuming, so in this traineeship, we tried to answer the following question: is it possible to automate blast searches for a complete sequencing run, filter useful information, visualize the results & generate a final results chart?
As starting point for my script, an existing pipeline script (PathoPipe) generates a specific directory structure for each sequencing run (12 barcoded samples). Here, every sample has a barcode directory containing specific taxon directories. Each taxon directory contains one multifasta file with either single reads or contigs that were classified as this taxon. The central python script is called from the original pipeline bash script, hence automatically executed when the previous pipeline steps are finished.
The code is split up in different python scripts, with one central script that calls and directs the other scripts. To make the code non-redundant and easy to read, the logic is further split up into modules. Firstly, the blast searches were automated using the subprocess module to perform BLAST+ searches against a local NCBI nucleotide database of ~250 GB. Since these searches are time-consuming, GNU parallelization was used to speed up this process. Next, useful information was extracted, including specificity of each read/contig by determining whether resulting taxonomy IDs match the pre-classified taxon. If not, reads/contigs were flagged aspecific. This determination is done using the NCBI taxonomy database via the ncbi_taxonomy module. By finding overlapping taxonomy IDs between results, specific reads were divided into groups, specified as different subtaxa. In order to extract information on biological relatedness, a scoring system was developed and validated to determine top blast hits for each of these (new) taxonomic groups. Furthermore, additional biologically relevant information was extracted from these blast results, including.:
- presence of a 16S rRNA gene contig in bacteria for species assignment
- suptyping of viruses like Influenza virus (H1N1) and Rotavirus A (G5P), based on specific segments of their genome.
As third step, all results per barcode were visualized through an interactive horizontal barplot (html page), built with Plotly. An example output file is given in Figure 1. All resulting html pages from one sequencing run (12 barcodes) were combined into one page, allowing quicker scroll-through of all the results at once.
As final step, a new .tsv file should be generated including curated raw sequencing numbers. Data from the run is converted into a viral .tsv file & bacterial .tsv file. Only taxon, subtaxon and read counts were required in these files. Contig counts (contiguous sequences obtained from multiple reads) were converted to read counts and inter-sample contamination was removed. For final interpretation, resulting curated counts were converted to low/medium/high as compared to an internal spike-in control.
Importantly, since the used databases change frequently, a cron task was added to the system to keep these databases up-to-date. This cron task executes a separate script that backs up and updates all databases, including the NCBI nucleotide & taxonomy databases once a month.
While the final .tsv files still need some finetuning before implementation in the PathoSense diagnostics mobile app & website, the script is already actively used to perform automated diagnostics interpretation. This is possible since the horizontal barplot visualization has been validated extensively and performs well, resulting in a significant gain of time as was aimed for with this traineeship.
Abstract advanced bachelor of bioinformatics 2019-2020: Development of bioinformatics applications for real-time geographical tracing of viral and bacterial infectious diseases
In the late 1980s, the syndrome that caused reproductive and respiratory problems in pigs was first referred to as 'Mystery Swine Disease' and today as 'Porcine Reproductive and Respiratory Syndrome (PRRS)'. To this day, this disease remains one of the most widespread and economically devastating diseases in pig industry1. The PRRS virus (PRRSV) is a member of the genus Porarterivirus belonging to the family Arteriviridae within the order Nidovirales. This relatively small enveloped virus contains a single-stranded positive sense RNA genome with a length of about 15 kb, encoding 10 ORF’s(open reading frames)2,3. The first characterization of circulating European (type 1) and North American (type 2) genotype isolates turned out to be surprisingly genetically different. Although the general disease phenotype, broad clinical symptoms, genomic organization and time of onset were all similar, these strains differed by ~ 40% at the nucleotide level4.
The rapid evolution of the virus makes it possible to derive the history of an epidemic from its genomic data. But with such mutation speed, backmutations can complicate phylogenic and genomic conclusions. In addition, the increased availability of novel sequencing technologies has allowed to perform rapid genome sequencing of pathogens. Such genomic information can be linked and plotted on a spatiotemporal map. This shows the spreading of the virus at a population level and helps to understand the evolution of the virus.
For setting up the spatiotemporal analysis pipeline the Nextstrain5 software package was completely adapted towards PRRSV genomes. Nextstrain consists of data curation, analysis and visualization components. Python scripts maintain a database with available sequences and associated metadata. A set of instruments performs phylodynamic analyses6, including sub-sampling, multiple-sequence alignment, phylogenetic inferences, temporal dating of ancestor nodes, and discrete geographical reconstruction of features, including inferences of the most likely transmission events. This uses the maximum probability analyses implemented in TreeTime and allowed a complete analysis of the entire PRRSV orf5 dataset (n = 768 samples) in 20 minutes7. Multiple views in different panels of the data are presented and remain synchronized when interacting with the data. From the orf5 dataset it could be concluded that most sequences do not differ that much from each other. The highest number of strains can be found in Italy, Spain and the United Kingdom. These are also located mostly in the same clade. Other countries contain a mixture of different clades. This can probably be explained by the large free transport of pigs throughout Europe, which allows the virus to spread quickly over large areas. When looking at whole genome sequences, much less data was available (n = 113). Here, the observed clusters were mainly regionally bound. Nevertheless, further addition of whole genome sequences are required to further support this hypothesis.
Finally, from these samples, a vaccinology analysis was performed. With the help of peptide-specific serum antibodies the antigenic regions in the envelope proteins were characterized and neutralizing regions were mapped8. Through the use of emboss scripts and the python-based visualization tool; Plotly, these regions were compared with known vaccination strains. By implementation of a scoring system, the most appropriate vaccine strain can be proposed. When these results can be linked to in vitro data from neutralization studies, we can evaluate whether this approach can be used to predict vaccine effectiveness.
- Lunney, J. K., Benfield, D. A. & Rowland, R. R. R. Porcine reproductive and respiratory syndrome virus: An update on an emerging and re-emerging viral disease of swine. Virus Res. 154, 1–6 (2010).
- Cavanagh, D. N. A new order comprising Coronaviridae and Arteriviridae. Arch. Virol. 142, 629–633 (1997).
- Snijder, E. J. & Meulenberg, J. J. M. the Molecular Biology of. J. Gen. Virol. 79, 961–979 (1998).
- Morrison, R.B., Collins, J.E., Harris, L., Christianson, W.T., Benfield, D.A., Chladek, D. W., Gorcyca, D.E., Joo, H.S., 1992. Serologic evidence incriminating a recently isolated virus (ATCC VR-2332) as the cause of swine infertility and respiratory syndrome (SIRS). J. Vet. Diagn. Investig. 4, 186–188.
- Hadfield et al., Nextstrain: real-time tracking of pathogen evolution, Bioinformatics (2018)
- Volz,E.M. et al. (2013) Viral phylodynamics. PLoS Comput. Biol., 9, e1002947
- Sagulenko,P. et al. (2018) Treetime: maximum-likelihood phylodynamic analysis. Virus Evol., 4, vex042
- Vanhee, M., Van Breedam, W., Costers, S., Geldhof, M., Noppe, Y., & Nauwynck, H. (2011). Characterization of antigenic regions in the porcine reproductive and respiratory syndrome virus by the use of peptide-specific serum antibodies. Vaccine, 29(29-30), 4794–4804. https://doi.org/10.1016/j.vaccine.2011.04.071
Dr. Sebastiaan Theuns
+32 9 264 73 87