Abstract traineeship advanced bachelor of bioinformatics 2019-2020: Mining plant metatranscriptomics datasets in search for plant pathogens
The internship is done within the framework of a European project called "Plant Health Bioinformatics Network". The focus of the internship is on the fourth work package, which covers a community effort to mine RNA-seq data for non-viral pathogens. A lot of the partners in this European project are plant virologists. They generate lots of plant-derived RNA-seq data solely to detect viruses in the samples, but this is generally only a very small part of the available data. Therefore during the project a method is developed to detect non-viral organisms in RNA-seq datasets, with as aim to detect plant pathogens of other origins (bacteria, fungi, insects, mites, etc.). First of all the project partners perform a mapping of their data against a ribosomal RNA database (SILVA). Those result files are sent to ILVO, where they are summarized into a count file by pathogen group. Those count files are then visualized in barplots showing the rRNA content. Based mainly on the bar plots, a selection is made of the most interesting samples which potentially contain non-viral plant pathogens. For those samples a data transfer was arranged from the partner to ILVO. At ILVO an in-depth analysis is performed on the data. The method which is used in for this in-depth analysis is created and optimized during the first part of the internship. Two main strategies are explored, the first being the use of a taxonomic classifier for metagenomics data (Kraken2) directly on the reads. Kraken2 uses a lowest common ancestor algorithm to classify a read. The second strategy consists of a meta-assembly (rnaSPAdes) followed by a database search and taxonomic assignment to species level based on the top hit. Also two hybrid pipelines were created using both tools. The main difference between those two hybrid pipelines is the relative position of the meta-assembly and the classification tool. The method based solely on the meta-assembly was chosen for further use in the project due to its speed and similar results. The disadvantage of this method is that it does not use a lowest common ancestor method.
In the second part of the internship data transfers between ILVO and the partners is arranged and a project report is written using R Markdown. In this project report not only the rRNA content of the samples is visualized but also the metadata of the samples, such as the application of a rRNA-depletion step, the RNA extraction method, plant tissue used, etc.
Abstract traineeship advanced bachelor of bioinformatics 2017-2018: Using high-throughput sequencing data from Coffea species to decipher the origin of coffee
Did you drink a cup of coffee to wake up this morning? Drinking coffee in the morning is a habit that is ingrained in many cultures around the world. As such, most people associate coffee with an espresso in the morning, a cappuccino after lunch or a latte macchiato in the evening. Some might even think of coffee beans, but coffee is certainly much more than that. Nearly all coffee that we drink originates from the plant species Coffea arabica. Coffea arabica is a very interesting species to investigate because it is the only known polyploid, more specifically tetraploid, species of the genus Coffea. Tetraploid means that the genome of this species consists of four copies of each chromosome. We know that Coffea arabica originated from the natural hybridization between two diploid Coffea species and that Coffea canephora is the paternal species. Nevertheless, the identity of the maternal ancestor of this hybrid is uncertain. The purpose of this study is to identify the maternal line of Coffea arabica.
For this project we used different types of samples, namely samples collected from greenhouse accessions, herbarium samples and samples collected in the wild that were immediately dried on silica for optimal DNA preservation. All samples were processed in the lab to obtain genotyping-by-sequencing (GBS) libraries. The GBS data were first analyzed using a standard bioinformatics pipeline that includes the trimming, merging, quality filtering and mapping of the data. However, the merging efficiency was remarkably lower for herbarium samples than for the other sample types and the used reference-based calling method for single nucleotide polymorphisms (SNPs) was very inaccurate when they were applied on Coffea arabica samples. Because of these difficulties, we tried a new reference-free pipeline based on the GIbPSs software on a smaller set of samples (mainly accessions retrieved form the greenhouse). The GIbPSs software identifies loci without using a reference genome and detects variation (SNPs) in loci that are shared by samples.
We have run this reference-free pipeline multiple times in order to optimize the settings for our study system and added extra filtering steps to remove microbial contamination from our dataset. The addition of these filtering steps required the programming of custom python, bash, and R scripts and the implementation of BLAST for locus identification. Finally, we were able to exclude several candidate species as the potential ancestors of Arabica coffee based on the percentage of common alleles and the number of common loci.
From these results, it is plausible that Coffea eugenioides is the maternal ancestor of Coffea arabica. This conclusion must be confirmed by further analyzes on larger datasets including more Coffea species and more samples per species. However, we provide a promising reference-free bioinformatics pipeline for the identification of the hybrid origin of plant species with a complex evolutionary history. As many crops probably have a hybrid origin, this tool can be of great help in the improvement of modern agriculture.
Burgemeester Van Gansberghelaan 92