Search form

ILVO Merelbeke

Contact details
Traineeship proposition

Abstract traineeship advanced bachelor of bioinformatics 2017-2018: Using high-throughput sequencing data from Coffea species to decipher the origin of coffee

Did you drink a cup of coffee to wake up this morning? Drinking coffee in the morning is a habit that is ingrained in many cultures around the world. As such, most people associate coffee with an espresso in the morning, a cappuccino after lunch or a latte macchiato in the evening. Some might even think of coffee beans, but coffee is certainly much more than that. Nearly all coffee that we drink originates from the plant species Coffea arabica. Coffea arabica is a very interesting species to investigate because it is the only known polyploid, more specifically tetraploid, species of the genus Coffea. Tetraploid means that the genome of this species consists of four copies of each chromosome. We know that Coffea arabica originated from the natural hybridization between two diploid Coffea species and that Coffea canephora is the paternal species. Nevertheless, the identity of the maternal ancestor of this hybrid is uncertain. The purpose of this study is to identify the maternal line of Coffea arabica.

For this project we used different types of samples, namely samples collected from greenhouse accessions, herbarium samples and samples collected in the wild that were immediately dried on silica for optimal DNA preservation. All samples were processed in the lab to obtain genotyping-by-sequencing (GBS) libraries. The GBS data were first analyzed using a standard bioinformatics pipeline that includes the trimming, merging, quality filtering and mapping of the data. However, the merging efficiency was remarkably lower for herbarium samples than for the other sample types and the used reference-based calling method for single nucleotide polymorphisms (SNPs) was very inaccurate when they were applied on Coffea arabica samples. Because of these difficulties, we tried a new reference-free pipeline based on the GIbPSs software on a smaller set of samples (mainly accessions retrieved form the greenhouse). The GIbPSs software identifies loci without using a reference genome and detects variation (SNPs) in loci that are shared by samples.

We have run this reference-free pipeline multiple times in order to optimize the settings for our study system and added extra filtering steps to remove microbial contamination from our dataset. The addition of these filtering steps required the programming of custom python, bash, and R scripts and the implementation of BLAST for locus identification. Finally, we were able to exclude several candidate species as the potential ancestors of Arabica coffee based on the percentage of common alleles and the number of common loci.

From these results, it is plausible that Coffea eugenioides is the maternal ancestor of Coffea arabica. This conclusion must be confirmed by further analyzes on larger datasets including more Coffea species and more samples per species. However, we provide a promising reference-free bioinformatics pipeline for the identification of the hybrid origin of plant species with a complex evolutionary history. As many crops probably have a hybrid origin, this tool can be of great help in the improvement of modern agriculture.


Burgemeester Van Gansberghelaan 92
9820 Merelbeke


Traineeship supervisor
Tom Ruttink
Traineeship supervisor
Sofie Derycke
Via Map