Abstract traineeship advanced bachelor of bioinformatics 2017-2018: Development of sequencing quality trending platform for Illumina MiSeq, MiniSeq and NextSeq instruments
Next generation sequencing (NGS), also known as massive parallel sequencing, allows DNA sequencing in a high-throughput manner. Illumina, the NGS market leader, manufactures and sells different sequencing instruments, which are capable of sequencing millions of DNA fragments simultaneously. NGS sequencing data is typically analyzed for the purpose of mutation analysis (variant calling), copy number analysis (CNV calling) or other applications.
For each sequencing experiment, it is crucial to know the quality of the generated sequencing data as this determines whether the downstream analysis can produce valid results. For this purpose, an Illumina sequencing instrument provides, each time an experiment is performed, the user with quality parameters. These quality parameters are contained in a difficult-to-access file format. However, the Illumina Sequencing Analysis Viewer (SAV see Figure 1) provides the user a click-and-browse interface to visualize certain quality parameters for a specific sequencing experiment. This software is easy to use, but does not allow the user to easily export the quality data. Nor does it allow multiple sequencing experiments to be assessed in a trending analysis.
The purpose of the internship was to extract the quality control features from the difficult-to-access files, structure relevant quality parameters in an easy-accessible file format, automatically generate printable quality reports, and, lastly, create a longitudinal trending platform for said quality parameters. The trending platform must be compatible with Illumina Miseq, MiniSeq and NextSeq instruments and agnostic of the sequencing experiment set-up.
At the start of the internship, a proof of concept script, using ‘Illuminate’, was available. Illuminate is an open-source package program that extracts the quality data from a single sequencing experiment output folder (‘run folder’). However, this package only supports MiSeq data and partially NextSeq data. Illuminate had to be replaced by an in-house analysis script.
Summarized, the work can be divided in three discrete parts: 1) Quality feature extraction and structuration, 2) PDF report generation per sequencing experiment and 3) Longitudinal trending using a web interface. Furthermore, human intervention had to be eliminated as much as possible. All analyses, reports and trending overview need to be generated in an automated fashion.
Results – Part 1: Feature extraction
In each Illumina run folder, quality parameters are scattered across different XML and binary files. The XML files are human-readable, but complex, and contain structured information such as instrument name, instrument type, passed-filter percentages etc. Data from these files can easily be extracted. The binary files contain quality data specific to the generation of the DNA sequences (error rates, Q30 scores etc). These binary files are machine-readable only and have a complicated structure depending on the Illumina instrument (MiSeq, NextSeq or MiniSeq).
The first part, quality feature extraction and structuration, focused on the extraction of the relevant parameters from these XML and binary files and was realized by a combination of Python (to process the binary files) and object-oriented (OO) Perl modules/scripts (to process the XML files). Everything is ‘glued’ together by a master script (in Perl), which operates on a run folder. Ultimately, this master scripts extracts all relevant quality parameters and structures it in a set of YAML files (one set per run folder). The YAML format is a both human and computer readable format and combines the advantages of both ‘regular flat files’ and a structured format such as XML or HTML.
The OO Perl was leveraged to do automated source code testing and validation using an available unit testing (UT) framework, ultimately allowing quick deployment into a real production environment.
Results – Part 2: Report generation
The second part focused on the generation of standardized PDF reports (see Figure 1) containing all relevant quality parameters stored in the YAML files. Therefore, a Perl script, again operating on run folder level, was made. This Perl, in turn, uses LaTeX and Sweave.
LaTeX is a mark-up language, comparable to HTML, which allows PDF to be generated from source code. Sweave is an R package allowing R source code to be used and executed from within LaTeX, allowing typical R plots to be generated automatically and directly from within the Perl script.
Ultimately, the Perl script generates a standard LaTeX (.tex) file and converts it to a PDF. This PDF contains all relevant quality parameters and plots required to either approve or reject a sequencing experiment. The PDF report allows a static ‘snapshot’ to be taken of the sequencing experiment and allows sign-off of the experiment, important for validation purposes.
Results – Part 3: Longitudinal trending
The third part focused on the creation of a longitudinal trending platform. This longitudinal approach allows one (or more) parameters to be assessed over time enabling measurement of, for example, lab, operator or instrument performance.
Results – Part 4: Automation
Human intervention was eliminated by combining ‘trigger scripts’ and Linux crontab. These trigger scripts are set to be executed, by crontab, at frequent intervals. The first trigger script scans the sequencing data repository and flags new run folders for processing. A second trigger script scans the sequencing data repository and flags run folders where the YAML files are generated but no PDF report is available yet. The flagged run folders are stored in an MySQL database. An executor script starts, at periodic intervals (using crontab), the required scripts (as described in step 1 and 2). All scripts are written in Perl.
The script that generates the JSON file used in the longitudinal trending platform is also executed periodically (using crontab) allowing the website to contain up-to-date parameters.
An automated sequencing quality trending platform requiring no hands-on or manual steps for Illumina MiSeq, MiniSeq and NextSeq instruments was successfully developed.
Joachim De Schrijver