Agilent Technologies
Abstract 2019-2020: Streamlining generation and querying of sequencing data annotation via a standardized front-end
Next Generation Sequencing is commonly used within Multiplicom to be able to develop and test their molecular diagnostics kits. An important part of this process is properly annotating and storing the generated sequencing data. Annotating the data allows the users to be able to retrieve the correct files if the data is needed for a consecutive phase.
The project is divided into two parts:
- Automatization of the creation of the annotation files
- Querying existing annotation files
Both parts of the project have similar steps, with an extra preceding step for the querying part of the project.
For the automatization of the creation of the annotation files, the user fills in a front-end, after which the data is sent through an API (Application Programming Interface). The API will perform several checks on the data to ensure that all necessary fields are filled in and no user-related errors can influence the created annotation file. The checked user-data is sent from the API to an underlying Python script using a temporary file, which will create the annotation file.
In the second part, an extra preceding step was necessary to summarize all the annotation files into a database using sqlite3. This summarizing script will be run via crontab to ensure that the database stays up-to-date with the newest annotation files. The user can fill in the necessary search parameters via a separate query front-end.
These user-data will be sent to an API, which will perform some checks on the data.
When the data have passed these checks, it is sent to a PHP script which will query the sqlite3 database based on the entered parameters. The query results are formatted to only show the name of the annotation file on the screen. If the user wants more information, it is possible to press a button to show a summary of the annotation file.
A testing script was created to be able to test several types of user input in both front-ends. The script sends different kinds of requests to the API and compares the result with the expected result. This makes the testing of the API faster and more user-friendly, since the different scenarios do not have to be created manually and no comparison has to be made by the user.
The implication of the annotation generation front-end will result in a reduction of the number of user-related errors, while the query front-end will result in a higher
time-efficiency.
Abstract traineeship advanced bachelor of bioinformatics 2017-2018: Development of sequencing quality trending platform for Illumina MiSeq, MiniSeq and NextSeq instruments
Background
Next generation sequencing (NGS), also known as massive parallel sequencing, allows DNA sequencing in a high-throughput manner. Illumina, the NGS market leader, manufactures and sells different sequencing instruments, which are capable of sequencing millions of DNA fragments simultaneously. NGS sequencing data is typically analyzed for the purpose of mutation analysis (variant calling), copy number analysis (CNV calling) or other applications.
For each sequencing experiment, it is crucial to know the quality of the generated sequencing data as this determines whether the downstream analysis can produce valid results. For this purpose, an Illumina sequencing instrument provides, each time an experiment is performed, the user with quality parameters. These quality parameters are contained in a difficult-to-access file format. However, the Illumina Sequencing Analysis Viewer (SAV see Figure 1) provides the user a click-and-browse interface to visualize certain quality parameters for a specific sequencing experiment. This software is easy to use, but does not allow the user to easily export the quality data. Nor does it allow multiple sequencing experiments to be assessed in a trending analysis.
Purpose
The purpose of the internship was to extract the quality control features from the difficult-to-access files, structure relevant quality parameters in an easy-accessible file format, automatically generate printable quality reports, and, lastly, create a longitudinal trending platform for said quality parameters. The trending platform must be compatible with Illumina Miseq, MiniSeq and NextSeq instruments and agnostic of the sequencing experiment set-up.
At the start of the internship, a proof of concept script, using ‘Illuminate’, was available. Illuminate is an open-source package program that extracts the quality data from a single sequencing experiment output folder (‘run folder’). However, this package only supports MiSeq data and partially NextSeq data. Illuminate had to be replaced by an in-house analysis script.
Summarized, the work can be divided in three discrete parts: 1) Quality feature extraction and structuration, 2) PDF report generation per sequencing experiment and 3) Longitudinal trending using a web interface. Furthermore, human intervention had to be eliminated as much as possible. All analyses, reports and trending overview need to be generated in an automated fashion.
Results – Part 1: Feature extraction
In each Illumina run folder, quality parameters are scattered across different XML and binary files. The XML files are human-readable, but complex, and contain structured information such as instrument name, instrument type, passed-filter percentages etc. Data from these files can easily be extracted. The binary files contain quality data specific to the generation of the DNA sequences (error rates, Q30 scores etc). These binary files are machine-readable only and have a complicated structure depending on the Illumina instrument (MiSeq, NextSeq or MiniSeq).
The first part, quality feature extraction and structuration, focused on the extraction of the relevant parameters from these XML and binary files and was realized by a combination of Python (to process the binary files) and object-oriented (OO) Perl modules/scripts (to process the XML files). Everything is ‘glued’ together by a master script (in Perl), which operates on a run folder. Ultimately, this master scripts extracts all relevant quality parameters and structures it in a set of YAML files (one set per run folder). The YAML format is a both human and computer readable format and combines the advantages of both ‘regular flat files’ and a structured format such as XML or HTML.
The OO Perl was leveraged to do automated source code testing and validation using an available unit testing (UT) framework, ultimately allowing quick deployment into a real production environment.
Results – Part 2: Report generation
The second part focused on the generation of standardized PDF reports (see Figure 1) containing all relevant quality parameters stored in the YAML files. Therefore, a Perl script, again operating on run folder level, was made. This Perl, in turn, uses LaTeX and Sweave.
LaTeX is a mark-up language, comparable to HTML, which allows PDF to be generated from source code. Sweave is an R package allowing R source code to be used and executed from within LaTeX, allowing typical R plots to be generated automatically and directly from within the Perl script.
Ultimately, the Perl script generates a standard LaTeX (.tex) file and converts it to a PDF. This PDF contains all relevant quality parameters and plots required to either approve or reject a sequencing experiment. The PDF report allows a static ‘snapshot’ to be taken of the sequencing experiment and allows sign-off of the experiment, important for validation purposes.
Results – Part 3: Longitudinal trending
The third part focused on the creation of a longitudinal trending platform. This longitudinal approach allows one (or more) parameters to be assessed over time enabling measurement of, for example, lab, operator or instrument performance.
Therefore, a Perl script was made that extracts all relevant data from the YAML files from multiple run folders and converts it to a JSON file. JSON is a frequently used data interchange format allowing data to be passed on between different scripts or programs. This JSON file is then visualized via a web interface displaying different graphs. For each instrument an interactive plot is available that allow a certain quality parameter to be shown of the last 10 sequencing (see Figure 1). This web interface is made using a combination of HTML and JavaScript (JQuery and chartJS).
Results – Part 4: Automation
Human intervention was eliminated by combining ‘trigger scripts’ and Linux crontab. These trigger scripts are set to be executed, by crontab, at frequent intervals. The first trigger script scans the sequencing data repository and flags new run folders for processing. A second trigger script scans the sequencing data repository and flags run folders where the YAML files are generated but no PDF report is available yet. The flagged run folders are stored in an MySQL database. An executor script starts, at periodic intervals (using crontab), the required scripts (as described in step 1 and 2). All scripts are written in Perl.
The script that generates the JSON file used in the longitudinal trending platform is also executed periodically (using crontab) allowing the website to contain up-to-date parameters.
Conclusion
An automated sequencing quality trending platform requiring no hands-on or manual steps for Illumina MiSeq, MiniSeq and NextSeq instruments was successfully developed.
Tags: bioinformatics |
Address
Galileilaan 18
2845 Niel
Belgium |
Contacts
Traineeship supervisor
Joachim De Schrijver
joachim.de-schrijver@agilent.com |