Abstract traineeship advanced bachelor of bioinformatics 2017-2018: Development and implementation of bioinformatic algorithms for NGS data analysis
At the Center of Molecular Diagnostics (CMD) of the Jessa Hospital, targeted next-generation sequencing (NGS) is used to detect somatic variants in solid and myeloid tumors. For each group of tumors (solid or myeloid) a specific panel is used. Samples are analyzed on MiSeq instruments (Illumina) and results are processed with the accompanying software, i.e. MiSeq Reporter (Illumina). Subsequently, variants are annotated and filtered with the VariantStudio software (Illumina) using specific criteria. Relevant variants are reported to the clinician, upon which the patient’s diagnosis, prognosis and treatment can be determined.
In this set-up, data can be easily analyzed on a per sample basis. However, a meta-analysis of these NGS data is impossible as they are spread over different files and stored as various file formats. The goal of this traineeship was to develop and design a database that groups these data and serves two main purposes. Firstly, when a variant formerly considered as a variant of unknown significance becomes relevant in terms of disease management, patients carrying this particular variant can be easily traced via the database and appropriate measures can be taken. Secondly, data can be easily extracted from the database to be used for statistical analyses and visualizations for research or publication purposes. In addition, the database enables long-term quality control in terms of follow-up of detected versus population mutation frequencies.
To realize this project, an automated workflow was established. Firstly, since the VariantStudio software used for annotation and filtering of the variants is not suitable for automation, this annotation and filtering process was simulated as closely as possible using an open source tool called SnpEff/SnpSift. This was accomplished via two bash scripts, one for annotation and one for filtering of the variants. While the former takes variant call format (VCF) files of several NGS runs as input and generates annotated VCF files per run, the latter subsequently filters these VCF files and generates a raw database file containing the annotated and filtered variants of the processed runs. This final output was used to optimize the SnpSift filter by comparing it with the results obtained via VariantStudio. Secondly, two python scripts were developed: one to further process the raw database file and one to parse relevant analysis, sample and patient data present in Microsoft Excel files. Both bash and python scripts were assembled into one main bash script which takes the ‘raw’ VCF files as input and outputs a number of database files for import in the database. In addition, log files and a file containing ‘aberrant’ data for manual curation are also generated. Thirdly, a database was developed and implemented in Microsoft Access, taking database normalization principles into account. The aforementioned database files can be easily (manually) imported into the database. Finally, several queries and forms were designed to consult the data in an organized way. Where appropriate, Visual Basic for Application (VBA) procedures were written.
In conclusion, an automated workflow for annotating, filtering and storing variants into a database was successfully implemented. The developed database meets the initial objectives. Patients carrying a particular variant can be easily tracked down via the database. Moreover, data can be consulted per sequencing panel with the option to set different search criteria. Furthermore, it is possible to detect if a patient has more than one sample analyzed. This is important in the context of patient follow-up as a patient initially eligible for a certain therapy might develop resistance to this therapy by acquiring other mutations at a later stage. If detected early on, therapy might be adjusted. In addition to using the database for meta-analyses, data can be easily exported for further statistical processing and visualization in Excel or R or any other statistical software package.