Abstract traineeship advanced bachelor of bioinformatics 2019-2020: Adaptation, validation and deployment of R-based HR deficiency detection tools in targeted gene panels
With the advent of cheap and reliable DNA sequencing technologies and the wide-scale adoption of these technologies in the industry, healthcare was bound to follow suit and adopt these methods. Using genomic data acquired from patients, a personalized approach to medicine can be taken. A great example for personalized medicine is cancer treatment, in which specific mutational signatures can be tackled in different ways. A myriad of bioinformatics tools have been developed for the analysis and visualization of all this data.
The right tools for the job
For this project, a tool named “SigMA” was selected. SigMA is an R-based Shiny Application, which specialized in the detection of Signature 3, otherwise known as HRD (Homologous Recombination Deficiency), in breast cancer data. SigMA can be used for much more than breast cancer data, however. Twenty-six different tumor types can be selected of which thirteen have specially tuned models for the accurate detection of Signature 3 within these samples. SigMA is also not restricted to data sourced from one sequencing platform and is specifically developed for targeted gene panels (rather than exome or whole-genome sequencing), which is a major boon for the staff at the Jessa hospital in Hasselt, whom apply the targeted pan-cancer Illumina TSO500 sequencing panel. Another advantage SigMA has over its competitors is that it contains four different analysis methods. The most important of these being Multivariable Analysis for the thirteen tumors mentioned earlier, and Maximum Likelihood for all others. The output SigMA generates, along with other clinical observations, should provide very accurate information about HRD in patients. Lastly, SigMA is also very user-friendly. It only requires VCF (Variant Call Format) or MAF (Mutation Annotation Format) files to perform an analysis. At most, these files should be filtered before use to bring down the time required for analysis of each sample, and to remove unnecessary data that could confuse the algorithm.
A standalone Shiny App was developed so that SigMA’s output data could be rendered after an analysis is concluded, since SigMA does not have an option to render data from previous analyses again without having to complete the analysis again.
In order to judge the output generated by SigMA, it was compared with another tool, namely signature.tools.lib. Signature.tools.lib is an R-package which has to be assembled locally on the host system, and unlike SigMA it provides no Shiny interface. Signature.tools.lib uses Cosine Similarity to perform all of its calculations, and is less user-friendly than SigMA, since the execution of this program requires an R-script. This does, however, make it easier to integrate this program into a pipeline. Signature.tools.lib is also designed to only work on Linux platforms, but can be adapted to run on Windows systems as well. It furthermore requires VCF files to be bgzipped and indexed (TBI (Tabix Index)). These files can be generated by Samtools, which only runs on Linux, so additional steps have to be taken to make this tool Windows compatible, if desired.
While both tools, through adjustment and improvement, have shown to be successful in detecting mutational signatures in targeted sequencing panel data generated by the Jessa hospital, further measures will have to be taken to properly validate these tools for clinical use. SigMA has shown to be the most promising by merit of its user-friendliness, easy to deploy nature, robust analysis techniques and traceability of final analysis results.
Abstract traineeship advanced bachelor of bioinformatics 2017-2018: Development and implementation of bioinformatic algorithms for NGS data analysis
At the Center of Molecular Diagnostics (CMD) of the Jessa Hospital, targeted next-generation sequencing (NGS) is used to detect somatic variants in solid and myeloid tumors. For each group of tumors (solid or myeloid) a specific panel is used. Samples are analyzed on MiSeq instruments (Illumina) and results are processed with the accompanying software, i.e. MiSeq Reporter (Illumina). Subsequently, variants are annotated and filtered with the VariantStudio software (Illumina) using specific criteria. Relevant variants are reported to the clinician, upon which the patient’s diagnosis, prognosis and treatment can be determined.
In this set-up, data can be easily analyzed on a per sample basis. However, a meta-analysis of these NGS data is impossible as they are spread over different files and stored as various file formats. The goal of this traineeship was to develop and design a database that groups these data and serves two main purposes. Firstly, when a variant formerly considered as a variant of unknown significance becomes relevant in terms of disease management, patients carrying this particular variant can be easily traced via the database and appropriate measures can be taken. Secondly, data can be easily extracted from the database to be used for statistical analyses and visualizations for research or publication purposes. In addition, the database enables long-term quality control in terms of follow-up of detected versus population mutation frequencies.
To realize this project, an automated workflow was established. Firstly, since the VariantStudio software used for annotation and filtering of the variants is not suitable for automation, this annotation and filtering process was simulated as closely as possible using an open source tool called SnpEff/SnpSift. This was accomplished via two bash scripts, one for annotation and one for filtering of the variants. While the former takes variant call format (VCF) files of several NGS runs as input and generates annotated VCF files per run, the latter subsequently filters these VCF files and generates a raw database file containing the annotated and filtered variants of the processed runs. This final output was used to optimize the SnpSift filter by comparing it with the results obtained via VariantStudio. Secondly, two python scripts were developed: one to further process the raw database file and one to parse relevant analysis, sample and patient data present in Microsoft Excel files. Both bash and python scripts were assembled into one main bash script which takes the ‘raw’ VCF files as input and outputs a number of database files for import in the database. In addition, log files and a file containing ‘aberrant’ data for manual curation are also generated. Thirdly, a database was developed and implemented in Microsoft Access, taking database normalization principles into account. The aforementioned database files can be easily (manually) imported into the database. Finally, several queries and forms were designed to consult the data in an organized way. Where appropriate, Visual Basic for Application (VBA) procedures were written.
In conclusion, an automated workflow for annotating, filtering and storing variants into a database was successfully implemented. The developed database meets the initial objectives. Patients carrying a particular variant can be easily tracked down via the database. Moreover, data can be consulted per sequencing panel with the option to set different search criteria. Furthermore, it is possible to detect if a patient has more than one sample analyzed. This is important in the context of patient follow-up as a patient initially eligible for a certain therapy might develop resistance to this therapy by acquiring other mutations at a later stage. If detected early on, therapy might be adjusted. In addition to using the database for meta-analyses, data can be easily exported for further statistical processing and visualization in Excel or R or any other statistical software package.