Center of Medical Genetics Edegem
Abstract 1 2019-2020: Pipeline for Copy Number Alteration and Tumor Fraction Detection in Cell-Free DNA of Cancer Patients using Ultra Low-Pass Whole-Genome Sequencing
Introduction: Cell-free DNA (cfDNA) is DNA that circulates freely in body fluid samples and has several (potential) clinical applications. The most frequently used source for cfDNA is the blood plasma. Every person has cfDNA in his plasma, but in healthy persons the concentration is generally low. Interestingly, cancer patients have increased levels of cfDNA and part of this cfDNA originates from the tumor. The cfDNA part that originates from a tumor is called circulating tumor DNA (ctDNA). The presence of ctDNA makes it possible to detect molecular alterations specific for the tumor in the cfDNA, as an alternative for tumor tissue. In addition, the fraction of ctDNA in cfDNA (i.e. tumor fraction) has been shown to be useful as a prognostic and follow-up biomarker in several cancer types. The current methods to detect tumor fraction have several shortcomings that limit feasibility. Alternative approaches are thus required and a possible alternative approach could be the use of ultra low-pass whole-genome sequencing (ULPWGS) of cfDNA, which can be used to determine the copy number alteration (CNA) profile and the tumor fraction.
Aim: Our aim was to develop a pipeline which allows the analysis of ULPWGS data of cfDNA from cancer patients and provides a CNA profile and tumor fraction for every analyzed sample.
Pipeline: To create the analysis pipeline we used Snakemake1, a workflow manager. The pipeline is summarized in Figure 1 and will be further explained here. The pipeline can be executed locally, but for improved performance the pipeline can also submit jobs to a server (e.g. cluster, grid or cloud system). The raw data is created by ULPWGS of cfDNA on the NextSeq500 using 75 bp single-end sequencing. Samples are analyzed on four lanes (L001-L002-L003-L004), so for every sample we start with four fastq files. The first step in the analysis is mapping the reads in the fastq files, for which we use the Burrows Wheeler Aligner (BWA)-MEM algorithm, and results in four separate BAM files. Then, the four different BAM files for one sample are merged using Picard Merge. Using Picard SortSam, the merged BAM file is sorted to prepare it for deduplication. Marking and removal of duplicate reads is then performed by Picard MarkDuplicates. The BAM file is sorted via Samtools Sort to create the final BAM file. We have used two tools to generate CNA profiles, namely the Bioconductor package QDNAseq2 and the R-based tool ichorCNA³. The final BAM file could be used immediately for running QDNAseq, which is implemented as an R script. For ichorCNA, a few additional steps are required. First, an index file for the final BAM file (BAI) should be generated and then, readCounter needs to be run to generate the WIG input file for ichorCNA. readCounter counts the number of reads in every predetermined bin (interval) and writes this in the WIG file. ichorCNA is then used to generate the CNA pattern and in addition, it also estimates the tumor fraction. To summarize the QDNAseq and ichorCNA results, the pipeline runs a Python script that generates interactive html pages using php. A page per patient is generated as multiple serial samples can be analyzed per patient and then it is nice to group the results for the different samples and visualize the evolution of tumor fraction over time. Quality control is performed on the raw reads using FastQC and we also visualize the evolution of the number of reads as an additional quality control metric for the different steps.
Conclusion: We were able to develop a pipeline for analysis of ULPWGS data of cfDNA samples from cancer patients. The pipeline summarizes the results per patient in interactive html pages, which include CNA patterns and tumor fractions of each sample. As the pipeline is developed using a workflow manager, it is relatively straightforward to implement additional analyses or tools, if desired, by adding new rules.
1. Köster, J. and Rahmann, S. 2012. Bioinformatics 28(19):2520-2.
2. Scheinin, I., et al. 2014. Genome Research 24:2022–32.
3. Adalsteinsson, V.A., et al. 2017. Nature Communications 8(1):1324.
Abstract 2 2019-2020: Evaluation of structural variation detection tools for whole genome sequencing pipeline construction
Structural variants (SV) or copy number variations (CNV) in the human genome may impact gene functions and cause a wide range of human diseases. SV detection tools are aimed at detecting these variations from whole genome sequencing (WGS) data. One algorithm (or tool) alone can’t detect all SVs and SV types such as copy number variations, deletions, duplications, inversions, translocations etc. A WGS pipeline containing an array of SV detection tools tries to achieve a higher accuracy in calling SVs for each SV type.
69 SV detection algorithms were paired and ranked using the F-measure, which takes into account the precision and recall, based on the analysis of literature data. Twelve top performing tools were selected and nine were implemented successfully using Singularity container images starting from existing Docker images or Github repositories. Using Singularity images is advantageous because they contain all the dependencies needed by the tools. This eliminates the need to install and set up all the dependencies on the server and resolves any clashing when different SV detection tools use different dependency versions, etc. Bash was used to construct scripts to run the tools on sample data in a reproducible manner and to build up the WGS pipeline. The output consisted of VCF and BCF files. Processing of these files was performed using SURVIVOR (in Python) and vcfR (in R). A descriptive analysis was employed to compare and evaluate tool performances. One important component part of the analysis to highlight is the number of overlapping calls shared between the tools. Combining SV detection tools which are good at calling the same SV types improves the accuracy. Visual representations of the analysis and of the filtered VCF output were made using Circlize, VennDiagram and related packages in R.
The final product is a WGS pipeline for calling structural variants such as deletions, duplications and inversions with tools like DELLY, GRIDSS, Manta, TIDDIT, etc.
Keywords: Bash, Circos, Docker, R, Singularity, Structural variation, SURVIVOR, vcfR whole genome sequencing
Abstract 2018-2019 (1): FUNCTIONAL PRIORITIZATION OF NON-CODING VARIANTS IN WHOLE GENOME SEQUENCING BASED ON PUBLIC DATABASES
Brugada syndrome is a genetic disorder in which the electrical activity of the heart is abnormal. It increases the risk of abnormal heart rhythms and sudden cardiac death. The most commonly involved gene is the SCNA5 gene, which encodes the cardiac sodium channel. However, genetic mutations in other (unknown) genes can be associated with Brugada syndrome. (1) In order to investigate the importance of these other genes, it may be valuable to look at all the interesting regions, including non-coding regions. Of four families, one or more members’ genome was sequenced and analysed by whole genome sequencing (WGS). The resulting non-coding variant files (hereby referred to as ‘WGS-samples’) were the start-point of the traineeship. In order to retrieve all kinds of important information on the variants, a summary of available and informative annotation sources were explored and presented. Data aggregation started by downloading different online databases of the human genome (hg19). Three databases were obtained from the ENCODE project; a transcription-factor binding-site database (TFBS), a Dnase clusters database and a genome segmentation database. Information from different human cell-lines was obtained from the Enhancer Atlas database. Lastly, candidate cis-regulatory elements (ccRE) information was retrieved from the SCREEN databases. On every obtained database-file, database manipulations were performed and relevant fields/columns were parsed to exclude the useful and interesting information, using R-scripts, a python-script and the linux command-line. To retrieve the useful information on the WGS-samples, ANNOVAR was used to annotate these files against the different database-files. The ANNOVAR-software is a tool to perform fast and easy variant annotations. (2) The tool uses variant call format (VCF) files as input, therefor the WGS-samples had to be converted to the correct format using an R-script. The database-files had to be converted as well, into the correct annovar-format, using the linux command-line. Next, the WGS-samples in VCF format were converted to the .avinput format so they could be correctly annotated. The subsequent variant annotation was region-based and generated output annotation-files for each sample against every database. Analogously, two predictive variants scoring tools were downloaded and installed. These tools, Genome Wide Annotation of Variants (GWAVA) and Regulatory Single nucleotide Variant Predictor (RSVP), are applicable to variants outside coding regions. (3,4) They predict the effect of a coding variant on a protein function. Trials were performed with testfiles to run the software, resolve errors and adapt where necessary. A third variant scoring tool, Combined Annotation Dependent Depletion (CADD), was already installed on the server and ready-to-use. The WGS-files were again converted to obtain a softwarecompatible format. Next, these compatible formats were assigned a score by running them against every scoring tool. 2 As a final step, the information obtained by the process above had to be represented in an interactive, well-defined way. First, an attempt was made to create heatmaps to present the information, but this was not ideal because of the amount of data. Next, it was decided that the visualization and interpretation of the results would be performed via shiny app. In this way, an interactive representation of the acquired data was possible by merging all the data-files into one summary. The variant scores were used as a base for the summary. The app was created in a way that only the variants that had a score above a specific threshold for the three scoring-predictors were displayed in relationship with the different databases. This results in an efficient manner to exclude only significant variants.
1. Antzelevitch C, Brugada P, Borggrefe M, Brugada J, Brugada R, Corrado D, et al. Brugada Syndrome: Report of the Second Consensus Conference. Circulation [Internet]. 2005 Feb 8 [cited 2019 Jun 14];111(5):659–70. Available from: http://www.ncbi.nlm.nih.gov/pubmed/15655131 2. Yang H, Wang K. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nat Protoc [Internet]. 2015 Oct 17 [cited 2019 Jun 13];10(10):1556– 66. Available from: http://www.ncbi.nlm.nih.gov/pubmed/26379229 3. Ritchie GRS, Dunham I, Zeggini E, Flicek P. Functional annotation of noncoding sequence variants. Nat Methods [Internet]. 2014 Mar 2 [cited 2019 Jun 13];11(3):294–6. Available from: http://www.ncbi.nlm.nih.gov/pubmed/24487584 4. Peterson TA, Mort M, Cooper DN, Radivojac P, Kann MG, Mooney SD. Regulatory Single-Nucleotide Variant Predictor Increases Predictive Performance of Functional Regulatory Variants. Hum Mutat [Internet]. 2016 [cited 2019 Jun 13];37(11):1137–43. Available from: http://www.ncbi.nlm.nih.gov/pubmed/27406314
Abstract 2018-2019 (2): Detection of structural variation in whole exome sequencing
Multiple robust methods exist to determine structural variation in Microarray (MA) data. Sadly, this is not yet the case for Whole Exome Sequencing (WES) data. To determine an optimal approach to detect structural variation, multiple CNV calling programs were examined and compared to find which program was most useful and user-friendly to utilize. For a set of ± 500 WES samples, gold-standard structural variation data was extracted from a MYSQL database on MA data, and then reformatted using an R script. The CNV calling programs examined during this traineeship consisted of ExomeCNV, CoNIFER, CNVkit, ExomeDepth and HMZDelFinder. Due to time constraints, only CoNIFER was used on the whole WES data set, while the other tools were limited to proof-of-concept implementation on a small test data set. The presentation will therefore mostly be covering the usage of the CoNIFER program. Since the programs all needed different input files, much pre-processing had to be done on the data first.
The raw data was obtained as unsorted and unfiltered BAM files, so the first step was to use SAMtools to filter and sort the raw BAM files. Next, the sorted BAMs had to be reordered to match the reference genome and bed file to work with certain tools such as GATK. This was done using the Picard tool with its ReorderSam function. After the BAMs were reordered, the creation of different derivative files could start: The GATK DepthOfCoverage function calculated the coverage for each sample, which was needed for ExomeCNV. Atlas2 was used to create VCF files from the WES data, which was needed to use HMZDelFinder. A multitude of bash scripts were written to run these pre-processing programs on the data (both locally and on the shared server). Raw BAM Data Sorted/Filtered BAM Data Reordered BAM Data Coverage files VCF files SAMtools Picard GATK ATLAS 2 Figure 1: Visualisation of the pre-processing Once all pre-processing was done (which was first attempted on the test data, then on the whole WES data set), the actual programs could be tested. Again, the CNV callers were first tested on the small test data set as a proof-of-concept and if the test was successful, the programs could be adjusted to work on the server on the whole data set. Since CoNIFER was the only tool that was run on the whole data set, only this caller was further discussed. CoNIFER consists of multiple python programs which can be activated from the command line. Figure 2 shows the pipeline of this CNV caller. First, the RPKM counts of each sample are calculated. These counts are then used to normalize the samples by using singular value decomposition (SVD). This normalization is done by the ‘analyze’ function. Once the samples are normalized, the actual CNV’s can be called using the ‘call’ function. A large call-file consisting of all calls for each sample will be made. This call file can finally be used to either export the CNVs to create files which collects the call per sample, or to plot the call in a graphic visualisation. Lastly, the obtained CNVs from the project’s WES data were processed using R and Rshiny to visualise the statistics (such as accuracy and precision) of the CNV calling tool, compared to the results obtained from microarray data from the same samples. These statistics were used to determine if CoNIFER is able to reliably call CNVs on WES data. In conclusion, we created the basis of a framework to evaluate WES-CNV callers on a large in-house validation dataset, using an interactive Rshiny interface. Extension of our work to additional methods is straightforward.
J. Fah Sathirapongsasuti, H. L. (2012). Package ‘ExomeCNV’. Opgehaald van CRAN: http://www2.uaem.mx/r-mirror/web/packages/ExomeCNV/ExomeCNV.pdf Niklas Krumm, P. H. (2012). Copy number variation detection and genotyping from exome sequence data. Genome Res. Plagnol, V. (2016, May 15). package ExomeDepth. Opgehaald van CRAN: https://cran.rproject.org/web/packages/ExomeDepth/ExomeDepth.pdf Rstudio. (2017). tutorial. Opgehaald van shiny rstudio: https://shiny.rstudio.com/tutorial/ Talevich, E. (2018). CNVkit Documentation. Opgehaald van ReadTheDocs: https://buildmedia.readthedocs.org/media/pdf/cnvkit/latest/cnvkit.pdf
Abstract traineeship (advanced bachelor of bioinformatics) 2017-2018 1: Creating reusable resources for analysing and interpreting DNA methylation data in the context of cancer research
DNA methylation is an epigenetic mechanism in which a methyl (CH3) group is added to DNA. The most documented methylation process is the methylation of the 5’ carbon group of Cytosine. DNA methylation induces changes in transcriptional regulation, mainly through promoter hypermethylation. In the context of cancer, DNA methylation can play different roles; hypermethylation of tumor suppressor genes, represses transcription, while hypomethylation of oncogenes promotes transcription, in both cases resulting in cancer propagation. Modern developments in next-generation sequencing and microarray technologies, have made it possible to study DNA methylation genome-wide over a large sample cohort. With such new methods however, substantial challenges arise regarding processing, analysis and interpretation. To that extent, several statistical tools have been created, but these lack ease of use and serial automation. The aim of this project was to create a reusable pipeline in which methylation data can be analyzed, using both pre-existing tool and in-house methods, ultimately resulting in a tactile and streamlined process. To achieve this goal, a selection of functions from the ChAMP package in R were used in combination with novel functions to acquire, preprocess and analyze the data. Functions for graphical illustration were also added to the conduit making it a “one-click” tool for routinely performed DNA methylation tasks.
Abstract traineeship (advanced bachelor of bioinformatics) 2017-2018 2: Integration of public and private genetic data in a collaborative online visual platform
The goal of the internship is to make a usable web page to interact with a database containing experimental data from routine NIPT analysis. NIPT or non- invasive prenatal testing is a test performed on pregnant women. The presence of circulating cell-free fetal DNA in the maternal plasma of the pregnant woman, in combination with recent advances in next generation sequencing (NGS) technologies, has made NIPT of fetal aneuploidy or copy number variation (CNV) a reality. NGS for aneuploidy detection applies counting statistics to millions of sequencing reads to identify subtle changes in the small percentage of fetal DNA present in the total cell-free DNA isolated from maternal plasma. An increase or decrease in the number of normalized sequencing reads, typically converted to a ‘z-score’, is indicative of aneuploidy for the respective chromosome or copy number variant (CNV) in a smaller region.
Abstract traineeship (advanced bachelor of bioinformatics) 2016-2017: Detection of patient-specific (sub)clonal variants by targeted resequencing
Many biological processes do follow a Darwinian evolutionary process. Cancer cell proliferation is not an exception. Those that are best adapted to environment will survive or divide faster than other tumor cells. During cancer cell division, multiple daughter cancer cells arise from one ancestor cell, all with their own additional, so-called somatic aberrations. These aberrations might be copy-number variations (CNVs) or nucleotide variations, eg. single nucleotide variations (SNVs) or insertion-deletions (indels). Cells that originate from the ancestor cell and share the same somatic mutations are called clonal cells. Cells that come from these clonal cells are subclonal cells. All cells from the tumor are phylogenetical related to each other. These somatic mutations can be used to draw an phylogenetical tree (undermentioned figure) visualizing the tumor-specific evolution.
Using targeted next-generation sequencing (e.g. Agilent Haloplex technique) we could analyze genomic regions of interest in multiple samples in a cost-effective manner. The samples were taken on the same moment from different spots of the tumor and metastases. Ultra-deep sequencing of samples allows detection of low-frequent subclonal mutations. In this study includes 66 colon adenocarcinoma associated genes. The technique is effective to detect allele frequencies in a range of 1% to 100%. The detection limit from 1% is necessary for these subclonal variants. Based on existing tools like PyClone we can analyze the inference of a clonal population.
PyClone takes as input the allelic count data as well the copy number information and uses a hierarchical Bayes statistical model to infer the cellular prevalence for each mutation. Every sample is supposed to be a mixture of several cellular populations, so PyClone can calculate the cellular frequency per mutation.
Prins Boudewijnlaan 43