Center of Medical Genetics Edegem
Abstract 2018-2019 (1): FUNCTIONAL PRIORITIZATION OF NON-CODING VARIANTS IN WHOLE GENOME SEQUENCING BASED ON PUBLIC DATABASES
Brugada syndrome is a genetic disorder in which the electrical activity of the heart is abnormal. It increases the risk of abnormal heart rhythms and sudden cardiac death. The most commonly involved gene is the SCNA5 gene, which encodes the cardiac sodium channel. However, genetic mutations in other (unknown) genes can be associated with Brugada syndrome. (1) In order to investigate the importance of these other genes, it may be valuable to look at all the interesting regions, including non-coding regions. Of four families, one or more members’ genome was sequenced and analysed by whole genome sequencing (WGS). The resulting non-coding variant files (hereby referred to as ‘WGS-samples’) were the start-point of the traineeship. In order to retrieve all kinds of important information on the variants, a summary of available and informative annotation sources were explored and presented. Data aggregation started by downloading different online databases of the human genome (hg19). Three databases were obtained from the ENCODE project; a transcription-factor binding-site database (TFBS), a Dnase clusters database and a genome segmentation database. Information from different human cell-lines was obtained from the Enhancer Atlas database. Lastly, candidate cis-regulatory elements (ccRE) information was retrieved from the SCREEN databases. On every obtained database-file, database manipulations were performed and relevant fields/columns were parsed to exclude the useful and interesting information, using R-scripts, a python-script and the linux command-line. To retrieve the useful information on the WGS-samples, ANNOVAR was used to annotate these files against the different database-files. The ANNOVAR-software is a tool to perform fast and easy variant annotations. (2) The tool uses variant call format (VCF) files as input, therefor the WGS-samples had to be converted to the correct format using an R-script. The database-files had to be converted as well, into the correct annovar-format, using the linux command-line. Next, the WGS-samples in VCF format were converted to the .avinput format so they could be correctly annotated. The subsequent variant annotation was region-based and generated output annotation-files for each sample against every database. Analogously, two predictive variants scoring tools were downloaded and installed. These tools, Genome Wide Annotation of Variants (GWAVA) and Regulatory Single nucleotide Variant Predictor (RSVP), are applicable to variants outside coding regions. (3,4) They predict the effect of a coding variant on a protein function. Trials were performed with testfiles to run the software, resolve errors and adapt where necessary. A third variant scoring tool, Combined Annotation Dependent Depletion (CADD), was already installed on the server and ready-to-use. The WGS-files were again converted to obtain a softwarecompatible format. Next, these compatible formats were assigned a score by running them against every scoring tool. 2 As a final step, the information obtained by the process above had to be represented in an interactive, well-defined way. First, an attempt was made to create heatmaps to present the information, but this was not ideal because of the amount of data. Next, it was decided that the visualization and interpretation of the results would be performed via shiny app. In this way, an interactive representation of the acquired data was possible by merging all the data-files into one summary. The variant scores were used as a base for the summary. The app was created in a way that only the variants that had a score above a specific threshold for the three scoring-predictors were displayed in relationship with the different databases. This results in an efficient manner to exclude only significant variants.
1. Antzelevitch C, Brugada P, Borggrefe M, Brugada J, Brugada R, Corrado D, et al. Brugada Syndrome: Report of the Second Consensus Conference. Circulation [Internet]. 2005 Feb 8 [cited 2019 Jun 14];111(5):659–70. Available from: http://www.ncbi.nlm.nih.gov/pubmed/15655131 2. Yang H, Wang K. Genomic variant annotation and prioritization with ANNOVAR and wANNOVAR. Nat Protoc [Internet]. 2015 Oct 17 [cited 2019 Jun 13];10(10):1556– 66. Available from: http://www.ncbi.nlm.nih.gov/pubmed/26379229 3. Ritchie GRS, Dunham I, Zeggini E, Flicek P. Functional annotation of noncoding sequence variants. Nat Methods [Internet]. 2014 Mar 2 [cited 2019 Jun 13];11(3):294–6. Available from: http://www.ncbi.nlm.nih.gov/pubmed/24487584 4. Peterson TA, Mort M, Cooper DN, Radivojac P, Kann MG, Mooney SD. Regulatory Single-Nucleotide Variant Predictor Increases Predictive Performance of Functional Regulatory Variants. Hum Mutat [Internet]. 2016 [cited 2019 Jun 13];37(11):1137–43. Available from: http://www.ncbi.nlm.nih.gov/pubmed/27406314
Abstract 2018-2019 (2): Detection of structural variation in whole exome sequencing
Multiple robust methods exist to determine structural variation in Microarray (MA) data. Sadly, this is not yet the case for Whole Exome Sequencing (WES) data. To determine an optimal approach to detect structural variation, multiple CNV calling programs were examined and compared to find which program was most useful and user-friendly to utilize. For a set of ± 500 WES samples, gold-standard structural variation data was extracted from a MYSQL database on MA data, and then reformatted using an R script. The CNV calling programs examined during this traineeship consisted of ExomeCNV, CoNIFER, CNVkit, ExomeDepth and HMZDelFinder. Due to time constraints, only CoNIFER was used on the whole WES data set, while the other tools were limited to proof-of-concept implementation on a small test data set. The presentation will therefore mostly be covering the usage of the CoNIFER program. Since the programs all needed different input files, much pre-processing had to be done on the data first.
The raw data was obtained as unsorted and unfiltered BAM files, so the first step was to use SAMtools to filter and sort the raw BAM files. Next, the sorted BAMs had to be reordered to match the reference genome and bed file to work with certain tools such as GATK. This was done using the Picard tool with its ReorderSam function. After the BAMs were reordered, the creation of different derivative files could start: The GATK DepthOfCoverage function calculated the coverage for each sample, which was needed for ExomeCNV. Atlas2 was used to create VCF files from the WES data, which was needed to use HMZDelFinder. A multitude of bash scripts were written to run these pre-processing programs on the data (both locally and on the shared server). Raw BAM Data Sorted/Filtered BAM Data Reordered BAM Data Coverage files VCF files SAMtools Picard GATK ATLAS 2 Figure 1: Visualisation of the pre-processing Once all pre-processing was done (which was first attempted on the test data, then on the whole WES data set), the actual programs could be tested. Again, the CNV callers were first tested on the small test data set as a proof-of-concept and if the test was successful, the programs could be adjusted to work on the server on the whole data set. Since CoNIFER was the only tool that was run on the whole data set, only this caller was further discussed. CoNIFER consists of multiple python programs which can be activated from the command line. Figure 2 shows the pipeline of this CNV caller. First, the RPKM counts of each sample are calculated. These counts are then used to normalize the samples by using singular value decomposition (SVD). This normalization is done by the ‘analyze’ function. Once the samples are normalized, the actual CNV’s can be called using the ‘call’ function. A large call-file consisting of all calls for each sample will be made. This call file can finally be used to either export the CNVs to create files which collects the call per sample, or to plot the call in a graphic visualisation. Lastly, the obtained CNVs from the project’s WES data were processed using R and Rshiny to visualise the statistics (such as accuracy and precision) of the CNV calling tool, compared to the results obtained from microarray data from the same samples. These statistics were used to determine if CoNIFER is able to reliably call CNVs on WES data. In conclusion, we created the basis of a framework to evaluate WES-CNV callers on a large in-house validation dataset, using an interactive Rshiny interface. Extension of our work to additional methods is straightforward.
J. Fah Sathirapongsasuti, H. L. (2012). Package ‘ExomeCNV’. Opgehaald van CRAN: http://www2.uaem.mx/r-mirror/web/packages/ExomeCNV/ExomeCNV.pdf Niklas Krumm, P. H. (2012). Copy number variation detection and genotyping from exome sequence data. Genome Res. Plagnol, V. (2016, May 15). package ExomeDepth. Opgehaald van CRAN: https://cran.rproject.org/web/packages/ExomeDepth/ExomeDepth.pdf Rstudio. (2017). tutorial. Opgehaald van shiny rstudio: https://shiny.rstudio.com/tutorial/ Talevich, E. (2018). CNVkit Documentation. Opgehaald van ReadTheDocs: https://buildmedia.readthedocs.org/media/pdf/cnvkit/latest/cnvkit.pdf
Abstract traineeship (advanced bachelor of bioinformatics) 2017-2018 1: Creating reusable resources for analysing and interpreting DNA methylation data in the context of cancer research
DNA methylation is an epigenetic mechanism in which a methyl (CH3) group is added to DNA. The most documented methylation process is the methylation of the 5’ carbon group of Cytosine. DNA methylation induces changes in transcriptional regulation, mainly through promoter hypermethylation. In the context of cancer, DNA methylation can play different roles; hypermethylation of tumor suppressor genes, represses transcription, while hypomethylation of oncogenes promotes transcription, in both cases resulting in cancer propagation. Modern developments in next-generation sequencing and microarray technologies, have made it possible to study DNA methylation genome-wide over a large sample cohort. With such new methods however, substantial challenges arise regarding processing, analysis and interpretation. To that extent, several statistical tools have been created, but these lack ease of use and serial automation. The aim of this project was to create a reusable pipeline in which methylation data can be analyzed, using both pre-existing tool and in-house methods, ultimately resulting in a tactile and streamlined process. To achieve this goal, a selection of functions from the ChAMP package in R were used in combination with novel functions to acquire, preprocess and analyze the data. Functions for graphical illustration were also added to the conduit making it a “one-click” tool for routinely performed DNA methylation tasks.
Abstract traineeship (advanced bachelor of bioinformatics) 2017-2018 2: Integration of public and private genetic data in a collaborative online visual platform
The goal of the internship is to make a usable web page to interact with a database containing experimental data from routine NIPT analysis. NIPT or non- invasive prenatal testing is a test performed on pregnant women. The presence of circulating cell-free fetal DNA in the maternal plasma of the pregnant woman, in combination with recent advances in next generation sequencing (NGS) technologies, has made NIPT of fetal aneuploidy or copy number variation (CNV) a reality. NGS for aneuploidy detection applies counting statistics to millions of sequencing reads to identify subtle changes in the small percentage of fetal DNA present in the total cell-free DNA isolated from maternal plasma. An increase or decrease in the number of normalized sequencing reads, typically converted to a ‘z-score’, is indicative of aneuploidy for the respective chromosome or copy number variant (CNV) in a smaller region.
Abstract traineeship (advanced bachelor of bioinformatics) 2016-2017: Detection of patient-specific (sub)clonal variants by targeted resequencing
Many biological processes do follow a Darwinian evolutionary process. Cancer cell proliferation is not an exception. Those that are best adapted to environment will survive or divide faster than other tumor cells. During cancer cell division, multiple daughter cancer cells arise from one ancestor cell, all with their own additional, so-called somatic aberrations. These aberrations might be copy-number variations (CNVs) or nucleotide variations, eg. single nucleotide variations (SNVs) or insertion-deletions (indels). Cells that originate from the ancestor cell and share the same somatic mutations are called clonal cells. Cells that come from these clonal cells are subclonal cells. All cells from the tumor are phylogenetical related to each other. These somatic mutations can be used to draw an phylogenetical tree (undermentioned figure) visualizing the tumor-specific evolution.
Using targeted next-generation sequencing (e.g. Agilent Haloplex technique) we could analyze genomic regions of interest in multiple samples in a cost-effective manner. The samples were taken on the same moment from different spots of the tumor and metastases. Ultra-deep sequencing of samples allows detection of low-frequent subclonal mutations. In this study includes 66 colon adenocarcinoma associated genes. The technique is effective to detect allele frequencies in a range of 1% to 100%. The detection limit from 1% is necessary for these subclonal variants. Based on existing tools like PyClone we can analyze the inference of a clonal population.
PyClone takes as input the allelic count data as well the copy number information and uses a hierarchical Bayes statistical model to infer the cellular prevalence for each mutation. Every sample is supposed to be a mixture of several cellular populations, so PyClone can calculate the cellular frequency per mutation.
Prins Boudewijnlaan 43