Erasmus MC (Rotterdam)
Abstract 1 2019-2020: Implementing RNA sequencing to stratify cardiovascular disorders in diagnostics using alternative and aberrant splicing, a bioinformatics perspective
In total the number of individuals suffer from relative rare diseases is high, which is mostly caused by a pathogenic variant in a single gene. The current molecular diagnostic technique Whole-Exome Sequencing (WES) can detect around 30-40% of these cases through identifying the clinical relevant DNA variant. More often DNA variants are found which are of uncertain clinical significance (VUS). In addition, most clinical cases can’t be properly diagnosed because the involved genetic aberration is not yet associated with the disease or because the aberration could not be detected by WES. Such variants includes deep intronic variants, structural variants and repeat expansions. However, these variants may lead to alternative splicing, over- or under- and loss of expression of genes or isoforms. RNA sequencing (RNAseq) measures the expression levels of transcripts in tissues and therefore is an additional useful technique to detect those variants.
Erasmus MC Clinical Genetics department has an urgent need in a robust and high quality workflow for analyzing RNAseq to assess aberrant gene expression and stratify patients in their regular diagnostics. Recently, an initial in-house pipeline was developed for the analysis of the RNAseq data. During my traineeship a quality control workflow was developed on top of this in-house pipeline, and the workflow was compared to a publicly available RNAseq workflow built with Nextflow. Furthermore, a differential gene expression analysis was performed on whole transcriptome data to distinguish cardiovascular patients from healthy ones.
The developed quality control flow contains a QC table and plots of the mapped data. The number of total, duplicate and mapped reads (from both HISAT2 and Kallisto mapping) are calculated with samtools flagstat. To assess RNA degradation of the samples the inner distance of the paired reads was obtained with RSeQC InnerDistance. The table was extended to assess the gender of the sample based on the RNAseq data (to check for possible sample swaps). This was calculated with the normalized counts of a panel of chrX and chrY genes. The quality report revealed that samples isolated from PreAnalytiX (PAX) kits are not preferred for RNA-sequencing.
An external automatic nextflow workflow was used (https://nf-co.re/rnaseq). This is an online bioinformatics analysis pipeline used in a docker with the advantage of always up to date software and highly reproducible results. Comparison of mapping with HISAT2 (with same references as in current in-house pipeline) and STAR revealed that mapping with HISAT2 is faster and needs less memory than STAR, but has less reads from an exonic origin, less uniquely mapped and more unmapped reads. Therefore further data analysis is performed on the STAR results.
Finally differential gene expression analysis was performed on 19 samples, of which 5 samples with a cardiovascular disorder. The R implemented DESeq2 method was used to compare cardiovascular samples vs controls but can also be performed for other sample groups. This resulted in 57 statistically significant genes that were up- or downregulated. These genes, probably biomarkers (expression had changed due to a mutation in a gene upstream in the same pathways), can be used as an indication for a cardiovascular disease to classify patients with a doubtful indication. A gene enrichment study with DAVID was performed on those significant genes to find common biologically properties. One of the results revealed that 22 of this genes belong to GAD_disease_class “Cardiovascular”. A literature search was performed on those 22 genes with the Agilent literature search in Cytoscape which created a network proving that a lot of those genes are related to each other according to the literature. In the future further pathway studies has to be performed on these genes.
Detection of aberrant gene expression of one sample within a group of samples was performed with OUTRIDER, but no significant results were obtained, probably because of the small sample size. In the future this analysis has to be repeated with a larger dataset.
In summary, it is not recommended to use samples isolated from PAX for RNA sequencing, the nextflow pipeline with the STAR mapper is recommended for future experiments, and DESeq2 can be used to explore gene expression differences between sample groups in order to find biomarkers for a disorder and to be able to classify patients with an uncertain disorder.
Abstract 2 2019-2020: CREATING A DATABASE FOR LOOKING UP VARIANTS FITTING IN A CLINICAL PICTURE
Since the introduction of next generation sequencing (NGS) new difficulties have arisen due to the challenges in variant interpretation of large datasets. At our department, NGS is applied on DNA samples in order to find sequencing variants that have clinical effects that are usually located in exons. In case of exome sequencing, a number of 100-150 thousands of variants can be reported. Whilst public databases are already available and providing insights into variants in worldwide populations, the purpose of this project was to introduce a local secured database to look up variant frequencies in a local population, in this case Rotterdam and surroundings. The constructed system is offering the possibility to look up genotype information by gene symbol. In addition an efficient feature was created to look up patients with ditto gene mutation. The combination of variant call format files (VCF) and genomic features format files (GTF) are used to gather the required information in order to investigate genetic variations. Subsequently this data is converted by a Python script that parses the data and sends this to the developed MySQL database system. The database is set up depending on Electron JS, a framework web application. In this database system two tables were shown on a Graphical User Interface (GUI), a features table and a variants table. Furthermore the application enables the end-user to acquire certain data by typing in gene symbols, transcript ID or to search for a gene on a chromosome with start and end positions. In order to eliminate redundant data an efficient feature was made to extract “transcripts” or “exons”. The application uses an SQL-query to give the requested results. This application provides an efficient and straightforward method to extract genetic variants located in exons in a local population.
Dr. Molewaterplein 40
Walter de Valk
010 704 45 34
Harmen van de Werken