Abstract advanced bachelor of bioinformatics 2019-2020: Development of a Cancer Immunity RNAseq Nextflow Pipeline
Cancer is the second leading cause of death and is responsible for 9.6 million deaths in 2018 worldwide. Multiple improvements in diagnostics and therapy options have been made during the past decades. One of the new ways to treat cancer is immune checkpoint blocking or immunotherapy with success in achieving long-term durable responses. However, major challenges still remain, including but not limited to the detection of mechanisms of intrinsic and acquired resistance to therapy with immune checkpoint blockers and the identification of predictive markers for response. RNA-seq is a recent transcriptome profiling technique that allows broad profiling of samples. The aim of this internship is to improve the current HistoGeneX RNA-seq Nextflow pipeline with Cancer Immunity packages. Three packages were implemented: arcasHLA, quanTIseq and MIXCR. ArcasHLA performs high resolution genotyping for HLA class I and class II genes from RNA sequencing. QuanTIseq quantifies the tumor immune contexture from human RNA-seq data. MiXCR is a tool used for fast and accurate analysis of T and B cell receptor repertoire sequence data.
Development of the pipeline was performed on a test server in the test environment, isolated from the production environment. To allow reproducibility and migration from the test to production environment, the bio-IT team at HistoGeneX creates their pipelines using the Nextflow language and uses containers based on BioConda and Docker. Tools can be, preferably, installed using conda environments and an environment.yml YAML file or by directly adding them to the Dockerfile that is used to build the Docker image. The environment.yml file contains all used software packages with their version that will be installed in the conda environment. All software packages that are available in the anaconda cloud can be automatically installed using this method. Unfortunately, the three main tools were not in the AnaConda cloud or only contained an old version at the time of this internship. This method of installation was only used for helper packages. The dockerfile contains the build instructions for a Docker image. MIXCR and arcasHLA were installed by modifying the current HistoGeneX dockerfile with all necessary commands to install the packages. QuanTIseq consists of multiple steps that were already implemented in the current HistoGeneX pipeline so only necessary scripts were downloaded from the github page and copied into the Docker image.
A Nextflow pipeline consists of different processes. The processes are executed independently from each other and can run in parallel. Each process needs an input and output and a process can have more than one input or output. The input and output are defined as channels. The interaction between the different processes and the execution order of the processes depends on these channels. First, separate Nextflow scripts were made for each tool. The arcasHLA script consists of two processes. The first process contains the extract and genotyping step. Here, the BAM files that were created by the STAR alignment were used as input. From these files only the chromosome 6 reads and related Human Leucocyte Antigen (HLA) sequences were extracted and written into paired fastq.gz files. These files were used for the genotype step to predict the most likely genotype. Next, the second process is performed where the output files from genotyping are merged into one .tsv file. Because the existing pipeline aligns against a version of the genome that does not contain the alternate HLA reads, a second script was created for arcasHLA where the trimmed fastq.gz files were used as input. This allowed a comparison between both methods in a later phase. Both methods were implemented into the pipeline. For quanTIseq, a script containing one process was created. quanTIseq exists of four different processes starting from raw fastq files, including trimming, kallisto pseudoalignment and immunephenotyping. Since trimming and kallisto were already implemented in the HistoGeneX pipeline, both steps were left out. Two scripts were adjusted for use in the HistoGeneX pipeline and a new conversion file was created because the GENCODE reference genome was different than used by quanTIseq. MIXCR uses FASTQ files as input so one process was created where the output files from the trimming process were used. Once they worked in a separate script, the Nextflow processes made for the tools were added to the existing pipeline. Nextflow uses configuration files for proper implementation of hardware of software settings. In these files, parameters that are used in the pipeline are described. Extra parameters were successfully added for the processes of the tools that were integrated in the pipeline. After successful testing, all pipeline files were transferred to the production server.
Biological validation of the pipeline was performed in two different clinical trial study sets: cervical cancer patients and hormone-positive breast cancer. The cervical cancer dataset consisted of pre- and post-treatment tumor tissue and normal adjacent tissue when available. The breast cancer dataset consisted only of pre- and post-treatment tumor tissue. T-cell staining results (CD8) were available for both datasets whereas Treg staining results (Foxp3) were available for the breast cancer dataset. This allowed comparing the quanTIseq data directly with IHC data. Dedicated data-analysis R scripts were created to analyze the results. Immunohistochemistry (IHC) results and patient information data were queried from the dedicated LIMS and result SQL servers.
QuanTIseq quantifies cell fractions of 10different immune cell types, including CD8 T cells and Treg cells. Comparison with IHC showed in general good to great correlations for both CD8 Tcells and Treg cells as depicted in Figure 1a-b with pearson correlation coefficients of 0.958 and 0.923 respectively. It should be noted that CD8 T cells were not picked up until a certain IHC threshold. For HLA haplotyping, results from the raw FASTQ files and extracted BAM files were first compared. While results were mostly similar, there were some discrepancies. Therefore, it was decided to only use the raw FASTQ results in subsequent analyses. Since pre-and post-treatment and normal and tumor results were available, the haplotypes of a patient from normal and tumor tissue were compared with the haplotypes of the same patient but from normal and tumor tissue after the cancer treatment. An overview from all the samples of one patient was created to compare and investigate differences between these samples. The results for patient 01-52 are illustrated in Figure 1c. RNAseq haplotyping showed that a potential Loss of Heterozygosity (LOH) event occurred in this patient. HistoGeneX will investigate this in further experiments. Moreover, RNAseq of 10 samples with known haplotypes will occur in the week of June 15th to further validate this. Finally, MiXCR analyses the T cell receptors (TCR) in the sample. We hypothesized that a high amount of T cells (CD8) should lead to more TCR sequences in the sample. Indeed, there was a high correlation between number of T cells and number of TCR sequences detected (results not showed). One of the major causes of cervical cancer is human papillomavirus (HPV). Therefore, we queried an online database containing sequences of TCR’s against HPV to check if there were TCR sequences against HPV present in the samples. No hits were found.
In conclusion, the internship resulted in the integration of three cancer immunity packages in the HistoGeneX RNA-seq pipeline. The pipeline has been successfully transferred to the production environment and runs without errors. QuanTIseq results were validated using IHC methods. arcasHLA and MiXCR results will be validated by HistoGeneX in a later phase.
Abstract 2018-2019: Improving standard Next-Generation Sequencing reporting by addition of Mutational Signatures and Copy Number Variations
The goal of this traineeship is to get familiar with bio-informatics software and tools in a more practical way. HistoGeneX performs Next-Generation Sequencing (NGS) to guide treatment options in patients with solid tumors. Currently, only mutations are reported to the clinicians. This project aims to improve the reporting by addition of Mutational Signatures and Copy Number variations (CNVs) to this reporting if possible. Therefore, my project consists of 2 subprojects: Mutational signatures and Copy Number Variation pipelines.
1. Mutational signatures
Somatic mutations are present in all cells of the human body and occur throughout life. Different mutational processes generate unique combinations of mutation types, termed mutational signatures. In the past few years, large-scale analyses have revealed many mutational signatures. COSMIC (Catalogue of somatic mutations in cancer) delivers a comprehensive resource by providing the profiles of and additional information about known mutational signatures. While Whole Exome Sequencing (WES) and Whole-Genome Sequencing (WGS) is of a different order of data, we want to incorporate these signatures in targeted sequencing data. In WGS data, the mutational signatures are automatically extracted from all present somatic mutations using dedicated bio-informatics tools. This is of course not feasible in targeted sequencing since the number of mutations detected is very low compared to WES and/or WGS. Therefore, this project is based on 3 easy to distinguish signatures: signature 1, 4 and 10 since these signatures only consist of a few single mutations. Signature 1 is associated with natural mutations due to aging, signature 4 represents mutations likely caused by tobacco smoke mutagens and signature 10 is associated with POLE (polymerase epsilon) somatic mutations in colorectal cancer. A Shiny application was created using R and RStudio to visualize the different mutations based on the output results of a commercially available Mutation Caller (Sophia DDM). The visualization gives a clean overview of the present mutations. Three tables are generated to illustrate the number and percentages of the different mutations and signatures. After incorporating patient data to the mutational signature data, data-analysis was performed in R. Although we would expect a correlation between the age of a patient and the number of signature 1 mutations, this was not present in our data. Signature 4 is mostly present in Lung Cancer samples, as expected. Signature 4 is, according to COSMIC, also found in various other cancers. These cancers are however not sequenced for diagnostic reasons at HistoGeneX. Signature 10 is mostly found in colorectal samples, as expected, although the results were not as clear as for signature 4. These results show the limitations of using WES and WGS-based signatures in a targeted gene panel. There was a strong correlation with low-quality samples and the amount of C>T mutations that were not due to aging.
2. Copy Number Variation pipelines
Variability in humans or in the human genome is caused by multiple types of genetic alterations. These alterations can involve differences in the DNA nucleotide sequence as well as changes in chromosome structure. CNVs are an example of structural variants that involve alterations in the number of copies of specific regions of DNA, which can either be a deletion or duplication. CNVs can either spontaneously arise de novo, or they can be inherited, as is the case for other types of genetic variations. Detecting CNV’s in targeted data is difficult. Recently, the CNVPanelizer Bioconductor package was created to perform this analysis. We implemented a completely new pythonbased CNVPanelizer pipeline to visualise the Copy Number (CN) of the genes in the targeted sequencing panel used at HistoGeneX. After trimming, aligning, sorting and indexing the raw FASTQ input, CNVPanelizer was used to infer the CN status. The whole pipeline was integrated into a Shiny application, allowing lab technicians to run this script after each sequencing run. A putative CN status is determined based on the ratios and can either be normal, amplification or deletion. Along with these results, a figure of the deletions and amplifications is generated after running the pipeline. This allows for a quick interpretation of the NGS analysis results. We then compared the results of the CNVPanelizer pipeline to previously detected and confirmed CNVs using another technique in patient samples. All 12 known and confirmed CNVs were detected using CNVPanelizer. More importantly, four additional CNVs were detected: a KRAS amplification in a lung sample, an EGFR and MEK1 deletion in a lung sample and a PTEN deletion in a melanoma sample. These CNVs are possibly of clinical importance. A PTEN deletion in a melanoma sample can be linked to loss of function and poor response to BRAFi monotherapy and a MEK1 deletion in a lung sample is thought to be associated with smoking. While these results are certainly promising, they should be validated using a different approach, e.g. FISH or WES/WGS. Five tonsil samples were sequenced to generate this baseline sample. A large number of FGFR3, HRAS and STK11 amplifications and ERBB2 deletions was found. This could be due to the reference samples and should be further investigated. In conclusion, the mutational signatures application will be used to provide additional information about the quality and origin of the sample. The CNV application provides accurate CNV results that will be used in guiding patient treatment.
 Alexandrov, L. B., Nik-Zainal, S., … Stratton, M. R. (2013). Signatures of mutational processes in human cancer. Nature, 500, 415-421.  Conrad, D. F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., … Hurles, M. E. (2010). Origins and functional impact of copy number variation in the human genome. Nature, 464, 704–712.  Catalanotti, F. Cheng, D. T., Shoushtari, A. N., Johnson, D. B., Panageas, K. S., Momtaz, P., … Solit, D. B. (2017). PTEN Loss-of-Function Alterations Are Associated With Intrinsic Resistance to BRAF Inhibitors in Metastatic Melanoma. JCO Precision Oncology, 1 - 15.
Pieter-Jan van Dam
+32 3 502 05 00