Abstract advanced bachelor of bioinformatics 2020-2021: DEVELOPMENT OF A NEXTFLOW SINGLE CELL-RNA SEQUENCING PROCESSING PIPELINE FOR DOWNSTREAM ANALYSIS WITH SEURAT
Single-cell RNA sequencing
Gene expression profiling of bulk tissue or cell mixtures by RNA-seq has been established as technique to interrogate gene expression levels. While bulk RNA can differentiate the gene expression between conditions, underlying differences in cell-type specific transcriptomes are averaged out. Singe-cell RNAseq overcomes this problem by isolating the cells and has been used in multiple novel research findings and in multiple domains, including cancer and SARS-Cov-2 research [Olsen T. (2018), Bassez, A (2021), Stephenson E. (2021)]. Multiple protocols for single-cell RNAseq are available with 10x Genomics protocols being the most widely spread and used in research during the past years. The goal of the traineeship is to create a pipeline to preprocess single-cell RNAseq data for Quality control and downstream analysis using Seurat. The pipeline is written using the Nextflow program language. Furthermore, the Seurat objects will be imported and analyzed using R.
PBMC samples from four patients were processed according to the manufacturers protocol. 5000 cells were sorted using FACS and processed using the 10x Chromium controller. Libraries were prepared with the Single Cell Immune Profiling workflow, generating only Gene Expression libraries. Libraries were then sequenced on a NovaSeq 6000 instrument from Illumina on a NovaSeq SP flowcell. A Nextflow pipeline to process these samples and generating Quality control, gene and transcript count matrices and a Seurat object file was created. An overview of the newly created processing pipeline is depicted in Figure 1. This ensures scalability and portability of the processing pipeline between servers and environments, both local and cloud-based. The samples will be preprocessed using 2 workflows: cellranger and kallisto combined with bustools. The cellranger count workflow uses the input fastq folder and performs alignment, filtering, barcode counting and UMI counting using one command. Chromium cellular barcodes were used to generate feature- barcode matrices, determine clusters and perform gene expression analysis.
For the second workflow, kallisto and bustools commands were used. Kallisto bus first performs pseudoalignment to the reference genome (GRCh38 GENCODE v32) and then generates a BUS file, containing cell barcode, UMI information and equivalence class for each cell. Then the BUS files are barcode corrected using a barcode whitelist and are sorted. A summary of the content of the sorted BUS file is created using bustools inspect. The sorted BUS files are converted into a barcode-feature matrix, where the features can be genes or transcripts in a TCC (Transcript Compatibility Counts) matrix. Bustools count is executed twice to generate two matrices, both for genes and transcripts. A gene map converting transcripts to genes was generated using a bash line command with the aforementioned reference genome. The formed matrix files from both workflows were converted into Seurat object using a custom-made R script.
In a next phase, the created Seurat objects were then futher analyzed. Due to time constraints, only the gene matrices produced by cellranger were analyzed in R. Seurat is an R package that enables performing Quality Control and analysis of scRNA-seq data. First, doublets, generated by library preparation, were detected using the DoubletFinder package and removed for further downstream analysis. Doublets occur when two cells are captured in one droplet, leading to a mixture of transcriptomes and thus erroneous results. After doublet removal, are samples were merged into one Seurat object.
The percentage of features mitochondrial and ribosomal features, and features derived from hemoglobin or platelets were calculated. Cells were the percentage of mitochondrial RNA features were too high were removed from the object. This was performed because low-quality or dying cells have higher mitochondrial counts than normal cells. Cells containing low ribosomal features were also removed as these cells are also of low-quality or dying. The percentage hemoglobin/platelets is calculated to see if there is red blood cell contamination. A scatter plot was created based on the cell cycle score. The cell cycle score is calculated to mitigate the possible effect of heterogeneity in cell cycle stages. All cell cycle scores calculated in all cells were roughly the same meaning no further normalization was needed.
Finally clusters were created and visualized using UMAP. The clusters are found by finding the markers of each cell and clustering them based on their Euclidean distance in PCA space. The clusters were then labeled for cell type by using the SingleR function from the SingleR package. As shown in Figure 2, all known PBMC cell types were detected in the samples, indicating proper processing using the Nextflow pipeline. Detailed analysis of these clusters and differential gene expression analyses were not performed due to time constraints.
In conclusion, the internship resulted in the development of a single-cell RNA-seq processing pipeline. The pipeline has been successfully tested in the HistoGeneX validation environment and generates the expected results.
Abstract advanced bachelor of bioinformatics 2019-2020: Development of a Cancer Immunity RNAseq Nextflow Pipeline
Cancer is the second leading cause of death and is responsible for 9.6 million deaths in 2018 worldwide. Multiple improvements in diagnostics and therapy options have been made during the past decades. One of the new ways to treat cancer is immune checkpoint blocking or immunotherapy with success in achieving long-term durable responses. However, major challenges still remain, including but not limited to the detection of mechanisms of intrinsic and acquired resistance to therapy with immune checkpoint blockers and the identification of predictive markers for response. RNA-seq is a recent transcriptome profiling technique that allows broad profiling of samples. The aim of this internship is to improve the current HistoGeneX RNA-seq Nextflow pipeline with Cancer Immunity packages. Three packages were implemented: arcasHLA, quanTIseq and MIXCR. ArcasHLA performs high resolution genotyping for HLA class I and class II genes from RNA sequencing. QuanTIseq quantifies the tumor immune contexture from human RNA-seq data. MiXCR is a tool used for fast and accurate analysis of T and B cell receptor repertoire sequence data.
Development of the pipeline was performed on a test server in the test environment, isolated from the production environment. To allow reproducibility and migration from the test to production environment, the bio-IT team at HistoGeneX creates their pipelines using the Nextflow language and uses containers based on BioConda and Docker. Tools can be, preferably, installed using conda environments and an environment.yml YAML file or by directly adding them to the Dockerfile that is used to build the Docker image. The environment.yml file contains all used software packages with their version that will be installed in the conda environment. All software packages that are available in the anaconda cloud can be automatically installed using this method. Unfortunately, the three main tools were not in the AnaConda cloud or only contained an old version at the time of this internship. This method of installation was only used for helper packages. The dockerfile contains the build instructions for a Docker image. MIXCR and arcasHLA were installed by modifying the current HistoGeneX dockerfile with all necessary commands to install the packages. QuanTIseq consists of multiple steps that were already implemented in the current HistoGeneX pipeline so only necessary scripts were downloaded from the github page and copied into the Docker image.
A Nextflow pipeline consists of different processes. The processes are executed independently from each other and can run in parallel. Each process needs an input and output and a process can have more than one input or output. The input and output are defined as channels. The interaction between the different processes and the execution order of the processes depends on these channels. First, separate Nextflow scripts were made for each tool. The arcasHLA script consists of two processes. The first process contains the extract and genotyping step. Here, the BAM files that were created by the STAR alignment were used as input. From these files only the chromosome 6 reads and related Human Leucocyte Antigen (HLA) sequences were extracted and written into paired fastq.gz files. These files were used for the genotype step to predict the most likely genotype. Next, the second process is performed where the output files from genotyping are merged into one .tsv file. Because the existing pipeline aligns against a version of the genome that does not contain the alternate HLA reads, a second script was created for arcasHLA where the trimmed fastq.gz files were used as input. This allowed a comparison between both methods in a later phase. Both methods were implemented into the pipeline. For quanTIseq, a script containing one process was created. quanTIseq exists of four different processes starting from raw fastq files, including trimming, kallisto pseudoalignment and immunephenotyping. Since trimming and kallisto were already implemented in the HistoGeneX pipeline, both steps were left out. Two scripts were adjusted for use in the HistoGeneX pipeline and a new conversion file was created because the GENCODE reference genome was different than used by quanTIseq. MIXCR uses FASTQ files as input so one process was created where the output files from the trimming process were used. Once they worked in a separate script, the Nextflow processes made for the tools were added to the existing pipeline. Nextflow uses configuration files for proper implementation of hardware of software settings. In these files, parameters that are used in the pipeline are described. Extra parameters were successfully added for the processes of the tools that were integrated in the pipeline. After successful testing, all pipeline files were transferred to the production server.
Biological validation of the pipeline was performed in two different clinical trial study sets: cervical cancer patients and hormone-positive breast cancer. The cervical cancer dataset consisted of pre- and post-treatment tumor tissue and normal adjacent tissue when available. The breast cancer dataset consisted only of pre- and post-treatment tumor tissue. T-cell staining results (CD8) were available for both datasets whereas Treg staining results (Foxp3) were available for the breast cancer dataset. This allowed comparing the quanTIseq data directly with IHC data. Dedicated data-analysis R scripts were created to analyze the results. Immunohistochemistry (IHC) results and patient information data were queried from the dedicated LIMS and result SQL servers.
QuanTIseq quantifies cell fractions of 10different immune cell types, including CD8 T cells and Treg cells. Comparison with IHC showed in general good to great correlations for both CD8 Tcells and Treg cells as depicted in Figure 1a-b with pearson correlation coefficients of 0.958 and 0.923 respectively. It should be noted that CD8 T cells were not picked up until a certain IHC threshold. For HLA haplotyping, results from the raw FASTQ files and extracted BAM files were first compared. While results were mostly similar, there were some discrepancies. Therefore, it was decided to only use the raw FASTQ results in subsequent analyses. Since pre-and post-treatment and normal and tumor results were available, the haplotypes of a patient from normal and tumor tissue were compared with the haplotypes of the same patient but from normal and tumor tissue after the cancer treatment. An overview from all the samples of one patient was created to compare and investigate differences between these samples. The results for patient 01-52 are illustrated in Figure 1c. RNAseq haplotyping showed that a potential Loss of Heterozygosity (LOH) event occurred in this patient. HistoGeneX will investigate this in further experiments. Moreover, RNAseq of 10 samples with known haplotypes will occur in the week of June 15th to further validate this. Finally, MiXCR analyses the T cell receptors (TCR) in the sample. We hypothesized that a high amount of T cells (CD8) should lead to more TCR sequences in the sample. Indeed, there was a high correlation between number of T cells and number of TCR sequences detected (results not showed). One of the major causes of cervical cancer is human papillomavirus (HPV). Therefore, we queried an online database containing sequences of TCR’s against HPV to check if there were TCR sequences against HPV present in the samples. No hits were found.
In conclusion, the internship resulted in the integration of three cancer immunity packages in the HistoGeneX RNA-seq pipeline. The pipeline has been successfully transferred to the production environment and runs without errors. QuanTIseq results were validated using IHC methods. arcasHLA and MiXCR results will be validated by HistoGeneX in a later phase.
Abstract 2018-2019: Improving standard Next-Generation Sequencing reporting by addition of Mutational Signatures and Copy Number Variations
The goal of this traineeship is to get familiar with bio-informatics software and tools in a more practical way. HistoGeneX performs Next-Generation Sequencing (NGS) to guide treatment options in patients with solid tumors. Currently, only mutations are reported to the clinicians. This project aims to improve the reporting by addition of Mutational Signatures and Copy Number variations (CNVs) to this reporting if possible. Therefore, my project consists of 2 subprojects: Mutational signatures and Copy Number Variation pipelines.
1. Mutational signatures
Somatic mutations are present in all cells of the human body and occur throughout life. Different mutational processes generate unique combinations of mutation types, termed mutational signatures. In the past few years, large-scale analyses have revealed many mutational signatures. COSMIC (Catalogue of somatic mutations in cancer) delivers a comprehensive resource by providing the profiles of and additional information about known mutational signatures. While Whole Exome Sequencing (WES) and Whole-Genome Sequencing (WGS) is of a different order of data, we want to incorporate these signatures in targeted sequencing data. In WGS data, the mutational signatures are automatically extracted from all present somatic mutations using dedicated bio-informatics tools. This is of course not feasible in targeted sequencing since the number of mutations detected is very low compared to WES and/or WGS. Therefore, this project is based on 3 easy to distinguish signatures: signature 1, 4 and 10 since these signatures only consist of a few single mutations. Signature 1 is associated with natural mutations due to aging, signature 4 represents mutations likely caused by tobacco smoke mutagens and signature 10 is associated with POLE (polymerase epsilon) somatic mutations in colorectal cancer. A Shiny application was created using R and RStudio to visualize the different mutations based on the output results of a commercially available Mutation Caller (Sophia DDM). The visualization gives a clean overview of the present mutations. Three tables are generated to illustrate the number and percentages of the different mutations and signatures. After incorporating patient data to the mutational signature data, data-analysis was performed in R. Although we would expect a correlation between the age of a patient and the number of signature 1 mutations, this was not present in our data. Signature 4 is mostly present in Lung Cancer samples, as expected. Signature 4 is, according to COSMIC, also found in various other cancers. These cancers are however not sequenced for diagnostic reasons at HistoGeneX. Signature 10 is mostly found in colorectal samples, as expected, although the results were not as clear as for signature 4. These results show the limitations of using WES and WGS-based signatures in a targeted gene panel. There was a strong correlation with low-quality samples and the amount of C>T mutations that were not due to aging.
2. Copy Number Variation pipelines
Variability in humans or in the human genome is caused by multiple types of genetic alterations. These alterations can involve differences in the DNA nucleotide sequence as well as changes in chromosome structure. CNVs are an example of structural variants that involve alterations in the number of copies of specific regions of DNA, which can either be a deletion or duplication. CNVs can either spontaneously arise de novo, or they can be inherited, as is the case for other types of genetic variations. Detecting CNV’s in targeted data is difficult. Recently, the CNVPanelizer Bioconductor package was created to perform this analysis. We implemented a completely new pythonbased CNVPanelizer pipeline to visualise the Copy Number (CN) of the genes in the targeted sequencing panel used at HistoGeneX. After trimming, aligning, sorting and indexing the raw FASTQ input, CNVPanelizer was used to infer the CN status. The whole pipeline was integrated into a Shiny application, allowing lab technicians to run this script after each sequencing run. A putative CN status is determined based on the ratios and can either be normal, amplification or deletion. Along with these results, a figure of the deletions and amplifications is generated after running the pipeline. This allows for a quick interpretation of the NGS analysis results. We then compared the results of the CNVPanelizer pipeline to previously detected and confirmed CNVs using another technique in patient samples. All 12 known and confirmed CNVs were detected using CNVPanelizer. More importantly, four additional CNVs were detected: a KRAS amplification in a lung sample, an EGFR and MEK1 deletion in a lung sample and a PTEN deletion in a melanoma sample. These CNVs are possibly of clinical importance. A PTEN deletion in a melanoma sample can be linked to loss of function and poor response to BRAFi monotherapy and a MEK1 deletion in a lung sample is thought to be associated with smoking. While these results are certainly promising, they should be validated using a different approach, e.g. FISH or WES/WGS. Five tonsil samples were sequenced to generate this baseline sample. A large number of FGFR3, HRAS and STK11 amplifications and ERBB2 deletions was found. This could be due to the reference samples and should be further investigated. In conclusion, the mutational signatures application will be used to provide additional information about the quality and origin of the sample. The CNV application provides accurate CNV results that will be used in guiding patient treatment.
 Alexandrov, L. B., Nik-Zainal, S., … Stratton, M. R. (2013). Signatures of mutational processes in human cancer. Nature, 500, 415-421.  Conrad, D. F., Pinto, D., Redon, R., Feuk, L., Gokcumen, O., Zhang, Y., … Hurles, M. E. (2010). Origins and functional impact of copy number variation in the human genome. Nature, 464, 704–712.  Catalanotti, F. Cheng, D. T., Shoushtari, A. N., Johnson, D. B., Panageas, K. S., Momtaz, P., … Solit, D. B. (2017). PTEN Loss-of-Function Alterations Are Associated With Intrinsic Resistance to BRAF Inhibitors in Metastatic Melanoma. JCO Precision Oncology, 1 - 15.
Pieter-Jan van Dam
+32 3 502 05 00