Universiteit Maastricht afdeling BiGCaT-Bioinformatics
Abstract advanced bachelor of bioinformatics 2019-2020: Building an automated workflow for RNA-seq data for determining gene-specific isoform expression
Recently large genome-scale studies presumed that practically all human multi-exon genes could be spliced into numerous transcript isoforms. There are 58,037 annotated human genes and 198,093 isoforms in Gencode v25. On average, there are 3.4 annotated transcripts per human gene and if just protein-coding genes are thought of, the ratio increases to 7:1. In any case, the quantity of annotated transcripts doesn't completely represent the complexity of all alternative splicing events in cells. Novel transcripts are regularly found by RNA sequencing (RNA-seq), which enables the detection of transcript isoforms, gene fusions, single nucleotide variants, and other features without the limitation of prior knowledge. This all leads us to my research question. It is as follows: How effective will the building of an automated workflow help to determine gene-specific isoform expression based on RNA-seq data?
RNA-seq has emerged as a powerful transcriptome profiling technology that allows in-depth analysis of alternative splicing. In a typical RNA-seq assay, extracted RNAs are reverse transcribed and fragmented into cDNA libraries, which are sequenced by high throughput sequencers. Transcript isoforms coming from the same gene are highly similar in sequence and share a large percentage of overlapping regions. It is, therefore, a challenging task to identify the true origin of the short sequencing reads, given that reads from overlapping regions can come from any of the transcript isoforms.
To be able to align raw reads to a reference genome/transcriptome many tools can be used. For my project I will use Spliced Transcripts Alignment to a Reference (STAR). It is a software package and enables highly accurate and ultra-fast alignment of RNA-seq reads to a reference genome. In addition it can detect annotated and novel splice junctions. STAR is capable of discovering more complex RNA sequence arrangements, such as chimeric and circular RNA. It can align spliced sequences of any length with moderate error rates providing scalability for emerging sequencing technologies.
After mapping it is also necessary to choose the right quantification tool/package. A number of packages have been developed to quantify expression at the transcript level. With this project I want to concentrate on the RNA-Seq by Expectation-Maximization (RSEM) package. RSEM implements iterations of Expectation-Maximization algorithms to assign reads to the isoforms from which they originate.
By using these tools/packages in an automated workflow, we want to make the determination of isoform expression more easy, understandable and well-visualized. The image on the second page illustrates the pipeline we are following.
6229 ER Maastricht