Abstract traineeship advanced bachelor of bioinformatics 2020-2021: IDENTIFYING AND VISUALIZING PROTEIN CLEAVAGE SITES IN ARABIDOPSIS THALIANA
During this traineeship an answer to the following question is addressed “To what extent are protein cleavage products in Arabidopsis thaliana (Arabidopsis) detectable in shotgun proteomics data and can they be accounted for in future searches?”. To analyze this, the trainee will use the Python and R programming languages to construct a computational pipeline that acquires, preprocesses, searches public proteomics data, next to post-hoc scripts that summarize and visualize the obtained results.
To identify protein cleavage products, two separate search indices need to be made, a tryptic and semi-tryptic one. The tryptic and semi-tryptic indices accord for a full (two enzymatic termini) and partial (one enzymatic terminus) digestion of the peptides. The semi-tryptic index is made to be able to locate non-tryptic and cellular cleavage events, with other words natural cleavage sites which may have a biological meaning. These indices are built with the Crux tide-index function (
https://crux.ms/commands/tide-index.html). Tide is a tool for identifying peptides from tandem mass spectra. It assigns peptides to spectra by comparing the observed spectra to a catalog of theoretical spectra derived from a database of known proteins.
The first step of this pipeline is the gathering and preprocessing of raw Thermo (.raw) proteomics datafiles to peak list (.mgf) files. For this project we re-analyzed three large Arabidopsis proteomic studies (PXD012708, PXD014877 and PXD013868) that are publicly available on the PRIDE repository (
https://www.ebi.ac.uk/pride/). To obtain all the FTP addresses of the .raw files a package named pridepy (
https://github.com/PRIDE- Archive/pridepy) has been used, which searches the PRIDE repository with a given project identifier. Afterwards, each separate acquired FTP address is processed in a loop and stored according to their queried PRIDE identifier. All data is downloaded multi-threaded by Axel (
https://github.com/axel-download-accelerator/axel) (giving an considerable speed boost compared to wget) and converted to .mgf format using ThermoRawFileParser (
https://github.com/compomics/ThermoRawFileParser). These converted .mgf files are then searched with the Crux cascade-search function (
https://crux.ms/commands/cascade-search.html). This search will use the previously made indices and search both of them in a automated, fast and statistically robust manner to find the cleavage products.
In total 769 MS samples have been analyzed and 2 post hoc scripts will be needed to interpret the obtained data. The first post hoc procedure is to summarize al the data so they are readable by non-experts. The summary is made with a Python script which filters the peptide identification results for identified cleavage sites. The number of peptide-to- spectrum (PSMs) to semi-tryptic peptides and their respective modification status (N- terminal acetylation, pyro-Glu formation or non-modified) are stored. In addition, a TargetP 2.0 prediction (
http://www.cbs.dtu.dk/services/TargetP/)is also incorporated in the summary to compare to predicted cleavage sites of N-terminal sorting signals for organellar targeting for proteins, e.g. mitochondrial transit peptide (mTP) or chloroplast. All this information is then written into a tab-delimited file that can easily be copied in Excel.
The second post hoc procedure is to visualize this data, this is done using a R script. This script requires a single protein identifier and will output a graph that displays several tryptic and semi-tryptic peptide identification results (
Figure 1). These are the amount of semi-tryptic PSMs and their respective modification status, a heatmap above the x-axis to show the tryptic PSMs, a heatmap under the x-axis to show the probability a PSM will be found (found with DeepMSPeptide, a deep learning algorithm to predict peptide detectability) and a vertical line to represent the predicted cleavage site of that protein identifier.
Lastly all of the data is processed into a .fasta file to produce ‘cleavage-aware’ FASTA databases. A total of 4 FASTA files will be made by adding N-terminal truncated versions of existing proteins, in other words the found cleavage sites will be searched as protein N- termini. The first 2 FASTA files will supplement the representative Arabidopsis proteome (Araport11, 48359 entries) with the cleavage sites which have more than 5 PSMs and the cleavages sites matching TargetP 2.0 predictions respectively. The last 2 FASTA files will do the same except the will use the full Arabidopsis proteome (including splice forms).
After creating these cleavage-aware FASTA databases, we tested their merit on a dataset studying the effect of 4 hours mannitol stress in Arabidopsis (PXD008900) using a standard, tryptic MaxQuant search. Searching the Araport11 representative proteins identified 15,843 tryptic peptides, while searching extended databases with all cleavage sites or only those corresponding to TargetP 2.0 matching cleavages proteins resulted in 402 (+ 2.54%) and 787 (+ 4.97%) additional peptide identifications. This clearly demonstrates that accounting for protein cleavage products makes significant contributions to tryptic peptide identification. Searching the full Araport11 proteome with splice forms results in 16,049 peptide identifications, which is 600 peptides less than when searching protein cleavage products. Hence, instead of accounting for pre-mRNA splicing in protein databases, it could be more meaningful to consider protein cleavage.
Abstract traineeship advanced bachelor of bioinformatics 2017-2018: Computational promoter and expression analysis to characterize stress regulation of Nictaba-related genes
The main characters in this story are the Nictaba lectin orthologues from Arabidopsis thaliana. Lectins are proteins that can selectively and reversibly bind to sugar structures. This class of proteins is widely represented in the plant kingdom but is also present in animals and fungi.
Some plant lectin families are constitutively expressed, while other families show an inducible expression mainly upon stress signals. A plant can suffer from abiotic and biotic stresses. Abiotic stress includes heat, drought, heavy metals, cold and salt stress while biotic stresses consist of insect infestations and pathogen infections.
One of these inducible plant lectins is Nictaba or Nicotiana tabacum agglutinin. It was firstly discovered in the leaves of tobacco, and was identified as a jasmonate inducible protein. In the genome of Arabidopsis thaliana, 31 orthologues of the Nictaba lectin gene have been identified. The Nictaba domain is often linked to another protein domain or a C- or N-terminal region. The most common domain is the F-box domain, other domains include a TIR domain (Toll/interleukin-1 receptor) and a AIG1 (avirulence induced gene 1)-type G domain.
To analyse the evolutionary relationships between the Nictaba-related genes in Arabidopsis and in other species, a phylogenetic tree was made. To have a clear view on the phylogeny of the Nictaba domain, the trees were generated using a multiple sequence alignment (Mega software) for only the Nictaba domain sequences, this to avoid interference of the other (non-lectin) domains with the multiple sequence alignment and phylogenetic tree (RAxML). In the obtained Maximum Likelihood tree, we can observe three main clades, designated as clades A, B and C. The phylogenetic relationship is related to the corresponding protein domain architecture. Clades B and C contain the F-box Nictaba-related genes. Clade A contains separate branches for TIR-Nictaba-related lectins and proteins with only a Nictaba lectin domain.
To investigate the potential involvement of Arabidopsis Nictaba-related lectins in the plant stress response, this project aims to determine the gene expression profiles for different Nictaba orthologues in Arabidopsis thaliana and analyse the conservation of regulatory sequences in the promoter regions of these genes. Therefore a workflow was designed in which we started from the expression profiles from the Nictaba-related genes and defined a set of co-expressed genes. From this set of genes, the gene-of-interest and its co-expressed genes, a gene ontology (GO) enrichment and motif enrichment were produced. The resulting data was visualized in a graph.
To define the expression profile, the project started with a genevestigator search for the 31 Nictaba orthologues in Arabidopsis. We selected microarray experiments from wild type plants for diverse biotic and abiotic stresses (including salicylic acid, methyl jasmonate, abscisic acid, Indole-3-acetic acid (auxin), salt, heat, cold, drought, Pseudomonas, Myzus) and different plant parts. The selection cut-off was set to FC=1.5 and p-value=0.05. Based on these results as well as the organisation of the phylogenetic tree 6 genes with different expression profiles were selected: AT1G80110 (PP2- B11), AT1G31200 (PP2-A9), AT2G02350 (PP2-B9), AT4G19840 (PP2-A1), AT5G52120 (PP2-A14) and AT1G65390 (PP2-A5). The genes PP2- B11, PP2-A14 and PP2-A5 were selected because they show an interesting stress responsive expression profile. For the other 3 genes (PP2-A9, PP2-B9 and PP2-A1) research has been done in our research facility. The 3 different clades of the phylogenetic tree have representatives among the selected genes. PP2-A9 has almost no N- or C-terminal domain and is located in the A-clade together with PP2-A1 (N-terminal domain) and PP2-A5 with a TIR domain. PP2-B11 (F-box) and PP2-B9 (small C-terminal domain) belong to clade B. One gene was selected from clade C: PP2-A14 with an F-box domain linked to a Nictaba domain.
Using their transcription profile, 100 and 200 co-expressed genes for each of the selected Nictaba-related genes were identified (positive and negative correlation). Since it is assumed that co-expressed genes might be involved in the same processes, a GO enrichment analysis was performed, this reveals if a GO term (hierarchical grouping gene descriptions) is significantly more present in the co-expressed genes than in the pool of Arabidopsis genes. In addition, we isolated the sequence 5 kb upstream and 1 kb downstream of the Nictaba-related genes to perform a motif enrichment to identify cis-regulatory elements (and linked transcription factors) enriched in the promoters of these genes.
To give a clear presentation of the data obtained from the GO and motif enrichment, we manipulated the data with Cytoscape, filtering on the q-value value (p-value corrected for the false discovery rate) and the number of hits for the enriched term in the set of the co-expressed genes.
The results of this analysis provide support to the hypothesis that the Arabidopsis Nictaba orthologues are involved in the stress response pathways of the plant. The expression profiles for most Nictaba-related genes show a stress regulated profile. In addition, the in-depth analysis of the selected genes retrieved similar stress related GO terms and cis-regulatory domains. We have shown that this type of analysis can provide information about the function of the gene of interest.
Abstract bachelorproef 2015-2016: Identification of new PGPR’S in wheat
Wegens confidentialiteit kan de samenvatting niet gepubliceerd worden.
Om dit te bereiken wordt eerst de structuur van de CORNET databank aangepast om alle nodige informatie in deze databank te kunnen integreren. Hierop volgend wordt de data ingebracht in de databank.
Eenmaal de databank de nodige informatie bevat, wordt de tf tool ontwikkeld. Dit wordt een nieuwe tool additioneel aan de reeds bestaande tools op de site.
De output van de tf tool gebeurt in tekstformaat of gevisualiseerd in Cytoscape. In Cytoscape wordt het resultaat visueel weergegeven als een netwerk van een of meerdere TF en hun targets.
Deze visualisatie geeft niet alleen de interacties weer tussen de transcriptiefactor en targetgen maar ook de regulatie die de transcriptiefactor uitoefent op het targetgen en de betrouwbaarheid van de data.
Na de ontwikkeling van de tf tool werd deze geïntegreerd met de proteïne-proteïne interactie en co-expressie tool, zodat het resultaat van een tool verder aangevuld kan worden met informatie uit een andere tool.
Naar de toekomst toe is er nog de mogelijkheid om andere datatypes te integreren in CORNET. Maar ook analoge informatie van andere organismen kunnen toegevoegd worden, zoals de economisch en agronomisch interessante maïs, rijst en populier. Zo kan er vergeleken worden of de moleculen en de interacties ertussen specifiek zijn voor een bepaald organisme of voorkomen in meerdere organismen.