UHasselt - Center for Statistics
Abstract 2018-2019: INPLEMENTATION OF PACMASS IN PYTHON
Mass spectrometry (MS) is a technique that can be used to detect, identify and quantify molecules, such as proteins and peptides, based on their mass-to-charge ratio. A molecule, measured with MS, can be identified with three approaches: querying against a database with theoretical MS spectra, querying against a database with experimentally obtained MS spectra and de novo sequencing. De novo approaches extract peptide sequences directly from the MS spectra. The advantage of these techniques is that it is possible to identify peptides whose MS spectrum is not present in databases. pacMASS is a newly developed de novo method that predicts atomic composition of peptides and small proteins based on monoisotopic or average mass observed in MS1 spectra. It is a time- and memory-efficient approach that can generate a list of possible composition for peptides or proteins with a mass of 400 to 4000 daltons.
The pacMASS algorithm consists out of three main steps:
1) Determining the ranges of C-, H-, N- and O-atoms based on predicted isotope ratios In this step the isotope ratios are estimated based on the observed monoisotopic or average mass with a polynomial regression model1 . The prediction intervals for the isotope ratios are calculated by using a mean squared error model2 . This prediction intervals are used to compare to the theoretical isotope ratios from the human proteome (UniProtKB9606) and to define the minimum and maximum number of C-, H-, N- and O-atoms.
2) Filtering the H- and N-ranges based on the nitrogen and hydrogen rules The nitrogen rule states that the number of nitrogen atoms is even when the nominal mass is even. The hydrogen rule states that peptides with a nominal mass that is divisible by two have an even number of hydrogen atoms. To apply these two rules the nominal mass is estimated based on the observed monoisotopic or average mass3 . The prediction interval for the nominal mass is calculated by using a mean squared error model2 . When the rounded upper and lower limit of the predicted nominal mass are identical, the hydrogen- and nitrogen-rules can be applied.
3) Generating all possible elemental compositions and filtering based on mass The mass tolerance based filter is calculated based on the mass accuracy of the mass spectrometer. In some situations, where the nitrogen and hydrogen rules cannot be applied, a fourth step is required.
4) Extra filtering based on Senior’s theorem This step is applied when the rounded upper and lower limit of the predicted nominal mass (step 2) are not identical. The nitrogen- and hydrogen-rules can thus not be applied. The possible elemental compositions are reduced based on the first condition of Senior’s theorem4 that states that the sum of the valences or the total number of atoms having odd valences is even. This extra filtering gives the same possible elemental compositions as when using the nitrogen- and hydrogen-rules but is less memory efficient.
The original pacMASS algorithm was written in R. During my internship I implemented pacMASS in Python to make some speed improvements. This was done by using the pandas and NumPy packages. Pandas is an open source library that provides high-performance data-structures and data analysis tools. In the pacMASS script, the pandas package is mainly used for importing data from .txt or .csv files and creating a data frame with results. The NumPy package is designed for scientific computing with Python. The most important datatype in this package is a ndarray. The values in a ndarray are stored in a contiguous block of memory. This makes accessing values stored in a ndarray very fast. ndarrays were used in the pacMASS script to store all the data and to perform mathematical operations on this data throughout all the steps of the pacMASS algorithm. The pacMASS python script can be used from the command line or through a graphical user interface that was created with pyQT. 1 Valkenborg, D., Jansen, I., and Burzykowski, T. (2008) A model-based method for the prediction of the isotopic distribution of peptides. J. Am. Soc. Mass Spectrom., 19, 703-712. 2 Eng, J.K., McCormack, A.L., Yates, J.R. (1994). An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J. Am. Soc. Mass Spectrom., 5, 976-989. 3 Perkins, D.N., Pappin, D.J., Creasy, D.M., Cottrell, J.S. (1999) Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20, 3551-3567. 4 Senior, J.K. (1951) Unimerism, J. Chem. Phys., 19, 865-873.
Agoralaan building D