Mass spectrometry-based proteomics plays an important role in identifying peptides. Peptide identification strongly depends on a precursor mass estimated from mass spectrometry; however, there is burden to estimate precise precursor masses because it ...
Mass spectrometry-based proteomics plays an important role in identifying peptides. Peptide identification strongly depends on a precursor mass estimated from mass spectrometry; however, there is burden to estimate precise precursor masses because it is too noisy to obtain correct isotope clusters. This problem can be reduced by conventional tools such as RAPID and MS-Deconv. These tools applied heuristic functions to recognize correct isotope clusters so that more precise precursor masses could be estimated. However, the heuristic functions were limited in modelling the patterns of experimental isotope clusters in that they were based on similarity with theoretical isotope clusters. Here, we propose a machine learning approach to identify correct isotope clusters, with a hope that it can better characterize experimental isotope clusters. Furthermore, we extend this concept to predict monoisotopic masses in addition to recognizing isotope clusters by developing a new software tool called MaSIC, which stands for MAss Spectrum Isotopic Cluster.
We designed an artificial neural network model to train characteristics of isotope clusters. The model takes a monoisotopic mass and intensities of peaks from the first to the twelfth in a cluster as an input, and then predicts whether the given cluster is an isotope cluster or not.
To train the model, we collected 3,749,487 peptide spectrum matches (PSMs) from a previous study. Predicted isotope clusters (PICs) corresponding to each PSM were generated from both RAPID and MS-Deconv, and we got ~1.73M PICs after de-duplicating clusters. We generated 0.75 M negative isotope clusters (NICs) consisting of subsequence of 1.73 M PICs.
Four-fifths of PICs and NICs were used for training, and the rest of them were used for test. We applied 5-fold cross validation to prevent overfitting. The accuracy was 99.98% on average. We used PICs and NICs derived from different experimental methods to test the model. The sensitivity and specificity were 99.95% and 99.85%, respectively.
DL4J, which is a useful library for machine learning algorithms in Java, was applied to make the trained model available on a Java platform. The software MaSIC can predict all possible isotope clusters when mass spectra in mzXML format are given as an input. The complementary use of both MaSIC and heuristic software can increase prediction performance.