GPS 6.0 - Kinase-specific Phosphorylation Site Prediction

※ Computational resources of protein phosphorylation:

Introduction:

Protein phosphorylation is the most ubiquitous post-translational modification (PTM), and plays important roles in most of biological processes. Identification of site-specific phosphorylated substrates is fundamental for understanding the molecular mechanisms of phosphorylation. Besides experimental approaches, prediction of potential candidates with computational methods has also attracted great attention for its convenience and fast-speed. In this review, we present a comprehensive but brief summarization of computational resources of protein phosphorylation, including phosphorylation databases, prediction of non-specific or organism-specific phosphorylation sites, prediction of kinase-specific phosphorylation sites or phospho-binding motifs, and other tools.

We apologized that the computational studies without any web links of databases or tools will not be included in this compendium, since it's not easy for experimentalists to use studies directly. We are grateful for users feedback. Please inform Dr. Yu Xue or Miaomiao Chen to add, remove or update one or multiple web links below.

Index:

<1> Phosphorylation Databases

<2> Prediction of non-specific or organism-specific phosphorylation sites

<3> Prediction of kinase-specific phosphorylation sites or phospho-binding motifs

<4> Miscellaneous tools

<5> Implemented tools

<6> Detection of potential phosphorylation sites from mass spectrometry data

==================================================================================

<1> Phosphorylation Databases:

1. Phospho.ELM 9.0 (PhosphoBase): contains 8,718 experimentally verified phosphorylated proteins from different species with 3,370 tyrosine, 31,754 serine and 7,449 threonine sites (Diella, et al., 2004; Diella, et al., 2008; Dinkel, et al., 2011).

2. PhosphoSitePlus: a new version of PhosphoSite, is a web-based database to collect protein modification sites, including protein phosphorylation sites from scientific literature as well as high-throughput discovery programs. Currently, PhosphoSitePlus contains over 120,000 phosphorylation sites (Hornbeck, et al., 2012).

3. PhosphoNET: PhosphoNET presently holds data on over 950,000 known and putative phosphorylation sites (P-sites) in over 23,000 human proteins that have been collected from the scientific literature and other reputable websites. Over 19% of these phospho-sites have been experimentally validated. The rest have been predicted with a novel P-Site Predictor algorithm developed at Kinexus with academic partners at the University of British Columbia and Simon Fraser University.

4. HPRD release 9: HPRD currently contains information for 16,972 PTMs which belong to various categories with phosphorylation (10,858), dephosphorylation (3,118) and glycosylation (1,860) forming the majority of the annotated PTMs. At least one enzyme responsible for PTMs has been annotated for 8,960 PTMs, which resulted in the documentation of 7,253 enzyme - substrate relationships (Keshava Prasad, et al., 2009).

5. PHOSIDA (Mirror website): a posttranslational modification database, comprises more than 80,000 phosphorylated, N-glycosylated or acetylated sites from nine different species. All sites are obtained from high-resolution mass spectrometric data using the same stringent quality criteria. PHOSIDA is comprised of three main components: the database environment, the prediction platform and the toolkit section (Gnad, et al., 2007; Gnad, et al., 2009; Gnad, et al., 2011).

6. PhosphoPep : A project to support systems biology signaling research by providing interactive interrogation of MS-derived phosphorylation data from 4 different organisms (Bodenmiller B,et al., 2011).

7. PhosPhAt 4.0: contains information on Arabidopsis phosphorylation sites which were identified by mass spectrometry in large scale experiments from different research groups with 60,366 phospho-peptides matching to 8141 nonredundant proteins (Heazlewood, et al., 2008; Durek, et al., 2010; Zulawski, et al., 2013).

8. P(3)DB 3.5: hosts protein phosphorylation data for 9 species from 32 experimental studies, containing 16,477 phosphoproteins, harboring 47,923 phosphosites. Centralized by these phosphorylation data, multiple related data and annotations are provided, including protein-protein interaction (PPI), gene ontology, protein tertiary structures, orthologous sequences, kinase/phosphatase classification and Kinase Client Assay (KiC Assay) data. In addition, it incorporates multiple network viewers for the above features such as PPI network, kinase-substrate network, phosphatase-substrate network, and domain co-occurrence network (Gao, et al., 2009; Yao, et al., 2012; Yao, et al., 2013).

9. UniProt : for each protein annotation, the "Amino acid modifications" in the "Sequence annotation (Features)" section collected the post-translational modification information of proteins (UniProt Consortium,et al., 2021).

10. dbPTM 3.0: an informative resource of experimental post-translational modifications (PTMs) obtained from public resources as well as manually curated MS/MS peptides associated with PTMs from research articles for investigating the substrate specificity of PTM sites and functional association of PTMs between substrates and their interacting proteins (Huang,et al., 2019; Lee, et al., 2006; Lu, et al., 2013).

11. SysPTM 2.0 (Mirror website): provides a systematic and sophisticated platform for proteomic PTM research, equipped not only with a knowledge base of manually curated multi-type modification data, but also with four fully developed, in-depth data mining tools (Li, et al., 2009).

12. PhosphoPOINT: is a comprehensive human kinase interactome and phospho-protein database, containing 4195 phospho-proteins with a total of 15,738 phosphorylation sites (Yang, et al., 2008).

13. NetworKIN 3.0: is a method for predicting in vivo kinase-substrate relationships, that augments consensus motifs with context for kinases and phosphoproteins. It's a great resource and open a door for computational discovering of phospho-regulatory network (Linding, et al., 2007; Linding, et al., 2008; Horn, et al., 2014).

14. Phospho3D 2.0: is a database of three-dimensional structures of phosphorylation sites which stores information retrieved from the phospho.ELM database and which is enriched with structural information and annotations at the residue level (Zanzoni, et al., 2007; Zanzoni, et al., 2011).

15. PepCyber :P~Pep 1.2: is a database of human protein-protein interactions mediated by 10 classes of phosphoprotein binding domains (PPBDs) (Gong, et al., 2008).

16. PhosphoVariant: a database for human phosphovariants, which were defined as genetic variations that change phosphorylation sites or their interacting kinases (Ryu, et al., 2009).

17. ProMEX: a mass spectral reference database for proteins and protein phosphorylation sites, containing 116,364 tryptic peptide product ion spectra entries of 48,218 different peptide sequence entries (Hummel, et al., 2009; Wienkoop, et al., 2012).

18. PlantsP: contains more than 300 phosphorylation sites from Arabidopsis thaliana plasma membrane proteins (Nühse, et al., 2009).

19. LymPHOS2: a web-oriented DataBase containing peptidic and protein sequences and spectrometric information on the PhosphoProteome of human T-Lymphocytes (Ovelleiro, et al., 2009).

20. PhosSNP 1.0: a genome-wide analysis of genetic polymorphisms that influence protein phosphorylation in H. Sapiens. It was estimated that ~69.76% of nsSNPs (non-synonymous SNPs) are potential phosSNPs (Phosphorylation-related SNPs) (64, 035) in 17, 614 proteins (Ren, et al., 2010).

21. The Phosphorylation Site Database: provides ready access to information from the primary scientific literature concerning those proteins from prokaryotic organisms, i.e., the members of the domains Archaea and Bacteria, that have been reported to undergo covalent phosphorylation on the hydroxyl side chains of serine, threonine, and/or tyrosine residues (Wurgler-Murphy, et al., 2004).

22. PhosphoGRID version 3.2: an online database of experimentally verified in vivo protein phosphorylation sites in the model eukaryotic organism S. cerevisiae. he database includes results from both high throughput (HTP) MS proteomics studies in addition to phosphosites identified in low throughput (LTP) studies of individual proteins or protein complexes (Stark, et al., 2010; Stark, et al., 2013).

23. TAIR: maintains a database of genetic and molecular data for Arabidopsis thaliana. Protein data available from TAIR includes the complete protein sequence along with phosphorylation site annotations (Lamesch, et al., 2011).

24. Medicago PhosphoProtein Database: a repository for Medicago truncatula phosphoprotein data which holds 3,404 non-redundant sites of phosphorylation on 829 proteins (Grimsrud, et al., 2010).

25. dbPPT 1.0 : a comprehensive resource of plant protein phosphorylation that contains 82,175 phosphorylation sites in 31,012 proteins from 20 plant organisms. The phosphorylation sites in dbPPT were manually curated from the literatures, while datasets in other public databases were also integrated (Cheng, et al., 2014).

26. EPSD : a comprehensive data resource updated from two databases of dbPPT and dbPAF , which contained 82,175 p-sites of 20 plants and 483,001 p-sites of 7 animals and fungi, respectively. (Lin, et al., 2020).

27. KinBase: holds information on over 3,000 protein kinase genes found in the genomes of human, and many other sequenced genomes. (G Manning, et al., 2002).

28. BioGRID: A public database that curated genetic, chemical interactions and proteins with a large number of PTM sites (Oughtred R,et al., 2019).

29. FPD: Including 62,272 non-redundant phosphorylation sites in 11,222 proteins across eight fungi (Bai,et al., 2017).

30. MPPD: A repository for Medicago truncatula phosphoprotein data (Grimsrud PA,et al., 2010).

31. iEKPD: Contained 197,348 phosphorylation regulators, including 109,912 protein kinases, 23,294 protein phosphatases and 68,748 PPBD-containing proteins in 164 eukaryotic specie (Guo,et al., 2019).

32. PubMed: Comprises more than 34 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full text content from PubMed Central and publisher web sites.

33. PDB: A leading resource of structural data of biological macromolecules (Berman HM,et al., 2000).

<2> Prediction of non-specific or organism-specific phosphorylation sites:

1. NetPhos 2.0: produces neural network predictions for serine, threonine and tyrosine phosphorylation sites in eukaryotic proteins (Blom, et al., 1999).

2. CRP: Cleaved Radioactivity of Phosphopeptides. CRP performs an in silico proteolytic cleavage of the sequence and reports the predicted Edman cycles in which radioactivity would be observed if a given serine, threonine or tyrosine will be phosphorylated (Mackey, et al., 2003).

3. DISPHOS 1.3: uses disorder information to improve the discrimination between phosphorylation and non-phosphorylation sites, and predicts serine, threonine and tyrosine phosphorylation sites in proteins (Iakoucheva, et al., 2004).

4. NetPhosYeast 1.0: predicts serine and threonine phosphorylation sites in yeast proteins (Ingrell, et al., 2007).

5. NetPhosBac 1.0: NetPhosBac 1.0 server predicts serine and threonine phosphorylation sites in bacterial proteins (Miller et al. 2009).

6. PhosPhAt 4.0: they utilized a set of 802 experimentally validated serine phosphorylation sites as the training data set in their 2.2 version, while with additional 1,818 threonine phosphorylation sites and 676 tyrosine sites in Arabidopsis to develop their 3.0 predictor for phosphorylation sites in Arabidopsis (Heazlewood, et al., 2008; Durek, et al., 2010).

7. PHOSIDA (Mirror website): a predictor based on more than 5,000 high confidence phosphosites, with the Support vector machines (SVMs) algorithm (Gnad, et al., 2007).

8. GANNPhos: uses a genetic algorithm integrated neural network (GANN) algorithm (Tang, et al., 2007). The tool is not available.

9. PHOSITE: is based on the case-based sequence analysis (Koenig and Grabe, 2004). The tool is not available.

10. PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine. The model trained by the animal phosphorylation sites was also applied to a plant phosphorylation site dataset as an independent test (Dou, et al., 2014).

11. PHOSFER: a novel tool for predicting phosphorylation sites in soybean using known phosphorylation sites from both soybean and other organisms (Trost and Kusalik, et al., 2013).

12. PhosphoRice: a meta-predictor of rice-specific phosphorylation site by using weighted voting strategy with parameters selected by restricted grid search and conditional random search (Que, et al., 2012).

<3> Prediction of kinase-specific phosphorylation sites or phospho-binding motifs:

1. GPS 5.0 : the old version of GPS system. GPS 5.0 software developed two novel methods of PWD and SMO to improve the performance for predicting kinase-specific p-sites of 479 human PKs and 44,795 PKs in 161 eukaryotes. (Xue, et al., 2019).

2. GPS 2.1 : the old version of GPS system. We renamed the tool as the Group-based Prediction System. GPS 2.1 software was implemented in JAVA and could predict kinase-specific phosphorylation sites for 408 human Protein Kinases in hierarchy (Xue, et al., 2008).

3. GPS 1.10 : the old version of GPS. We designed a novel algorithm GPS (Group-based Phosphorylation sites Prediction) and construct an easy-to-use web server for the experimentalists (Xue, et al., 2005; Zhou, et al., 2004).

4. PPSP 1.0 : we also developed another online program for prediction of kinase-specific phosphorylation sites, implemented in Baysian Decision Theory (BDT) (Xue, et al., 2006).

5. ScanProsite: consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them (de Castro, et al., 2006; Hulo, et al., 2008).

6. ELM: is a resource for predicting functional sites in eukaryotic proteins (Puntervoll, et al., 2010).

7. Minimotif Miner: analyzes protein queries for the presence of short functional motifs that, in at least one protein, has been demonstrated to be involved in posttranslational modifications (PTM), binding to other proteins, nucleic acids, or small molecules, or proteins trafficking (Balla, et al., 2006; Rajasekaran, et al., 2009).

8. PhosphoMotif Finder: contains known kinase/phosphatase substrate as well as binding motifs that are curated from the published literature. It reports the PRESENCE of any literature-derived motif in the query sequence (Amanchy, et al., 2007).

9. PREDIKIN 1.0: produces a prediction of substrates for serine/threonine protein kinases based on the primary sequence of a protein kinase catalytic domain (Brinkworth, et al., 2003).

10. Predikin & PredikinDB 2.0: consists of two components: (i) PredikinDB, a database of phosphorylation sites that links substrates to kinase sequences and (ii) a Perl module, which provides methods to classify protein kinases, reliably identify substrate-determining residues, generate scoring matrices and score putative phosphorylation sites in query sequences (Saunders, et al., 2008; Saunders and Kobe, 2008).

11. ScanSite 2.0: searches for motifs within proteins that are likely to be phosphorylated by specific protein kinases or bind to domains such as SH2 domains, 14-3-3 domains or PDZ domains (Obenauer, et al., 2003).

12. NetPhosK 1.0: produces neural network predictions of kinase specific eukaryotic protein phosphoylation sites. Currently NetPhosK covers the following kinases: PKA, PKC, PKG, CKII, Cdc2, CaM-II, ATM, DNA PK, Cdk5, p38 MAPK, GSK3, CKI, PKB, RSK, INSR, EGFR and Src (Blom, et al., 2004).

13. PredPhospho 1.0: implemented in SVM algorithm, could predict kinase-specific phosphorylation sites for 4 kinase groups and 4 kinase families, respectively (Kim, et al., 2004).

14. PredPhospho 2.0: enhance version of PredPhospho predictor, which was still implemented in SVM algorithm, for 7 kinase groups and 18 kinase families, respectively (Ryu, et al., 2009).

15. KinasePhos 1.0: predicts kinase-specific phosphorylation sites within given protein sequences. Profile Hidden Markov Model (HMM) is applied for learning to each group of sequences surrounding to the phosphorylation residues (Huang, et al., 2005).

16. KinasePhos 2.0: New version of kinase-specific phosphorylation site prediction tool that is based the sequenece-based amino acid coupling-pattern analysis and solvent accessibility as new features of SVM (support vector machine) (Wong, et al., 2007).

17. PhoScan: predicts of kinase-specific phosphorylation sites with sequence features by a log-odds ratio approach (Li, et al., 2007).

18. pkaPS: Prediction of protein kinase A phosphorylation sites using the simplified kinase binding model (Neuberger, et al., 2007).

19. CRPhos 0.8: Prediction of kinase-specific phosphorylation sites using conditional random fields. Its source code is free for academic research and could be compiled in Linux/Unix OS (Dang, et al., 2008).

20. AutoMotif 2.0: allows for identification of PTM (post-translational modification) sites, including phosphorylation sites in proteins. The AutoMotif Server 2.0 was trained support vector machine (SVM) for each type of PTM separately on proteins of the Swiss-Prot database (version 42.0) (Plewczynski, et al., 2005; Plewczynski, et al., 2008).

21. MetaPredPS: Meta-predictors make predictions by organizing and processing the predictions produced by several other predictors in a defined problem domain (Wan, et al., 2008).

22. SMALI: searches for peptide ligands in human proteins that are likely to bind to SH2 domains (Huang, et al., 2008; Li, et al., 2008).

23. NetPhorest: is a non-redundant collection of 125 sequence-based classifiers for linear motifs in phosphorylation-dependent signaling. The collection contains both family-based and gene-specific classifiers (Miller, et al., 2008; Miller, et al., 2008; Horn, et al., 2014).

24. SiteSeek: is trained using a novel compact evolutionary and hydrophobicity profile to detect possible protein phosphorylation sites for a target sequence (Yoo, et al., 2008). The tool is not available.

25. PostMod: is a predict sever for phosphorylation sites. The authors combined physicochemical information, motif information, and evolutionary information by simply comaparing sequence similarities, and could predict phosphorylation sites for 48 different kinases (Jung, et al., 2010).

26. iGPS 1.0 : we developed a software package of iGPS (GPS algorithm with the interaction filter, or in vivo GPS) mainly for the prediction of in vivo ssKSRs. Eukaryotic PKs were classified into a hierarchy with four levels: group, family, subfamily, and single PK. Based on the hypothesis that similar PKs recognize similar SLMs, we selected a predictor in GPS 2.0 for each PK and directly predicted the potential PKs for the non-annotated p-sites from the phosphoproteomic studies. Consequently, protein-protein interaction (PPI) information was used as the major contextual factor to filtrate potentially false-positive hits (Song, et al., 2012).

27. Musite: a tool for global prediction of general and kinase-specific phosphorylation sites. The authors collected phosphoproteomics data in multiple organisms from several reliable sources and used them to train prediction models by a comprehensive machine-learning approach that integrates local sequence similarities to known phosphorylation sites, protein disorder scores, and amino acid frequencies (Gao, et al., 2010; Yao, et al., 2012).

28. MusiteDeep: the first deep-learning framework for predicting general and kinase-specific phosphorylation sites (Wang, et al., 2017).

29. PlantPhos: is a web tool for predicting potential phosphorylation sites in plant proteins with various substrate motifs based on Hidden Markov Models (HMM) and Maximal Dependence Decomposition (MDD) (Lee, et al., 2011).

30. PSEA: the authors proposed a new method called PSEA (Phosphorylation Set Enrichment Analysis) to detect new sites phosphorylated by a specific kinase, kinase family and kinase group. For each query, they assigned a P-value according to its similarity with known phosphorylated ones. The smaller the P-value, the more significant will be the chance that the given peptides were phosphorylated by the chosed kinase type (Suo, et al., 2014).

31. PKIS: based on the latest version of Phopho.ELM (9.0), a novel kinase identification web server, PKIS, incorporating support vector machines (SVMs) with the composition of monomer spectrum (CMS) is used to assign protein kinase for experimentally verified P-sites of human in high specificity (Zou, et al., 2013).

32. PTMPred: a support vector machine (SVM) with the kernel matrix computed by PSPM(position-specific propensity matrices) is applied to predict the posttranslational modification sites (Xu, et al., 2014).

33. Phos_pred: a method for the prediction of protein kinase-specific phosphorylation sites in hierarchical structure using functional information and random forest (Fan, et al., 2014).

34. PhosK3D: is a web server for identifying kinase-specific phosphorylation sites on protein sequences and three-dimensional structures (Su and Lee, et al., 2013).

35. PhosphoPICK: a method for predicting kinase substrates using cellular context information, and is currently able to make predictions for 59 human kinases (Patrick, et al., 2013).

36. DeepPSP: a global-gocal gnformation-based deep neural network for the prediction of protein phosphorylation sites(Guo, et al., 2021).

37. PhosIDN: an integrated deep neural network for improving protein phosphorylation site prediction by combining sequence and protein-protein interaction information (Yang, et al., 2021).

38. GasPhos: a kinase-specific phosphorylation prediction system using a new feature selection approach with a GAAided ant colony system (Chen, et al., 2020).

39. KinasePhos 3.0: New version of kinase-specific phosphorylation site prediction tool with 771 predictive models(Ma, et al., 2022).

<4> Miscellaneous tools:

1. DOG 2.0 : prepares publication-quality figures of protein domain structures. The scale of a protein domain and the position of a functional motif/site will be precisely calculated (Ren, et al., 2009).

2. Motif-X: is a software tool designed to extract overrepresented patterns from any sequence data set. The algorithm is an iterative strategy which builds successive motifs through comparison to a dynamic statistical background (Schwartz and Gygi, 2005; Chou and Schwartz, 2005 ).

3. Scan-X: is a software tool designed to find motifs (identified using motif-x) within any sequence data set. The first large scale scan was performed using all available human, mouse, fly and yeast phosphorylation and acetylation data to perform a scan for undiscovered sites (Schwartz, et al., 2008).

4. MoDL: finds mutliple motifs in a set of phosphorylated peptides (Ritz, et al., 2009).

5. PhosphoBlast: allows the user to submit a protein query to search against the curated dataset of phosphorylated peptides (Wang and Klemke, 2008).

6. RLIMS-P: is a rule-based text-mining program specifically designed to extract protein phosphorylation information on protein kinase, substrate and phosphorylation sites from the abstracts (Hu, et al., 2005; Yuan, et al., 2006).

7. KEA: Kinase enrichment analysis (KEA) is a web-based tool with an underlying database providing users with the ability to link lists of mammalian proteins/genes with the kinases that phosphorylate them (Lachmann and Ma'ayan, 2009).

8. HemI 1.0 : an easy-to-use tool can visualize either gene or protein expression data in heatmaps. Additionally, the heatmaps can be recolored, rescaled or rotated in a customized manner. In addition, HemI provides multiple clustering strategies for analyzing the data. Publication-quality figures can be exported directly (Deng, et al., 2014).

<5> Implemented Tools:

1. Echarts: An Open Source JavaScript Visualization Library.

2. IUPred: The web server takes a single amino acid sequence as an input and calculates the pairwise energy profile along the sequence (Dosztányi,et al., 2021).

3. 3Dmol.js: A modern, object-oriented JavaScript library for visualizing molecular data.

4. NetSurfP: A tool for predicting solvent accessibility, secondary structure, structural disorder and backbone dihedral angles for each residue of an amino acid sequence (H?ie MH,et al., 2022).

<6> Detection of potential phosphorylation sites from mass spectrometry data:

1. PhosphoScore: is a phosphorylation assignment program that is compatible with all levels of tandem mass spectrometry spectra (MSn) generated through the Bioworks/Sequest platform. The program utilizes a "cost function" which takes into account both the match quality and normalized intensity of observed spectral peaks compared to a theoretical spectrum. PhosphoScore was written in Java (Ruttenberg, et al., 2008).

2. Ascore: measures the probability of correct phosphorylation site localization based on the presence and intensity of site-determining ions in MS/MS spectra (Beausoleil, et al., 2006).

3. Colander: a probability-based support vector machine algorithm for automatic screening for CID spectra of phosphopeptides prior to database search (Lu, et al., 2008).

4. DeBunker: a SVM-based software, which could automatically validate phosphopeptide identifications from tandem mass spectra (Lu, et al., 2007).

5. APIVASE 2.2: was developed for phosphopeptide validation by combining the information obtained from MS2 spectra and its corresponding neutral loss MS3 spectra (Jiang, et al., 2008).

6. InsPecT: a new scoring function was developed for phosphorylated peptide tandem mass spectra for ion-trap instruments, without the need for manual validation (Payne, et al., 2008).

7. Phosphopeptide FDR Estimator: is designed for analysis of phosphopeptide LC-MS/MS data (Du, et al., 2008). The tool is not available.

8. PhosTShunter: a fast and reliable tool to detect phosphorylated peptides in liquid chromatography Fourier transform tandem mass spectrometry data sets (Kocher, et al., 2006). The tool is not available.

9. PhosphoScan: a probability-based method for phosphorylation site prediction using MS2/MS3 pair information (Wan, et al., 2008). The tool is not available.

10. ArMone: a new scoring function was developed for phosphorylated peptide tandem mass spectra for ion-trap instruments, without the need for manual validation (Jiang, et al., 2010).