Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE

Raw Text

Search in PubMed

Search in NLM Catalog

Add to Search

.

2023 Jan 24;21(1):12.

doi: 10.1186/s12915-023-01510-8.

Chao Wang   1 ,

Quan Zou   2

Expand

Affiliations

1 School of Software Engineering, Chengdu University of Information Technology, Chengdu, China.

2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. zouquan@nclab.net.

PMID: 36694239

PMCID: PMC9875434

DOI: 10.1186/s12915-023-01510-8

Free PMC article

Chao Wang  et al.

BMC Biol .

2023 .

Free PMC article

Show details

Display options

Format

Search in PubMed

Search in NLM Catalog

Add to Search

.

2023 Jan 24;21(1):12.

doi: 10.1186/s12915-023-01510-8.

Authors

Chao Wang   1 ,

Quan Zou   2

Affiliations

1 School of Software Engineering, Chengdu University of Information Technology, Chengdu, China.

2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. zouquan@nclab.net.

PMID: 36694239

PMCID: PMC9875434

DOI: 10.1186/s12915-023-01510-8

Cite

Display options

Format

Abstract

Background: Protein solubility is a precondition for efficient heterologous protein expression at the basis of most industrial applications and for functional interpretation in basic research. However, recurrent formation of inclusion bodies is still an inevitable roadblock in protein science and industry, where only nearly a quarter of proteins can be successfully expressed in soluble form. Despite numerous solubility prediction models having been developed over time, their performance remains unsatisfactory in the context of the current strong increase in available protein sequences. Hence, it is imperative to develop novel and highly accurate predictors that enable the prioritization of highly soluble proteins to reduce the cost of actual experimental work.

Results: In this study, we developed a novel tool, DeepSoluE, which predicts protein solubility using a long-short-term memory (LSTM) network with hybrid features composed of physicochemical patterns and distributed representation of amino acids. Comparison results showed that the proposed model achieved more accurate and balanced performance than existing tools. Furthermore, we explored specific features that have a dominant impact on the model performance as well as their interaction effects.

Conclusions: DeepSoluE is suitable for the prediction of protein solubility in E. coli; it serves as a bioinformatics tool for prescreening of potentially soluble targets to reduce the cost of wet-experimental studies. The publicly available webserver is freely accessible at http://lab.malab.cn/~wangchao/softs/DeepSoluE/ .

Keywords: Feature embedding; Interpretation; Machine learning; Protein solubility.

© 2023. The Author(s).

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1

Comparison of different feature selection…

Fig. 1

Comparison of different feature selection methods. A – E Metrices value and feature…

Fig. 2

Performance comparison of DeepSoluE and…

Fig. 2

Performance comparison of DeepSoluE and 11 conventional machine learning methods

Fig. 3

Feature contribution and dependency analysis.…

Fig. 3

Feature contribution and dependency analysis. A The 20 most important features. B Summary…

Fig. 4

The DeepSoluE workflow. A Physicochemical…

Fig. 4

The DeepSoluE workflow. A Physicochemical feature encoding, feature optimization, and distributed representation of…

See this image and copyright information in PMC

Similar articles

DeepAc4C: a convolutional neural network model with hybrid features composed of physicochemical patterns and distributed representation information for identification of N4-acetylcytidine in mRNA. Wang C, Ju Y, Zou Q, Lin C. Wang C, et al. Bioinformatics. 2021 Dec 22;38(1):52-57. doi: 10.1093/bioinformatics/btab611. Bioinformatics. 2021. PMID: 34427581

Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction. Chang CC, Song J, Tey BT, Ramanan RN. Chang CC, et al. Brief Bioinform. 2014 Nov;15(6):953-62. doi: 10.1093/bib/bbt057. Epub 2013 Aug 7. Brief Bioinform. 2014. PMID: 23926206 Review.

Enhancer-FRL: Improved and Robust Identification of Enhancers and Their Activities Using Feature Representation Learning. Wang C, Zou Q, Ju Y, Shi H. Wang C, et al. IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):967-975. doi: 10.1109/TCBB.2022.3204365. Epub 2023 Apr 3. IEEE/ACM Trans Comput Biol Bioinform. 2023. PMID: 36063523

DSResSol: A Sequence-Based Solubility Predictor Created with Dilated Squeeze Excitation Residual Networks. Madani M, Lin K, Tarakanova A. Madani M, et al. Int J Mol Sci. 2021 Dec 17;22(24):13555. doi: 10.3390/ijms222413555. Int J Mol Sci. 2021. PMID: 34948354 Free PMC article.

Large-scale comparative assessment of computational predictors for lysine post-translational modification sites. Chen Z, Liu X, Li F, Li C, Marquez-Lago T, Leier A, Akutsu T, Webb GI, Xu D, Smith AI, Li L, Chou KC, Song J. Chen Z, et al. Brief Bioinform. 2019 Nov 27;20(6):2267-2290. doi: 10.1093/bib/bby089. Brief Bioinform. 2019. PMID: 30285084 Free PMC article. Review.

See all similar articles

Cited by

A Transformer-Based Ensemble Framework for the Prediction of Protein-Protein Interaction Sites. Mou M, Pan Z, Zhou Z, Zheng L, Zhang H, Shi S, Li F, Sun X, Zhu F. Mou M, et al. Research (Wash D C). 2023 Sep 27;6:0240. doi: 10.34133/research.0240. eCollection 2023. Research (Wash D C). 2023. PMID: 37771850 Free PMC article.

MV-CVIB: a microbiome-based multi-view convolutional variational information bottleneck for predicting metastatic colorectal cancer. Cui Z, Wu Y, Zhang QH, Wang SG, He Y, Huang DS. Cui Z, et al. Front Microbiol. 2023 Aug 22;14:1238199. doi: 10.3389/fmicb.2023.1238199. eCollection 2023. Front Microbiol. 2023. PMID: 37675425 Free PMC article.

Design, evaluation, and immune simulation of potentially universal multi-epitope mpox vaccine candidate: focus on DNA vaccine. Rcheulishvili N, Mao J, Papukashvili D, Feng S, Liu C, Wang X, He Y, Wang PG. Rcheulishvili N, et al. Front Microbiol. 2023 Jul 21;14:1203355. doi: 10.3389/fmicb.2023.1203355. eCollection 2023. Front Microbiol. 2023. PMID: 37547674 Free PMC article.

MIX-TPI: a flexible prediction framework for TCR-pMHC interactions based on multimodal representations. Yang M, Huang ZA, Zhou W, Ji J, Zhang J, He S, Zhu Z. Yang M, et al. Bioinformatics. 2023 Aug 1;39(8):btad475. doi: 10.1093/bioinformatics/btad475. Bioinformatics. 2023. PMID: 37527015 Free PMC article.

Golgi_DF: Golgi proteins classification with deep forest. Bao W, Gu Y, Chen B, Yu H. Bao W, et al. Front Neurosci. 2023 May 12;17:1197824. doi: 10.3389/fnins.2023.1197824. eCollection 2023. Front Neurosci. 2023. PMID: 37250391 Free PMC article.

References

Wilkinson DL, Harrison RG. Predicting the solubility of recombinant proteins in Escherichia coli. Biotechnology. 1991;9(5):443–448. - PubMed

Manning MC, Chou DK, Murphy BM, Payne RW, Katayama DS. Stability of protein pharmaceuticals: An update. Pharm Res. 2010;27(4):544–575. doi: 10.1007/s11095-009-0045-6. - DOI - PubMed

Ventura S. Sequence determinants of protein aggregation: tools to increase protein solubility. Microb Cell Fact. 2005;4(1):11. doi: 10.1186/1475-2859-4-11. - DOI - PMC - PubMed

Chiti F, Dobson CM. Protein misfolding, amyloid formation, and human disease: A summary of progress over the last decade. In: Kornberg RD, editor. Annu Rev Biochem. 2017. pp. 27–68. - PubMed

Bhandari BK, Gardner PP, Lim CS. Solubility-Weighted Index: fast and accurate prediction of protein solubility. Bioinformatics. 2020;36(18):4691–4698. doi: 10.1093/bioinformatics/btaa578. - DOI - PMC - PubMed

Publication types

Research Support, Non-U.S. Gov't Actions Search in PubMed Search in MeSH Add to Search

MeSH terms

Amino Acid Sequence Actions Search in PubMed Search in MeSH Add to Search

Computational Biology / methods Actions Search in PubMed Search in MeSH Add to Search

Escherichia coli* / genetics Actions Search in PubMed Search in MeSH Add to Search

Escherichia coli* / metabolism Actions Search in PubMed Search in MeSH Add to Search

Protein Processing, Post-Translational Actions Search in PubMed Search in MeSH Add to Search

Proteins* / metabolism Actions Search in PubMed Search in MeSH Add to Search

Solubility Actions Search in PubMed Search in MeSH Add to Search

Substances

Proteins Actions Search in PubMed Search in MeSH Add to Search

Related information

MedGen

Grants and funding

62002051/National Natural Science Foundation of China

62131004/National Natural Science Foundation of China

62272065/National Natural Science Foundation of China

62250028/National Natural Science Foundation of China

LinkOut - more resources

Full Text Sources BioMed Central Europe PubMed Central PubMed Central

Miscellaneous NCI CPTAC Assay Portal

Single Line Text

Search in PubMed. Search in NLM Catalog. Add to Search. . 2023 Jan 24;21(1):12. doi: 10.1186/s12915-023-01510-8. Chao Wang   1 , Quan Zou   2. Expand. Affiliations. 1 School of Software Engineering, Chengdu University of Information Technology, Chengdu, China. 2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. zouquan@nclab.net. PMID: 36694239. PMCID: PMC9875434. DOI: 10.1186/s12915-023-01510-8. Free PMC article. Chao Wang  et al. BMC Biol . 2023 . Free PMC article. Show details. Display options. Format. Search in PubMed. Search in NLM Catalog. Add to Search. . 2023 Jan 24;21(1):12. doi: 10.1186/s12915-023-01510-8. Authors. Chao Wang   1 , Quan Zou   2. Affiliations. 1 School of Software Engineering, Chengdu University of Information Technology, Chengdu, China. 2 Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. zouquan@nclab.net. PMID: 36694239. PMCID: PMC9875434. DOI: 10.1186/s12915-023-01510-8. Cite. Display options. Format. Abstract. Background: Protein solubility is a precondition for efficient heterologous protein expression at the basis of most industrial applications and for functional interpretation in basic research. However, recurrent formation of inclusion bodies is still an inevitable roadblock in protein science and industry, where only nearly a quarter of proteins can be successfully expressed in soluble form. Despite numerous solubility prediction models having been developed over time, their performance remains unsatisfactory in the context of the current strong increase in available protein sequences. Hence, it is imperative to develop novel and highly accurate predictors that enable the prioritization of highly soluble proteins to reduce the cost of actual experimental work. Results: In this study, we developed a novel tool, DeepSoluE, which predicts protein solubility using a long-short-term memory (LSTM) network with hybrid features composed of physicochemical patterns and distributed representation of amino acids. Comparison results showed that the proposed model achieved more accurate and balanced performance than existing tools. Furthermore, we explored specific features that have a dominant impact on the model performance as well as their interaction effects. Conclusions: DeepSoluE is suitable for the prediction of protein solubility in E. coli; it serves as a bioinformatics tool for prescreening of potentially soluble targets to reduce the cost of wet-experimental studies. The publicly available webserver is freely accessible at http://lab.malab.cn/~wangchao/softs/DeepSoluE/ . Keywords: Feature embedding; Interpretation; Machine learning; Protein solubility. © 2023. The Author(s). PubMed Disclaimer. Conflict of interest statement. The authors declare that they have no competing interests. Figures. Fig. 1. Comparison of different feature selection… Fig. 1. Comparison of different feature selection methods. A – E Metrices value and feature… Fig. 2. Performance comparison of DeepSoluE and… Fig. 2. Performance comparison of DeepSoluE and 11 conventional machine learning methods. Fig. 3. Feature contribution and dependency analysis.… Fig. 3. Feature contribution and dependency analysis. A The 20 most important features. B Summary… Fig. 4. The DeepSoluE workflow. A Physicochemical… Fig. 4. The DeepSoluE workflow. A Physicochemical feature encoding, feature optimization, and distributed representation of… See this image and copyright information in PMC. Similar articles. DeepAc4C: a convolutional neural network model with hybrid features composed of physicochemical patterns and distributed representation information for identification of N4-acetylcytidine in mRNA. Wang C, Ju Y, Zou Q, Lin C. Wang C, et al. Bioinformatics. 2021 Dec 22;38(1):52-57. doi: 10.1093/bioinformatics/btab611. Bioinformatics. 2021. PMID: 34427581. Bioinformatics approaches for improved recombinant protein production in Escherichia coli: protein solubility prediction. Chang CC, Song J, Tey BT, Ramanan RN. Chang CC, et al. Brief Bioinform. 2014 Nov;15(6):953-62. doi: 10.1093/bib/bbt057. Epub 2013 Aug 7. Brief Bioinform. 2014. PMID: 23926206 Review. Enhancer-FRL: Improved and Robust Identification of Enhancers and Their Activities Using Feature Representation Learning. Wang C, Zou Q, Ju Y, Shi H. Wang C, et al. IEEE/ACM Trans Comput Biol Bioinform. 2023 Mar-Apr;20(2):967-975. doi: 10.1109/TCBB.2022.3204365. Epub 2023 Apr 3. IEEE/ACM Trans Comput Biol Bioinform. 2023. PMID: 36063523. DSResSol: A Sequence-Based Solubility Predictor Created with Dilated Squeeze Excitation Residual Networks. Madani M, Lin K, Tarakanova A. Madani M, et al. Int J Mol Sci. 2021 Dec 17;22(24):13555. doi: 10.3390/ijms222413555. Int J Mol Sci. 2021. PMID: 34948354 Free PMC article. Large-scale comparative assessment of computational predictors for lysine post-translational modification sites. Chen Z, Liu X, Li F, Li C, Marquez-Lago T, Leier A, Akutsu T, Webb GI, Xu D, Smith AI, Li L, Chou KC, Song J. Chen Z, et al. Brief Bioinform. 2019 Nov 27;20(6):2267-2290. doi: 10.1093/bib/bby089. Brief Bioinform. 2019. PMID: 30285084 Free PMC article. Review. See all similar articles. Cited by. A Transformer-Based Ensemble Framework for the Prediction of Protein-Protein Interaction Sites. Mou M, Pan Z, Zhou Z, Zheng L, Zhang H, Shi S, Li F, Sun X, Zhu F. Mou M, et al. Research (Wash D C). 2023 Sep 27;6:0240. doi: 10.34133/research.0240. eCollection 2023. Research (Wash D C). 2023. PMID: 37771850 Free PMC article. MV-CVIB: a microbiome-based multi-view convolutional variational information bottleneck for predicting metastatic colorectal cancer. Cui Z, Wu Y, Zhang QH, Wang SG, He Y, Huang DS. Cui Z, et al. Front Microbiol. 2023 Aug 22;14:1238199. doi: 10.3389/fmicb.2023.1238199. eCollection 2023. Front Microbiol. 2023. PMID: 37675425 Free PMC article. Design, evaluation, and immune simulation of potentially universal multi-epitope mpox vaccine candidate: focus on DNA vaccine. Rcheulishvili N, Mao J, Papukashvili D, Feng S, Liu C, Wang X, He Y, Wang PG. Rcheulishvili N, et al. Front Microbiol. 2023 Jul 21;14:1203355. doi: 10.3389/fmicb.2023.1203355. eCollection 2023. Front Microbiol. 2023. PMID: 37547674 Free PMC article. MIX-TPI: a flexible prediction framework for TCR-pMHC interactions based on multimodal representations. Yang M, Huang ZA, Zhou W, Ji J, Zhang J, He S, Zhu Z. Yang M, et al. Bioinformatics. 2023 Aug 1;39(8):btad475. doi: 10.1093/bioinformatics/btad475. Bioinformatics. 2023. PMID: 37527015 Free PMC article. Golgi_DF: Golgi proteins classification with deep forest. Bao W, Gu Y, Chen B, Yu H. Bao W, et al. Front Neurosci. 2023 May 12;17:1197824. doi: 10.3389/fnins.2023.1197824. eCollection 2023. Front Neurosci. 2023. PMID: 37250391 Free PMC article. References. Wilkinson DL, Harrison RG. Predicting the solubility of recombinant proteins in Escherichia coli. Biotechnology. 1991;9(5):443–448. - PubMed. Manning MC, Chou DK, Murphy BM, Payne RW, Katayama DS. Stability of protein pharmaceuticals: An update. Pharm Res. 2010;27(4):544–575. doi: 10.1007/s11095-009-0045-6. - DOI - PubMed. Ventura S. Sequence determinants of protein aggregation: tools to increase protein solubility. Microb Cell Fact. 2005;4(1):11. doi: 10.1186/1475-2859-4-11. - DOI - PMC - PubMed. Chiti F, Dobson CM. Protein misfolding, amyloid formation, and human disease: A summary of progress over the last decade. In: Kornberg RD, editor. Annu Rev Biochem. 2017. pp. 27–68. - PubMed. Bhandari BK, Gardner PP, Lim CS. Solubility-Weighted Index: fast and accurate prediction of protein solubility. Bioinformatics. 2020;36(18):4691–4698. doi: 10.1093/bioinformatics/btaa578. - DOI - PMC - PubMed. Publication types. Research Support, Non-U.S. Gov't Actions Search in PubMed Search in MeSH Add to Search. MeSH terms. Amino Acid Sequence Actions Search in PubMed Search in MeSH Add to Search. Computational Biology / methods Actions Search in PubMed Search in MeSH Add to Search. Escherichia coli* / genetics Actions Search in PubMed Search in MeSH Add to Search. Escherichia coli* / metabolism Actions Search in PubMed Search in MeSH Add to Search. Protein Processing, Post-Translational Actions Search in PubMed Search in MeSH Add to Search. Proteins* / metabolism Actions Search in PubMed Search in MeSH Add to Search. Solubility Actions Search in PubMed Search in MeSH Add to Search. Substances. Proteins Actions Search in PubMed Search in MeSH Add to Search. Related information. MedGen. Grants and funding. 62002051/National Natural Science Foundation of China. 62131004/National Natural Science Foundation of China. 62272065/National Natural Science Foundation of China. 62250028/National Natural Science Foundation of China. LinkOut - more resources. Full Text Sources BioMed Central Europe PubMed Central PubMed Central. Miscellaneous NCI CPTAC Assay Portal.