Cell-penetrating peptides (CPPs) translocate into the cell as various biologically active conjugates and possess numerous biomedical applications. Several machine learning (ML)-based predictors have been proposed in the past, but most focus on identifying only CPPs. In 2018, we proposed a two-layer predictor called MLCPP to predict CPPs and their strength simultaneously.
Although the research community widely accessed MLCPP, the prediction quality needed further improvement to enhance the practical application. So, we developed an updated version of MLCPP, called MLCPP 2.0, an interpretable stacking model for accurately identifying CPP and its strength. We updated the benchmarking dataset, explored 17 different sequence-based feature encoding algorithms, and utilized seven different conventional ML classifiers. Specifically, we constructed several baseline models using multiple 10-fold cross-validation whose predicted probability values were merged and treated as a new feature vector. A feature selection technique was employed and selected features were fed into the appropriate classifier for the construction of an effective stacked model. Our analysis revealed that 80 and 40 baseline models are essential for MLCPP 2.0 performance for accurately identifying CPP and its strength, respectively. Analytical results showed that MLCPP 2.0 achieved excellent performance on the independent test set, significantly outperforming the existing state-of-the-art predictors.
We believe MLCPP 2.0 will provide essential advances in discovering novel CPPs and facilitate the hypothesis-driven experimental design.