Metadata-Version: 2.4
Name: scarse
Version: 1.0.0
Summary: A package for training SCARSE on small sample sizes of peptide sequences that can later be used to predict peptide properties of unseen peptides. Making SCARSE perfectly suited for AI-infused peptide engineering.
Home-page: https://github.com/LeoAnd00/scarse
Author: Leo Andrekson, Robin Rydbergh, Rocío Mercado, Michaela Wenzel
Author-email: leo.andrekson@chalmers.se, robin.rydbergh@chalmers.se, rocom@chalmers.se, wenzelm@chalmers.se
License: MIT
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: alembic==1.18.4
Requires-Dist: annotated-doc==0.0.4
Requires-Dist: anyio==4.12.1
Requires-Dist: asttokens==3.0.1
Requires-Dist: bottle==0.13.4
Requires-Dist: bottle-websocket==0.2.9
Requires-Dist: certifi==2026.2.25
Requires-Dist: cffi==2.0.0
Requires-Dist: click==8.3.1
Requires-Dist: colorama==0.4.6
Requires-Dist: colorlog==6.10.1
Requires-Dist: comm==0.2.3
Requires-Dist: debugpy==1.8.20
Requires-Dist: decorator==5.2.1
Requires-Dist: executing==2.2.1
Requires-Dist: filelock==3.20.0
Requires-Dist: fsspec==2025.12.0
Requires-Dist: future==1.0.0
Requires-Dist: gevent==25.9.1
Requires-Dist: gevent-websocket==0.10.1
Requires-Dist: greenlet==3.3.2
Requires-Dist: h11==0.16.0
Requires-Dist: hf-xet==1.3.2
Requires-Dist: httpcore==1.0.9
Requires-Dist: httpx==0.28.1
Requires-Dist: huggingface_hub==1.6.0
Requires-Dist: idna==3.11
Requires-Dist: importlib_resources==6.5.2
Requires-Dist: ipykernel==7.2.0
Requires-Dist: ipython==9.11.0
Requires-Dist: ipython_pygments_lexers==1.1.1
Requires-Dist: jedi==0.19.2
Requires-Dist: Jinja2==3.1.6
Requires-Dist: joblib==1.5.3
Requires-Dist: jupyter_client==8.8.0
Requires-Dist: jupyter_core==5.9.1
Requires-Dist: Mako==1.3.10
Requires-Dist: markdown-it-py==4.0.0
Requires-Dist: MarkupSafe==3.0.2
Requires-Dist: matplotlib-inline==0.2.1
Requires-Dist: mdurl==0.1.2
Requires-Dist: mpmath==1.3.0
Requires-Dist: nest-asyncio==1.6.0
Requires-Dist: networkx==3.6.1
Requires-Dist: numpy==2.3.5
Requires-Dist: optuna==4.7.0
Requires-Dist: packaging==26.0
Requires-Dist: pandas==3.0.1
Requires-Dist: parso==0.8.6
Requires-Dist: pillow==12.0.0
Requires-Dist: platformdirs==4.9.4
Requires-Dist: prompt_toolkit==3.0.52
Requires-Dist: psutil==7.2.2
Requires-Dist: pure_eval==0.2.3
Requires-Dist: pycparser==3.0
Requires-Dist: Pygments==2.19.2
Requires-Dist: pyparsing==3.3.2
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: PyYAML==6.0.3
Requires-Dist: pyzmq==27.1.0
Requires-Dist: regex==2026.2.28
Requires-Dist: rich==14.3.3
Requires-Dist: safetensors==0.7.0
Requires-Dist: scikit-learn==1.8.0
Requires-Dist: scipy==1.17.1
Requires-Dist: setuptools==70.2.0
Requires-Dist: shellingham==1.5.4
Requires-Dist: six==1.17.0
Requires-Dist: SQLAlchemy==2.0.48
Requires-Dist: stack-data==0.6.3
Requires-Dist: sympy==1.14.0
Requires-Dist: threadpoolctl==3.6.0
Requires-Dist: tokenizers==0.22.2
Requires-Dist: torch==2.10.0
Requires-Dist: torchvision==0.25.0
Requires-Dist: tornado==6.5.4
Requires-Dist: tqdm==4.67.3
Requires-Dist: traitlets==5.14.3
Requires-Dist: transformers==5.3.0
Requires-Dist: typer==0.24.1
Requires-Dist: typing_extensions==4.15.0
Requires-Dist: tzdata==2025.3
Requires-Dist: wcwidth==0.6.0
Requires-Dist: zope.event==6.1
Requires-Dist: zope.interface==8.2
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# SCARSE: Small-sample Classification And Regression Solution for low-resource peptide Engineering
<p align="center">
  <img src="figures/scarse_banner.png" alt="Workflow Diagram" style="width: 100%; height: auto;" />
</p>

## Abstract
Reliable estimation of downstream performance in low-data peptide machine learning is critical for guiding early-stage AI-driven peptide engineering, yet it is often unclear how to assess whether a model will be effective in iterative discovery settings. Here, we show that cross validation R² score can serve as a simple and robust proxy for predicting active learning workflow performance, enabling early-stage evaluation of model suitability for sequential peptide optimization. To support this, we introduce SCARSE, a machine learning framework combining ESM-2 protein language model embeddings with Gaussian process regression and extremely randomized trees classification, designed for low-resource peptide property prediction (20–500 training samples). We benchmark SCARSE across 23 peptide and small-protein datasets covering substitution and indel variants, antimicrobial peptides, cell-penetrating peptides, and toxic/non-toxic peptides. The protein language model approach significantly outperforms a hand-engineered descriptor baseline on substitution and indel tasks, while the two approaches achieve comparable performance on shorter peptide non-mutant datasets where simpler descriptors capture enough of the signal. In simulated active learning workflows, SCARSE consistently outperforms baseline and random sampling strategies, and notably we demonstrate that CV R² computed from as few as 50 labeled peptides can be sufficient to estimate final active learning endpoint performance, providing a practical, data-efficient criterion for deciding whether a given dataset combined with SCARSE is suitable for iterative peptide discovery. SCARSE is released as a pip package and is available via HuggingFace Spaces to facilitate integration into peptide engineering workflows.

## How SCARSE works

SCARSE is designed for peptide property prediction in low-data regimes by combining protein language model embeddings with classical machine learning methods.

The workflow consists of the following steps:

1. **Input data**
   - A CSV file containing peptide sequences and one or more target variables.
   - The sequence column (`seq_col`) should contain amino acid sequences.
   - The target column(s) (`score_col`) contain regression values or class labels.

2. **Sequence embedding**
   - Sequences are converted into numerical representations using the ESM-2 protein language model.

3. **Model selection**
   - Depending on the task:
     - **Regression** → Gaussian Process Regression  
     - **Classification** → Extremely Randomized Trees  
   - These models are chosen for robustness in small-sample settings.

4. **Hyperparameter optimization**
   - Models are tuned using cross-validation and Optuna-based optimization.
   - The number of folds and optimization trials can be controlled by the user.

5. **Training output**
   - Cross-validation performance metrics are returned.
   - A trained model environment is stored internally and reused for prediction.

6. **Prediction**
   - New sequences are embedded using the same pipeline.
   - The trained model generates predictions for each target variable.

---

## Notes and best practices

- **Call order matters**  
  You must run `scarse.train()` before calling `scarse.pred()`, as the trained model is stored internally.

- **Data quality is important**  
  - Ensure no missing or empty sequences  
  - Use consistent formatting (standard amino acid codes(ACDEFGHIKLMNPQRSTVWY))

- **Small datasets are supported**  
  SCARSE have been evaluated for datasets as small as ~20 samples, but performance generally improves with more data.

## Tested for Python version
- Python version == 3.12.10

## Setup
```
pip install scarse
```

## Usage

### For training on regression problem:
```
import scarse

scarse.train(data_path="../app/train.csv", 
             classification=False, 
             seq_col="sequence",
             score_col=["score"])
```
### For training on classification problem:
```
import scarse

scarse.train(data_path="../app/train.csv", 
             classification=True, 
             seq_col="sequence",
             score_col=["classes"])
```
### For predicting after model have been trained:
```
df_pred = scarse.pred(data_path="../app/test.csv", seq_col="sequence")
```

## Tutorials
See the following tutorial, structured as a Python notebook:
* [tutorial.ipynb](tutorial.ipynb)

## Correlate to active learning end-point performance 
Below we illustrate the relation between CV R² score and end-point active learning performance. <br>
The y-axis display how many times better performance SCARSE guided active learning delivers compared to random sampling when looking at the accumulation of top 10% of peptides. <br>
By comparing the CV R² score of your data to the corresponding figure below for your dataset size one can get and indication of how suitable your data is combined with SCARSE to perform active learning peptide engineering. <br>
Note that this can only be used as a guide to evaluate regression problem performance. <br>

<p align="center">
  <img src="figures/active_learning_performance.png" alt="Workflow Diagram" width="500"/>
</p>

## Citation
Coming soon!
