Metadata-Version: 2.4
Name: s8kpred
Version: 0.1.0
Summary: Protein secondary structure prediction (3-state & 8-state) using XGBoost and PSSM features
License: MIT
Project-URL: Homepage, https://github.com/mayank2801/s8kpred
Project-URL: Documentation, https://github.com/mayank2801/s8kpred#readme
Project-URL: Bug Tracker, https://github.com/mayank2801/s8kpred/issues
Project-URL: Source, https://github.com/mayank2801/s8kpred
Keywords: bioinformatics,protein,secondary structure,machine learning,xgboost,PSSM,PSI-BLAST
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: xgboost>=1.7
Requires-Dist: biopython>=1.80
Requires-Dist: scikit-learn>=1.3
Provides-Extra: plot
Requires-Dist: biotite>=0.38; extra == "plot"
Requires-Dist: matplotlib>=3.7; extra == "plot"
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: biotite>=0.38; extra == "dev"
Requires-Dist: matplotlib>=3.7; extra == "dev"
Dynamic: license-file

# S8kPred — Protein Secondary Structure Prediction

[![PyPI](https://img.shields.io/pypi/v/s8kpred)](https://pypi.org/project/s8kpred/)
[![Python](https://img.shields.io/pypi/pyversions/s8kpred)](https://pypi.org/project/s8kpred/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

**S8kPred** predicts protein secondary structure directly from amino acid sequence using XGBoost models trained on PSSM (Position-Specific Scoring Matrix) and tripeptide propensity features.

- **3-state**: Helix (H), Beta-strand (E), Coil/Loop (L)
- **8-state**: H, G, I, E, B, T, S, L (full DSSP alphabet)

---

## Requirements

| Dependency | Purpose |
|---|---|
| Python ≥ 3.9 | Runtime |
| `numpy`, `pandas`, `xgboost`, `scikit-learn` | Core ML pipeline |
| `biopython` | FASTA I/O for PSI-BLAST |
| **NCBI PSI-BLAST** | PSSM generation (external binary) |
| **UniRef50 (or similar) BLAST database** | PSI-BLAST database |
| `biotite`, `matplotlib` *(optional)* | Cartoon structure plots |

---

## Installation

### From PyPI (recommended)
```bash
pip install s8kpred
```

### With cartoon plot support
```bash
pip install s8kpred[plot]
```

### From GitHub (latest development version)
```bash
pip install git+https://github.com/mayank2801/s8kpred.git
```

### From source
```bash
git clone https://github.com/mayank2801/s8kpred.git
cd s8kpred
pip install -e .          # editable install
pip install -e .[plot]    # with plotting extras
```

---

## Setting up PSI-BLAST

S8kPred requires NCBI PSI-BLAST to generate evolutionary features. You have two options:

### Option A — System install
```bash
# Ubuntu / Debian
sudo apt install ncbi-blast+

# macOS (Homebrew)
brew install blast

# Conda
conda install -c bioconda -c conda-forge "blast>=2.14"
```

### Option B — Manual download
Download the NCBI BLAST+ toolkit from:
https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/

Then either add the `bin/` folder to your `PATH` or pass the full path via `--psiblast`.

---

## Setting up a BLAST database

S8kPred works best with **UniRef50**. Download and format it:

```bash
# Download
wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz
gunzip uniref50.fasta.gz

# Build BLAST database
mkdir -p ~/blast_dbs/uniref50
makeblastdb -in uniref50.fasta \
            -dbtype prot \
            -out ~/blast_dbs/uniref50/uniref50 \
            -title "UniRef50"
```

Then point s8kpred at it:
```bash
export S8KPRED_BLASTDB=~/blast_dbs/uniref50/uniref50
```
or pass `--blastdb ~/blast_dbs/uniref50/uniref50` on every invocation.

---

## Model data files

The trained XGBoost models and lookup tables are **not** bundled in the PyPI wheel because of their size. Download them from the [Releases page](https://github.com/mayank2801/s8kpred/releases) and place them in the `s8kpred/data/` directory inside your Python environment:

```
s8kpred/data/
  TriPeptidePropensityThreeStateSecStructure2AND.csv
  TriPeptidePropensityEightStateSecStructure.csv
  TripeptideBinaryTable_60.csv
  model_3state.json
  model_8state.ubj
```

Or override paths at runtime:
```bash
s8kpred predict -i input.fasta \
  --blastdb ~/blast_dbs/uniref50/uniref50 \
  --model-3state /path/to/model_3state.json \
  --model-8state /path/to/model_8state.ubj
```

---

## Quick start

### Command line

```bash
# Single FASTA file
s8kpred predict -i protein.fasta --blastdb ~/blast_dbs/uniref50/uniref50

# Multi-sequence FASTA
s8kpred predict -i multi_seq.fasta --blastdb ~/blast_dbs/uniref50/uniref50

# Multiple separate FASTA files in one run
s8kpred predict -i seq1.fasta seq2.fasta seq3.fasta \
                --blastdb ~/blast_dbs/uniref50/uniref50

# Inline sequence (no file needed)
s8kpred predict \
  --sequence MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGD \
  --id my_protein \
  --blastdb ~/blast_dbs/uniref50/uniref50

# Custom output folder and job name
s8kpred predict -i input.fasta \
  --blastdb ~/blast_dbs/uniref50/uniref50 \
  --output-dir ./results \
  --job experiment_01

# Skip 8-state prediction
s8kpred predict -i input.fasta --blastdb ... --no-8state

# Skip cartoon plots
s8kpred predict -i input.fasta --blastdb ... --no-plot

# Quiet mode (suppress progress output)
s8kpred predict -i input.fasta --blastdb ... --quiet

# Use more PSI-BLAST threads
s8kpred predict -i input.fasta --blastdb ... --threads 16

# Override PSI-BLAST location
s8kpred predict -i input.fasta \
  --psiblast /opt/ncbi-blast/bin/psiblast \
  --blastdb ~/blast_dbs/uniref50/uniref50
```

### Python API

```python
from s8kpred import predict, predict_file

# ── Single sequence ──────────────────────────────────────────────────
result = predict(
    sequence="MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGD",
    seq_id="my_protein",
    blastdb="/data/blast/uniref50/uniref50",
)

print(result.results_3state["my_protein"])   # e.g. "CCCHHHHHHCCEEEEC..."
print(result.results_8state["my_protein"])   # e.g. "LLLHHHHHHLLEEEELL..."
print(result.job_dir)                         # Path to all output files

# ── Single FASTA file ────────────────────────────────────────────────
result = predict_file(
    fasta_file="proteins.fasta",
    blastdb="/data/blast/uniref50/uniref50",
    output_dir="./results",
)
print(result.summary())

# ── Multi-sequence FASTA ─────────────────────────────────────────────
result = predict_file("multi_seq.fasta", blastdb="...")
for seq_id, ss in result.results_3state.items():
    print(f"{seq_id}: {ss}")

# ── Custom model paths ────────────────────────────────────────────────
from pathlib import Path
result = predict_file(
    "proteins.fasta",
    blastdb="...",
    model_3state=Path("/models/model_3state.json"),
    model_8state=Path("/models/model_8state.ubj"),
)

# ── Skip 8-state to save time ─────────────────────────────────────────
result = predict("MKTAYI...", blastdb="...", run_8state=False)
```

---

## Output files

All outputs are written to a timestamped job directory under `--output-dir`:

```
s8kpred_jobs/
└── 20250210_153042_a1b2c3/
    ├── FASTA/
    │   └── input_sequence.fasta        # combined input
    ├── pssm_outputs/
    │   ├── Seq_1.pssm                  # raw PSI-BLAST PSSM
    │   └── ...
    ├── PSSM_Features_ML_17W.csv        # sliding-window PSSM features
    ├── ResultThreeState.ss2            # PSIPRED-style vertical format
    ├── ResultThreeState.horiz           ├── ResultThreeState.csv            # per-residue probabilities   # PSIPRED-style horizontal format
    ├── ResultThreeState.fas            # pseudo-FASTA format
    ├── ResultThreeState.csv            # per-residue probabilities
    ├── ResultEightState.ss2
    ├── ResultEightState.horiz
    ├── ResultEightState.fas
    ├── ResultEightState.csv
    ├── Seq_1_cartoon.png               # helix/sheet cartoon (requires biotite)
    └── log.dat                         # timing and status log
```

### Secondary structure codes

| Code | State |
|------|-------|
| **3-state** | |
| H    | α-Helix |
| E    | β-Strand |
| L    | Loop / Coil |
| **8-state** | |
| H    | α-Helix |
| G    | 3₁₀-Helix |
| I    | π-Helix |
| E    | β-Strand |
| B    | β-Bridge |
| T    | Turn |
| S    | Bend |
| L    | Loop / Coil |

---

## Environment variables

| Variable | Default | Description |
|---|---|---|
| `S8KPRED_BLASTDB` | *(empty)* | BLAST database path prefix |
| `S8KPRED_PSIBLAST` | `psiblast` | PSI-BLAST binary path |
| `S8KPRED_ITERATIONS` | `3` | PSI-BLAST iterations |

---

## CLI reference

```
s8kpred predict --help
```

```
usage: s8kpred predict [-h] (-i FASTA [FASTA ...] | -s SEQ)
                       [--blastdb DB] [--psiblast BIN] [--iterations N]
                       [--threads N] [-o DIR] [--job ID]
                       [--model-3state PATH] [--model-8state PATH]
                       [--no-3state] [--no-8state] [--no-plot] [-q]
                       [--id ID]
```

---

## Citation

If you use S8kPred in your research, please cite:

> [Your citation here]

---

## License

MIT — see [LICENSE](LICENSE) for details.
