Metadata-Version: 2.4
Name: spam-classifier
Version: 0.2.0
Summary: Spam/ham classifier with an MLOps-style training pipeline.
Author-email: Emin Tagiev <emin.tagiev@phystech.edu>
License: MIT
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic==2.12.5
Requires-Dist: scikit-learn==1.8.0
Requires-Dist: pandas==3.0.0
Requires-Dist: nltk==3.9.2
Requires-Dist: pyyaml==6.0.3
Requires-Dist: joblib==1.5.3
Provides-Extra: dev
Requires-Dist: pytest==9.0.2; extra == "dev"
Requires-Dist: pytest-cov==7.0.0; extra == "dev"
Requires-Dist: black==26.1.0; extra == "dev"
Requires-Dist: mypy==1.19.1; extra == "dev"
Requires-Dist: flake8==7.3.0; extra == "dev"
Requires-Dist: isort==7.0.0; extra == "dev"
Requires-Dist: pre-commit==4.5.1; extra == "dev"
Requires-Dist: types-PyYAML==6.0.12.20250915; extra == "dev"
Dynamic: license-file

# Spam classifier

This project demonstrates how to package and train a simple spam/ham classifier with MLOps practices. It is designed for students learning how to structure ML code into modules, build training pipelines, configure via YAML, and add tests and CI.

## Project structure

- `spam_classifier/` — package code (pipeline, training, inference)
- `data/` — raw and processed datasets
- `config.yaml` — pipeline and training configuration
- `tests/` — pytest suite (unit + quality)
- `.github/workflows/ci.yml` — GitHub Actions CI

## Setup (uv)

```bash
uv venv --seed --python 3.13
uv pip install -e ".[dev]"
```

Minimum supported Python version is 3.11. If you prefer `venv`, you can still use it, but the project CI and Makefile expect `uv`.

## Data

Download and prepare the dataset:

```bash
make download_data
make process_data
```

`make process_data` builds `data/processed/train.csv` and `data/processed/test.csv`. The holdout split is controlled by:

- `data.test_size` in `config.yaml` (default 0.1)
- `training.use_holdout` (True/False)

## Training

Train with cross-validation and optional holdout evaluation:

```bash
make train
```

Training behavior is controlled in `config.yaml`:

- `training.cv_folds` — number of CV folds
- `training.metrics` — metrics to log (accuracy/precision/recall/f1/roc_auc)
- `training.use_holdout` — evaluate on `test.csv` if True
- `training.run_validation` — run CV if True

### Versioned artifacts

Package version is stored in `spam_classifier/_VERSION`. Model and log filenames include this version:

- Model: `spam_classifier/models/spam_classifier_vX.Y.Z.pkl`
- Logs: `spam_classifier/logs/logs_X.Y.Z.log`

## Inference

Single message:

```bash
uv run python -m spam_classifier.predict "Free prize! Call now"
```

Batch inference from file (one message per line):

```bash
uv run python -m spam_classifier.predict data/processed/test.csv -o results/preds.csv
```

Options:

- `-o/--output` — output CSV path (default: project root)
- `--no-message` — exclude message text from output CSV
- `--model-path` — path to a trained `.pkl` model (overrides default)

If you installed the package from PyPI, you must train a model or pass `--model-path`
because no weights are bundled with the package by default.

If you have activated the virtual environment, you can omit `uv run` and call `python` directly.

If you have activated the virtual environment, you can omit `uv run` and call `python` directly.

## Tests

Run full test suite:

```bash
uv run pytest tests
```

Quality tests (require trained model and holdout data):

```bash
uv run pytest -m quality
```

If you have activated the virtual environment, you can omit `uv run` for pytest as well.

## CI

GitHub Actions runs on PRs to `main` and `develop`:

- `black --check`
- `flake8`
- `mypy`
- `pytest tests`

## Pre-commit

Install and run pre-commit hooks:

```bash
pre-commit install
pre-commit run --all-files
```

Hooks included: `black`, `flake8`, `mypy`.

## Publishing

### TestPyPI (manual)

1. Update `spam_classifier/_VERSION`
2. Create a GitHub Actions run:
   - Go to **Actions → Publish → Run workflow**
   - Select `testpypi`
3. The package is built and published to TestPyPI

### PyPI (release)

1. Update `spam_classifier/_VERSION`
2. Create a GitHub Release (tag should match the version, e.g. `v0.1.0`)
3. The Publish workflow will build and upload to PyPI

### Trusted publishing

This project uses GitHub Actions OIDC (trusted publishing). You must configure
the trusted publisher on **PyPI and TestPyPI** to allow the `Publish` workflow
from this repository to upload packages.
