Metadata-Version: 2.4
Name: gatekeeper-ml
Version: 0.1.3
Summary: Lightweight payload classifier for HTTP log gatekeeping.
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: Faker>=37.1.0
Requires-Dist: joblib>=1.4.2
Requires-Dist: matplotlib>=3.9.1
Requires-Dist: numpy>=1.26.4
Requires-Dist: pandas>=2.2.2
Requires-Dist: pyarrow>=16.1.0
Requires-Dist: requests>=2.32.3
Requires-Dist: scikit-learn>=1.5.1
Requires-Dist: scipy>=1.13.1
Requires-Dist: seaborn>=0.13.2

# Gatekeeper ML

Gatekeeper ML is a lightweight binary classifier for HTTP log components. It uses character-level TF-IDF plus handcrafted text statistics to classify raw strings as either:

- `0` for normal
- `1` for suspicious

The design goal is low-latency screening. Safe-looking values can bypass the ML model entirely through a conservative regex fast path, and only suspicious candidates need deeper downstream analysis.

## Project Structure

```text
gatekeeper-ml/
├── cli.py
├── data/
│   ├── raw/
│   └── processed/
├── models/
├── pyproject.toml
├── src/
│   └── gatekeeper_ml/
│       ├── __init__.py
│       ├── config.py
│       ├── data_loader.py
│       ├── features.py
│       ├── predict.py
│       ├── train.py
│       └── models/
│           └── payload_classifier.pkl
├── README.md
├── MANIFEST.in
└── requirements.txt
```

## 1. Create a Virtual Environment

From the `gatekeeper-ml` directory:

### Windows PowerShell

```powershell
python -m venv .venv
.venv\Scripts\Activate.ps1
```

### macOS / Linux

```bash
python3 -m venv .venv
source .venv/bin/activate
```

## 2. Install Dependencies

```bash
pip install --upgrade pip
pip install -r requirements.txt
```

For local development as an installable package:

```bash
pip install -e .
```

To build and publish distribution artifacts, also install the packaging tools:

```bash
pip install build twine
```

## 3. Fetch and Prepare the Dataset

This command downloads suspicious payloads from PayloadsAllTheThings, generates normal HTTP-like samples, and writes the combined processed dataset to `data/processed/`.

```bash
python cli.py fetch
```

To force a fresh download from the upstream payload sources:

```bash
python cli.py fetch --force-refresh
```

## 4. Train the Model

This command:

- loads the processed dataset
- trains the RandomForest-based pipeline
- saves the serialized `.pkl` model artifact to `models/`
- prints Precision, Recall, F1-Score, and the recall-oriented threshold

```bash
python cli.py train
```

To rebuild the dataset before training:

```bash
python cli.py train --force-refresh
```

Expected artifacts:

- `models/gatekeeper_payload_classifier.pkl`
- `models/training_metrics.json`

## 5. Run Predictions

Pass one or more strings directly on the command line:

```bash
python cli.py predict "/api/v1/users/42" "<script>alert(1)</script>" "' OR 1=1 --"
```

Example output:

```text
Prediction results
0    /api/v1/users/42
1    <script>alert(1)</script>
1    ' OR 1=1 --
Total inference time: 4.812 ms
Average per input:    1.604 ms
```

To point at a custom trained model:

```bash
python cli.py predict --model-path models/gatekeeper_payload_classifier.pkl "12345" "../../etc/passwd"
```

## 6. Use as a Python Library

After `pip install -e .`, you can import the classifier directly:

```python
from gatekeeper_ml import PayloadClassifier

# Option 1: Default load
clf = PayloadClassifier()

# Option 2: Custom path load
clf = PayloadClassifier(model_path="path/to/your_model.pkl")

predictions = clf.predict_batch([
    "/api/v1/users/42",
    "<script>alert(1)</script>",
])
```

## 7. Build the Package

Before building a release, update the version in `pyproject.toml`.

Build both the source distribution and wheel:

```bash
python -m build
```

This creates the release artifacts in `dist/`:

- `dist/gatekeeper_ml-<version>.tar.gz`
- `dist/gatekeeper_ml-<version>-py3-none-any.whl`

Validate the generated metadata before upload:

```bash
python -m twine check dist/*
```

## 8. Publish to TestPyPI

TestPyPI is the safest way to verify packaging before a real release.

Create a TestPyPI API token, then export it:

### Windows PowerShell

```powershell
$env:TWINE_USERNAME="__token__"
$env:TWINE_PASSWORD="pypi-..."
```

### macOS / Linux

```bash
export TWINE_USERNAME="__token__"
export TWINE_PASSWORD="pypi-..."
```

Upload to TestPyPI:

```bash
python -m twine upload --repository testpypi dist/*
```

You can then verify installation from TestPyPI:

```bash
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple gatekeeper-ml
```

## 9. Publish to PyPI

After the TestPyPI release looks good, upload the same `dist/` artifacts to PyPI:

```bash
python -m twine upload dist/*
```

Recommended release flow:

1. Update the version in `pyproject.toml`.
2. Remove old build artifacts if needed: `rm -rf dist build *.egg-info` or the Windows equivalent.
3. Run `python -m build`.
4. Run `python -m twine check dist/*`.
5. Publish to TestPyPI.
6. Install and smoke-test from TestPyPI.
7. Publish to PyPI.

## 10. Trusted Publishing

For CI/CD releases, PyPI now supports Trusted Publishing via OIDC, which avoids storing long-lived API tokens in your repository secrets. If you publish from GitHub Actions or another supported CI provider, this is the recommended long-term setup.

References:

- PyPI Trusted Publishing docs: https://docs.pypi.org/trusted-publishers/using-a-publisher/
- Twine project page: https://pypi.org/project/twine/

## Notes

- The data loader prefers live GitHub payload sources but includes a bootstrap suspicious set so the pipeline remains usable when remote fetches fail.
- The `predict` command loads the model once and evaluates inputs in batch form for low overhead.
- Fast-path safe heuristics are intentionally conservative to preserve recall.
- A dedicated module usage guide is available at `docs/USAGE.md`.
