Metadata-Version: 2.4
Name: trakr
Version: 0.1.0
Summary: Lightweight artifact integrity and drift detection for ML/data pipelines
Author: Trakr contributors
License: MIT
Project-URL: Homepage, https://github.com/your-org/trakr
Project-URL: Repository, https://github.com/your-org/trakr
Project-URL: Issues, https://github.com/your-org/trakr/issues
Project-URL: Changelog, https://github.com/your-org/trakr/blob/main/CHANGELOG.md
Keywords: ml,mlops,pipeline,data-pipeline,artifacts,integrity,hashing,reproducibility,drift-detection,cli
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: System :: Archiving
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer>=0.9
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Provides-Extra: s3
Requires-Dist: boto3>=1.28; extra == "s3"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Dynamic: license-file

# trakr

Artifact integrity and drift detection for ML and data pipelines.

A small CLI that hashes the things you care about, snapshots them alongside
your environment, and yells when anything moves. That's it.

---

## Why this exists

I kept getting bitten by the same thing: a model I trained two weeks ago
behaves slightly differently in prod than it did on my laptop. A dataset
"didn't change", except the file modified-time says it did. Someone in
staging swapped a config and we noticed three days later.

The tools I had didn't help much. DVC versions things, MLflow tracks
experiments, git-lfs stores blobs — none of them answer the simple question
"is this file the same one I snapshotted last Tuesday?".

trakr is the boring, 500-line answer to that. SHA-256 a file. Save the hash
and the env. Compare them later. Exit non-zero if anything moved.

It is meant to live *next to* your existing pipeline tools, not replace them.

## How it compares

|                                            | trakr | DVC | MLflow | git-lfs | W&B |
| ------------------------------------------ | :--: | :-: | :----: | :-----: | :-: |
| SHA-256 integrity verification             |  ✅  |     |        |         |     |
| Drift diff between runs                    |  ✅  |     |        |         |     |
| Python + package env capture               |  ✅  |     |   ✅   |         | ✅  |
| S3 (no-download ETag check)                |  ✅  | ✅  |   ✅   |         | ✅  |
| Zero config to start                       |  ✅  |     |        |         |     |
| Non-zero exit on drift (CI-friendly)       |  ✅  | ✅  |        |         |     |
| JSON output                                |  ✅  |     |   ✅   |         | ✅  |
| Pre-commit hook                            |  ✅  | ✅  |        |   ✅    |     |
| Runs without a server or DB                |  ✅  | ✅  |        |   ✅    |     |
| 3 dependencies, no daemon                  |  ✅  |     |        |   ✅    |     |
| Dataset / model versioning                 |      | ✅  |   ✅   |   ✅    | ✅  |
| Experiment tracking & metrics              |      |     |   ✅   |         | ✅  |
| Model registry                             |      |     |   ✅   |         | ✅  |
| Visualization dashboards                   |      |     |   ✅   |         | ✅  |

Use DVC to version, MLflow to track experiments, W&B for dashboards. Use
trakr to make sure the artifacts they point at are the ones you actually
expect.

## Install

```bash
pip install trakr
```

For S3 support:

```bash
pip install "trakr[s3]"
```

Needs Python 3.10 or newer.

## Quickstart

```bash
trakr init
trakr track model.pkl --name model --type model
trakr track data/train.csv --name training --type dataset
trakr snapshot
trakr verify
```

That's the whole thing. `init` creates a `.trakr/` directory in your repo,
`track` registers a path under an alias, `snapshot` writes a manifest of the
current state, and `verify` re-hashes everything and compares.

When something drifts, `verify` exits 1 and prints what changed:

```
  Verify against run_2026-04-23-001
┏━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Artifact ┃ Status     ┃ Detail       ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ model    │ ✗ mismatch │ hash changed │
│ training │ ✓ verified │ hash match   │
└──────────┴────────────┴──────────────┘
✗ Drift detected — some artifacts do not match the latest snapshot.
```

That non-zero exit is the whole point — drop `trakr verify` into a CI step
and a drifted artifact stops the pipeline.

## Commands

### `trakr init`

Creates `.trakr/` (config, manifests folder, cache folder). Idempotent up to
the point that running it twice is an error, on purpose — if `.trakr/` is
already there, you probably don't want to clobber it.

### `trakr track <path> --name <alias> [--type <type>]`

Register a path. Local files are hashed; `s3://bucket/key` paths are tracked
by ETag and size (no download). The `--type` is a free-form label —
`model`, `dataset`, `config`, whatever you find useful in the diff later.

Re-running with the same `--name` overwrites the entry, which is convenient
when you're iterating on what to track.

### `trakr snapshot`

Writes `.trakr/manifests/run_<id>.yaml` with:

- the current hash + size of each tracked local file
- ETag + size + last-modified for each S3 object
- the Python version and every installed package
- a timestamp and an auto-incrementing run id (`YYYY-MM-DD-NNN`)

Files larger than 10 MB get a progress bar while they're being hashed.

### `trakr verify [--json]`

Re-runs the same collection against the latest manifest and compares.
Prints a table by default. With `--json` it prints a machine-readable
result, useful in CI:

```bash
trakr verify --json | jq '.status'
# "ok" or "drift"
```

Exit code is 0 on a clean verify, 1 if anything mismatched.

### `trakr diff <run1> <run2>`

Tree view of what changed between two snapshots — added, removed, changed
artifacts, plus environment differences (Python version, package versions).

### `trakr list [--json]`

Shows what you're currently tracking, with the last-known hash from the
latest snapshot if there is one.

### `trakr history [--limit N]`

Recent snapshots, newest first. Default limit is 20.

### `trakr status [--json]`

Quick summary panel: how many artifacts you track, when the last snapshot
was, whether anything has drifted since. Cheap to run, good for a shell
prompt or a Makefile target.

### `trakr untrack <name>`

Stop tracking an artifact. Doesn't delete anything from disk or from old
manifests, just removes it from `config.yaml`.

## Configuration

`.trakr/config.yaml` is plain YAML and is meant to be hand-edited:

```yaml
pipeline: default            # free-form label, shown in manifests
hash_algorithm: sha256       # reserved; sha256 is the only one wired up today
log_level: info              # default; CLI flags override
artifacts:
  - name: model
    path: model.pkl
    type: model
  - name: training-data
    path: s3://my-bucket/data.csv
    type: dataset
```

### Environment variables

| Variable          | What it does                                                  |
| ----------------- | ------------------------------------------------------------- |
| `TRAKR_DIR`       | Use a different directory than `./.trakr/`.                   |
| `TRAKR_LOG_LEVEL` | `debug` / `info` / `warning` / `error`.                       |
| `TRAKR_NO_COLOR`  | Disable colored output. (Also respects standard `NO_COLOR`.)  |
| `AWS_*`           | Whatever boto3 uses — credentials, region, profile.           |

### Global flags

```bash
trakr --version
trakr -v <cmd>          # debug logging
trakr -q <cmd>          # quiet (warnings only)
trakr --trakr-dir /path  # custom .trakr/ location
```

These go *before* the subcommand: `trakr -v snapshot`, not `trakr snapshot -v`.

## CI integration

### GitHub Actions

```yaml
name: verify-artifacts
on: [push, pull_request]
jobs:
  trakr:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install trakr
      - run: trakr verify
```

### Pre-commit

```yaml
repos:
  - repo: https://github.com/your-org/trakr
    rev: v0.1.0
    hooks:
      - id: trakr-verify
```

### GitLab CI

```yaml
verify-artifacts:
  image: python:3.12
  script:
    - pip install trakr
    - trakr verify
```

## Building from source

```bash
git clone https://github.com/your-org/trakr.git
cd trakr
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev,s3]"

pytest          # 30 tests
ruff check src/ tests/
```

The layout is small:

```
src/trakr/
  cli/        # typer commands and rich rendering — the user-facing layer
  core/       # hashing, manifest, environment — pure logic, no UI
  handlers/   # one module per artifact source (local, s3)
```

Adding a new handler is roughly: drop a file in `handlers/` that exposes
`get_artifact_info(path) -> dict`, then teach `_get_handler` in `cli/commands.py`
to recognize the new prefix. Tests welcome.

## Contributing

PRs welcome. Read [CONTRIBUTING.md](CONTRIBUTING.md) for the dev setup,
coding style (ruff), and what we look for in a PR.

The project is intentionally small. If a feature requires a server, a daemon,
or a fourth dependency, it probably belongs in a downstream tool — open an
issue first so we can talk it through.

By participating you agree to the [Code of Conduct](CODE_OF_CONDUCT.md).

## Roadmap

Things on the list, in roughly the order I'd reach for them:

- `trakr doctor` — diagnose common setup problems
- glob patterns in `trakr track`
- BLAKE3 and SHA-512 as alternative hash algorithms
- remote manifest storage (S3/GCS backend)
- `trakr verify --run <id>` to verify against a specific run
- `trakr show <run_id>` to pretty-print a single manifest
- a `demo/` directory with an end-to-end sample pipeline
- a published GitHub Action

These are all reasonable starter PRs — open an issue if you want to take one.

## License

[MIT](LICENSE).
