Metadata-Version: 2.4
Name: locus-etl
Version: 0.0.2
Summary: Locus — turn any unstructured corpus into validated, source-grounded tabular data. CLI: `locus`.
Author: Dibae101
License: MIT
License-File: LICENSE
Keywords: data-extraction,etl,grounding,llm,provenance,rag
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.11
Requires-Dist: packaging>=23.0
Requires-Dist: pandas>=2.0
Requires-Dist: pyarrow>=15.0
Requires-Dist: pydantic>=2.6
Requires-Dist: pyyaml>=6.0
Requires-Dist: rapidfuzz>=3.0
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.12
Provides-Extra: dedup
Requires-Dist: splink>=4.0; extra == 'dedup'
Provides-Extra: docker
Requires-Dist: docker>=7.0; extra == 'docker'
Provides-Extra: docling
Requires-Dist: docling>=2.0; extra == 'docling'
Provides-Extra: embeddings
Requires-Dist: fastembed>=0.3; extra == 'embeddings'
Provides-Extra: html
Requires-Dist: selectolax>=0.3; extra == 'html'
Requires-Dist: trafilatura>=1.8; extra == 'html'
Provides-Extra: llm
Requires-Dist: instructor>=1.3; extra == 'llm'
Requires-Dist: litellm>=1.40; extra == 'llm'
Provides-Extra: normalize
Requires-Dist: babel>=2.14; extra == 'normalize'
Requires-Dist: phonenumbers>=8.13; extra == 'normalize'
Requires-Dist: python-dateutil>=2.9; extra == 'normalize'
Provides-Extra: oci
Requires-Dist: oras>=0.2.0; extra == 'oci'
Provides-Extra: ocr
Requires-Dist: rapidocr-onnxruntime>=1.3; extra == 'ocr'
Provides-Extra: pdf
Requires-Dist: pdfplumber>=0.11; extra == 'pdf'
Provides-Extra: serve
Requires-Dist: fastapi>=0.110; extra == 'serve'
Requires-Dist: httpx>=0.27; extra == 'serve'
Requires-Dist: uvicorn>=0.29; extra == 'serve'
Provides-Extra: sql
Requires-Dist: sqlalchemy>=2.0; extra == 'sql'
Description-Content-Type: text/markdown

# Locus

Turn any unstructured corpus into validated, **source-grounded** tabular data — ready to feed an LLM.

[![PyPI](https://img.shields.io/pypi/v/locus-etl.svg)](https://pypi.org/project/locus-etl/)
[![Python](https://img.shields.io/pypi/pyversions/locus-etl.svg)](https://pypi.org/project/locus-etl/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

Locus packages data operations as reusable, versioned **images**. You pull an image, point it at your own data, run it locally, and get a clean table where **every cell carries its source location and a faithfulness score**. Images compose into pipelines, and you can publish your own to Locus Hub (public or private).

## Install

```bash
pip install locus-etl          # the CLI command is `locus`
```

Optional extras: `pip install "locus-etl[pdf,serve,llm,oci]"` (PDF parsing, Hub/result UI, LLM engine, OCI registry).

## Quickstart

```bash
locus catalog list                  # see the official image catalog
printf 'name,amount\nAcme,100\nGlobex,200\n' > data.csv
cat > locusfile.yaml <<EOF
image: doc-to-tables
source: { type: files, path: ./data.csv }
EOF
locus run locusfile.yaml --export out.csv   # grounded table + _lineage column
locus run locusfile.yaml --serve            # preview UI with per-cell provenance
locus hub                                    # browse the catalog in a local web UI
```

## Architecture (layered)

![Locus layered architecture](docs/architecture.png)

The diagram is generated from [`docs/generate_architecture_diagram.py`](docs/generate_architecture_diagram.py) (PNG + SVG in `docs/`).

### Layer summary

| Layer | Spec | Responsibility |
|-------|------|----------------|
| **Layer 1 — Engine** | `unstructured-to-tabular-etl` | Raw corpus -> validated, source-grounded table. Connectors, parsing, extraction, cleaning, the cell-level grounding/faithfulness contract, review. Embedded inside every image. |
| **Layer 2 — Runtime** | `locus-image-runtime` | Packaging, CLI, Locusfile, image pull, multi-image composition (DAG), typed stage interchange, cross-stage provenance, serve/export, and publishing to Locus Hub. |

## Key properties

- **Local-first.** Default runtime is a plain Python process — no daemon, no Linux VM. Docker is an optional backend.
- **Privacy is explicit.** Deterministic engine keeps data local; the LLM engine activates only when you add a key, with a consent notice before any data leaves.
- **Trust travels with the data.** Provenance and faithfulness survive every pipeline stage, from extraction through merge and redaction.

## Documentation

Detailed design lives in the spec documents:

- Layer 1 engine — [`.kiro/specs/unstructured-to-tabular-etl/requirements.md`](.kiro/specs/unstructured-to-tabular-etl/requirements.md)
- Layer 2 runtime — [`.kiro/specs/locus-image-runtime/requirements.md`](.kiro/specs/locus-image-runtime/requirements.md)
- Image catalog (planned images + build order) — [`.kiro/specs/locus-image-runtime/image-catalog.md`](.kiro/specs/locus-image-runtime/image-catalog.md)
- Architecture & decision log — [`.kiro/specs/unstructured-to-tabular-etl/architecture-notes.md`](.kiro/specs/unstructured-to-tabular-etl/architecture-notes.md)

## Status

**Layer 1 engine: feature-complete.** **Layer 2 runtime: feature-complete (all 12 build stages done).** Raw corpus → validated, source-grounded table with cell-level provenance; a deterministic default engine and opt-in guardrailed LLM engine; cleaning/dedup; human-in-the-loop review; file/HTTP/REST/SQL connectors with CSV/PDF/HTML/records parsers and DataFrame/Parquet/SQL emitters. The `locus` CLI runs single images and multi-stage pipelines (typed DAG with static type-check + cross-stage provenance), builds/publishes/pulls images via a local registry, and serves a local result UI with the provenance viewer. 208 tests, CI on Python 3.11/3.12 (ruff + mypy strict + pytest).

```bash
pip install locus-etl        # CLI command is `locus`; extras: [pdf] [llm] [serve] [oci] ...
locus init                   # gitignore .env
locus catalog list           # see the official image catalog
locus run locusfile.yaml     # run a pipeline, get a grounded table
locus run locusfile.yaml --serve --port 8080   # preview UI with provenance
locus hub                    # browse the image catalog in a local web UI
locus build / push / pull / search / inspect   # image lifecycle
```

Remaining work is the official image catalog and a hub-side discovery index. The
**OCI/Harbor registry backend is implemented** (`OrasImageStore`): set `LOCUS_REGISTRY`
(and optionally `LOCUS_NAMESPACE`) to push/pull/inspect against Harbor, GHCR, ECR, or any
OCI registry; otherwise a local filesystem registry is the zero-config default.

```python
from locus_engine import (
    Pipeline, PipelineConfig, PluginRegistry,
    FileConnector, CsvParser, Connector, Parser, SourceRef,
)

registry = PluginRegistry()
registry.register(FileConnector(), Connector)
registry.register(CsvParser(), Parser)

config = PipelineConfig.load({"source": {"type": "files", "path": "./data"}})
pipeline = Pipeline(config, registry)

out = pipeline.run([SourceRef(uri="./data/invoices.csv", kind="file")])
frame = pipeline.emit(out)          # pandas DataFrame with a _lineage column
print(frame)
```

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

## License

[MIT](LICENSE) © 2026 Dibae101
