Metadata-Version: 2.4
Name: spindle-eval
Version: 0.1.0
Summary: Pipeline-agnostic evaluation and observability for knowledge graph, RAG, and KOS pipelines
Author: Spindle Team
License: MIT
Project-URL: Homepage, https://github.com/danielkentwood/spindle-eval
Project-URL: Documentation, https://github.com/danielkentwood/spindle-eval#readme
Project-URL: Repository, https://github.com/danielkentwood/spindle-eval
Project-URL: Changelog, https://github.com/danielkentwood/spindle-eval/releases
Keywords: graph-rag,evaluation,mlflow,experimentation,rag,knowledge-graph,hydra
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: mlflow>=3.0
Requires-Dist: hydra-core>=1.3
Requires-Dist: omegaconf>=2.3
Requires-Dist: ragas>=0.2
Requires-Dist: optuna<3.0,>=2.10
Requires-Dist: hydra-optuna-sweeper>=1.2
Requires-Dist: langfuse>=2.0
Requires-Dist: opentelemetry-sdk>=1.20
Requires-Dist: opentelemetry-exporter-otlp>=1.20
Requires-Dist: scipy>=1.10
Requires-Dist: numpy>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: rank-bm25>=0.2
Requires-Dist: sentence-transformers>=2.2
Requires-Dist: rdflib>=7.0
Requires-Dist: pyshacl>=0.25
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Provides-Extra: spindle
Requires-Dist: spindle>=0.1.0; extra == "spindle"
Dynamic: license-file

# spindle-eval

Pipeline-agnostic evaluation and observability framework for knowledge graph, RAG, and KOS pipelines. spindle-eval wraps any pipeline defined as a sequence of `Stage` objects with structured experiment tracking, automated metrics, parameter sweeps, quality gates, baseline comparisons, and CI/CD regression detection.

Originally built for [spindle](https://github.com/danielkentwood/spindle) (a Graph RAG pipeline), spindle-eval is designed to evaluate **any** pipeline — full end-to-end systems, individual stages, or partial subsets.

## Why spindle-eval?

Multi-stage pipelines have many interacting parameters. Tuning them requires more than ad-hoc scripts. spindle-eval provides:

- **Stage-gated evaluation** — each stage must meet quality thresholds before downstream stages run, enforcing upstream-first optimization
- **Pipeline-agnostic execution** — define stages with the `Stage` protocol, wire them with `StageDef`, run them with `PipelineExecutor`
- **Composable configs** — Hydra config groups for every pipeline aspect, enabling single runs or multi-dimensional parameter sweeps
- **Multiple tracking backends** — MLflow for experiments, file-based for CI, composite for multi-backend, no-op for benchmarking
- **Structured events** — thread-safe event store with duration analysis, token tracking, and error filtering
- **KOS metrics** — intrinsic quality metrics for SKOS taxonomies and OWL ontologies (taxonomy depth, label quality, SHACL conformance, etc.)
- **Automated regression detection** — CI compares metrics against baselines with bootstrap confidence intervals
- **Golden dataset management** — versioned evaluation datasets with a question-type taxonomy and extensible reference fields for extraction and KOS evaluation

## Architecture overview

```
                    ┌─────────────────────────────┐
                    │     Hydra Configuration      │
                    │  (composable YAML per stage)  │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │     spindle-eval runner      │
                    │  (discovery + orchestration) │
                    └──────────────┬──────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │      PipelineExecutor        │
                    │  (stage wiring, metrics,     │
                    │   gates, event logging)      │
                    └──────────────┬──────────────┘
                                   │
          ┌────────────┬───────────┼───────────┬────────────┐
          ▼            ▼           ▼           ▼            ▼
      Stage 1      Stage 2     Stage 3     Stage N    Metric fns
      (any)        (any)       (any)       (any)     (attached)
          │            │           │           │            │
          └────────────┴───────────┴───────────┴────────────┘
                                   │
                    ┌──────────────▼──────────────┐
                    │       Tracker backends       │
                    ├──────────┬─────────┬────────┤
                    ▼          ▼         ▼        ▼
                 MLflow     File     Langfuse   No-op
              (experiments) (JSON)  (traces)  (benchmarks)
```

## Installation

```bash
pip install spindle-eval
```

For co-development with a pipeline package (editable install):

```bash
pip install -e ".[dev]"
pip install -e /path/to/your-pipeline
```

## Quick start

### Full pipeline evaluation

```bash
# Single evaluation run
python -m spindle_eval.runner retrieval=hybrid generation=claude evaluation=quick

# Parameter sweep
python -m spindle_eval.runner --multirun \
  preprocessing.chunk_size=256,512,1024 \
  retrieval.top_k=5,10,20
```

### Evaluate a single stage

```python
from spindle_eval.pipeline import PipelineExecutor
from spindle_eval.protocols import StageDef, StageResult
from spindle_eval.tracking import create_tracker
from spindle_eval.metrics.chunk_metrics import boundary_coherence, size_distribution

class MyChunker:
    name = "chunking"
    def run(self, inputs, cfg):
        chunks = do_chunking(cfg)
        return StageResult(outputs={"chunks": chunks})

tracker = create_tracker("file", output_dir="./results")
stages = [
    StageDef(
        name="chunking",
        stage=MyChunker(),
        metrics=[boundary_coherence, size_distribution],
    ),
]
result = PipelineExecutor(tracker).execute(stages, cfg)
tracker.end_run()
```

### Evaluate a KOS builder

```python
from spindle_eval.metrics.kos_metrics import taxonomy_depth, label_quality, orphan_concept_ratio

stages = [
    StageDef(
        name="taxonomy",
        stage=MyTaxonomyBuilder(),
        input_keys={"chunks": "preprocessing.chunks"},
        metrics=[taxonomy_depth, label_quality, orphan_concept_ratio],
        gate=lambda m: m.get("orphan_concept_ratio", 1.0) < 0.3,
    ),
]
```

## Configuration

Hydra config groups live in `spindle_eval/conf/` (packaged with the install) and compose together:

| Group | Options | Controls |
|---|---|---|
| `preprocessing` | `default`, `small_chunks`, `large_chunks` | Chunking strategy and size |
| `ontology` | `schema_first`, `schema_free`, `hybrid` | Entity/relation schema discovery |
| `extraction` | `llm`, `nlp`, `finetuned` | Triple extraction method |
| `retrieval` | `hybrid`, `local`, `global`, `drift` | Graph retrieval strategy |
| `generation` | `gpt4`, `claude`, `gemini` | LLM for answer generation |
| `evaluation` | `quick`, `full` | Number of evaluation examples |
| `sweep` | `none`, `er_threshold`, `retrieval`, `chunk_size` | Predefined sweep dimensions |

Pipeline packages can register additional config groups via Hydra's `SearchPathPlugin`. See [docs/hydra-config-conventions.md](docs/hydra-config-conventions.md).

## Metrics

### RAG quality (via Ragas)
Faithfulness, context recall, context precision, answer correctness, answer relevancy.

### Graph quality
Connectivity, modularity, B-CUBED clustering, CEAF entity alignment, subgraph completeness.

### Extraction quality
Triple extraction precision, recall, and F1 — with configurable stage gates.

### KOS quality
Taxonomy depth/breadth, label quality, definition completeness, thesaurus connectivity, orphan ratio, axiom density, SHACL conformance. See [docs/kos-evaluation-guide.md](docs/kos-evaluation-guide.md).

### Chunk and provenance quality
Boundary coherence, size distribution, evidence span coverage.

### Statistical rigor
Bootstrap confidence intervals for all metrics, used for regression detection in CI.

## Tracking backends

| Backend | Class | Use case |
|---------|-------|----------|
| MLflow | `MLflowTracker` | Production experiment tracking |
| File | `FileTracker` | Local development, CI |
| Langfuse | Via OpenTelemetry | Trace-level debugging |
| No-op | `NoOpTracker` | Benchmarking, unit tests |
| Composite | `CompositeTracker` | Fan out to multiple backends |

```python
from spindle_eval.tracking import create_tracker

tracker = create_tracker("mlflow")
tracker = create_tracker("file", output_dir="./results")
tracker = create_tracker("noop")
```

## Documentation

| Guide | Audience |
|-------|----------|
| [Spindle Developer Guide](docs/spindle-developer-guide.md) | Pipeline developers integrating with spindle-eval |
| [Custom Pipeline Guide](docs/custom-pipeline-guide.md) | Developers building non-spindle pipelines |
| [KOS Evaluation Guide](docs/kos-evaluation-guide.md) | Developers evaluating SKOS/OWL knowledge structures |
| [Hydra Config Conventions](docs/hydra-config-conventions.md) | Config authors and sweep designers |
| [Tracking Setup](docs/tracking_setup.md) | Setting up MLflow/Langfuse (GKE or local Docker) |
| [PyPI Publishing](docs/pypi-publish.md) | Building and uploading releases to PyPI |

## Requirements

- Python 3.10+
- Pipeline package (optional — mocks used if unavailable, controlled via `runner.allow_mock_fallback`)

## Project structure

```
spindle-eval/
├── src/spindle_eval/
│   ├── runner.py           # Hydra entrypoint, pipeline discovery
│   ├── pipeline.py         # PipelineExecutor (stage wiring, metrics, gates)
│   ├── protocols.py        # Stage, StageDef, StageResult, Tracker protocols
│   ├── compat.py           # Legacy component dict → StageDef adapter
│   ├── mocks.py            # Mock Stage implementations for testing
│   ├── metrics/            # Ragas, graph, extraction, KOS, chunk, provenance
│   ├── tracking/           # MLflow, file, noop, composite trackers
│   ├── events/             # Event store, duration/token/error analysis
│   ├── datasets/           # Golden dataset loading, KOS reference extraction
│   ├── baselines/          # Baseline runner implementations
│   ├── ci/                 # Regression detection, PR report generation
│   └── production/         # Feedback loops, staleness monitoring
│   ├── conf/               # Hydra config groups (packaged for pip install)
│   └── golden_data/        # Default evaluation datasets (JSONL)
├── docs/                   # Developer guides
├── baselines/              # Baseline metric snapshots
└── tests/
```
