Metadata-Version: 2.4
Name: knowlytix-benchmark
Version: 0.0.2
Summary: Benchmark for structured information retrieval from financial documents using graph-verifiable questions
Project-URL: Homepage, https://github.com/knowlytix/gms
Project-URL: Issues, https://github.com/knowlytix/gms/issues
Author: Agus Sudjianto, Wingyan Lau
License-Expression: Apache-2.0
Keywords: benchmark,finance,knowledge-graph,llm,question-answering,structured-retrieval
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown

# knowlytix-benchmark

> Benchmark for **Struct**ured **R**etrieval from **Fin**ancial **Doc**uments.
> Auto-generates questions from document-graph topology and scores LLM
> predictions against a provably correct graph-traversal baseline.
> (Internally referred to as the FinStructBench module — `knowlytix.benchmark.*`.)

`knowlytix-benchmark` is one of four packages in the [Geometric Memory Systems][gms-repo]
family. Use it to answer questions like *"how close does my RAG pipeline get
to the graph-verified ground truth on this financial report?"* — with metrics
that resist gaming because the ground truth comes from graph operations, not
a held-out human-labeled test set.

- **Package**: `knowlytix-benchmark`
- **License**: Apache-2.0
- **Python**: 3.12+
- **Status**: alpha (v0.x)

## Install

```bash
pip install knowlytix-benchmark
```

Depends on [`knowlytix-core`][knowlytix-core-pypi] (pinned `~=0.1.0`). LLM-mode
scoring routes through [LiteLLM][litellm] — set `GMS_LLM_MODEL` plus your
provider's API key and any model works.

## Quickstart — score a prediction set

```python
import json
from importlib.resources import files

from knowlytix.benchmark import score_answer

# Smoke fixtures shipped with the wheel:
questions = json.loads((files("knowlytix.benchmark.fixtures.smoke") / "questions.json").read_text())
predictions = json.loads((files("knowlytix.benchmark.fixtures.smoke") / "predictions.json").read_text())

by_id = {p["id"]: p["answer"] for p in predictions["predictions"]}
for q in questions["questions"]:
    result = score_answer(by_id[q["id"]], q["ground_truth"])
    mark = "correct" if result.correct else "wrong"
    print(f"{q['id']}: {mark}  partial={result.partial_score:.2f}  ({result.detail})")
```

## Quickstart — run the full benchmark

```python
from knowlytix.benchmark import Benchmark, get_instance_path

bench = Benchmark(get_instance_path("model_validation"))
result = bench.run()          # graph-only mode (no LLM, no API key needed)
bench.print_results(result)
```

To evaluate an LLM against the same ground-truth graph:

```python
from knowlytix.benchmark.llm_caller import create_client

client = create_client()       # reads GMS_LLM_MODEL_SCORER → GMS_LLM_MODEL
result = bench.run(llm_client=client)
bench.print_results(result)
```

## CLI

```bash
benchmark --instance model_validation
benchmark --instance credit_portfolio --llm-model anthropic/claude-opus-4-6
```

## Configuration

### `FINSTRUCTBENCH_*` — scoring tolerances

| Variable | Default | Meaning |
|---|---|---|
| `FINSTRUCTBENCH_FLOAT_TOL` | `1e-6` | Absolute tolerance for float comparisons. |
| `FINSTRUCTBENCH_CLOSE_THRESHOLD` | `0.01` | Relative tolerance for "close enough" financial values. |
| `FINSTRUCTBENCH_TUPLE_ELEMENT_TOL` | `1e-3` | Tolerance per element inside tuple answers. |

### `GMS_LLM_*` — LLM routing (only needed for LLM-mode scoring)

| Variable | Meaning |
|---|---|
| `GMS_LLM_MODEL` | Base LiteLLM model string. |
| `GMS_LLM_MODEL_SCORER` | Override for scoring calls (recommended). |
| `GMS_LLM_TIMEOUT_SECONDS` | Per-call timeout. Default `60`. |

See `.env.example` in the source repo for the full provider key reference.

## Included benchmark instances

Five synthetic financial-domain instances ship with the wheel:

| Instance | Topic |
|---|---|
| `basel_capital` | Bank capital adequacy under Basel III |
| `credit_portfolio` | Credit risk portfolio analysis |
| `fair_lending` | Fair lending compliance testing |
| `model_validation` | Model validation report (largest) |
| `stress_test` | Stress testing scenarios |

All synthetic — no real institution, person, or market event is depicted.

## Public API

```python
from knowlytix.benchmark import (
    Benchmark, BenchmarkResult,
    DocumentGraph, ENMEntry, ENMKey, PhaseEncoder,
    FinStructBenchSettings,
    GeneratedQuestion, ScoreResult, score_answer,
    default_generators, get_instance_path, ingest_markdown, list_instances,
)
```

`GeneratedQuestion` is a stable contract consumed by `knowlytix.harness.testing.bridge` —
don't rename without coordinating (see `CLAUDE.md` §Coding standards).

## Related packages

| Package | Role |
|---|---|
| [`knowlytix-core`][knowlytix-core-pypi] | Geometric memory engine (required runtime dep) |
| [`knowlytix-knowledge`][knowlytix-knowledge-pypi] | Document-graph ingest + query front-end |
| [`knowlytix-harness`][knowlytix-harness-pypi] | DOE-driven testing + runtime governance (consumes `GeneratedQuestion`) |

## Links

- Source: [knowlytix/gms][gms-repo]
- Paper: _FinStructBench: Benchmarking Structured Retrieval from Financial Documents_

[gms-repo]: https://github.com/knowlytix/gms
[knowlytix-core-pypi]: https://pypi.org/project/knowlytix-core/
[knowlytix-knowledge-pypi]: https://pypi.org/project/knowlytix-knowledge/
[knowlytix-harness-pypi]: https://pypi.org/project/knowlytix-harness/
[litellm]: https://docs.litellm.ai/
