Metadata-Version: 2.4
Name: llamella
Version: 0.1.0
Summary: LLM-free response quality scoring. Grade every response. No second LLM call.
Author-email: sunnyguntuka <sunnyguntuka.smi@gmail.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/sunnyguntuka/llamella
Project-URL: Repository, https://github.com/sunnyguntuka/llamella
Project-URL: Documentation, https://github.com/sunnyguntuka/llamella#readme
Project-URL: Issues, https://github.com/sunnyguntuka/llamella/issues
Project-URL: Changelog, https://github.com/sunnyguntuka/llamella/blob/main/CHANGELOG.md
Keywords: llm,evaluation,nlp,rag,quality-scoring,nli,hallucination-detection,groundedness,ai-evaluation
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: sentence-transformers<4.0,>=3.0
Requires-Dist: numpy<3.0,>=1.24
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: sqlalchemy>=2.0; extra == "dev"
Requires-Dist: cryptography>=41.0; extra == "dev"
Provides-Extra: ci
Requires-Dist: pytest; extra == "ci"
Requires-Dist: pytest-cov; extra == "ci"
Requires-Dist: ruff; extra == "ci"
Requires-Dist: sqlalchemy>=2.0; extra == "ci"
Requires-Dist: cryptography>=41.0; extra == "ci"
Requires-Dist: numpy<3.0,>=1.24; extra == "ci"
Provides-Extra: pandas
Requires-Dist: pandas; extra == "pandas"
Provides-Extra: security
Requires-Dist: cryptography>=41.0; extra == "security"
Provides-Extra: database
Requires-Dist: sqlalchemy>=2.0; extra == "database"
Provides-Extra: bench
Requires-Dist: scipy>=1.10; extra == "bench"
Requires-Dist: datasets>=2.14; extra == "bench"
Requires-Dist: tqdm>=4.65; extra == "bench"
Requires-Dist: matplotlib>=3.7; extra == "bench"
Requires-Dist: psutil>=5.9; extra == "bench"
Dynamic: license-file

<p align="center">
  <img src="assets/llamella-logo.png" alt="llamella - LLM-free response quality scoring" width="280">
</p>

<h3 align="center">LLM-free response quality scoring</h3>
<p align="center">Grade every response. No second LLM call. Zero cost. Deterministic.</p>

<p align="center">
  <a href="https://pypi.org/project/llamella/">
    <img src="https://img.shields.io/pypi/v/llamella" alt="PyPI">
  </a>
  <a href="https://github.com/sunnyguntuka/llamella/actions/workflows/tests.yml">
    <img src="https://github.com/sunnyguntuka/llamella/actions/workflows/tests.yml/badge.svg" alt="Tests">
  </a>
  <a href="https://codecov.io/gh/sunnyguntuka/llamella">
    <img src="https://codecov.io/gh/sunnyguntuka/llamella/graph/badge.svg?token=ETQ357WCNN" alt="Coverage">
  </a>
  <a href="LICENSE">
    <img src="https://img.shields.io/badge/license-Apache--2.0-blue" alt="License">
  </a>
  <a href="https://pypi.org/project/llamella/">
    <img src="https://img.shields.io/pypi/dm/llamella" alt="Downloads">
  </a>
</p>

<p align="center">
  <img src="assets/feature_icon_strip.svg" alt="Features: NLI scoring, evidence mapping, five metrics, IQS composite score, feedback loop, quality gate" width="680">
</p>


## Why llamella?

Teams deploying LLM agents and RAG systems can't manually review
every response. Existing tools use LLM-as-judge - a second LLM
call per evaluation - which costs $0.01–0.05/eval, takes 2–5s,
and gives non-deterministic results. **llamella** scores every
response locally using NLI models and embedding similarity.
Zero cost. Deterministic. 100% coverage.


## How it's different

<p align="center">
  <img src="assets/readme_approach_comparison.svg" alt="LLM-as-judge approach sends each response to GPT for evaluation; llamella scores locally using NLI cross-encoders and embedding similarity with no API call" width="680">
</p>

| Feature | llamella | DeepEval | RAGAS | TruthScore |
|---|---|---|---|---|
| Cost per eval | **$0.00** | $0.01–0.05 | $0.01–0.05 | Requires LLM |
| Latency (GPU) | **10–50ms** | 2–5s | 2–5s | 2–5s+ |
| Latency (CPU) | **600ms–2s** | 2–5s | 2–5s | 2–5s+ |
| LLM call required | **No** | Yes | Yes | Yes (claim decomposition) |
| Deterministic | **Yes** | No | No | No |
| Runs offline | **Yes** | No | No | Partial (Ollama) |
| Feedback loop | **Yes** | No | No | No |
| Metrics | **5 + composite** | 50+ (LLM-judged) | 4 (LLM-judged) | 1 |


## Quick start

```bash
pip install llamella
```

```python
from llamella import Auditor

auditor = Auditor()

result = auditor.score(
    query="What is our refund policy?",
    response="We offer a 30-day full refund at no extra cost.",
    context=["All customers are eligible for a 30-day full refund at no extra cost."]
)

print(result.iqs)           # 0.93 - composite Information Quality Score
print(result.groundedness)  # 0.97
print(result.flags)         # [] - no issues
```

Two convenience functions for one-off scoring:

```python
import llamella

# Returns full EntailmentResult
result = llamella.score(
    query="What is our refund policy?",
    response="We offer a 30-day full refund.",
    context=["30-day full refund at no extra cost."]
)

# Returns True if IQS >= threshold
passed = llamella.verify(
    query="What is our refund policy?",
    response="We offer a 30-day full refund.",
    context=["30-day full refund at no extra cost."],
    threshold=0.7
)
```

For repeated scoring, instantiate `Auditor` once and reuse it —
models are loaded once and cached.


## How it works

<p align="center">
  <img src="assets/readme_architecture_diagram.svg" alt="Architecture diagram: user query and LLM response enter the llamella scoring engine, which runs five parallel metrics (groundedness via NLI, completeness via embeddings, relevance via cosine similarity, consistency via pairwise NLI, confidence via regex), combines them into an IQS score, raises quality flags, and feeds corrections back as guardrails" width="680">
</p>

Every response flows through the scoring engine which checks five
independent quality dimensions using NLI cross-encoders and
embedding similarity - no LLM calls anywhere in the pipeline.
Responses below threshold enter the feedback loop where corrections
become guardrails for future responses.


## Metrics

Five independent quality dimensions, each scored 0–1:

| Metric | What it measures | How | Typical CPU latency |
|---|---|---|---|
| **Groundedness** | Is the response faithful to source context? | NLI cross-encoder per claim, batched | ~800ms |
| **Completeness** | Did the response address all parts of the query? | Embedding similarity per query segment | ~150ms |
| **Relevance** | Is the response on-topic? | Cosine similarity query↔response | ~100ms |
| **Consistency** | Does the response contradict itself? | Pairwise NLI between sentences (capped at 25) | ~400ms |
| **Confidence** | How assertive vs hedged is the response? | Regex pattern matching | <1ms |

> **Note:** Latencies are for CPU (DeBERTa-v3-base). Use `device="cuda"`
> for 10–50× speedup on GPU. First call also loads model weights (~5s).


## IQS - the composite score

IQS (Information Quality Score) is the weighted harmonic mean of all
five metrics. Harmonic mean penalizes low scores hard: a response with
0.95 groundedness but 0.1 completeness scores ~0.3, not 0.5.

```
Default weights:
  groundedness  0.35    # most important - is it faithful?
  completeness  0.25    # did it answer the full question?
  relevance     0.20    # is it on topic?
  consistency   0.15    # does it contradict itself?
  confidence    0.05    # calibration check
```

When no context is provided, the groundedness weight is redistributed
proportionally across the other four metrics.


## Flags

**llamella** automatically flags specific quality issues:

| Flag | Condition |
|---|---|
| `hallucination_risk` | groundedness < 0.5 AND confidence > 0.7 |
| `off_topic` | relevance < 0.3 |
| `self_contradictory` | consistency < 0.7 |
| `incomplete` | completeness < 0.3 |
| `ungrounded` | groundedness < 0.3 |


## No LLM anywhere

Unlike every competitor, **llamella** uses zero LLM calls:

- **No LLM for judging** - NLI cross-encoders evaluate entailment,
  not GPT-4
- **No LLM for claim extraction** - deterministic regex and sentence
  splitting, not a second model call
- **No LLM for scoring** - embedding similarity, not generated text
- **No API key required** - works offline, air-gapped, on a laptop

The only neural models used are a 350 MB NLI cross-encoder
(DeBERTa-v3-base) and a 90 MB sentence embedding model
(all-MiniLM-L6-v2). Both run locally on CPU or GPU.


## Configuration

```python
auditor = Auditor(
    nli_model="cross-encoder/nli-deberta-v3-base",
    embedding_model="all-MiniLM-L6-v2",
    device="cpu",                   # or "cuda"
    weights={
        "groundedness": 0.40,
        "completeness": 0.20,
        "relevance": 0.20,
        "consistency": 0.15,
        "confidence": 0.05,
    },
    entailment_threshold=0.5,
    coverage_threshold=0.45,
    contradiction_threshold=0.7,
    max_sentences=25,
    max_query_length=10_000,
    max_response_length=50_000,
    max_context_items=50,
    max_context_item_length=10_000,
    max_batch_size=1_000,
)
```

### Custom models

```python
from llamella.models import trust_model

trust_model("myorg/fine-tuned-nli")
auditor = Auditor(nli_model="myorg/fine-tuned-nli")
```


## No-context mode

Works without source context. Groundedness is skipped; IQS is
computed from the remaining metrics.

```python
result = auditor.score(
    query="Explain quantum computing",
    response="Quantum computing uses qubits that can be in superposition..."
)
print(result.groundedness)  # None
print(result.iqs)           # computed from remaining metrics
```


## Batch scoring

```python
results = auditor.score_batch([
    {"query": "...", "response": "...", "context": ["..."]},
    {"query": "...", "response": "..."},  # no context
])
```


## Agent registry

Route scoring through per-agent configuration while sharing one
model instance:

```python
from llamella import Auditor, AgentRegistry

auditor = Auditor()
registry = AgentRegistry(auditor)

registry.register("support_bot",
    weights={"groundedness": 0.45, "completeness": 0.20,
             "relevance": 0.15, "consistency": 0.15, "confidence": 0.05},
    iqs_threshold=0.8,
    context_required=True,
)

registry.register("code_assistant",
    weights={"completeness": 0.40, "relevance": 0.30,
             "groundedness": 0.10, "consistency": 0.15, "confidence": 0.05},
    iqs_threshold=0.7,
)

result = registry.score("support_bot",
    query="What is the refund policy?",
    response="We offer 30-day refunds.",
    context=["30-day refund policy..."],
)

stats = registry.get_stats("support_bot")
```

The registry is duck-type compatible with `Auditor` - pass it to
`sample_and_score` or `DatabaseConnector` directly.


## Sampling

Score a statistically meaningful subset instead of every response:

```python
from llamella import Auditor, sample_and_score

auditor = Auditor()
items = [
    {"query": "...", "response": "...", "context": ["..."]}
    for _ in range(50_000)
]

# Random sample
results = sample_and_score(
    auditor, items, strategy="random", sample_size=500, seed=42
)

# Auto-compute size for 95% confidence, ±3% margin
results = sample_and_score(
    auditor, items, strategy="confidence",
    confidence_level=0.95, margin_of_error=0.03, seed=42
)

print(results.summary())
# Sampled 500/50000 (1.0%) using random strategy.
# Mean IQS: 0.872 (±0.041), 95% CI: [0.868, 0.876]
# Flags: hallucination_risk: 12 (2.4%), incomplete: 5 (1.0%)
```

Five strategies: `random`, `percentage`, `stratified`,
`confidence`, `priority`.


## Database connector

Score responses directly from a SQL database:

```bash
pip install "llamella[database]"
```

```python
from llamella import Auditor
from llamella.connectors import DatabaseConnector

auditor = Auditor()

connector = DatabaseConnector(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    source_table="llm_responses",
    column_map={
        "query": "user_query",
        "response": "agent_response",
        "context": "rag_chunks",
    },
    result_table="llamella_scores",
)

connector.score_all(auditor)
connector.score_incremental(auditor, cursor_column="created_at")
connector.score_sampled(auditor, strategy="random", sample_size=500, seed=42)
```

Supports PostgreSQL, MySQL, SQLite, BigQuery, and Snowflake.


## Feedback loop

**llamella** doesn't just score - it learns. Flagged responses enter
a correction pipeline. Human-reviewed corrections are stored and
injected back into future prompts as guardrails, preventing the
same mistake twice.

```python
from llamella.feedback.store import FeedbackStore, CorrectionRecord
from llamella.feedback.injector import GuardrailInjector

store = FeedbackStore("corrections.jsonl")
injector = GuardrailInjector(store)

result = auditor.score(query=query, response=response, context=context)

if result.iqs < 0.7:
    store.add(CorrectionRecord(
        id="abc123",
        timestamp="2026-06-02T00:00:00Z",
        query=query,
        response=response,
        scores=result.to_dict(),
        flags=result.flags,
        correction="The correct answer...",
        reason="Why the original was wrong",
        context_used=context,
        corrected_by="human",
    ))

guardrails = injector.build_context(query=query, strategy="relevant")
system_prompt = f"You are a helpful agent.\n{guardrails}"
```

### Encrypted storage

```python
from cryptography.fernet import Fernet

key = Fernet.generate_key()
store = FeedbackStore("corrections.jsonl", encryption_key=key)
```

### Data retention

```python
store = FeedbackStore(
    "corrections.jsonl",
    max_records=10_000,
    ttl_days=90,
)
store.delete("record-id")
store.purge(before_date="2026-01-01")
store.validate_integrity()
```


## Performance

All numbers measured on CPU (Intel i7, single thread).
Use `device="cuda"` for GPU acceleration.

| Operation | CPU latency | Notes |
|---|---|---|
| `import llamella` | ~250ms | No model loaded at import |
| First `score()` call | ~5–6s | Model weights downloaded and cached |
| Subsequent `score()`, no context | ~600ms | Embedding + regex only |
| `score()`, 1 context chunk | ~1s | +NLI inference |
| `score()`, 10 context chunks | ~2.5s | Batched NLI |
| `score()`, 50 context chunks | ~10s | Consider GPU for large context |
| `score_batch(100)` | ~60s | Sequential |

> GPU latency estimated at 10–50ms per response with preloaded models.
> Run `python -m benchmarks.run_all --only speed` on your hardware
> for actual numbers.

### Improved sentence splitting

**llamella** uses a regex sentence splitter by default. For better
accuracy on complex text, enable NLTK once after installation:

```bash
python -c "import llamella; llamella.setup_nltk()"
```


## Security

- **Model allowlist** - only pre-approved model names are loaded.
  Use `trust_model()` to authorize custom models.
- **Prompt injection protection** - `GuardrailInjector` sanitizes
  all feedback fields before system-prompt interpolation.
- **Encrypted feedback store** - pass `encryption_key=` (Fernet)
  to encrypt records at rest.
- **PII scrubbing** - SSNs, emails, phone numbers, and credit cards
  are masked before guardrail injection.
- **Input limits** - configurable length limits on all inputs
  prevent memory exhaustion.
- **Tamper detection** - per-record SHA-256 hashing with sequential
  numbering.


## Install

```bash
pip install llamella
```

Optional extras:

```bash
pip install "llamella[security]"   # encrypted feedback storage
pip install "llamella[database]"   # database connector (SQLAlchemy)
pip install "llamella[bench]"      # benchmark suite dependencies
pip install "llamella[dev]"        # development tools (pytest, ruff)
```


## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup,
coding guidelines, and the PR process.

```bash
git clone https://github.com/sunnyguntuka/llamella
cd llamella
pip install -e ".[dev]"
pytest tests/ -v
ruff check src/ tests/
```


## Changelog

See [CHANGELOG.md](CHANGELOG.md).


## License

Apache-2.0 - see [LICENSE](LICENSE).
