Metadata-Version: 2.4
Name: slotloss
Version: 0.1.0
Summary: Per-grammar-role loss decomposition for fine-tuned structured JSON output
Author-email: Breck Baldwin <breckbaldwin@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/breckbaldwin/slotloss
Project-URL: Paper, https://arxiv.org/abs/XXXX.XXXXX
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.40
Provides-Extra: peft
Requires-Dist: peft>=0.10; extra == "peft"

# slotloss

Per-grammar-role loss decomposition for fine-tuned structured JSON output.

**Fine-tuning your LLM for JSON output? Your aggregate metrics might be lying to you.**

`slotloss` decomposes fine-tuning loss by grammar role (structural tokens, schema keys, enum values, booleans, free text) and compares baseline vs fine-tuned performance. It reveals per-role regressions that aggregate metrics hide.

## The Problem

Standard LoRA fine-tuning + grammar-constrained decoding produces valid JSON at all scales. Aggregate loss improves. Everything looks great.

But at 32B parameters, fine-tuning can **degrade** specific grammar roles while aggregate loss improves:

```
slotloss: Per-Grammar-Role Loss Report
======================================================================

Role             Baseline   Fine-tuned     Change  Status
----------------------------------------------------------------------
STRUCTURAL         5.3298       0.0002     -100.0%  OK (-100%)
KEY                0.4736       0.0001     -100.0%  OK (-100%)
ENUM_VALUE         0.3313       0.3029       -8.6%  OK (-9%)
BOOLEAN            0.4568       1.0498     +129.8%  !! REGRESSION (+130%)
FREE_TEXT          1.3287       0.6289      -52.7%  OK (-53%)
----------------------------------------------------------------------
TOTAL              0.5544       0.1742      -68.6%

WARNING: 1 grammar role(s) REGRESSED after fine-tuning:
  BOOLEAN: 0.4568 -> 1.0498 (+130%)

Your model may be memorizing majority values for constrained fields.
```

Aggregate loss improved 69%. BOOLEAN prediction got 130% worse. Without `slotloss`, you'd never know.

## Install

```bash
pip install slotloss
```

## Usage

### Command Line

```bash
# Compare baseline vs fine-tuned
slotloss --model Qwen/Qwen2.5-7B-Instruct \
    --checkpoint my_lora/ \
    --schema schema.json \
    --data test.jsonl \
    --device cuda

# Baseline only
slotloss --model Qwen/Qwen2.5-7B-Instruct \
    --schema schema.json \
    --data test.jsonl

# Save JSON report
slotloss --model Qwen/Qwen2.5-7B-Instruct \
    --checkpoint my_lora/ \
    --schema schema.json \
    --data test.jsonl \
    --output report.json
```

Exit code is 1 if regressions are detected, 0 otherwise. Use in CI/CD pipelines.

### Python API

```python
from slotloss import analyze

report = analyze(
    model_name="Qwen/Qwen2.5-7B-Instruct",
    checkpoint="my_lora/",
    schema="schema.json",
    data="test.jsonl",
    device="cuda",
)

print(report)  # formatted report with regression warnings

# Programmatic access
for comp in report.comparisons:
    print(f"{comp.role}: {comp.baseline_loss:.4f} -> {comp.finetuned_loss:.4f} ({comp.status})")

if report.regressions:
    print(f"REGRESSIONS: {[r.role for r in report.regressions]}")
```

### Low-Level API

```python
from slotloss import GrammarRole, assign_grammar_roles

# Assign grammar roles to any JSON string
roles = assign_grammar_roles('{"city": "NYC", "cuisine": "Italian"}', schema)
# [STRUCTURAL, QUOTE, KEY, KEY, KEY, KEY, QUOTE, STRUCTURAL, ...]
```

## Data Format

Test data is JSONL with `prompt` and `target_json` fields:

```json
{"prompt": "Extract restaurant info...", "target_json": "{\"city\": \"NYC\"}"}
```

Schema is standard JSON Schema:

```json
{
  "type": "object",
  "properties": {
    "city": {"type": "string"},
    "cuisine": {"type": "string", "enum": ["Mexican", "Italian"]},
    "has_wifi": {"type": "string", "enum": ["True", "False"]}
  }
}
```

## Grammar Roles

| Role | Description | Examples |
|------|-------------|----------|
| STRUCTURAL | JSON syntax | `{` `}` `[` `]` `:` `,` |
| QUOTE | String delimiters | `"` |
| KEY | Object key characters | `city`, `cuisine` |
| ENUM_VALUE | Categorical values | `Italian`, `Economy` |
| BOOLEAN | Boolean strings | `True`, `False` |
| NUMBER | Numeric characters | `42`, `3.14` |
| FREE_TEXT | Non-categorical content | names, addresses |
| WHITESPACE | Formatting | spaces, newlines |

## Why Regressions Happen

Fine-tuning on small datasets biases the model toward training-set patterns. Structural tokens (trivial decisions) improve massively, dominating the aggregate gradient. Constrained fields like booleans and enums (genuine decisions) can overfit to majority values. Aggregate loss improves because the large gains on trivial roles outweigh the regression on substantive roles.

The regression emerges at scale: larger pretrained models have stronger existing competencies that fine-tuning can disrupt. The better the base model already is at a grammar role, the more fine-tuning has to lose.

## Paper

Baldwin (2026), "Valid JSON, Wrong Answer: Fine-Tuning Degrades Grammar-Role Performance at Scale Despite Improved Aggregate Loss."

## License

MIT
