Metadata-Version: 2.4
Name: hard-needle
Version: 0.1.2
Summary: Semantically hard multi-needle long-context data generator
Author: hard-needle contributors
License: MIT
Project-URL: Homepage, https://github.com/denial-web/hard-needle
Project-URL: Repository, https://github.com/denial-web/hard-needle
Project-URL: Issues, https://github.com/denial-web/hard-needle/issues
Keywords: long-context,needle-in-a-haystack,evaluation,retrieval,rag,dataset,synthetic-data
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: datasets
Requires-Dist: datasets>=2.0.0; extra == "datasets"
Provides-Extra: tokenizer
Requires-Dist: transformers>=4.30.0; extra == "tokenizer"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# hard-needle

[![CI](https://github.com/denial-web/hard-needle/actions/workflows/ci.yml/badge.svg)](https://github.com/denial-web/hard-needle/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/hard-needle.svg)](https://pypi.org/project/hard-needle/)
[![Python](https://img.shields.io/pypi/pyversions/hard-needle.svg)](https://pypi.org/project/hard-needle/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

**Stop testing long-context LLMs with random passwords.** `hard-needle` generates haystacks where multiple confusable facts share the same template, so the model has to actually disambiguate by entity instead of pattern-matching a unique token.

```bash
pip install hard-needle
hard-needle-generate --num-examples 200 --num-needles 6 --ctx-chars 12000 --output eval.jsonl
```

## What you get

Each example places **multiple semantically similar facts** in a haystack and asks about one of them:

```
Document:
... [haystack of distractor sentences] ...
The special project code for Aurora is ATL-7704.
... [more distractors] ...
The special project code for Aegis is ANV-5503.
... [more distractors] ...
The special project code for Apollo is ATL-7701.
... [more distractors] ...

Question: What is the special project code for Apollo?
Answer: ATL-7701
```

A model that "just remembers there was a project code in the document" gets it wrong. It has to **bind the right code to the right project name**.

## Why it matters

Most public Needle-in-a-Haystack benchmarks insert one obvious sentence (`The magic password is 7XQ32B`) into Paul Graham essays. Modern LLMs ace this with shallow attention because the needle has unique surface form. Real long-context tasks — reading meeting notes, parsing legal documents, multi-hop QA — almost never look like that.

`hard-needle` gives you:

| | Standard NIH | `hard-needle` |
|---|---|---|
| Distractors | Generic prose | Semantically similar facts (multiple project codes, multiple deadlines, etc.) |
| Disambiguation | None — needle is unique | Required — model must bind value to entity |
| Eval pool isolation | N/A | Disjoint `default` / `unseen` entity pools to detect memorization |
| Output | Plain text | Structured `needle_records` with type, entity, value, char position, depth fraction |
| Negatives | None | Optional paired `corrupt_example` for contrastive eval |

Designed for: **honest long-context evaluation, contrastive training data, lost-in-the-middle studies with realistic confusion.**

## Quickstart (Python)

```python
from hard_needle import generate_hard_example, generate_dataset

ex = generate_hard_example(num_needles=3, ctx_chars=8000, seed=42)
print(ex["prompt"])           # full input prompt
print(ex["target"])           # gold answer
for r in ex["needle_records"]:
    print(r["type"], r["entity"], "->", r["value"], f"(depth={r['depth_frac']:.2f})")

ds = generate_dataset(
    num_examples=500,
    num_needles=6,
    ctx_chars=12000,
    pool_set="default",       # or "unseen" for held-out generalization eval
    include_corrupted=True,
    corruption_ratio=0.2,
    seed=0,
)
```

## CLI

```bash
hard-needle-generate \
    --num-examples 1000 \
    --num-needles 6 \
    --ctx-chars 12000 \
    --pool-set default \
    --include-corrupted \
    --corruption-ratio 0.2 \
    --seed 42 \
    --output train.jsonl

# Disjoint eval pool — no entity/value overlap with --pool-set default
hard-needle-generate \
    --num-examples 200 \
    --num-needles 6 \
    --ctx-chars 12000 \
    --pool-set unseen \
    --seed 100 \
    --output eval.jsonl
```

Each output line is a JSON object:

```json
{
  "prompt": "You are an internal assistant for the ...",
  "target": "ATL-7701",
  "text": "<prompt> <target>",
  "question": "What is the special project code for Apollo?",
  "target_needle_type": "project_code",
  "target_entity": "Apollo",
  "target_value": "ATL-7701",
  "needle_records": [
    {
      "type": "project_code",
      "entity": "Aurora",
      "value": "ATL-7704",
      "sentence": "The special project code for Aurora is ATL-7704.",
      "char_pos": 1842,
      "depth_frac": 0.42
    }
  ],
  "num_needles": 3,
  "ctx_chars": 8000,
  "pool_set": "default",
  "is_corrupted": false
}
```

## Needle types

Each example uses one of four needle types — all entities are projects, but the value type varies:

| Type | Entity example | Value example |
|---|---|---|
| `project_code` | `Aurora` | `AUR-4521` |
| `deadline` | `Apollo` | `April 03` |
| `budget` | `Atlas` | `$1.4M` |
| `lead` | `Andromeda` | `Dr. Sarah Chen` |

Disjoint `unseen` pool uses different surface forms (e.g. `Brontis`, `BRX-9001`, `Dr. Aiko Tanaka`) for held-out generalization eval.

## Optional extras

```bash
pip install "hard-needle[datasets]"      # PG-19 streaming distractors (vs builtin pool)
pip install "hard-needle[tokenizer]"     # Token-aware truncation via transformers
pip install "hard-needle[dev]"           # pytest
```

## Limitations

- Context length is controlled in **characters** by default. Token-aware truncation requires `[tokenizer]` extra and is best-effort across needle insertions.
- Builtin distractor pool is small. Use `--distractor-source pg19` for production-scale data.
- Templates are deliberately simple ("The X for Y is Z"). For paraphrase-robustness studies, augment downstream.

## Citing / links

If `hard-needle` helped your research or evaluation, a star is appreciated. If you publish using it, drop a link to your work in the issues — happy to maintain a "used by" list.

## License

MIT — see [LICENSE](LICENSE).
