Metadata-Version: 2.4
Name: llm2graph
Version: 0.3.4
Summary: Paper-aligned LLM-only graph construction, benchmark runners, and public-facing evaluation tools for LLM unlearning experiments.
Author: Raj Sanjay Shah
License-Expression: MIT
Project-URL: Homepage, https://pypi.org/project/llm2graph/
Keywords: knowledge-graph,llm,unlearning,benchmark,reproducibility,evaluation
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: networkx>=3.2
Requires-Dist: openai>=1.37
Requires-Dist: pydantic>=2.7
Requires-Dist: tenacity>=8.2
Requires-Dist: typer>=0.12
Provides-Extra: gemini
Requires-Dist: google-generativeai>=0.7; extra == "gemini"
Provides-Extra: hf-local
Requires-Dist: transformers>=4.44; extra == "hf-local"
Requires-Dist: accelerate>=0.33; extra == "hf-local"
Requires-Dist: sentencepiece>=0.2; extra == "hf-local"
Requires-Dist: einops>=0.7; extra == "hf-local"
Provides-Extra: dev
Requires-Dist: pytest>=8.3; extra == "dev"

﻿# LLM2Graph

`llm2graph` is a paper-aligned toolkit for reproducing the graph-based evaluation pipeline from *The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning* while still being usable for general public workflows.

It supports three layers of use:

1. Entity-centric knowledge graph construction using LLM API calls throughout elicitation, triple extraction, relevance filtering, and alias resolution.
2. Query generation from graph paths with controllable difficulty through hop depth, aliases, paraphrases, distractors, and retention probes.
3. Benchmark and public workflows through simple CLI commands, dataset loaders, and reusable Python APIs.

## Why 0.3.4

Version `0.3.4` extends the `0.3.3` paper-aligned pipeline with:

- dataset-specific runners for RWKU and TOFU
- public-friendly dataset loading from `.txt`, `.json`, `.jsonl`, and `.csv`
- seed extraction utilities for previewing local benchmark files
- reusable Python APIs for both benchmark replication and general entity-level experiments

## Install

```bash
pip install llm2graph
```

Optional extras:

```bash
pip install "llm2graph[gemini]"
pip install "llm2graph[hf-local]"
```

## Public Quickstart

If you just want to build a graph and evaluate one entity:

```powershell
$env:OPENAI_API_KEY="..."
llm2graph entity --seed "Marie Curie" --out graph.json
llm2graph gen-queries --graph graph.json --target "Marie Curie" --hops 2 --out queries.json
llm2graph eval --queries queries.json --pre-model gpt-4o-mini-2024-07-18 --post-model gpt-4o-mini-2024-07-18 --out eval_report.json
```

If you have a simple text file of seed entities, one per line:

```powershell
llm2graph extract-seeds --dataset seeds.txt
llm2graph run-benchmark --benchmark rwku --dataset seeds.txt --out-dir demo_run --limit 5
```

The generic `run-benchmark` command is the easiest public entry point. The benchmark-oriented aliases `run-rwku` and `run-tofu` are also available, and all three commands work with plain seed lists so general users do not need a benchmark-specific file format.

## Benchmark Workflows

Build a graph with API-driven elicitation:

```bash
llm2graph entity \
  --seed "Stephen King" \
  --max-depth 2 \
  --elicitation-question-count 10 \
  --provider openai \
  --model gpt-4o-mini-2024-07-18 \
  --use-relevance \
  --relevance-threshold 3.0 \
  --out graph.json
```

Generate forget probes for every hop from `1..N` plus retention probes:

```bash
llm2graph gen-queries \
  --graph graph.json \
  --target "Stephen King" \
  --hops 3 \
  --num-paths 100 \
  --per-kind-limit 100 \
  --aliases 3 \
  --paraphrases 2 \
  --distractors 2 \
  --random-seed 13 \
  --provider openai \
  --model gpt-4o-mini-2024-07-18 \
  --out queries.json
```

Run a full RWKU-style sweep from a local dataset file:

```bash
llm2graph run-benchmark \
  --benchmark rwku \
  --dataset rwku_entities.jsonl \
  --out-dir rwku_run \
  --hops 3 \
  --num-paths 100 \
  --per-kind-limit 100
```

Equivalent benchmark-specific alias:

```bash
llm2graph run-rwku \
  --dataset rwku_entities.jsonl \
  --out-dir rwku_run \
  --hops 3 \
  --num-paths 100 \
  --per-kind-limit 100
```

Run a full TOFU-style sweep from a local dataset file:

```bash
llm2graph run-benchmark \
  --benchmark tofu \
  --dataset tofu_authors.json \
  --out-dir tofu_run \
  --hops 3 \
  --num-paths 100 \
  --per-kind-limit 100
```

Equivalent benchmark-specific alias:

```bash
llm2graph run-tofu \
  --dataset tofu_authors.json \
  --out-dir tofu_run \
  --hops 3 \
  --num-paths 100 \
  --per-kind-limit 100
```

Each benchmark run writes per-entity artifacts plus a top-level `benchmark_summary.json`.

## Public-Facing API

```python
from llm2graph import BenchmarkRunner, BenchmarkRunnerConfig, GraphBuilder, QueryGenerator, Evaluator, load_seeds

seeds = load_seeds("seeds.txt")

runner = BenchmarkRunner(
    BenchmarkRunnerConfig(
        benchmark="rwku",
        dataset_path="seeds.txt",
        output_dir="demo_run",
        provider="openai",
        model="gpt-4o-mini-2024-07-18",
    )
)
summary = runner.run()
```

## Reproducibility Notes

`llm2graph` stores experiment metadata inside produced JSON files so runs are easier to audit.

- graph artifacts record construction settings and provider/model provenance
- query artifacts record path-sampling settings, source graph metadata, and retention probe structure
- evaluation artifacts record pre/post/judge model metadata, skipped pre-check counts, residual flags, and paper-style summary metrics
- benchmark runs record per-entity outputs and a summary manifest

To improve reproducibility across runs:

- keep prompt settings fixed
- set `random_seed` during query generation
- persist the exact graph JSON used to create question sets
- keep provider and model versions in your experiment logs
- prefer a judge model for semantic equivalence when exact string match is too brittle
- reuse the graph per `(model, seed entity)` pair when evaluating multiple unlearning methods

## Paper Alignment

This package is intended to support the attached paper experiments and reproducibility workflow. The key package abstractions map onto the paper's methodology as follows:

- `GraphBuilder`: API-driven entity-to-graph elicitation and controlled BFS expansion with decay
- `QueryGenerator`: single-hop, multi-hop, alias-based, 1-hop retention, 2-hop retention, and relationship-retention probe synthesis
- `Evaluator`: pre/post comparison, residual knowledge measurement, and paper-style aggregate metrics
- `BenchmarkRunner`: dataset-level orchestration for RWKU, TOFU, or public seed lists

## Examples

See [examples/reproducibility.md](examples/reproducibility.md) and [examples/reproduce_pipeline.py](examples/reproduce_pipeline.py) for a minimal experiment template.
