Metadata-Version: 2.4
Name: causal-worlds
Version: 0.39.0
Summary: Generate fictional-but-coherent causal operations worlds (executable sim + ground-truth answer-key) from a natural-language description, for benchmarking causal-discovery agents.
Project-URL: Homepage, https://github.com/noumenal-ai/causal-worlds
Project-URL: Repository, https://github.com/noumenal-ai/causal-worlds
Project-URL: Changelog, https://github.com/noumenal-ai/causal-worlds/blob/main/CHANGELOG.md
Project-URL: Issues, https://github.com/noumenal-ai/causal-worlds/issues
Author: Noumenal
License: MIT
License-File: LICENSE
Keywords: benchmark,causal-discovery,causality,llm,scm,synthetic-data
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.13
Requires-Dist: numpy>=2.1
Requires-Dist: pydantic-settings>=2.5
Requires-Dist: pydantic>=2.9
Requires-Dist: typer>=0.12
Provides-Extra: discover
Requires-Dist: causal-learn>=0.1.4; extra == 'discover'
Requires-Dist: dagma>=1.1; extra == 'discover'
Requires-Dist: gies>=0.0.1; extra == 'discover'
Requires-Dist: lingam>=1.8; extra == 'discover'
Provides-Extra: gym
Requires-Dist: gymnasium>=0.29; extra == 'gym'
Provides-Extra: llm
Requires-Dist: anthropic>=0.40; extra == 'llm'
Requires-Dist: instructor[google-genai]>=1.0; extra == 'llm'
Provides-Extra: observability
Requires-Dist: langfuse>=3.0; extra == 'observability'
Provides-Extra: temporal
Requires-Dist: joblib>=1.3; extra == 'temporal'
Requires-Dist: lingam>=1.8; extra == 'temporal'
Requires-Dist: statsmodels>=0.14; extra == 'temporal'
Requires-Dist: tigramite>=5.2; extra == 'temporal'
Description-Content-Type: text/markdown

# causal-worlds

[![PyPI](https://img.shields.io/pypi/v/causal-worlds.svg)](https://pypi.org/project/causal-worlds/)
[![CI](https://github.com/noumenal-ai/causal-worlds/actions/workflows/ci.yml/badge.svg)](https://github.com/noumenal-ai/causal-worlds/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Python 3.13+](https://img.shields.io/badge/python-3.13%2B-blue.svg)](https://www.python.org)

**Turn a plain-language description of an operation into a fictional causal world with a declared,
ground-truth causal graph — then benchmark whether a causal-discovery method can recover it.**

Because the structure is *declared* (not learned from data), it's an **answer key**: run any
discovery method on the generated data and *score* how well it recovered the world. The worlds are
**fiction-first** — plausible and internally consistent, not models of any real system — so there is
no data to leak and nothing to memorize, which is exactly what makes a causal benchmark trustworthy.

📖 **The story:** [*Correlation lies — we built a causal world that can prove it, then tried to break it
ourselves*](https://selftaughtamit.medium.com/correlation-lies-we-built-a-causal-world-that-can-prove-it-then-tried-to-break-it-ourselves-30efae26b457) (Medium).

### Correlation lies — and because the world is *declared*, you can prove it

In the built-in `coffee` world, `overtime` and `sales` rise together (correlation **0.64**) — a naive
analyst "discovers" that overtime drives sales. But *force* `overtime` with `do()` and sales doesn't
move: the true causal effect is **0.00**. The whole association was a hidden confounder the benchmark
planted — and it *knows*, so it can score who gets fooled:

```python
import numpy as np
from causal_worlds import build_substrate, worlds

sub = build_substrate(worlds.get("coffee"), standardize=False)
ov, sa = sub.variables.index("overtime"), sub.variables.index("sales")

seen = sub.sample(40_000, seed=0).data
corr = np.corrcoef(seen[:, ov], seen[:, sa])[0, 1]                     # ≈ 0.64  → looks causal
hi = sub.sample(40_000, seed=1, do={"overtime":  1.0}).data[:, sa].mean()
lo = sub.sample(40_000, seed=1, do={"overtime": -1.0}).data[:, sa].mean()
print(round(corr, 2), round((hi - lo) / 2, 2))   # 0.64  0.00  → strong correlation, ZERO causal effect
```

That's the whole game: a method that only *sees* the data keeps the mirage; one that can *act* (and is
latent-aware) doesn't. Score any discovery method against the known truth:

```python
from causal_worlds import worlds, grade_spec, InterventionalCiDiscoverer
print(grade_spec(worlds.get("coffee"), InterventionalCiDiscoverer()))
# directed_shd=0  f1=1.0  confounded_reported=0     ← swap in YOUR discoverer
```

## See the world it builds

An SCM *is* a DAG, so look at it. `to_mermaid(spec)` and `to_dot(spec)` are **zero-dependency** string
renderers (or run `causal-worlds viz coffee`). The hidden confounder is drawn dashed — it's the latent
structure a discovery method never gets to see, and the reason the world is hard:

![The declared SCM for the built-in "coffee" world: a hidden confounder (dashed) drives several observed variables](https://raw.githubusercontent.com/noumenal-ai/causal-worlds/main/docs/figures/coffee_world.png)

```python
from causal_worlds import worlds, to_mermaid
print(to_mermaid(worlds.get("coffee")))   # paste into a ```mermaid block — GitHub renders it live
```

> **Status — on [PyPI](https://pypi.org/project/causal-worlds/), beta.** The full loop works: natural
> language → an *admitted* causal world, persisted with provenance. A Claude **author** proposes the
> world; an independent Gemini **judge** (a different model family) + statistical gates admit only
> worlds that are valid, recoverable, faithful, and *not* guessable from variable names. The engine
> (specify → sample → grade → score), all grading, and the renderers run with **no API key**; only
> *authoring* needs keys. Shipped: temporal (lagged) worlds, a control track, and a Gymnasium env.
> See the [CHANGELOG](CHANGELOG.md).

## Causality in three rungs (a 2-minute tour)

That picture isn't decoration — it's a *causal* model, and causality has exactly three levels (Judea
Pearl's **Ladder of Causation**). causal-worlds lets you stand on each rung with real code, on a world
whose true answer you already know.

**Rung 1 — Association · *what goes with what?*** In the data, `footfall` and `sales` rise together.
But is that `footfall → sales`, or is the hidden `local_buzz` (a street festival, good weather) quietly
driving *both*? From the data alone you **cannot tell** — two things moving together always have a
hidden third suspect. That gap is the whole difficulty of causality.

```python
substrate = build_substrate(worlds.get("coffee"))
data = substrate.sample(2000, seed=0)                  # just watch the world
```

**Rung 2 — Intervention · *what if I act?*** Don't watch a variable — *set* it. `do()` is **surgery on
the graph**: it cuts every arrow pointing *into* the variable and keeps every arrow pointing *out*, so
what then moves `sales` is the **real** effect, mirage removed. The data slopes `sales` on `footfall`
at **1.56** — but `do(footfall)` reveals the true effect is only **0.90** (the data overstated it by
0.66, the confounder again); and for the pure-mirage pair above, `do(overtime)` gives exactly **0.00**.

```python
hi = sub.sample(40_000, seed=1, do={"footfall":  1.0}).data[:, sa].mean()
lo = sub.sample(40_000, seed=1, do={"footfall": -1.0}).data[:, sa].mean()
print(round((hi - lo) / 2, 2))   # 0.9  — the true causal effect, vs the 1.56 the raw data suggested
```

You can *see* the surgery — `to_dot(spec, do={"footfall": 1.0})` (or `causal-worlds viz coffee
--format dot`) draws the mutilated graph: every arrow **into** `footfall` is gone, its arrows **out**
remain. Our `do()` is verified *genuine* surgery, not statistical conditioning ([docs/scope.md](docs/scope.md)):

![Rung 2 — do(footfall): the arrows into footfall are cut, its effects still flow](https://raw.githubusercontent.com/noumenal-ai/causal-worlds/main/docs/figures/coffee_do_footfall.png)

**Rung 3 — Counterfactual · *what would have happened?*** "We sold what we sold last Saturday — would
we have sold more had we set `price` differently, *that same Saturday*?" That needs the full model,
with that specific day's hidden `local_buzz` held fixed. Because causal-worlds *declares* the entire
SCM, it's exact — Pearl's **abduction → action → prediction**:

```python
from causal_worlds import counterfactual
cf = counterfactual(worlds.get("coffee"), do={"price": 3.0}, seed=0)
print(cf.factual["sales"], "->", cf.counterfactual["sales"])   # what happened -> what would have
print(cf.effect["sales"])                                       # the change attributable to the act
```

(Same noise, same hidden `local_buzz`, only `price` changed — so the difference is *caused* by the
act. On a *weekend* unit the `price → demand` sign is flipped, so the counterfactual can surprise you.)

**Why this makes a benchmark.** A method that only *sees* (Rung 1) keeps the `local_buzz` mirage as a
real `overtime → sales` edge; only a *latent-aware* method that *does* (Rung 2) escapes it — which is
exactly the [crossover result](docs/findings.md). New to causality? This *is* the tour — read the
diagrams, run the snippets, and you've climbed all three rungs.

## Install

```bash
pip install causal-worlds              # or: uv add causal-worlds
pip install 'causal-worlds[discover]'  # + the baseline discovery stack (PC/GES/FCI/GIES)
pip install 'causal-worlds[llm]'       # + natural-language authoring (Claude + Gemini)
```

The base install (engine, grading, renderers, built-in worlds, CLI) needs only `typer`, `pydantic`,
`numpy`.

## 60-second quickstart (no API key)

```bash
causal-worlds worlds                     # list built-in worlds: braking, coffee, ecommerce
causal-worlds viz coffee                 # print the SCM as Mermaid (--format dot for Graphviz)
causal-worlds gate coffee                # run the validity gates -> admitted=True
causal-worlds grade coffee               # grade the reference discoverer -> directed_shd=0 ...
causal-worlds score benchmark/v0.6/world_01   # grade the reference on a shipped benchmark world
```

New to it? Walk through the **[getting-started guide](docs/getting-started.md)** or run the
**[examples](examples/)** (each prints its expected output, so you can read without running).

## Benchmark your own discoverer

Implement one method — `recover(substrate, *, seed) -> set[(src, dst)]` — and grade it against any
world's answer key:

```python
from causal_worlds import grade_spec, worlds

class MyDiscoverer:
    def recover(self, substrate, *, seed):
        sample = substrate.sample(2000, seed=seed)        # observational data...
        flows = substrate.sample(2000, seed=seed, do={"price": 1.0})  # ...or interventional
        return {("price", "demand")}                       # your recovered edges

print(grade_spec(worlds.get("coffee"), MyDiscoverer()))
```

Or from the CLI on a persisted world: `causal-worlds score <bundle> --discoverer your_pkg:YourClass`.

## Benchmark a controller (Stage 2 — control)

The same worlds are a **control** benchmark: pick lever values to maximise an objective. Because the
mechanisms are declared, the best the levers can do is computable — a *by-construction optimal policy*
([scope §1a](docs/scope.md)) — so a policy is graded by **regret** against it, no external data needed.

```python
from causal_worlds import default_objective, grade_control, worlds

spec = worlds.get("coffee")
objective = default_objective(spec)                  # controllables raise the outcome KPI, quadratic cost
report = grade_control(spec, objective, {"price": 3.0}, seed=7)   # your policy here
print(report.regret)                                 # regret vs the declared optimum (0 = optimal play)
```

A pluggable `Controller` is graded by `grade_controller`. Or drive it as a **Gymnasium env**
(`pip install 'causal-worlds[gym]'`) where the regime shifts between steps (a perturbation):

```python
from causal_worlds.gym import ControlEnv
env = ControlEnv(worlds.get("coffee"))               # action = lever values; reward = objective
obs, info = env.reset(seed=0)                          # info["optimal_reward"] / info["regret"] per step
```

## Author a world from a description (needs `[llm]` + keys)

Set `ANTHROPIC_API_KEY` and `GEMINI_API_KEY` (see [`.env.example`](.env.example); the CLI auto-loads a
local `.env`), then:

```bash
causal-worlds generate "a coffee chain with weekend swings and variable lead times" ./my-world
causal-worlds viz ./my-world             # ...then look at what it built
```

Or describe one **conversationally** — `causal-worlds elicit ./my-world` asks the *minimal* clarifying
questions (entities & roles, what drives what, regimes, hidden causes, the objective), shows the
accumulating brief, and authors once it's complete. In Python:

```python
from causal_worlds import generate
from causal_worlds.author import build_claude_author
from causal_worlds.judge import build_gemini_judge

world = generate(
    "a hospital ED with triage staffing and bed pressure",
    author=build_claude_author(complexity="hard"),   # easy | standard | hard | adversarial
    judge=build_gemini_judge(),                       # independent model family
)
print(world.report.difficulty, world.report.grade)
```

By default authoring is **benchmark mode**: a world is rejected if it is guessable from its variable
names/roles (the published benchmark must not be name-guessable, so *intuitive* operations often fail
on purpose). To just **describe a world and get it**, use **playground mode** — `--playground` on the
CLI, or `anti_cliche=False` in Python. It keeps the faithfulness check and still reports `difficulty`
(now advisory), but never rejects on guessability:

```bash
causal-worlds generate "a regional power grid with rooftop solar and time-of-use pricing" ./grid --playground
```

(With the `observability` extra + Langfuse keys, every `generate → author → gate` run is traced.)

## What the benchmark shows

Across the 26-world hardened [`benchmark/v0.6`](benchmark/v0.6/) set, with an **information-fair**
comparison (the `+do` methods get the *same* interventional budget as the latent-aware reference):
**latent-awareness — not interventions — is the dividing line.** `PC + interventions` still scores the
hidden-confounded pair as causal just as often as observational PC — confounded-kept **30 vs 29**
(summed over the 26 worlds, seed-averaged); only the latent-aware rule reaches **0** (ΔF1 +0.37, 95%
CI [0.33, 0.42]). It's an **identifiability result**, not
"we beat the toolbox." Full table, bootstrap CIs, and the honest caveats (admission circularity,
simulated-DAG leakage, difficulty-as-descriptor, anti-cliché role leakage) are in
**[docs/findings.md](docs/findings.md)**.

## What you get per world

1. **An executable SCM** — sample observational data and `do()`-intervene, deterministically by seed.
2. **A time-series dataset** — the observed variables (the input to a discovery method).
3. **An answer key** — the declared causal edges + the hidden-confounded pairs, derived from the spec.
4. **A manifest** — full provenance (models, grader version, seed, difficulty) and an honesty label.

## Concepts

- **Spec / IR** — variables (with roles, incl. hidden), additive mechanisms (linear-Gaussian by
  default; per-term `Transform`s give additive-nonlinear forms like `square`), regime sign-flips.
- **Answer key** — directed edges over *observed* variables + the hidden-confounded pairs; *derived*
  from the spec, never stored separately, so they can't disagree.
- **Gates** — T1 validity · T2 sample-sanity · T3 faithfulness (grader-independent) · T4 anti-cliché
  (named prior recovers < half *and* a name+role-blind prior stays near chance). All must pass to admit.
- **Reference grader** — an interventional-CI discoverer that uses `do()` data to tell *confounding*
  from *causation*, where PC/GES/GIES/FCI (which assume causal sufficiency) cannot.

Depth: [`docs/foundations.md`](docs/foundations.md) (are these worlds *truly* causal? — the three
rungs, grounded in Pearl/Wright/Haavelmo) · [`docs/scope.md`](docs/scope.md) ·
[`docs/hld.md`](docs/hld.md) · [`docs/lld.md`](docs/lld.md) ·
[`docs/architecture.md`](docs/architecture.md) · [`docs/findings.md`](docs/findings.md) ·
[`docs/validation.md`](docs/validation.md).

## Roadmap

Shipped: NL authoring · independent judge + anti-cliché gate · artifact persistence · the baseline
crossover · a structural-difficulty axis · a 26-world hardened benchmark (`v0.6`) · **temporal worlds**
(lagged edges + autoregression) and **time-series grading** (PCMCI+, LPCMCI, VARLiNGAM, Granger) ·
**conversational elicitation** · the **control track** (by-construction optimal policy, regret, and
regret-under-perturbation) + a **Gymnasium env** · **graph renderers** (Mermaid / DOT).
**additive-nonlinear mechanisms** (`square`/`tanh`/… transforms — the built-in `braking` world's stopping
distance grows with **speed²**, which linear discovery can't see; [#10](https://github.com/noumenal-ai/causal-worlds/issues/10)).
Next: nonlinear **interactions** (products of parents) + a post-nonlinear form, and a temporal
benchmark *set* (n>1). Tracked as [issues](https://github.com/noumenal-ai/causal-worlds/issues).

## Why this is the unoccupied intersection

Today's tools each own one corner — *natural-language authoring × executable causal simulator ×
ground-truth answer-key for discovery* is the gap:

| Tool | Corner it owns | What it lacks (for this job) |
|---|---|---|
| **[G-Sim](https://arxiv.org/abs/2506.09272)** | LLM authors a sim + calibrates to data | needs *real data*; aimed at fidelity, not a declared answer-key |
| **[DEVS-Gen](https://arxiv.org/abs/2603.03784)** | NL → executable discrete-event ops sim | no declared causal-graph answer-key |
| **[SD-SCM](https://arxiv.org/abs/2411.08019)** | LLM fills mechanisms → counterfactuals | needs a *user-supplied* DAG; tabular, not an executable sim |
| **[TimeGraph](https://arxiv.org/abs/2506.01361)** | known-graph time-series for discovery | parametric/templated; no natural-language authoring |

Built on the shoulders of [pgmpy](https://github.com/pgmpy/pgmpy),
[DoWhy](https://github.com/py-why/dowhy),
[CausalPlayground](https://github.com/sa-and/CausalPlayground),
[causal-learn](https://github.com/py-why/causal-learn), and
[Gymnasium](https://github.com/Farama-Foundation/Gymnasium).

## Contributing

Issues and PRs welcome. The bar: `make validate` green (ruff `select=ALL`, mypy `strict`, pytest with
a coverage floor) — see [`docs/engineering.md`](docs/engineering.md). Atomic, conventional commits.

## License

[MIT](LICENSE). An open-source project from [Noumenal](https://github.com/noumenal-ai).
