open source · MIT · pip install causal-worlds

A causal world with an answer key.

Turn a plain-language description of an operation into a fictional-but-coherent causal world with a declared ground-truth causal graph. See its correlations, intervene with do(), ask counterfactuals — and check every answer against the truth, because you wrote it.

Correlation lies — and because the world is declared, you can prove it

In the built-in coffee world, overtime and sales rise together. A naive analyst "discovers" that overtime drives sales. But force overtime with an intervention and watch sales — it doesn't move. The whole link was a hidden confounder.

0.64
SEEING — correlation(overtime, sales)
0.00
DOING — causal effect via do(overtime)
3 / 3
rungs of Pearl's ladder, with a known answer
The declared SCM for the coffee world; the hidden confounder local_buzz, dashed red, drives footfall, overtime, and sales.
The declared SCM. The dashed-red node is a hidden confounder — the structure a discovery method never sees.
import numpy as np
from causal_worlds import build_substrate, worlds

sub = build_substrate(worlds.get("coffee"), standardize=False)
ov, sa = sub.variables.index("overtime"), sub.variables.index("sales")
seen = sub.sample(40_000, seed=0).data
corr = np.corrcoef(seen[:, ov], seen[:, sa])[0, 1]                  # ≈ 0.64  → looks causal
hi = sub.sample(40_000, seed=1, do={"overtime":  1.0}).data[:, sa].mean()
lo = sub.sample(40_000, seed=1, do={"overtime": -1.0}).data[:, sa].mean()
print(round(corr, 2), round((hi - lo) / 2, 2))   # 0.64  0.00  → strong correlation, ZERO causal effect

Seeing, doing, imagining

Rung 1

Seeing

Association — what correlates. The data alone can't tell you why.

Rung 2

Doing

do() is genuine graph surgery — cut the arrows into a variable, keep the arrows out. The mirage vanishes.

Rung 3

Imagining

Counterfactuals (abduction → action → prediction), exact — on cross-sectional and temporal worlds.

do(footfall): the arrows into footfall are cut; its downstream effects still flow.
Rung 2 — do(footfall): arrows into footfall are cut; its effects still flow. Verified genuine surgery, not conditioning.

More than a coffee demo — who it's for

One generator, several jobs. The coffee world above is one cross-sectional example; the same machinery spans discovery and control, cross-sectional and temporal.

Causal discovery

Benchmark a method

Implement one function — recover() — and score it against the answer key, with PC, FCI, GES, GIES, DAGMA, and DirectLiNGAM wrapped as baselines to beat.

RL / control

Benchmark a control agent

The same worlds are a control gym: a by-construction optimal policy, regret, and regret-under-perturbation, plus a gymnasium.Env whose regime shifts under you.

Time series

Temporal causal discovery

Lagged, autoregressive worlds with a lagged answer key and temporal counterfactuals — graded against PCMCI+, LPCMCI, VARLiNGAM, and Granger.

LLM evaluation

Parrot-proof reasoning eval

A test an LLM can't ace by reciting variable names: fiction-first, name-blinded, and audited so the only way to score is to actually discover.

Sandbox

Describe a world, conversationally

causal-worlds elicit builds a world through a short dialogue; causal-worlds viz draws it — a playground for causal intuition.

Agents

Stress-test a causal agent

Plug in your own discoverer or controller and see whether it recovers structure — and stays optimal — where the standard toolbox gets fooled.

A benchmark we tried to break ourselves

A benchmark you can memorize isn't a benchmark. Fiction-first means there's nothing real to recite and no data to leak. Then we audited our own work and fixed what we found:

The headline crossover is an honest identifiability result: given the same interventions, standard methods keep a hidden-confounded pair as causal; only a latent-aware rule reaches zero. Details + the disclosed residuals: findings · foundations.

Try it in 60 seconds — no API key

pip install causal-worlds

# in Python:
from causal_worlds import worlds, grade_spec, InterventionalCiDiscoverer
print(grade_spec(worlds.get("coffee"), InterventionalCiDiscoverer()))
# directed_shd=0  f1=1.0  confounded_reported=0     ← swap in YOUR discoverer

Engine, grading, and the graph renderers run offline; only authoring a fresh world from a sentence needs a model. causal-worlds viz coffee draws the world for you.