open source · MIT · pip install causal-worlds

A causal world with an answer key.

Turn a plain-language description of an operation into a fictional-but-coherent causal world with a declared ground-truth causal graph. See its correlations, intervene with do(), ask counterfactuals — and check every answer against the truth, because you wrote it.

Correlation lies — and because the world is declared, you can prove it

In the built-in coffee world, overtime and sales rise together. A naive analyst "discovers" that overtime drives sales. But force overtime with an intervention and watch sales — it doesn't move. The whole link was a hidden confounder.

0.64
SEEING — correlation(overtime, sales)
0.00
DOING — causal effect via do(overtime)
3 / 3
rungs of Pearl's ladder, with a known answer
The declared SCM for the coffee world; the hidden confounder local_buzz, dashed red, drives footfall, overtime, and sales.
The declared SCM. The dashed-red node is a hidden confounder — the structure a discovery method never sees.
import numpy as np
from causal_worlds import build_substrate, worlds

sub = build_substrate(worlds.get("coffee"), standardize=False)
ov, sa = sub.variables.index("overtime"), sub.variables.index("sales")
seen = sub.sample(40_000, seed=0).data
corr = np.corrcoef(seen[:, ov], seen[:, sa])[0, 1]                  # ≈ 0.64  → looks causal
hi = sub.sample(40_000, seed=1, do={"overtime":  1.0}).data[:, sa].mean()
lo = sub.sample(40_000, seed=1, do={"overtime": -1.0}).data[:, sa].mean()
print(round(corr, 2), round((hi - lo) / 2, 2))   # 0.64  0.00  → strong correlation, ZERO causal effect

Seeing, doing, imagining

Rung 1

Seeing

Association — what correlates. The data alone can't tell you why.

Rung 2

Doing

do() is genuine graph surgery — cut the arrows into a variable, keep the arrows out. The mirage vanishes.

Rung 3

Imagining

Counterfactuals (abduction → action → prediction), exact — on cross-sectional and temporal worlds.

do(footfall): the arrows into footfall are cut; its downstream effects still flow.
Rung 2 — do(footfall): arrows into footfall are cut; its effects still flow. Verified genuine surgery, not conditioning.

Rung 3 — imagine a different past on the same day. Because the SCM is declared, the counterfactual is exact (abduction → action → prediction):

from causal_worlds import counterfactual, worlds
cf = counterfactual(worlds.get("coffee"), do={"footfall": 2.0}, seed=0)
print(round(cf.factual["sales"], 2), "->", round(cf.counterfactual["sales"], 2))   # 3.24 -> 4.55

More than a coffee demo — who it's for

One generator, several jobs. The coffee world above is one cross-sectional example; the same machinery spans discovery and control, cross-sectional and temporal.

Causal discovery

Benchmark a method

Implement one function — recover() — and score it against the answer key, with PC, FCI, GES, GIES, DAGMA, and DirectLiNGAM wrapped as baselines to beat.

RL / control

Benchmark a control agent

The same worlds are a control gym: a by-construction optimal policy, regret, and regret-under-perturbation, plus a gymnasium.Env whose regime shifts under you.

Time series

Temporal causal discovery

Lagged, autoregressive worlds with a lagged answer key and temporal counterfactuals — graded against PCMCI+, LPCMCI, VARLiNGAM, and Granger.

LLM evaluation

Parrot-proof reasoning eval

A test an LLM can't ace by reciting variable names: fiction-first, name-blinded, and audited so the only way to score is to actually discover.

Sandbox

Describe a world, get a world

Write a sentence and causal-worlds generate … --playground hands you an executable world with its answer key — no benchmark gate in the way. elicit builds one through dialogue; viz draws it.

Agents

Stress-test a causal agent

Plug in your own discoverer or controller and see whether it recovers structure — and stays optimal — where the standard toolbox gets fooled.

Every capability — a few lines each

Everything below runs with no API key (only authoring needs a model). Every number is from a live run.

Benchmark a discoverer against ground truth

Implement one function, recover() — or run the wrapped baselines — and score against the answer key. The column that matters is confounded_reported: how many spurious confounded pairs a method keeps as real edges.

from causal_worlds import worlds, grade_spec, InterventionalCiDiscoverer, PcDiscoverer, GiesDiscoverer
spec = worlds.get("coffee")
for name, disc in {"interventional-ci": InterventionalCiDiscoverer, "pc": PcDiscoverer, "gies": GiesDiscoverer}.items():
    r = grade_spec(spec, disc(), seed=0)
    print(f"{name:>18}  F1={r.f1:.2f}  confounded_kept={r.confounded_reported}")
# interventional-ci  F1=1.00  confounded_kept=0     ← the only one that nails it AND avoids the trap
#                pc  F1=0.77  confounded_kept=1
#               gies  F1=0.77  confounded_kept=1
Discovery shootout on the coffee world: only the latent-aware reference keeps zero confounded pairs and recovers the structure; PC, FCI and GIES all keep the spurious edge.
Only a latent-aware method isn't fooled — every box-stock discoverer on the coffee world.

Stay optimal when the world shifts

The same worlds are a control benchmark: the optimum is computable from the declared SCM, so a policy is graded by regret — and the load-bearing metric is regret under a regime shift.

from causal_worlds import worlds, default_objective, optimal_policy, regret_under_perturbation
spec = worlds.get("coffee"); obj = default_objective(spec)
print(optimal_policy(spec, obj))                                   # {'price': 0.0}  ← declared optimum
# the price lever's sign flips across regimes, so a regime-blind policy collapses when it flips:
rep = regret_under_perturbation(spec, obj, {"price": -1.0}, seed=7)
print(rep.per_regime)                                              # {'baseline': 0.0, 'weekend': 2.0}
Control under perturbation: a regime-blind policy has 0.0 regret in its own regime and 2.0 when the regime flips, while a regime-aware policy stays at 0.0.
A regime-blind policy stays optimal — until the regime flips. A regime-aware one stays at zero.

Worlds that evolve in time

Lagged, autoregressive worlds with a lagged answer key — and counterfactuals that roll a whole trajectory forward under a sustained intervention.

from causal_worlds import counterfactual_temporal, worlds
tcf = counterfactual_temporal(worlds.get("supply"), do={"order": 2.0}, seed=0, steps=200)
# hold orders high for 200 steps:  stockout -2.68,  inventory +3.35,  cost +0.63  (mean shift)

Author one from a sentence — or just describe it

By default authoring builds a benchmark-grade world (rejected if it's guessable from the variable names). Pass --playground to keep faithfulness + a difficulty score but never reject — describe a world and just get it.

$ causal-worlds generate "a regional power grid with rooftop solar and time-of-use pricing" ./grid
not admitted: T4 cliché: names+roles recover it (prior F1 0.84 >= 0.5)
hint: re-run with --playground to author it anyway (guessability becomes an advisory score).

$ causal-worlds generate "a regional power grid with rooftop solar and time-of-use pricing" ./grid --playground
admitted -> ./grid  difficulty=0.16 (advisory)     # 8 edges + a hidden confounder, reference grader F1 0.93

See any world

Zero-dependency SCM renderers — Mermaid (draws on GitHub) or Graphviz DOT, with path coefficients on the edges and hidden confounders dashed.

from causal_worlds import to_mermaid, worlds
print(to_mermaid(worlds.get("coffee")))     # or from the CLI:  causal-worlds viz coffee --format dot

A benchmark we tried to break ourselves

A benchmark you can memorize isn't a benchmark. Fiction-first means there's nothing real to recite and no data to leak. Then we audited our own work and fixed what we found:

The headline crossover is an honest identifiability result: given the same interventions, standard methods keep a hidden-confounded pair as causal; only a latent-aware rule reaches zero. Details + the disclosed residuals: findings · foundations.

Try it in 60 seconds — no API key

pip install causal-worlds

# in Python:
from causal_worlds import worlds, grade_spec, InterventionalCiDiscoverer
print(grade_spec(worlds.get("coffee"), InterventionalCiDiscoverer()))
# directed_shd=0  f1=1.0  confounded_reported=0     ← swap in YOUR discoverer

Engine, grading, and the graph renderers run offline; only authoring a fresh world from a sentence needs a model — causal-worlds generate "…" ./world --playground for the describe-and-get-it path, or the default for a benchmark-grade (name-unguessable) world. causal-worlds viz coffee draws the world for you.