open source · MIT · pip install causal-worlds
Turn a plain-language description of an operation into a fictional-but-coherent causal
world with a declared ground-truth causal graph. See its correlations, intervene with
do(), ask counterfactuals — and check every answer against the truth, because you wrote it.
In the built-in coffee world, overtime and sales rise
together. A naive analyst "discovers" that overtime drives sales. But force overtime with an
intervention and watch sales — it doesn't move. The whole link was a hidden confounder.
import numpy as np
from causal_worlds import build_substrate, worlds
sub = build_substrate(worlds.get("coffee"), standardize=False)
ov, sa = sub.variables.index("overtime"), sub.variables.index("sales")
seen = sub.sample(40_000, seed=0).data
corr = np.corrcoef(seen[:, ov], seen[:, sa])[0, 1] # ≈ 0.64 → looks causal
hi = sub.sample(40_000, seed=1, do={"overtime": 1.0}).data[:, sa].mean()
lo = sub.sample(40_000, seed=1, do={"overtime": -1.0}).data[:, sa].mean()
print(round(corr, 2), round((hi - lo) / 2, 2)) # 0.64 0.00 → strong correlation, ZERO causal effect
Association — what correlates. The data alone can't tell you why.
do() is genuine graph surgery — cut the arrows into a variable, keep the arrows out. The mirage vanishes.
Counterfactuals (abduction → action → prediction), exact — on cross-sectional and temporal worlds.
do(footfall): arrows into footfall are cut; its effects still flow. Verified genuine surgery, not conditioning.One generator, several jobs. The coffee world above is one cross-sectional example; the same machinery spans discovery and control, cross-sectional and temporal.
Implement one function — recover() — and score it against the answer
key, with PC, FCI, GES, GIES, DAGMA, and DirectLiNGAM wrapped as baselines to beat.
The same worlds are a control gym: a by-construction optimal policy,
regret, and regret-under-perturbation, plus a
gymnasium.Env whose regime shifts under you.
Lagged, autoregressive worlds with a lagged answer key and temporal counterfactuals — graded against PCMCI+, LPCMCI, VARLiNGAM, and Granger.
A test an LLM can't ace by reciting variable names: fiction-first, name-blinded, and audited so the only way to score is to actually discover.
causal-worlds elicit builds a world through a short dialogue;
causal-worlds viz draws it — a playground for causal intuition.
Plug in your own discoverer or controller and see whether it recovers structure — and stays optimal — where the standard toolbox gets fooled.
A benchmark you can memorize isn't a benchmark. Fiction-first means there's nothing real to recite and no data to leak. Then we audited our own work and fixed what we found:
The headline crossover is an honest identifiability result: given the same interventions, standard methods keep a hidden-confounded pair as causal; only a latent-aware rule reaches zero. Details + the disclosed residuals: findings · foundations.
pip install causal-worlds
# in Python:
from causal_worlds import worlds, grade_spec, InterventionalCiDiscoverer
print(grade_spec(worlds.get("coffee"), InterventionalCiDiscoverer()))
# directed_shd=0 f1=1.0 confounded_reported=0 ← swap in YOUR discoverer
Engine, grading, and the graph renderers run offline; only authoring a fresh world from a
sentence needs a model. causal-worlds viz coffee draws the world for you.