This cached ablation compares four panel ATT estimators across a sequence of data-generating processes. The point is not to crown a universal winner; it is to make a failure mode visible. When treatment timing is selected on latent factors, a two-way fixed-effects regression can look precise while estimating the wrong counterfactual trend.
Each Monte Carlo replication generates one panel and sends the same draw through four estimators:
FixedEffectsOLS, used as a two-way fixed-effects treatment regression;
SyntheticControl, fit to the average treated unit’s pre-period path;
SyntheticDID, using the panel estimator exposed by crabbymetrics;
MatrixCompletion, fit after masking treated post-treatment outcomes and using the completed values as counterfactuals.
All four DGPs have the same true post-treatment ATT, \(\tau = 1\). The DGPs differ in whether untreated outcomes are additive or low-rank and in whether treatment is randomly assigned or selected on latent factor loadings.
The page uses Quarto execution caching and freezing, so the Monte Carlo runs once and later site renders reuse the saved output unless this file changes.
1 Setup
Show code
from dataclasses import dataclassfrom html import escapeimport matplotlib.pyplot as pltimport numpy as npfrom IPython.display import HTML, displayimport crabbymetrics as cmnp.set_printoptions(precision=4, suppress=True)def html_table(headers, rows): parts = ["<table>", "<thead>", "<tr>"] parts.extend(f"<th>{escape(str(h))}</th>"for h in headers) parts.extend(["</tr>", "</thead>", "<tbody>"])for row in rows: parts.append("<tr>") parts.extend(f"<td>{cell}</td>"for cell in row) parts.append("</tr>") parts.extend(["</tbody>", "</table>"])return"".join(parts)@dataclass(frozen=True)class DGP: key: str title: str factor_strength: float selection: str additive_only: bool=False nonlinear: bool=False
with the factor term omitted in the additive benchmark. In selected-adoption designs, the treated units are more likely to have large latent factor scores. In the strongest design, treatment is assigned to the units with the highest factor-selection score, so parallel trends is deliberately false.
The estimators are intentionally fed identical panels within a replication. SyntheticDID and MatrixCompletion use the common fit(Y, W) panel API: Y is the observed panel and W is the same-shaped absorbing treatment matrix. MatrixCompletion treats the untreated cells (W == 0) as observed and reconstructs \(Y(0)\) in exactly the treated cells where the ATT is evaluated.
Show code
def estimate_twfe(panel, treated_units, t_pre): n, tt = panel.shape unit = np.repeat(np.arange(n, dtype=np.uint32), tt) time = np.tile(np.arange(tt, dtype=np.uint32), n) treated = np.zeros(n) treated[treated_units] =1.0 post = (np.arange(tt) >= t_pre).astype(float) d = np.repeat(treated, tt) * np.tile(post, n) fe = np.column_stack([unit, time]) model = cm.FixedEffectsOLS() model.fit(d[:, None], fe, panel.reshape(-1))returnfloat(np.asarray(model.summary()["coef"])[0])def estimate_sc(panel, treated_units, t_pre): all_units = np.arange(panel.shape[0]) controls = np.array([i for i in all_units if i notinset(treated_units)], dtype=int) donors_pre = panel[np.ix_(controls, np.arange(t_pre))].T donors_post = panel[np.ix_(controls, np.arange(t_pre, panel.shape[1]))].T treated_pre = panel[np.ix_(treated_units, np.arange(t_pre))].mean(axis=0) treated_post = panel[np.ix_(treated_units, np.arange(t_pre, panel.shape[1]))].mean(axis=0) model = cm.SyntheticControl(max_iterations=800) model.fit(donors_pre, treated_pre)returnfloat((treated_post - model.predict(donors_post)).mean())def treatment_matrix(panel, treated_units, t_pre): w = np.zeros_like(panel, dtype=float) w[np.ix_(list(treated_units), np.arange(t_pre, panel.shape[1]))] =1.0return wdef estimate_sdid(panel, treated_units, t_pre): model = cm.SyntheticDID(zeta_omega=1e-8, zeta_lambda=1e-8, max_iterations=1500) model.fit(panel, treatment_matrix(panel, treated_units, t_pre))returnfloat(model.summary()["att"])def estimate_mc(panel, treated_units, t_pre): model = cm.MatrixCompletion(lambda_fraction=0.03, max_iterations=500, tolerance=1e-7) model.fit(panel, treatment_matrix(panel, treated_units, t_pre))returnfloat(model.summary()["att"])estimators = [ ("TWFE", estimate_twfe), ("SC", estimate_sc), ("SDID", estimate_sdid), ("MC", estimate_mc),]
4 Monte Carlo
Show code
def run_mc(reps=160, seed=2026050202): rng = np.random.default_rng(seed) results = {d.key: {name: [] for name, _ in estimators} for d in dgps} failures = {d.key: {name: 0for name, _ in estimators} for d in dgps}for dgp in dgps:for _ inrange(reps): panel, treated_units, t_pre, _tau = make_panel(rng, dgp)for name, fn in estimators:try: estimate = fn(panel, treated_units, t_pre)if np.isfinite(estimate): results[dgp.key][name].append(estimate)else: failures[dgp.key][name] +=1exceptException: failures[dgp.key][name] +=1for dgp in dgps:for name, _ in estimators: results[dgp.key][name] = np.asarray(results[dgp.key][name], dtype=float)return results, failuresresults, failures = run_mc()
The additive benchmark is deliberately easy: all estimators are centered near the truth. Random adoption under a factor structure is already harder for TWFE because untreated potential outcomes have richer unit-specific time paths; SC and SDID use the pre-period path and are tighter.
The selected-adoption designs are the main diagnostic. Once treatment is selected on latent factors, TWFE inherits the wrong counterfactual trend and drifts upward. SC and SDID are not magic—the strong nonlinear-selection DGP is hard for them too—but they degrade more gracefully because they explicitly match or reweight pre-treatment panel paths. In these settings, the current matrix-completion default is more fragile than SDID, which is useful information for tuning and API examples rather than a final statement about the estimator class.