crabbymetrics
  • Home
  • API
  • Binding Crash Course
  • Supervised Learning
    • OLS
    • Ridge
    • Fixed Effects OLS
    • ElasticNet
    • Synthetic Control
    • Synthetic DID
    • Logit
    • Multinomial Logit
    • Poisson
    • TwoSLS
    • GMM
    • FTRL
    • MEstimator Poisson
  • Semiparametrics
    • Balancing Weights
    • EPLM
    • Average Derivative
    • Double ML And AIPW
    • Richer Regression
  • Unsupervised Learning
    • PCA And Kernel Basis
  • Ablations
    • Variance Estimators
    • Semiparametric Estimator Comparisons
    • Bridging Finite And Superpopulation
  • Optimization
    • Optimizers
    • GMM With Optimizers
  • Ding: First Course
    • Overview And TOC
    • Ch 1 Correlation And Simpson
    • Ch 2 Potential Outcomes
    • Ch 3 CRE And Fisher RT
    • Ch 4 CRE And Neyman
    • Ch 9 Bridging Finite And Superpopulation
    • Ch 11 Propensity Score
    • Ch 12 Double Robust ATE
    • Ch 13 Double Robust ATT
    • Ch 21 Experimental IV
    • Ch 23 Econometric IV
    • Ch 27 Mediation

On this page

  • Literature Map
  • Entropy Balancing, CBPS, and Tilting
  • Kang-Schafer
  • Hainmueller
  • Takeaways

Balancing Weights

Entropy and quadratic calibration on Kang-Schafer and Hainmueller DGPs

BalancingWeights solves the mean-balancing problem

\[ \sum_{i \in \mathcal{C}} w_i = 1, \qquad \sum_{i \in \mathcal{C}} w_i X_i \approx \bar X_{\mathcal{T}}, \]

while keeping the control weights close to uniform under either an entropy or a quadratic objective. In causal-inference terms, the common ATT use case is:

\[ \widehat{\tau}_{ATT} = \bar Y_{\mathcal{T}} - \sum_{i \in \mathcal{C}} w_i Y_i. \]

The examples below use two stress tests:

  • Kang-Schafer, where the observed covariates are nonlinear transformations of latent Gaussian drivers.
  • A Hainmueller-style simulation, where overlap and functional-form difficulty can be dialed up or down.

Literature Map

BalancingWeights sits in a tight cluster of weighting estimators that differ more by parameterization than by the balance conditions they impose.

  • Hainmueller (2012) formulates entropy balancing as a convex calibration problem: choose positive control weights that exactly match treated covariate moments while staying close to baseline weights.
  • Graham, de Xavier Pinto, and Egel (2012) show that inverse probability tilting estimates a logit index by moment conditions chosen so the implied weights satisfy exact sample balance.
  • Imai and Ratkovic (2014) recast the same balance-first logic as a propensity-score GMM / empirical-likelihood estimator.
  • Graham, Pinto, and Egel (2016) extend the same tilting geometry to data-combination problems by introducing separate study and auxiliary tilts.
  • Zhao and Percival (2017) make the dual interpretation explicit from the entropy-balancing side: entropy balancing behaves like a logistic propensity-score fit with a different loss.

For the ATT problem, the cleanest exact equivalence is between entropy balancing and just-identified logit CBPS. Graham’s tilting estimators are the same family, but full AST adds an extra layer of tilting beyond that baseline case.

Entropy Balancing, CBPS, and Tilting

Let \(b_i = b(X_i)\) be the balance basis supplied by the user, and let \(q_i > 0\) denote baseline control weights. BalancingWeights.fit(...) does not require an intercept column in b_i; it handles the sum-to-one constraint separately. For the ATT, entropy balancing solves

\[ \min_{\{w_i\}_{i \in \mathcal{C}}} \sum_{i \in \mathcal{C}} w_i \log\!\left(\frac{w_i}{q_i}\right) \quad \text{subject to} \quad \sum_{i \in \mathcal{C}} w_i = 1, \qquad \sum_{i \in \mathcal{C}} w_i b_i = \bar b_{\mathcal{T}}. \]

The Lagrangian first-order conditions imply a log-linear dual solution

\[ w_i(\lambda) = \frac{q_i \exp(\lambda^\top b_i)} {\sum_{j \in \mathcal{C}} q_j \exp(\lambda^\top b_j)}, \]

after the usual sign relabeling of the multipliers. This is the core Hainmueller result: entropy balancing chooses the multiplier \(\lambda\) so that the tilted control distribution exactly matches the treated moments.

Now write the ATT-CBPS balance equations with an augmented basis \(c_i = (1, b_i^\top)^\top\) and a logit propensity score \(p_i = \Lambda(\beta^\top c_i)\):

\[ \frac{1}{n} \sum_{i=1}^n \left[ D_i c_i - (1-D_i)\frac{p_i}{1-p_i} c_i \right] = 0. \]

Because the logit odds ratio satisfies

\[ \frac{p_i}{1-p_i} = \exp(\beta^\top c_i), \]

the implied unnormalized control weights are

\[ \tilde w_i(\beta) = (1-D_i)\exp(\beta^\top c_i). \]

The intercept moment pins down their total mass:

\[ \sum_{i \in \mathcal{C}} \tilde w_i(\beta) = \sum_{i=1}^n D_i = n_{\mathcal{T}}. \]

Dividing by \(n_{\mathcal{T}}\) yields normalized control weights

\[ w_i^{\mathrm{CBPS}} = \frac{\exp(\beta^\top c_i)} {\sum_{j \in \mathcal{C}} \exp(\beta^\top c_j)}, \qquad i \in \mathcal{C}, \]

and the remaining moments become

\[ \sum_{i \in \mathcal{C}} w_i^{\mathrm{CBPS}} c_i = \bar c_{\mathcal{T}}. \]

So with the same balance basis \(c(X)\), an intercept, and uniform baseline weights \(q_i\), ATT entropy balancing and just-identified logit CBPS deliver the same normalized weights. The practical difference is mostly primal versus dual parameterization: entropy balancing solves directly for calibration weights or multipliers, while CBPS solves for propensity-score coefficients whose logit odds generate those same weights.

Graham’s inverse probability tilting step is the same balance-first idea in another parameterization. The 2012 IPT moments choose a logit index so the implied weights satisfy exact balance moments, which in the ATT specialization again produces inverse-odds weights proportional to \(\exp(\beta^\top c_i)\). AST adds a second layer of tilting on top of that baseline. Writing \(\hat p_i = \Lambda(r_i^\top \hat \delta)\), the auxiliary weights take the form

\[ \hat \pi_i^a \propto (1-D_i)\frac{\hat p_i} {1 - \Lambda(r_i^\top \hat \delta + t_i^\top \hat \lambda_a)}. \]

When the extra auxiliary tilt is unnecessary, \(\hat \lambda_a = 0\), this collapses to

\[ \hat \pi_i^a \propto (1-D_i)\frac{\hat p_i}{1-\hat p_i} = (1-D_i)\exp(r_i^\top \hat \delta), \]

which is exactly the same inverse-odds / entropy-balancing weight formula. With nontrivial study or auxiliary tilts, AST is a strict generalization rather than literally the same estimator.

Quadratic balancing in this library keeps the same balance constraints and changes only the distance penalty. It therefore targets the same sample moments as entropy balancing, but it does not imply the same log-linear weight formula.

The rest of the page uses small helper functions for the simulations and diagnostics.

Show code
from html import escape

import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML, display

import crabbymetrics as cm


def html_table(headers, rows):
    parts = [
        "<table>",
        "<thead>",
        "<tr>",
        *[f"<th>{escape(str(header))}</th>" for header in headers],
        "</tr>",
        "</thead>",
        "<tbody>",
    ]
    for row in rows:
        parts.append("<tr>")
        for cell in row:
            parts.append(f"<td>{escape(str(cell))}</td>")
        parts.append("</tr>")
    parts.extend(["</tbody>", "</table>"])
    return "".join(parts)


def expit(x):
    return 1.0 / (1.0 + np.exp(-x))


def kang_schafer_dgp(n, rng):
    z = rng.normal(size=(n, 4))
    z1, z2, z3, z4 = z.T

    propensity = expit(-z1 + 0.5 * z2 - 0.25 * z3 - 0.1 * z4)
    d = rng.binomial(1, propensity)

    y0 = 210.0 + 27.4 * z1 + 13.7 * (z2 + z3 + z4) + rng.normal(size=n)
    y = y0

    x = np.column_stack(
        [
            np.exp(z1 / 2.0),
            z2 / (1.0 + np.exp(z1)) + 10.0,
            (z1 * z3 / 25.0 + 0.6) ** 3,
            (z2 + z4 + 20.0) ** 2,
        ]
    )
    return y, d, x, z


def hainmueller_dgp(
    n,
    rng,
    overlap_design=1,
    pscore_design=1,
    outcome_design=1,
):
    mean = np.zeros(3)
    cov = np.array([[2.0, 1.0, -1.0], [1.0, 1.0, -0.5], [-1.0, -0.5, 1.0]])
    x1, x2, x3 = rng.multivariate_normal(mean=mean, cov=cov, size=n).T
    x4 = rng.uniform(-3.0, 3.0, size=n)
    x5 = rng.chisquare(df=1.0, size=n)
    x6 = rng.binomial(1, 0.5, size=n)
    x = np.column_stack([x1, x2, x3, x4, x5, x6])

    if overlap_design == 1:
        epsilon = rng.normal(0.0, np.sqrt(30.0), size=n)
    elif overlap_design == 2:
        epsilon = rng.normal(0.0, 10.0, size=n)
    elif overlap_design == 3:
        epsilon = rng.chisquare(df=5.0, size=n)
        epsilon = (epsilon - 5.0) / np.sqrt(10.0) * np.sqrt(67.6) + 0.5
    else:
        raise ValueError("unknown overlap_design")

    if pscore_design == 1:
        base_term = x1 + 2.0 * x2 - 2.0 * x3 - x4 - 0.5 * x5 + x6
    elif pscore_design == 2:
        base_term = x1 + x1**2 - x4 * x6
    elif pscore_design == 3:
        base_term = 2.0 * np.cos(x1) + np.sin(np.pi * x2)
    else:
        raise ValueError("unknown pscore_design")

    d = (base_term + epsilon > 0.0).astype(int)

    eta = rng.normal(0.0, 1.0, size=n)
    if outcome_design == 1:
        y = x1 + x2 + x3 - x4 + x5 + x6 + eta
    elif outcome_design == 2:
        y = x1 + x2 + 0.2 * x3 * x4 - np.sqrt(x5) + eta
    elif outcome_design == 3:
        y = 2.0 * np.cos(x1) + np.sin(np.pi * x2) + (x1 + x2 + x5) ** 2 + eta
    else:
        raise ValueError("unknown outcome_design")

    return y, d, x


def fit_att_balancing(y, d, x, objective):
    treated = d == 1
    control = ~treated
    model = cm.BalancingWeights(
        objective=objective,
        solver="auto",
        autoscale=True,
        max_iterations=300,
        tolerance=1e-8,
    )
    model.fit(x[control], x[treated])
    summary = model.summary()
    weights = np.asarray(summary["weights"])
    att_hat = y[treated].mean() - np.dot(weights, y[control])
    return att_hat, summary


def standardized_mean_difference(x_treated, x_control, weights=None):
    treated_mean = x_treated.mean(axis=0)
    control_mean = x_control.mean(axis=0) if weights is None else np.average(
        x_control, axis=0, weights=weights
    )
    treated_var = x_treated.var(axis=0)
    control_var = x_control.var(axis=0) if weights is None else np.average(
        (x_control - control_mean) ** 2, axis=0, weights=weights
    )
    pooled = np.sqrt(0.5 * (treated_var + control_var))
    pooled = np.where(pooled > 1e-12, pooled, 1.0)
    return (treated_mean - control_mean) / pooled


def evaluate_single_dataset():
    rng = np.random.default_rng(123)
    y, d, x, z = kang_schafer_dgp(2000, rng)
    treated = d == 1
    control = ~treated

    naive_att = y[treated].mean() - y[control].mean()
    quad_att, quad_summary = fit_att_balancing(y, d, x, "quadratic")
    ent_att, ent_summary = fit_att_balancing(y, d, x, "entropy")
    oracle_att, oracle_summary = fit_att_balancing(y, d, z, "entropy")

    smd_before = standardized_mean_difference(x[treated], x[control])
    smd_quad = standardized_mean_difference(
        x[treated], x[control], weights=np.asarray(quad_summary["weights"])
    )
    smd_ent = standardized_mean_difference(
        x[treated], x[control], weights=np.asarray(ent_summary["weights"])
    )

    rows = [
        ["Naive difference", f"{naive_att: .3f}", "--", "--", "--"],
        [
            "Quadratic balancing on observed X",
            f"{quad_att: .3f}",
            quad_summary["success"],
            f"{quad_summary['effective_sample_size']: .1f}",
            f"{quad_summary['max_abs_diff']: .2e}",
        ],
        [
            "Entropy balancing on observed X",
            f"{ent_att: .3f}",
            ent_summary["success"],
            f"{ent_summary['effective_sample_size']: .1f}",
            f"{ent_summary['max_abs_diff']: .2e}",
        ],
        [
            "Entropy balancing on latent Z (oracle)",
            f"{oracle_att: .3f}",
            oracle_summary["success"],
            f"{oracle_summary['effective_sample_size']: .1f}",
            f"{oracle_summary['max_abs_diff']: .2e}",
        ],
    ]
    display(HTML(html_table(["Estimator", "ATT estimate", "Success", "ESS", "Max balance error"], rows)))

    labels = [f"x{j + 1}" for j in range(x.shape[1])]
    fig, ax = plt.subplots(figsize=(8, 4))
    xpos = np.arange(len(labels))
    width = 0.25
    ax.bar(xpos - width, np.abs(smd_before), width=width, label="Unweighted")
    ax.bar(xpos, np.abs(smd_quad), width=width, label="Quadratic")
    ax.bar(xpos + width, np.abs(smd_ent), width=width, label="Entropy")
    ax.axhline(0.1, color="black", linestyle="--", linewidth=1.0)
    ax.set_xticks(xpos)
    ax.set_xticklabels(labels)
    ax.set_ylabel("Absolute standardized mean difference")
    ax.set_title("Kang-Schafer: single-dataset balance on observed transformed covariates")
    ax.legend()
    fig.tight_layout()

    return {
        "naive": naive_att,
        "quadratic": quad_att,
        "entropy": ent_att,
        "oracle": oracle_att,
    }

Kang-Schafer

Kang-Schafer is useful here because balancing is asked to work on the transformed covariates \(X\), not the latent Gaussian drivers \(Z\) that generated treatment and outcomes. The true ATT is zero, so the gap between the estimator and zero is pure bias.

On this single draw, both balancing estimators substantially reduce the raw covariate imbalance. The Max balance error column is the largest absolute difference between the weighted control mean and treated mean in the fitted balance basis. The oracle version that balances on latent \(Z\) shows the benchmark we would like to approach, but it uses variables that are not observed in the real Kang-Schafer problem.

Show code
single_dataset = evaluate_single_dataset()
Estimator ATT estimate Success ESS Max balance error
Naive difference -20.704 -- -- --
Quadratic balancing on observed X -6.252 True 469.3 1.14e-12
Entropy balancing on observed X -4.627 True 406.7 4.66e-08
Entropy balancing on latent Z (oracle) 0.001 True 337.3 6.49e-10

Show code
def run_kang_schafer_panel(n_rep=80, n=1000, seed=2026):
    rng = np.random.default_rng(seed)
    rows = []
    for rep in range(n_rep):
        y, d, x, z = kang_schafer_dgp(n, rng)
        naive = y[d == 1].mean() - y[d == 0].mean()
        quad, quad_summary = fit_att_balancing(y, d, x, "quadratic")
        ent, ent_summary = fit_att_balancing(y, d, x, "entropy")
        oracle, oracle_summary = fit_att_balancing(y, d, z, "entropy")

        rows.append(("Naive", naive, True))
        rows.append(("Quadratic on observed X", quad, bool(quad_summary["success"])))
        rows.append(("Entropy on observed X", ent, bool(ent_summary["success"])))
        rows.append(("Entropy on latent Z", oracle, bool(oracle_summary["success"])))
    return rows


def summarize_rows(rows, truth=0.0):
    method_order = [
        "Naive",
        "Quadratic on observed X",
        "Entropy on observed X",
        "Entropy on latent Z",
    ]
    out = []
    for method in method_order:
        values = np.array([row[1] for row in rows if row[0] == method], dtype=float)
        successes = np.array([row[2] for row in rows if row[0] == method], dtype=bool)
        out.append(
            [
                method,
                f"{values.mean(): .3f}",
                f"{(values.mean() - truth): .3f}",
                f"{np.sqrt(np.mean((values - truth) ** 2)): .3f}",
                f"{successes.mean(): .3f}",
            ]
        )
    return out


kang_rows = run_kang_schafer_panel()
display(HTML(html_table(["Method", "Mean Estimate", "Bias", "RMSE", "Success Rate"], summarize_rows(kang_rows))))
Method Mean Estimate Bias RMSE Success Rate
Naive -20.383 -20.383 20.511 1.000
Quadratic on observed X -6.217 -6.217 6.353 1.000
Entropy on observed X -4.402 -4.402 4.544 1.000
Entropy on latent Z 0.027 0.027 0.095 1.000

The observed-\(X\) versions still live inside the canonical misspecification problem, so they do not become oracle estimators just by balancing means. But they do move sharply toward zero relative to the naive treated-control difference.

Hainmueller

The Hainmueller-style design below keeps the unit-level treatment effect at zero while varying overlap and the difficulty of the treatment and outcome models. The harder setting deliberately uses nonlinear assignment and outcome functions, but the outcome remains finite and stable enough that the RMSE table is still interpretable.

Show code
def run_hainmueller_panel(setting_name, overlap_design, pscore_design, outcome_design, n_rep=50, n=1500, seed=3030):
    rng = np.random.default_rng(seed)
    rows = []
    for rep in range(n_rep):
        y, d, x = hainmueller_dgp(
            n=n,
            rng=rng,
            overlap_design=overlap_design,
            pscore_design=pscore_design,
            outcome_design=outcome_design,
        )
        naive = y[d == 1].mean() - y[d == 0].mean()
        quad, quad_summary = fit_att_balancing(y, d, x, "quadratic")
        ent, ent_summary = fit_att_balancing(y, d, x, "entropy")

        rows.append((setting_name, "Naive", naive, True))
        rows.append((setting_name, "Quadratic", quad, bool(quad_summary["success"])))
        rows.append((setting_name, "Entropy", ent, bool(ent_summary["success"])))
    return rows


def summarize_hainmueller(rows, truth=0.0):
    settings = sorted({row[0] for row in rows})
    method_order = ["Naive", "Quadratic", "Entropy"]
    out = []
    for setting in settings:
        for method in method_order:
            values = np.array([row[2] for row in rows if row[0] == setting and row[1] == method], dtype=float)
            successes = np.array([row[3] for row in rows if row[0] == setting and row[1] == method], dtype=bool)
            out.append(
                [
                    setting,
                    method,
                    f"{values.mean(): .3f}",
                    f"{(values.mean() - truth): .3f}",
                    f"{np.sqrt(np.mean((values - truth) ** 2)): .3f}",
                    f"{successes.mean(): .3f}",
                ]
            )
    return out


hain_easy = run_hainmueller_panel("Easier: overlap 2 / pscore 1 / outcome 1", 2, 1, 1)
hain_hard = run_hainmueller_panel("Harder: overlap 1 / pscore 3 / outcome 3", 1, 3, 3)
hain_rows = hain_easy + hain_hard
display(
    HTML(
        html_table(
            ["Setting", "Method", "Mean Estimate", "Bias", "RMSE", "Success Rate"],
            summarize_hainmueller(hain_rows),
        )
    )
)
Setting Method Mean Estimate Bias RMSE Success Rate
Easier: overlap 2 / pscore 1 / outcome 1 Naive 1.157 1.157 1.167 1.000
Easier: overlap 2 / pscore 1 / outcome 1 Quadratic -0.001 -0.001 0.064 0.980
Easier: overlap 2 / pscore 1 / outcome 1 Entropy -0.001 -0.001 0.063 1.000
Harder: overlap 1 / pscore 3 / outcome 3 Naive -1.109 -1.109 1.304 1.000
Harder: overlap 1 / pscore 3 / outcome 3 Quadratic -1.173 -1.173 1.299 1.000
Harder: overlap 1 / pscore 3 / outcome 3 Entropy -1.197 -1.197 1.321 1.000
Show code
def rmse_by_setting(rows):
    settings = sorted({row[0] for row in rows})
    methods = ["Naive", "Quadratic", "Entropy"]
    rmse = np.zeros((len(settings), len(methods)))
    for i, setting in enumerate(settings):
        for j, method in enumerate(methods):
            values = np.array([row[2] for row in rows if row[0] == setting and row[1] == method], dtype=float)
            rmse[i, j] = np.sqrt(np.mean(values**2))
    return settings, methods, rmse


settings, methods, rmse = rmse_by_setting(hain_rows)
fig, ax = plt.subplots(figsize=(10, 4))
xpos = np.arange(len(settings))
width = 0.25
for j, method in enumerate(methods):
    ax.bar(xpos + (j - 1) * width, rmse[:, j], width=width, label=method)
ax.set_xticks(xpos)
ax.set_xticklabels(settings, rotation=10, ha="right")
ax.set_ylabel("RMSE around true ATT = 0")
ax.set_title("Hainmueller DGP: balancing weights versus the naive difference in means")
ax.legend()
fig.tight_layout()

The harder setting is included as a failure-mode check, not a victory lap. Exact mean balance is powerful when the important differences are visible in the balance basis, but it cannot force a linear mean-balance basis to recover every nonlinear response surface.

Takeaways

  • BalancingWeights is most naturally a building block. The class returns the control weights; the ATT estimate is the weighted control mean subtracted from the treated mean.
  • autoscale=True is useful on these simulation designs because the raw covariate scales can be wildly different.
  • Entropy and quadratic balancing often move together on easy designs, but their weight geometry differs: entropy keeps all weights positive, while quadratic calibration can put some controls at the lower bound.
  • Kang-Schafer remains hard when only the transformed covariates are observed. Balancing means helps, but it does not erase misspecification by itself.