make_fully_hetereogenous_dataset

extensions.synthetic_data.make_fully_hetereogenous_dataset(n_obs=1000, n_confounders=5, ate=4.0, seed=None, **doubleml_kwargs)

Generate an “interactive regression” model data generating process with fully heterogenous treatment effects. The outcome is continuous and the treatment is binary. The dataset is generated using the make_confounded_irm_data function from the doubleml package. We enforce the additional “unobserved” confounder A to be zero for all observations, since confounding is captured in X.

The general form of the data generating process is:

\[ Y_i= g(D_i,\mathbf{X_i})+\epsilon_i \] \[ D_i=f(\mathbf{X_i})+\eta_i \]

where \(Y_i\) is the outcome, \(D_i\) is the treatment, \(\mathbf{X_i}\) are the covariates, \(\epsilon_i\) and \(\eta_i\) are the error terms, \(g\) is the outcome function, and \(f\) is the treatment function.

Note that the treatment effect is fully heterogenous, thus the CATE is defined as: \(\tau = \mathbb{E}[g(1,\mathbf{X}) - g(0,\mathbf{X})|\mathbf{X}]\) for any \(\mathbf{X}\).

The ATE is defined as the average of the CATE function over the covariates: \(\mathbb{E}[\tau (\cdot)]\)

See the doubleml documentation for more details on the specific functional forms of the data generating process.

As a DAG, the data generating process can be roughly represented as:

flowchart TD;
    X((X))-->D((D));
    X((X))-->Y((Y));
    D((D))-->|"τ(X)"|Y((Y));
    linkStyle 0,1 stroke:black,stroke-width:2px
    linkStyle 1,2 stroke:black,stroke-width:2px

Parameters

Name Type Description Default
n_obs int The number of observations to generate. Default is 1000. 1000
n_confounders int The number of confounders to generate. Default is 5. 5
ate float The average treatment effect. Default is 4.0. 4.0
seed int | None The seed to use for the random number generator. Default is None. None
**doubleml_kwargs Additional keyword arguments to pass to the data generating process. {}

Returns

Type Description
pd.DataFrame The generated dataset where y is the outcome, d is the treatment, and X are the covariates.
pd.DataFrame The true conditional average treatment effects.
float The true average treatment effect.

Examples

>>> from caml.extensions.synthetic_data import make_fully_hetereogenous_dataset
>>> df, true_cates, true_ate = make_fully_hetereogenous_dataset(n_obs=1000, n_confounders=5, ate=4.0, seed=1)
Back to top