flowchart TD; X((X))-->d((d)); X((X))-->y((y)); d((d))-->|"τ(X)"|y((y)); linkStyle 0,1 stroke:black,stroke-width:2px linkStyle 1,2 stroke:black,stroke-width:2px
make_fully_heterogeneous_dataset
extensions.synthetic_data.make_fully_heterogeneous_dataset(n_obs=1000, n_confounders=5, theta=4.0, seed=None, **doubleml_kwargs)
Simulate data generating process from an interactive regression model with fully heterogenous treatment effects. The outcome is continuous and the treatment is binary. The dataset is generated using a modified version of make_irm_data
function from the doubleml
package.
The general form of the data generating process is:
\[ y_i= g(d_i,\mathbf{X_i})+\epsilon_i \] \[ d_i=f(\mathbf{X_i})+\eta_i \]
where \(y_i\) is the outcome, \(d_i\) is the treatment, \(\mathbf{X_i}\) are the confounders utilized for effect heterogeneity, \(\epsilon_i\) and \(\eta_i\) are the error terms, \(g\) is the outcome function, and \(f\) is the treatment function.
See the doubleml
documentation for more details on the specific functional forms of the data generating process.
Note that the treatment effect is fully heterogenous, thus the CATE is defined as: \(\tau = \mathbb{E}[g(1,\mathbf{X}) - g(0,\mathbf{X})|\mathbf{X}]\) for any \(\mathbf{X}\).
The ATE is defined as the average of the CATE function over all observations: \(\mathbb{E}[\tau (\cdot)]\)
As a DAG, the data generating process can be roughly represented as:
Parameters
Name | Type | Description | Default |
---|---|---|---|
n_obs |
int | The number of observations to generate. | 1000 |
n_confounders |
int | The number of confounders \(\mathbf{X_i}\) to generate (these are utilized fully for heterogeneity). | 5 |
theta |
float | The base parameter for the treatment effect. Note this differs from the ATE. | 4.0 |
seed |
int | None | The seed to use for the random number generator. | None |
**doubleml_kwargs |
Additional keyword arguments to pass to the data generating process. | {} |
Returns
Type | Description |
---|---|
pandas.DataFrame | The generated dataset where y is the outcome, d is the treatment, and X are the confounders which are fully utilized for heterogeneity. |
numpy.ndarray | The true conditional average treatment effects. |
float | The true average treatment effect. |
Examples
>>> from caml.extensions.synthetic_data import make_fully_hetereogenous_dataset
>>> df, true_cates, true_ate = make_fully_hetereogenous_dataset(n_obs=1000, n_confounders=5, theta=4.0, seed=1)