causalis.scenarios.multi_unconfoundedness.dgp

Module Contents

Functions

generate_multitreatment_gamma_26

Pre-configured multi-treatment dataset with Gamma-distributed outcome.

generate_multitreatment_binary_26

Pre-configured multi-treatment dataset with Binary outcome.

generate_multitreatment_irm_26

generate_multi_dml_cx_26

The notebook simulates overlapping contact and repeat actions. This packaged DGP resolves them into a mutually exclusive one-hot treatment:

Data

multi_dml_cx_26

API

causalis.scenarios.multi_unconfoundedness.dgp.generate_multitreatment_gamma_26(n: int = 100000, seed: int = 42, include_oracle: bool = False, return_causal_data: bool = True) Union[pandas.DataFrame, causalis.data_contracts.multicausaldata.MultiCausalData]

Pre-configured multi-treatment dataset with Gamma-distributed outcome.

  • 3 treatment classes: d_0 (control), d_1, d_2

  • 8 confounders with realistic marginals sampled through a Gaussian copula

  • Gamma outcome with log-link confounding and heterogeneous arm effects

Examples

df = generate_multitreatment_gamma_26(n=256, seed=7, return_causal_data=False) bool(df[[“d_0”, “d_1”, “d_2”]].sum(axis=1).eq(1).all()) True {“tenure_months”, “credit_utilization”, “y”}.issubset(df.columns) True

Notes

Let :math:X = (\text{tenure}, \text{sessions}, \text{spend}, \text{premium}, \text{urban}, \text{tickets}, \text{discount}, \text{credit}) denote the 8 observed confounders. The treatment assignment mechanism is a multinomial logit with calibrated marginal arm rates near :math:(0.50, 0.25, 0.25):

.. math::

s_k(X) = \alpha_{d,k} + \beta_{d,k}^{\top} X, \qquad
\Pr(D = k \mid X) = \frac{\exp(s_k(X))}{\sum_{j=0}^{2} \exp(s_j(X))}.

The confounders are jointly sampled through a Toeplitz copula with :math:\mathrm{Corr}(X_i, X_j) = 0.3^{|i-j|}.

The outcome uses a log link. For arm :math:k,

.. math::

\log \mu_k(X) = \alpha_y + \beta_y^{\top} X + \theta_k + \tau_k(X),
\qquad
Y(k) \mid X \sim \Gamma(\text{shape}=2, \text{scale}=\mu_k(X)/2).

This scenario fixes :math:\theta = (0, -0.05, 0.10) and uses the heterogeneous shifts

.. math::

\tau_1(X) =
\min \left\{
-0.22
- 0.0010 \, \text{tenure}
- 0.006 \, \text{sessions}
- 0.05 \, \text{premium}
- 0.04 \, \text{discount}
- 0.10 \, (\text{credit} - 0.45),
-0.02
\right\},

.. math::

\tau_2(X) =
\max \left\{
0.16
+ 0.014 \, \text{sessions}
+ 0.030 \, \log(1 + \text{spend})
+ 0.06 \, \text{urban}
- 0.006 \, \text{tickets}
+ 0.12 \, (\text{credit} - 0.45),
0.02
\right\}.

So d_1 is always weakly worse than control on the log-mean scale, while d_2 is always weakly better than control.

causalis.scenarios.multi_unconfoundedness.dgp.generate_multitreatment_binary_26(n: int = 100000, seed: int = 42, include_oracle: bool = False, return_causal_data: bool = True) Union[pandas.DataFrame, causalis.data_contracts.multicausaldata.MultiCausalData]

Pre-configured multi-treatment dataset with Binary outcome.

  • 3 treatment classes: d_0 (control), d_1, d_2

  • 8 confounders with realistic marginals sampled through a Gaussian copula

  • Binary outcome with a logistic baseline and heterogeneous arm effects

Examples

df = generate_multitreatment_binary_26(n=256, seed=7, return_causal_data=False) bool(df[[“d_0”, “d_1”, “d_2”]].sum(axis=1).eq(1).all()) True {“weekly_active_days”, “engagement_score”, “y”}.issubset(df.columns) True

Notes

Let :math:X = (\text{tenure}, \text{active days}, \text{income}, \text{premium}, \text{family}, \text{complaints}, \text{discount}, \text{engagement}) denote the 8 confounders. Treatment assignment again follows a calibrated multinomial logit with target arm rates near :math:(0.50, 0.25, 0.25):

.. math::

s_k(X) = \alpha_{d,k} + \beta_{d,k}^{\top} X, \qquad
\Pr(D = k \mid X) = \frac{\exp(s_k(X))}{\sum_{j=0}^{2} \exp(s_j(X))}.

The outcome uses a logistic link with alpha_y = -1.1:

.. math::

\operatorname{logit}\Pr(Y(k)=1 \mid X)
= -1.1 + \beta_y^{\top} X + \theta_k + \tau_k(X).

This scenario fixes :math:\theta = (0, -0.18, 0.26) and uses

.. math::

\tau_1(X) =
\min \left\{
-0.16
- 0.0008 \, \text{tenure}
- 0.020 \, \text{active days}
- 0.08 \, \text{premium}
- 0.03 \, \text{complaints}
- 0.10 \, (\text{engagement} - 0.60),
-0.02
\right\},

.. math::

\tau_2(X) =
\max \left\{
0.14
+ 0.020 \, \text{active days}
+ 0.028 \, \log(1 + \text{income})
+ 0.05 \, \text{family}
- 0.010 \, \text{complaints}
+ 0.12 \, (\text{engagement} - 0.60),
0.02
\right\}.

The clipping keeps d_1 uniformly below control and d_2 uniformly above control on the log-odds scale, while the Gaussian copula with :math:\mathrm{Corr}(X_i, X_j) = 0.3^{|i-j|} induces cross-feature dependence.

causalis.scenarios.multi_unconfoundedness.dgp.generate_multitreatment_irm_26(n: int = 100000, seed: int = 42, include_oracle: bool = False, return_causal_data: bool = True) Union[pandas.DataFrame, causalis.data_contracts.multicausaldata.MultiCausalData]
causalis.scenarios.multi_unconfoundedness.dgp.generate_multi_dml_cx_26(n: int = 100000, seed: int = 42, include_oracle: bool = False, return_causal_data: bool = True) Union[pandas.DataFrame, causalis.data_contracts.multicausaldata.MultiCausalData]

The notebook simulates overlapping contact and repeat actions. This packaged DGP resolves them into a mutually exclusive one-hot treatment:

  • control

  • neg_contact_flg

  • error_flg

  • neg_contact_flg_error_flg

Treatment assignment matches the notebook’s independent Bernoulli contact and repeat mechanisms exactly after overlap-resolution, but is exposed through the shared multi-treatment generator so it integrates with MultiCausalData and the scenario tooling.

Examples

df = generate_multi_dml_cx_26(n=256, seed=7, return_causal_data=False) treatment_cols = [“control”, “neg_contact_flg”, “error_flg”, “neg_contact_flg_error_flg”] bool(df[treatment_cols].sum(axis=1).eq(1).all()) True {“age”, “prev_apps”, “csat_prev”, “y”}.issubset(df.columns) True

Notes

Write :math:a(X) for the contact logit and :math:b(X) for the repeat logit. The notebook first draws two conditionally independent Bernoulli actions,

.. math::

C \mid X \sim \operatorname{Bernoulli}(\sigma(a(X))), \qquad
R \mid X \sim \operatorname{Bernoulli}(\sigma(b(X))),

where :math:\sigma(z) = 1 / (1 + e^{-z}). In this packaged benchmark the pair :math:(C, R) is re-encoded as a one-hot treatment:

.. math::

D =
\begin{cases}
\text{control} & (C, R) = (0, 0), \\
\text{neg\_contact\_flg} & (C, R) = (1, 0), \\
\text{error\_flg} & (C, R) = (0, 1), \\
\text{neg\_contact\_flg\_error\_flg} & (C, R) = (1, 1).
\end{cases}

Let :math:p_c = \sigma(a(X)) and :math:p_r = \sigma(b(X)). Then the arm probabilities are

.. math::

\Pr(D=\text{control}\mid X) = (1-p_c)(1-p_r),

.. math::

\Pr(D=\text{neg\_contact\_flg}\mid X) = p_c (1-p_r),

.. math::

\Pr(D=\text{error\_flg}\mid X) = (1-p_c) p_r,

.. math::

\Pr(D=\text{neg\_contact\_flg\_error\_flg}\mid X) = p_c p_r.

Equivalently, this is exactly the softmax model with class scores :math:(0, a(X), b(X), a(X)+b(X)), which is why the implementation passes g_d=[None, _cx_contact_logit, _cx_repeat_logit, lambda x: _cx_contact_logit(x) + _cx_repeat_logit(x)].

The observed outcome uses a binary logit baseline :math:g_y(X) plus a class effect

.. math::

\operatorname{logit}\Pr(Y=1 \mid X, D)
= g_y(X) + \theta(D),

with :math:\theta(\text{control}) = \theta(\text{neg\_contact\_flg}) = 0 and :math:\theta(\text{error\_flg}) = \theta(\text{neg\_contact\_flg\_error\_flg}) = -0.65.

Worked overlap example: if :math:a(X)=0.8 and :math:b(X)=-0.2, then :math:p_c \approx 0.690 and :math:p_r \approx 0.450, giving arm probabilities approximately (0.170, 0.379, 0.140, 0.311) for (control, neg_contact_flg, error_flg, neg_contact_flg_error_flg).

causalis.scenarios.multi_unconfoundedness.dgp.multi_dml_cx_26

None