causalis.dgp.causaldata.functional¶
Notes¶
Use this module when you want a ready-made synthetic dataset without specifying
every structural component manually. For finer control over the DGP, instantiate
CausalDatasetGenerator directly instead.
Examples¶
from causalis.dgp.causaldata.functional import generate_rct, obs_linear_effect rct = generate_rct(n=500, outcome_type=”normal”, return_causal_data=True) rct.outcome, rct.treatment (‘y’, ‘d’) obs = obs_linear_effect(n=500, theta=1.0, target_d_rate=0.3) {“y”, “d”}.issubset(obs.columns) True
High-level helpers for single-treatment synthetic causal datasets.
This module provides convenient wrappers around
:class:causalis.dgp.causaldata.base.CausalDatasetGenerator for common
benchmarking setups such as classic A/B tests, observational confounding, and
CUPED-oriented examples with pre-period covariates.
Module Contents¶
Functions¶
Generate an RCT dataset with randomized treatment assignment. |
|
Generate a classic RCT dataset with three binary confounders: platform_ios, country_usa, and source_paid. |
|
Generate a classic RCT dataset with three binary confounders and a gamma outcome. |
|
Generate an observational dataset with linear effects of confounders and a constant treatment effect. |
|
Tweedie-like DGP with mixed marginals and structured HTE. Features many zeros and a heavy right tail. Suitable for CUPED benchmarking. |
|
Binary CUPED-oriented DGP with richer confounders and structured HTE. |
|
A standard linear benchmark with moderate confounding. Based on the benchmark scenario in docs/research/dgp_benchmarking.ipynb. |
API¶
- causalis.dgp.causaldata.functional.generate_rct(n: int = 20000, split: float = 0.5, random_state: Optional[int] = 42, outcome_type: str = 'binary', outcome_params: Optional[Dict] = None, confounder_specs: Optional[List[Dict[str, Any]]] = None, k: int = 0, x_sampler: Optional[Callable[[int, int, int], numpy.ndarray]] = None, add_ancillary: bool = True, deterministic_ids: bool = False, add_pre: bool = True, pre_name: str = 'y_pre', pre_corr: float = 0.7, prognostic_scale: float = 1.0, beta_y: Optional[Union[List[float], numpy.ndarray]] = None, g_y: Optional[Callable[[numpy.ndarray], numpy.ndarray]] = None, use_prognostic: Optional[bool] = None, include_oracle: bool = True, return_causal_data: bool = False) Union[pandas.DataFrame, causalis.dgp.causaldata.CausalData]¶
Generate an RCT dataset with randomized treatment assignment.
Uses
CausalDatasetGeneratorinternally, ensuring treatment is independent of X. Specifically designed for benchmarking variance reduction techniques like CUPED.Notes on effect scale
How
outcome_paramsmaps into the structural effect:outcome_type=”normal”: treatment shifts the mean by (mean[“B”] - mean[“A”]) on the outcome scale.
outcome_type=”binary”: treatment shifts the log-odds by (logit(p_B) - logit(p_A)).
outcome_type=”poisson” or “gamma”: treatment shifts the log-mean by log(lam_B / lam_A).
Ancillary columns (if add_ancillary=True) are generated from baseline confounders X only, avoiding outcome leakage and post-treatment adjustment issues.
Parameters
n : int, default=20_000 Number of samples to generate. split : float, default=0.5 Proportion of samples assigned to the treatment group. random_state : int, optional Random seed for reproducibility. outcome_type : {“binary”, “normal”, “poisson”, “gamma”}, default=”binary” Distribution family of the outcome. outcome_params : dict, optional Parameters defining baseline rates/means and treatment effects. e.g., {“p”: {“A”: 0.1, “B”: 0.12}} for binary, or {“shape”: 2.0, “scale”: {“A”: 1.0, “B”: 1.1}} for poisson/gamma. confounder_specs : list of dict, optional Schema for confounder distributions. k : int, default=0 Number of confounders if specs not provided. x_sampler : callable, optional Custom sampler for confounders. add_ancillary : bool, default=True Whether to add descriptive columns like ‘age’, ‘platform’, etc. deterministic_ids : bool, default=False Whether to generate deterministic user IDs. add_pre : bool, default=True Whether to generate a pre-period covariate (
y_pre). pre_name : str, default=”y_pre” Name of the pre-period covariate column. pre_corr : float, default=0.7 Target correlation betweeny_preand the outcome Y in the control group. prognostic_scale : float, default=1.0 Scale of the prognostic signal derived from confounders. include_oracle : bool, default=True Whether to include oracle ground-truth columns like ‘cate’, ‘m’, etc. return_causal_data : bool, default=False Whether to return aCausalDataobject instead of apandas.DataFrame.Returns
pandas.DataFrame or CausalData Synthetic RCT dataset.
Examples
from causalis.dgp.causaldata.functional import generate_rct data = generate_rct( … n=1000, … outcome_type=”binary”, … outcome_params={“p”: {“A”: 0.10, “B”: 0.12}}, … add_pre=True, … return_causal_data=True, … ) data.treatment, data.outcome (‘d’, ‘y’) “y_pre” in data.df.columns True {“g0”, “g1”, “cate”}.issubset(data.df.columns) True
- causalis.dgp.causaldata.functional.generate_classic_rct(n: int = 10000, split: float = 0.5, random_state: Optional[int] = 42, outcome_params: Optional[Dict] = None, add_pre: bool = False, beta_y: Optional[Union[List[float], numpy.ndarray]] = None, outcome_depends_on_x: bool = True, prognostic_scale: float = 1.0, pre_corr: float = 0.7, return_causal_data: bool = False, add_ancillary: bool = False, deterministic_ids: bool = False, include_oracle: bool = True, **kwargs) Union[pandas.DataFrame, causalis.dgp.causaldata.CausalData]¶
Generate a classic RCT dataset with three binary confounders: platform_ios, country_usa, and source_paid.
Parameters
n : int, default=10_000 Number of samples to generate. split : float, default=0.5 Proportion of samples assigned to the treatment group. random_state : int, optional Random seed for reproducibility. outcome_params : dict, optional Parameters defining baseline rates/means and treatment effects. e.g., {“p”: {“A”: 0.1, “B”: 0.15}} for binary. add_pre : bool, default=False Whether to generate a pre-period covariate (
y_pre). beta_y : array-like, optional Linear coefficients for confounders in the outcome model. outcome_depends_on_x : bool, default=True Whether to add default effects for confounders if beta_y is None. prognostic_scale : float, default=1.0 Scale of nonlinear prognostic signal (passed to generate_rct). pre_corr : float, default=0.7 Target correlation for y_pre (passed to generate_rct). return_causal_data : bool, default=False Whether to return aCausalDataobject instead of apandas.DataFrame. add_ancillary : bool, default=False Whether to add standard ancillary columns (age, platform, etc.). deterministic_ids : bool, default=False Whether to generate deterministic user IDs. include_oracle : bool, default=True Whether to include oracle ground-truth columns like ‘cate’, ‘propensity’, etc. **kwargs : Additional arguments passed togenerate_rct.Returns
pandas.DataFrame or CausalData Synthetic classic RCT dataset.
- causalis.dgp.causaldata.functional.classic_rct_gamma(n: int = 10000, split: float = 0.5, random_state: Optional[int] = 42, outcome_params: Optional[Dict] = None, add_pre: bool = False, beta_y: Optional[Union[List[float], numpy.ndarray]] = None, outcome_depends_on_x: bool = True, prognostic_scale: float = 1.0, pre_corr: float = 0.7, add_ancillary: bool = True, deterministic_ids: bool = False, include_oracle: bool = True, return_causal_data: bool = False, **kwargs) Union[pandas.DataFrame, causalis.dgp.causaldata.CausalData]¶
Generate a classic RCT dataset with three binary confounders and a gamma outcome.
The gamma outcome uses a log-mean link, so treatment effects are multiplicative on the mean scale. The default parameters are chosen to resemble a skewed real-world metric (e.g., spend or revenue).
Parameters
n : int, default=10_000 Number of samples to generate. split : float, default=0.5 Proportion of samples assigned to the treatment group. random_state : int, optional Random seed for reproducibility. outcome_params : dict, optional Gamma parameters, e.g. {“shape”: 2.0, “scale”: {“A”: 15.0, “B”: 16.5}}. Mean = shape * scale. add_pre : bool, default=False Whether to generate a pre-period covariate (
y_pre). beta_y : array-like, optional Linear coefficients for confounders in the log-mean outcome model. outcome_depends_on_x : bool, default=True Whether to add default effects for confounders if beta_y is None. prognostic_scale : float, default=1.0 Scale of nonlinear prognostic signal. pre_corr : float, default=0.7 Target correlation for y_pre with post-outcome in control group. add_ancillary : bool, default=True Whether to add standard ancillary columns (age, platform, etc.). deterministic_ids : bool, default=False Whether to generate deterministic user IDs. include_oracle : bool, default=True Whether to include oracle ground-truth columns like ‘cate’, ‘propensity’, etc. return_causal_data : bool, default=False Whether to return aCausalDataobject instead of apandas.DataFrame. **kwargs : Additional arguments passed togenerate_rct(e.g., pre_name, g_y, use_prognostic).Returns
pandas.DataFrame or CausalData Synthetic classic RCT dataset with gamma outcome.
- causalis.dgp.causaldata.functional.obs_linear_effect(n: int = 10000, theta: float = 1.0, outcome_type: str = 'continuous', sigma_y: float = 1.0, target_d_rate: Optional[float] = None, confounder_specs: Optional[List[Dict[str, Any]]] = None, beta_y: Optional[numpy.ndarray] = None, beta_d: Optional[numpy.ndarray] = None, random_state: Optional[int] = 42, k: int = 0, x_sampler: Optional[Callable[[int, int, int], numpy.ndarray]] = None, include_oracle: bool = True, add_ancillary: bool = False, deterministic_ids: bool = False) pandas.DataFrame¶
Generate an observational dataset with linear effects of confounders and a constant treatment effect.
Parameters
n : int, default=10_000 Number of samples to generate. theta : float, default=1.0 Constant treatment effect. outcome_type : {“continuous”, “binary”, “poisson”, “gamma”}, default=”continuous” Family of the outcome distribution. sigma_y : float, default=1.0 Noise level for continuous outcomes. target_d_rate : float, optional Target treatment prevalence (propensity mean). confounder_specs : list of dict, optional Schema for confounder distributions. beta_y : array-like, optional Linear coefficients for confounders in the outcome model. beta_d : array-like, optional Linear coefficients for confounders in the treatment model. random_state : int, optional Random seed for reproducibility. k : int, default=0 Number of confounders if specs not provided. x_sampler : callable, optional Custom sampler for confounders. include_oracle : bool, default=True Whether to include oracle ground-truth columns like ‘cate’, ‘m’, etc. add_ancillary : bool, default=False If True, adds standard ancillary columns (age, platform, etc.). deterministic_ids : bool, default=False If True, generates deterministic user IDs.
Returns
pandas.DataFrame Synthetic observational dataset.
Notes
This helper is a lightweight observational benchmark:
treatment is not randomized unless
beta_dis zero andtarget_d_rateforces a near-constant propensity;oracle columns such as
mandcateare available wheninclude_oracle=True;the treatment effect is constant on the structural link scale, so heterogeneity only enters through the outcome family transformation.
Examples
from causalis.dgp.causaldata.functional import obs_linear_effect df = obs_linear_effect( … n=1000, … theta=1.0, … target_d_rate=0.35, … k=3, … random_state=3141, … ) sorted(col for col in [“y”, “d”, “m”, “cate”] if col in df.columns) [‘cate’, ‘d’, ‘m’, ‘y’]
- causalis.dgp.causaldata.functional.make_cuped_tweedie(n: int = 10000, seed: int = 42, add_pre: bool = True, pre_name: str = 'y_pre', pre_target_corr: float = 0.6, pre_spec: Optional[causalis.dgp.causaldata.preperiod.PreCorrSpec] = None, include_oracle: bool = False, return_causal_data: bool = True, theta_log: float = 0.2) Union[pandas.DataFrame, causalis.dgp.causaldata.CausalData]¶
Tweedie-like DGP with mixed marginals and structured HTE. Features many zeros and a heavy right tail. Suitable for CUPED benchmarking.
Parameters
n : int, default=10000 Number of samples to generate. seed : int, default=42 Random seed. add_pre : bool, default=True Whether to add a pre-period covariate ‘y_pre’. pre_name : str, default=”y_pre” Name of the pre-period covariate column. pre_target_corr : float, default=0.6 Target correlation between y_pre and post-outcome y in control group. pre_spec : PreCorrSpec, optional Detailed specification for pre-period calibration (transform, method, etc.). If provided,
pre_target_corris ignored in favor ofpre_spec.target_corr. include_oracle : bool, default=False Whether to include oracle ground-truth columns like ‘cate’, ‘propensity’, etc. return_causal_data : bool, default=True Whether to return a CausalData object. theta_log : float, default=0.2 The log-uplift theta parameter for the treatment effect.Returns
pd.DataFrame or CausalData
- causalis.dgp.causaldata.functional.generate_cuped_binary(n: int = 10000, seed: int = 42, add_pre: bool = True, pre_name: str = 'y_pre', pre_target_corr: float = 0.65, pre_spec: Optional[causalis.dgp.causaldata.preperiod.PreCorrSpec] = None, include_oracle: bool = True, return_causal_data: bool = True, theta_logit: float = 0.38) Union[pandas.DataFrame, causalis.dgp.causaldata.CausalData]¶
Binary CUPED-oriented DGP with richer confounders and structured HTE.
Designed for CUPED benchmarking with randomized treatment and a calibrated pre-period covariate while preserving exact oracle cate under include_oracle.
Parameters
n : int, default=10000 Number of samples to generate. seed : int, default=42 Random seed. add_pre : bool, default=True Whether to add a pre-period covariate. pre_name : str, default=”y_pre” Name of the pre-period covariate column. pre_target_corr : float, default=0.65 Target correlation between y_pre and post-outcome y in the control group. pre_spec : PreCorrSpec, optional Detailed specification for pre-period calibration. If provided,
pre_target_corris ignored in favor ofpre_spec.target_corr. include_oracle : bool, default=True Whether to include oracle columns like m, g0, g1, cate. return_causal_data : bool, default=True Whether to return a CausalData object. theta_logit : float, default=0.38 Baseline log-odds uplift scale for heterogeneous treatment effects.Returns
pd.DataFrame or CausalData
- causalis.dgp.causaldata.functional.make_gold_linear(n: int = 10000, seed: int = 42) causalis.dgp.causaldata.CausalData¶
A standard linear benchmark with moderate confounding. Based on the benchmark scenario in docs/research/dgp_benchmarking.ipynb.