causalis.dgp.causaldata.functional

Notes

Use this module when you want a ready-made synthetic dataset without specifying every structural component manually. For finer control over the DGP, instantiate CausalDatasetGenerator directly instead.

Examples

from causalis.dgp.causaldata.functional import generate_rct, obs_linear_effect rct = generate_rct(n=500, outcome_type=”normal”, return_causal_data=True) rct.outcome, rct.treatment (‘y’, ‘d’) obs = obs_linear_effect(n=500, theta=1.0, target_d_rate=0.3) {“y”, “d”}.issubset(obs.columns) True

High-level helpers for single-treatment synthetic causal datasets.

This module provides convenient wrappers around :class:causalis.dgp.causaldata.base.CausalDatasetGenerator for common benchmarking setups such as classic A/B tests, observational confounding, and CUPED-oriented examples with pre-period covariates.

Module Contents

Functions

generate_rct

Generate an RCT dataset with randomized treatment assignment.

generate_classic_rct

Generate a classic RCT dataset with three binary confounders: platform_ios, country_usa, and source_paid.

classic_rct_gamma

Generate a classic RCT dataset with three binary confounders and a gamma outcome.

obs_linear_effect

Generate an observational dataset with linear effects of confounders and a constant treatment effect.

make_cuped_tweedie

Tweedie-like DGP with mixed marginals and structured HTE. Features many zeros and a heavy right tail. Suitable for CUPED benchmarking.

generate_cuped_binary

Binary CUPED-oriented DGP with richer confounders and structured HTE.

make_gold_linear

A standard linear benchmark with moderate confounding. Based on the benchmark scenario in docs/research/dgp_benchmarking.ipynb.

API

causalis.dgp.causaldata.functional.generate_rct(n: int = 20000, split: float = 0.5, random_state: Optional[int] = 42, outcome_type: str = 'binary', outcome_params: Optional[Dict] = None, confounder_specs: Optional[List[Dict[str, Any]]] = None, k: int = 0, x_sampler: Optional[Callable[[int, int, int], numpy.ndarray]] = None, add_ancillary: bool = True, deterministic_ids: bool = False, add_pre: bool = True, pre_name: str = 'y_pre', pre_corr: float = 0.7, prognostic_scale: float = 1.0, beta_y: Optional[Union[List[float], numpy.ndarray]] = None, g_y: Optional[Callable[[numpy.ndarray], numpy.ndarray]] = None, use_prognostic: Optional[bool] = None, include_oracle: bool = True, return_causal_data: bool = False) Union[pandas.DataFrame, causalis.dgp.causaldata.CausalData]

Generate an RCT dataset with randomized treatment assignment.

Uses CausalDatasetGenerator internally, ensuring treatment is independent of X. Specifically designed for benchmarking variance reduction techniques like CUPED.

Notes on effect scale

How outcome_params maps into the structural effect:

  • outcome_type=”normal”: treatment shifts the mean by (mean[“B”] - mean[“A”]) on the outcome scale.

  • outcome_type=”binary”: treatment shifts the log-odds by (logit(p_B) - logit(p_A)).

  • outcome_type=”poisson” or “gamma”: treatment shifts the log-mean by log(lam_B / lam_A).

Ancillary columns (if add_ancillary=True) are generated from baseline confounders X only, avoiding outcome leakage and post-treatment adjustment issues.

Parameters

n : int, default=20_000 Number of samples to generate. split : float, default=0.5 Proportion of samples assigned to the treatment group. random_state : int, optional Random seed for reproducibility. outcome_type : {“binary”, “normal”, “poisson”, “gamma”}, default=”binary” Distribution family of the outcome. outcome_params : dict, optional Parameters defining baseline rates/means and treatment effects. e.g., {“p”: {“A”: 0.1, “B”: 0.12}} for binary, or {“shape”: 2.0, “scale”: {“A”: 1.0, “B”: 1.1}} for poisson/gamma. confounder_specs : list of dict, optional Schema for confounder distributions. k : int, default=0 Number of confounders if specs not provided. x_sampler : callable, optional Custom sampler for confounders. add_ancillary : bool, default=True Whether to add descriptive columns like ‘age’, ‘platform’, etc. deterministic_ids : bool, default=False Whether to generate deterministic user IDs. add_pre : bool, default=True Whether to generate a pre-period covariate (y_pre). pre_name : str, default=”y_pre” Name of the pre-period covariate column. pre_corr : float, default=0.7 Target correlation between y_pre and the outcome Y in the control group. prognostic_scale : float, default=1.0 Scale of the prognostic signal derived from confounders. include_oracle : bool, default=True Whether to include oracle ground-truth columns like ‘cate’, ‘m’, etc. return_causal_data : bool, default=False Whether to return a CausalData object instead of a pandas.DataFrame.

Returns

pandas.DataFrame or CausalData Synthetic RCT dataset.

Examples

from causalis.dgp.causaldata.functional import generate_rct data = generate_rct( … n=1000, … outcome_type=”binary”, … outcome_params={“p”: {“A”: 0.10, “B”: 0.12}}, … add_pre=True, … return_causal_data=True, … ) data.treatment, data.outcome (‘d’, ‘y’) “y_pre” in data.df.columns True {“g0”, “g1”, “cate”}.issubset(data.df.columns) True

causalis.dgp.causaldata.functional.generate_classic_rct(n: int = 10000, split: float = 0.5, random_state: Optional[int] = 42, outcome_params: Optional[Dict] = None, add_pre: bool = False, beta_y: Optional[Union[List[float], numpy.ndarray]] = None, outcome_depends_on_x: bool = True, prognostic_scale: float = 1.0, pre_corr: float = 0.7, return_causal_data: bool = False, add_ancillary: bool = False, deterministic_ids: bool = False, include_oracle: bool = True, **kwargs) Union[pandas.DataFrame, causalis.dgp.causaldata.CausalData]

Generate a classic RCT dataset with three binary confounders: platform_ios, country_usa, and source_paid.

Parameters

n : int, default=10_000 Number of samples to generate. split : float, default=0.5 Proportion of samples assigned to the treatment group. random_state : int, optional Random seed for reproducibility. outcome_params : dict, optional Parameters defining baseline rates/means and treatment effects. e.g., {“p”: {“A”: 0.1, “B”: 0.15}} for binary. add_pre : bool, default=False Whether to generate a pre-period covariate (y_pre). beta_y : array-like, optional Linear coefficients for confounders in the outcome model. outcome_depends_on_x : bool, default=True Whether to add default effects for confounders if beta_y is None. prognostic_scale : float, default=1.0 Scale of nonlinear prognostic signal (passed to generate_rct). pre_corr : float, default=0.7 Target correlation for y_pre (passed to generate_rct). return_causal_data : bool, default=False Whether to return a CausalData object instead of a pandas.DataFrame. add_ancillary : bool, default=False Whether to add standard ancillary columns (age, platform, etc.). deterministic_ids : bool, default=False Whether to generate deterministic user IDs. include_oracle : bool, default=True Whether to include oracle ground-truth columns like ‘cate’, ‘propensity’, etc. **kwargs : Additional arguments passed to generate_rct.

Returns

pandas.DataFrame or CausalData Synthetic classic RCT dataset.

causalis.dgp.causaldata.functional.classic_rct_gamma(n: int = 10000, split: float = 0.5, random_state: Optional[int] = 42, outcome_params: Optional[Dict] = None, add_pre: bool = False, beta_y: Optional[Union[List[float], numpy.ndarray]] = None, outcome_depends_on_x: bool = True, prognostic_scale: float = 1.0, pre_corr: float = 0.7, add_ancillary: bool = True, deterministic_ids: bool = False, include_oracle: bool = True, return_causal_data: bool = False, **kwargs) Union[pandas.DataFrame, causalis.dgp.causaldata.CausalData]

Generate a classic RCT dataset with three binary confounders and a gamma outcome.

The gamma outcome uses a log-mean link, so treatment effects are multiplicative on the mean scale. The default parameters are chosen to resemble a skewed real-world metric (e.g., spend or revenue).

Parameters

n : int, default=10_000 Number of samples to generate. split : float, default=0.5 Proportion of samples assigned to the treatment group. random_state : int, optional Random seed for reproducibility. outcome_params : dict, optional Gamma parameters, e.g. {“shape”: 2.0, “scale”: {“A”: 15.0, “B”: 16.5}}. Mean = shape * scale. add_pre : bool, default=False Whether to generate a pre-period covariate (y_pre). beta_y : array-like, optional Linear coefficients for confounders in the log-mean outcome model. outcome_depends_on_x : bool, default=True Whether to add default effects for confounders if beta_y is None. prognostic_scale : float, default=1.0 Scale of nonlinear prognostic signal. pre_corr : float, default=0.7 Target correlation for y_pre with post-outcome in control group. add_ancillary : bool, default=True Whether to add standard ancillary columns (age, platform, etc.). deterministic_ids : bool, default=False Whether to generate deterministic user IDs. include_oracle : bool, default=True Whether to include oracle ground-truth columns like ‘cate’, ‘propensity’, etc. return_causal_data : bool, default=False Whether to return a CausalData object instead of a pandas.DataFrame. **kwargs : Additional arguments passed to generate_rct (e.g., pre_name, g_y, use_prognostic).

Returns

pandas.DataFrame or CausalData Synthetic classic RCT dataset with gamma outcome.

causalis.dgp.causaldata.functional.obs_linear_effect(n: int = 10000, theta: float = 1.0, outcome_type: str = 'continuous', sigma_y: float = 1.0, target_d_rate: Optional[float] = None, confounder_specs: Optional[List[Dict[str, Any]]] = None, beta_y: Optional[numpy.ndarray] = None, beta_d: Optional[numpy.ndarray] = None, random_state: Optional[int] = 42, k: int = 0, x_sampler: Optional[Callable[[int, int, int], numpy.ndarray]] = None, include_oracle: bool = True, add_ancillary: bool = False, deterministic_ids: bool = False) pandas.DataFrame

Generate an observational dataset with linear effects of confounders and a constant treatment effect.

Parameters

n : int, default=10_000 Number of samples to generate. theta : float, default=1.0 Constant treatment effect. outcome_type : {“continuous”, “binary”, “poisson”, “gamma”}, default=”continuous” Family of the outcome distribution. sigma_y : float, default=1.0 Noise level for continuous outcomes. target_d_rate : float, optional Target treatment prevalence (propensity mean). confounder_specs : list of dict, optional Schema for confounder distributions. beta_y : array-like, optional Linear coefficients for confounders in the outcome model. beta_d : array-like, optional Linear coefficients for confounders in the treatment model. random_state : int, optional Random seed for reproducibility. k : int, default=0 Number of confounders if specs not provided. x_sampler : callable, optional Custom sampler for confounders. include_oracle : bool, default=True Whether to include oracle ground-truth columns like ‘cate’, ‘m’, etc. add_ancillary : bool, default=False If True, adds standard ancillary columns (age, platform, etc.). deterministic_ids : bool, default=False If True, generates deterministic user IDs.

Returns

pandas.DataFrame Synthetic observational dataset.

Notes

This helper is a lightweight observational benchmark:

  • treatment is not randomized unless beta_d is zero and target_d_rate forces a near-constant propensity;

  • oracle columns such as m and cate are available when include_oracle=True;

  • the treatment effect is constant on the structural link scale, so heterogeneity only enters through the outcome family transformation.

Examples

from causalis.dgp.causaldata.functional import obs_linear_effect df = obs_linear_effect( … n=1000, … theta=1.0, … target_d_rate=0.35, … k=3, … random_state=3141, … ) sorted(col for col in [“y”, “d”, “m”, “cate”] if col in df.columns) [‘cate’, ‘d’, ‘m’, ‘y’]

causalis.dgp.causaldata.functional.make_cuped_tweedie(n: int = 10000, seed: int = 42, add_pre: bool = True, pre_name: str = 'y_pre', pre_target_corr: float = 0.6, pre_spec: Optional[causalis.dgp.causaldata.preperiod.PreCorrSpec] = None, include_oracle: bool = False, return_causal_data: bool = True, theta_log: float = 0.2) Union[pandas.DataFrame, causalis.dgp.causaldata.CausalData]

Tweedie-like DGP with mixed marginals and structured HTE. Features many zeros and a heavy right tail. Suitable for CUPED benchmarking.

Parameters

n : int, default=10000 Number of samples to generate. seed : int, default=42 Random seed. add_pre : bool, default=True Whether to add a pre-period covariate ‘y_pre’. pre_name : str, default=”y_pre” Name of the pre-period covariate column. pre_target_corr : float, default=0.6 Target correlation between y_pre and post-outcome y in control group. pre_spec : PreCorrSpec, optional Detailed specification for pre-period calibration (transform, method, etc.). If provided, pre_target_corr is ignored in favor of pre_spec.target_corr. include_oracle : bool, default=False Whether to include oracle ground-truth columns like ‘cate’, ‘propensity’, etc. return_causal_data : bool, default=True Whether to return a CausalData object. theta_log : float, default=0.2 The log-uplift theta parameter for the treatment effect.

Returns

pd.DataFrame or CausalData

causalis.dgp.causaldata.functional.generate_cuped_binary(n: int = 10000, seed: int = 42, add_pre: bool = True, pre_name: str = 'y_pre', pre_target_corr: float = 0.65, pre_spec: Optional[causalis.dgp.causaldata.preperiod.PreCorrSpec] = None, include_oracle: bool = True, return_causal_data: bool = True, theta_logit: float = 0.38) Union[pandas.DataFrame, causalis.dgp.causaldata.CausalData]

Binary CUPED-oriented DGP with richer confounders and structured HTE.

Designed for CUPED benchmarking with randomized treatment and a calibrated pre-period covariate while preserving exact oracle cate under include_oracle.

Parameters

n : int, default=10000 Number of samples to generate. seed : int, default=42 Random seed. add_pre : bool, default=True Whether to add a pre-period covariate. pre_name : str, default=”y_pre” Name of the pre-period covariate column. pre_target_corr : float, default=0.65 Target correlation between y_pre and post-outcome y in the control group. pre_spec : PreCorrSpec, optional Detailed specification for pre-period calibration. If provided, pre_target_corr is ignored in favor of pre_spec.target_corr. include_oracle : bool, default=True Whether to include oracle columns like m, g0, g1, cate. return_causal_data : bool, default=True Whether to return a CausalData object. theta_logit : float, default=0.38 Baseline log-odds uplift scale for heterogeneous treatment effects.

Returns

pd.DataFrame or CausalData

causalis.dgp.causaldata.functional.make_gold_linear(n: int = 10000, seed: int = 42) causalis.dgp.causaldata.CausalData

A standard linear benchmark with moderate confounding. Based on the benchmark scenario in docs/research/dgp_benchmarking.ipynb.