causalis.dgp.causaldata.base¶
Notes¶
This module is the lowest-level tabular DGP layer used by higher-level helpers
in :mod:causalis.dgp.causaldata.functional and scenario-specific wrappers.
Reach for it when you want direct control over the structural equations rather
than a pre-packaged example dataset.
Examples¶
import numpy as np from causalis.dgp.causaldata.base import CausalDatasetGenerator gen = CausalDatasetGenerator( … theta=1.5, … beta_y=np.array([0.7, -0.4]), … beta_d=np.array([1.0, 0.5]), … target_d_rate=0.35, … seed=3141, … include_oracle=True, … k=2, … ) df = gen.generate(200) sorted(col for col in [“y”, “d”, “m”, “cate”] if col in df.columns) [‘cate’, ‘d’, ‘m’, ‘y’]
Low-level generators for single-treatment synthetic causal datasets.
The central class in this module is :class:CausalDatasetGenerator, which
builds observational or randomized-style data with binary treatment and
optionally exposes oracle quantities such as propensities and potential-outcome
means.
Module Contents¶
Classes¶
Generate synthetic causal inference datasets with controllable confounding, treatment prevalence, noise, and (optionally) heterogeneous treatment effects. |
API¶
- class causalis.dgp.causaldata.base.CausalDatasetGenerator¶
Generate synthetic causal inference datasets with controllable confounding, treatment prevalence, noise, and (optionally) heterogeneous treatment effects.
Data model (high level)
confounders X ∈ R^k are drawn from user-specified distributions.
Binary treatment D is assigned by a logistic model: D ~ Bernoulli( sigmoid(alpha_d + f_d(X) + u_strength_d * U) ), where f_d(X) = (X @ beta_d + g_d(X)) * propensity_sharpness, and U ~ N(0,1) is an optional unobserved confounder.
Outcome Y depends on treatment and confounders with link determined by
outcome_type: outcome_type = “continuous”: Y = alpha_y + f_y(X) + u_strength_y * U + T * tau(X) + ε, ε ~ N(0, sigma_y^2) outcome_type = “binary”: logit P(Y=1|T,X,U) = alpha_y + f_y(X) + u_strength_y * U + T * tau(X) outcome_type = “poisson”: log E[Y|T,X,U] = alpha_y + f_y(X) + u_strength_y * U + T * tau(X) outcome_type = “gamma”: log E[Y|T,X,U] = alpha_y + f_y(X) + u_strength_y * U + T * tau(X) where f_y(X) = X @ beta_y + g_y(X), and tau(X) is either constantthetaor a user function.
Returned columns
y: outcome
d: binary treatment (0/1)
x1..xk (or user-provided names)
m: true propensity P(T=1 | X) marginalized over U
m_obs: realized propensity P(T=1 | X, U)
tau_link: tau(X) on the structural (link) scale
g0: E[Y | X, T=0] on the natural outcome scale marginalized over U .,9
g1: E[Y | X, T=1] on the natural outcome scale marginalized over U
cate: g1 - g0 (conditional average treatment effect on the natural outcome scale)
Notes on effect scale:
For “continuous”,
theta(or tau(X)) is an additive mean difference, sotau_link == cate.For “binary”, tau acts on the log-odds scale.
cateis reported as a risk difference.For “poisson” and “gamma”, tau acts on the log-mean scale.
cateis reported on the mean scale.
Parameters
theta : float, default=1.0 Constant treatment effect used if
tauis None. tau : callable, optional Function tau(X) -> array-like shape (n,) for heterogeneous effects. beta_y : array-like, optional Linear coefficients of confounders in the outcome baseline f_y(X). beta_d : array-like, optional Linear coefficients of confounders in the treatment score f_d(X). g_y : callable, optional Nonlinear/additive function g_y(X) -> (n,) added to the outcome baseline. g_d : callable, optional Nonlinear/additive function g_d(X) -> (n,) added to the treatment score. alpha_y : float, default=0.0 Outcome intercept (natural scale for continuous; log-odds for binary; log-mean for Poisson/Gamma). alpha_d : float, default=0.0 Treatment intercept (log-odds). Iftarget_d_rateis set,alpha_dis auto-calibrated. sigma_y : float, default=1.0 Std. dev. of the Gaussian noise for continuous outcomes. outcome_type : {“continuous”, “binary”, “poisson”, “gamma”, “tweedie”}, default=”continuous” Outcome family and link as defined above. confounder_specs : list of dict, optional Schema for generating confounders. See_gaussian_copulafor details. k : int, default=5 Number of confounders whenconfounder_specsis None. Defaults to independent N(0,1). x_sampler : callable, optional Custom sampler (n, k, seed) -> X ndarray of shape (n,k). Overridesconfounder_specs. use_copula : bool, default=False If True andconfounder_specsprovided, use Gaussian copula for X. copula_corr : array-like, optional Correlation matrix for copula. target_d_rate : float, optional Target treatment prevalence (propensity mean). Calibratesalpha_d. u_strength_d : float, default=0.0 Strength of the unobserved confounder U in treatment assignment. u_strength_y : float, default=0.0 Strength of the unobserved confounder U in the outcome. propensity_sharpness : float, default=1.0 Scales the X-driven treatment score to adjust positivity difficulty. seed : int, optional Random seed for reproducibility.Attributes
rng : numpy.random.Generator Internal RNG seeded from
seed.Notes
Oracle outputs are reported on the natural outcome scale:
mis the treatment propensity marginalized over latent noise.g0andg1are mean potential outcomes on the observed outcome scale.cateis alwaysg1 - g0on that same natural scale, even when the structural treatment effect is specified on a link scale such as log-odds or log-mean.
Examples
import numpy as np gen = CausalDatasetGenerator( … theta=0.8, … beta_y=np.array([0.5, -0.2, 0.1]), … beta_d=np.array([0.8, 0.4, -0.3]), … target_d_rate=0.4, … outcome_type=”continuous”, … seed=123, … k=3, … ) df = gen.generate(1000) float(df[“d”].mean()) > 0.0 True “cate” in df.columns True
- theta: float¶
1.0
- tau: Optional[Callable[[numpy.ndarray], numpy.ndarray]]¶
None
- beta_y: Optional[numpy.ndarray]¶
None
- beta_d: Optional[numpy.ndarray]¶
None
- g_y: Optional[Callable[[numpy.ndarray], numpy.ndarray]]¶
None
- g_d: Optional[Callable[[numpy.ndarray], numpy.ndarray]]¶
None
- alpha_y: float¶
0.0
- alpha_d: float¶
0.0
- sigma_y: float¶
1.0
- outcome_type: str¶
‘continuous’
- confounder_specs: Optional[List[Dict[str, Any]]]¶
None
- k: int¶
5
- x_sampler: Optional[Callable[[int, int, int], numpy.ndarray]]¶
None
- use_copula: bool¶
False
- copula_corr: Optional[numpy.ndarray]¶
None
- target_d_rate: Optional[float]¶
None
- u_strength_d: float¶
0.0
- u_strength_y: float¶
0.0
- propensity_sharpness: float¶
1.0
- score_bounding: Optional[float]¶
None
- alpha_zi: float¶
None
- beta_zi: Optional[numpy.ndarray]¶
None
- g_zi: Optional[Callable[[numpy.ndarray], numpy.ndarray]]¶
None
- u_strength_zi: float¶
0.0
- tau_zi: Optional[Callable[[numpy.ndarray], numpy.ndarray]]¶
None
- pos_dist: str¶
‘gamma’
- gamma_shape: float¶
2.0
- lognormal_sigma: float¶
1.0
- include_oracle: bool¶
True
- seed: Optional[int]¶
None
- rng: numpy.random.Generator¶
‘field(…)’
- __post_init__()¶
Initialize RNG and validate configuration.
- generate(n: int, U: Optional[numpy.ndarray] = None) pandas.DataFrame¶
Draw a synthetic dataset of size
n.Parameters
n : int Number of samples to generate. U : numpy.ndarray, optional Unobserved confounder. If None, generated from N(0,1).
Returns
pandas.DataFrame The generated dataset with outcome ‘y’, treatment ‘d’, confounders, and oracle ground-truth columns.
- to_causal_data(n: int, confounders: Optional[Union[str, List[str]]] = None) causalis.dgp.causaldata.CausalData¶
Generate a dataset and convert it to a CausalData object.
Parameters
n : int Number of samples to generate. confounders : str or list of str, optional List of confounder column names to include. If None, automatically detects numeric confounders.
Returns
CausalData A CausalData object containing the generated dataset.
- oracle_nuisance(num_quad: int = 21)¶
Return nuisance functions (m(x), g0(x), g1(x)) compatible with IRM.
Parameters
num_quad : int, default=21 Number of quadrature points for marginalizing over U.
Returns
dict Dictionary of callables mapping X to nuisance values.