Metadata-Version: 2.4
Name: experiment-utils-pd
Version: 0.1.11
Summary: Set of utility functions for analyzing experimental and observational data
Author-email: Sebastian Daza <sebastian.daza@gmail.com>
License: MIT License
        
        Copyright (c) 2025 Sebastian Daza
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Development Status :: 3 - Alpha
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.5
Requires-Dist: pandas>=2.3.1
Requires-Dist: matplotlib>=3.5.2
Requires-Dist: seaborn>=0.11.2
Requires-Dist: multiprocess>=0.70.14
Requires-Dist: threadpoolctl>=3.5.0
Requires-Dist: statsmodels>=0.13.2
Requires-Dist: scipy>=1.9.1
Requires-Dist: linearmodels>=6.1
Requires-Dist: scikit-learn==1.5.2
Requires-Dist: xgboost>=2.1.3
Requires-Dist: pytest>=8.3.5
Requires-Dist: ruff>=0.11.7
Requires-Dist: ipykernel>=6.29.5
Requires-Dist: build>=1.3.0
Requires-Dist: twine>=6.1.0
Requires-Dist: lifelines>=0.30.1
Dynamic: license-file

[![ci](https://github.com/sdaza/experiment-utils-pd/actions/workflows/ci.yaml/badge.svg)](https://github.com/sdaza/experiment-utils-pd/actions/workflows/ci.yaml)
[![PyPI version](https://img.shields.io/pypi/v/experiment-utils-pd.svg)](https://pypi.org/project/experiment-utils-pd/)

# Experiment Utils

A comprehensive Python package for designing, analyzing, and validating experiments with advanced causal inference capabilities.

## Features

- **Experiment Analysis**: Estimate treatment effects with multiple adjustment methods (covariate balancing, regression, IV, AIPW)
- **Multiple Outcome Models**: OLS, logistic, Poisson, negative binomial, and Cox proportional hazards
- **Doubly Robust Estimation**: Augmented IPW (AIPW) for OLS, logistic, Poisson, and negative binomial models
- **Survival Analysis**: Cox proportional hazards with IPW and regression adjustment
- **Covariate Balance**: Check and visualize balance between treatment groups
- **Marginal Effects**: Average marginal effects for GLMs (probability change, count change)
- **Bootstrap Inference**: Robust confidence intervals and p-values via bootstrap resampling
- **Multiple Comparison Correction**: Family-wise error rate control (Bonferroni, Holm, Sidak, FDR)
- **Power Analysis**: Calculate statistical power and find optimal sample sizes
- **Retrodesign Analysis**: Assess reliability of study designs (Type S/M errors)
- **Random Assignment**: Generate balanced treatment assignments with stratification

## Table of Contents

- [Experiment Utils](#experiment-utils)
  - [Features](#features)
  - [Table of Contents](#table-of-contents)
  - [Installation](#installation)
    - [From PyPI (Recommended)](#from-pypi-recommended)
    - [From GitHub (Latest Development Version)](#from-github-latest-development-version)
  - [Quick Start](#quick-start)
  - [User Guide](#user-guide)
    - [Basic Experiment Analysis](#basic-experiment-analysis)
    - [Checking Covariate Balance](#checking-covariate-balance)
    - [Covariate Adjustment Methods](#covariate-adjustment-methods)
    - [Outcome Models](#outcome-models)
    - [Survival Analysis (Cox Models)](#survival-analysis-cox-models)
    - [Bootstrap Inference](#bootstrap-inference)
    - [Multiple Experiments](#multiple-experiments)
    - [Categorical Treatment Variables](#categorical-treatment-variables)
    - [Instrumental Variables (IV)](#instrumental-variables-iv)
    - [Multiple Comparison Adjustments](#multiple-comparison-adjustments)
    - [Non-Inferiority Testing](#non-inferiority-testing)
    - [Combining Effects (Meta-Analysis)](#combining-effects-meta-analysis)
    - [Retrodesign Analysis](#retrodesign-analysis)
  - [Power Analysis](#power-analysis)
    - [Calculate Power](#calculate-power)
    - [Power from Real Data](#power-from-real-data)
    - [Grid Power Simulation](#grid-power-simulation)
    - [Find Sample Size](#find-sample-size)
    - [Simulate Retrodesign](#simulate-retrodesign)
  - [Utilities](#utilities)
    - [Balanced Random Assignment](#balanced-random-assignment)
    - [Standalone Balance Checker](#standalone-balance-checker)
    - [Advanced Topics](#advanced-topics)
        - [Covariate Adjustment Methods](#covariate-adjustment-methods)  
        - [Outcome Models](#outcome-models)  
        - [Survival Analysis (Cox Models)](#survival-analysis-cox-models)  
        - [When to Use Different Adjustment Methods](#when-to-use-different-adjustment-methods)  
        - [Non-Collapsibility of Hazard and Odds Ratios](#non-collapsibility-of-hazard-and-odds-ratios)  
        - [Handling Missing Data](#handling-missing-data)
        - [Best Practices](#best-practices)
        - [Common Workflows](#common-workflows)
  - [Contributing](#contributing)
  - [License](#license)
  - [Citation](#citation)

## Installation

### From PyPI (Recommended)

```bash
pip install experiment-utils-pd
```

### From GitHub (Latest Development Version)

```bash
pip install git+https://github.com/sdaza/experiment-utils-pd.git
```

## Quick Start

Here's a complete example analyzing an A/B test with covariate adjustment:

```python
import pandas as pd
import numpy as np
from experiment_utils.experiment_analyzer import ExperimentAnalyzer

# Create sample experiment data
np.random.seed(42)
df = pd.DataFrame({
    "user_id": range(1000),
    "treatment": np.random.choice([0, 1], 1000),
    "conversion": np.random.binomial(1, 0.15, 1000),
    "revenue": np.random.normal(50, 20, 1000),
    "age": np.random.normal(35, 10, 1000),
    "is_member": np.random.choice([0, 1], 1000),
})

# Initialize analyzer
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["conversion", "revenue"],
    covariates=["age", "is_member"],
    adjustment="balance",  # Adjust for covariates
    balance_method="ps-logistic",
)

# Estimate treatment effects
analyzer.get_effects()

# View results
results = analyzer.results
print(results[["outcome", "absolute_effect", "relative_effect", 
               "pvalue", "stat_significance"]])

# Balance is automatically calculated when covariates are provided
balance = analyzer.balance
print(f"\nBalance: {balance['balance_flag'].mean():.1%} of covariates balanced")
```

Output:
```
       outcome  absolute_effect  relative_effect   pvalue stat_significance
0   conversion           0.0234           0.1623   0.0456                 1
1      revenue           2.1450           0.0429   0.1234                 0

Balance: 100.0% of covariates balanced
```

## User Guide

### Basic Experiment Analysis

Analyze a simple A/B test without covariate adjustment:

```python
from experiment_utils.experiment_analyzer import ExperimentAnalyzer

# Simple analysis (no covariates)
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["conversion"],
)

analyzer.get_effects()
print(analyzer.results)
```

**Key columns in results:**
- `outcome`: Outcome variable name
- `absolute_effect`: Treatment effect (treatment - control mean)
- `relative_effect`: Lift (absolute_effect / control_mean)
- `standard_error`: Standard error of the effect
- `pvalue`: P-value for hypothesis test
- `stat_significance`: 1 if significant at alpha level, 0 otherwise
- `abs_effect_lower/upper`: Confidence interval bounds (absolute)
- `rel_effect_lower/upper`: Confidence interval bounds (relative)

### Checking Covariate Balance

**Balance is automatically calculated** when you provide covariates and run `get_effects()`:

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["conversion"],
    covariates=["age", "income", "region"],  # Can include categorical
)

analyzer.get_effects()

# Balance is automatically available
balance = analyzer.balance
print(balance[["covariate", "smd", "balance_flag"]])
print(f"\nBalanced: {balance['balance_flag'].mean():.1%}")

# Identify imbalanced covariates
imbalanced = balance[balance["balance_flag"] == 0]
if not imbalanced.empty:
    print(f"Imbalanced: {imbalanced['covariate'].tolist()}")
```

**Check balance independently** (optional, before running `get_effects()` or with custom parameters):

```python
# Check balance with different threshold
balance_strict = analyzer.check_balance(threshold=0.05)
```

**Balance metrics explained:**
- `smd`: Standardized Mean Difference (|SMD| < 0.1 indicates good balance)
- `balance_flag`: 1 if balanced, 0 if imbalanced
- `mean_treated/control`: Group means for the covariate

### Covariate Adjustment Methods

When treatment and control groups differ on covariates, adjust for bias:

**Option 1: Propensity Score Weighting (Recommended)**

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["conversion", "revenue"],
    covariates=["age", "income", "is_member"],
    adjustment="balance",
    balance_method="ps-logistic",  # Logistic regression for propensity scores
    target_effect="ATT",  # Average Treatment Effect on Treated
)

analyzer.get_effects()

# Check post-adjustment balance
print(analyzer.adjusted_balance)

# Retrieve weights for transparency
weights_df = analyzer.weights
print(weights_df.head())
```

**Available methods:**
- `ps-logistic`: Propensity score via logistic regression (fast, interpretable)
- `ps-xgboost`: Propensity score via XGBoost (flexible, non-linear)
- `entropy`: Entropy balancing (exact moment matching)

**Target effects:**
- `ATT`: Average Treatment Effect on Treated (most common)
- `ATE`: Average Treatment Effect (entire population)
- `ATC`: Average Treatment Effect on Control

**Option 2: Regression Adjustment**

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["conversion"],
    regression_covariates=["age", "income"],  # Use regression_covariates
    adjustment=None,  # No weighting, just regression
)

analyzer.get_effects()
```

**Option 3: IPW + Regression (Combined)**

Use both propensity score weighting and regression covariates for extra robustness:

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["conversion", "revenue"],
    covariates=["age", "income", "is_member"],
    adjustment="balance",
    regression_covariates=["age", "income"],  # Also include in regression
    target_effect="ATE",
)

analyzer.get_effects()
```

**Option 4: Doubly Robust / AIPW**

Augmented Inverse Probability Weighting is consistent if either the propensity score model or the outcome model is correctly specified. Available for OLS, logistic, Poisson, and negative binomial models:

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["revenue"],
    covariates=["age", "income", "is_member"],
    adjustment="aipw",
    target_effect="ATE",
)

analyzer.get_effects()

# AIPW results include influence-function based standard errors
print(analyzer.results[["outcome", "absolute_effect", "standard_error", "pvalue"]])
```

AIPW works by fitting separate outcome models for treated and control groups, predicting potential outcomes for all units, and combining them with IPW via the augmented influence function. Standard errors are derived from the influence function, making them robust without requiring bootstrap.

> **Note**: AIPW is not supported for Cox survival models due to the complexity of survival-specific doubly robust methods. For Cox models, use IPW + Regression instead.

### Outcome Models

By default, all outcomes are analyzed with OLS. Use `outcome_models` to specify different model types:

**Logistic regression (binary outcomes)**

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["converted", "churned"],
    outcome_models="logistic",  # Apply to all outcomes
    covariates=["age", "tenure"],
)

analyzer.get_effects()

# By default, results report marginal effects (probability change in percentage points)
# Use compute_marginal_effects=False for odds ratios instead
```

**Poisson / Negative binomial (count outcomes)**

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["orders", "page_views"],
    outcome_models="poisson",  # or "negative_binomial" for overdispersed counts
    covariates=["age", "tenure"],
)

analyzer.get_effects()

# Results report change in expected count (marginal effects) by default
# Use compute_marginal_effects=False for rate ratios
```

**Mixed models per outcome**

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["revenue", "converted", "orders"],
    outcome_models={
        "revenue": "ols",
        "converted": "logistic",
        "orders": ["poisson", "negative_binomial"],  # Compare both
    },
    covariates=["age"],
)

analyzer.get_effects()

# Results include model_type column to distinguish
print(analyzer.results[["outcome", "model_type", "absolute_effect", "pvalue"]])
```

**Marginal effects options**

```python
# Average Marginal Effect (default) - recommended
analyzer = ExperimentAnalyzer(..., compute_marginal_effects="overall")

# Marginal Effect at the Mean
analyzer = ExperimentAnalyzer(..., compute_marginal_effects="mean")

# Odds ratios / rate ratios instead of marginal effects
analyzer = ExperimentAnalyzer(..., compute_marginal_effects=False)
```

| `compute_marginal_effects` | Logistic output | Poisson/NB output |
|---|---|---|
| `"overall"` (default) | Probability change (pp) | Change in expected count |
| `"mean"` | Probability change at mean | Count change at mean |
| `False` | Odds ratio | Rate ratio |

### Survival Analysis (Cox Models)

Analyze time-to-event outcomes using Cox proportional hazards:

```python
from experiment_utils.experiment_analyzer import ExperimentAnalyzer

# Specify Cox outcomes as tuples: (time_col, event_col)
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=[("time_to_event", "event_occurred")],
    outcome_models="cox",
    covariates=["age", "income"],
)

analyzer.get_effects()

# Results report log(HR) as absolute_effect and HR as relative_effect
print(analyzer.results[["outcome", "absolute_effect", "relative_effect", "pvalue"]])
```

**Cox with regression adjustment**

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=[("survival_time", "died")],
    outcome_models="cox",
    covariates=["age", "comorbidity_score"],
    regression_covariates=["age", "comorbidity_score"],
)

analyzer.get_effects()
```

**Cox with IPW + Regression (recommended for confounded data)**

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=[("survival_time", "died")],
    outcome_models="cox",
    covariates=["age", "comorbidity_score"],
    adjustment="balance",
    regression_covariates=["age", "comorbidity_score"],  # Include both
    target_effect="ATE",
)

analyzer.get_effects()
```

> **Note**: IPW alone for Cox models estimates the marginal hazard ratio, which differs from the conditional HR due to non-collapsibility. The package will warn you if you use IPW without regression covariates. See [Non-Collapsibility](#non-collapsibility-of-hazard-and-odds-ratios) for details.

**Alternative: separate event_col parameter**

```python
# Equivalent to tuple notation
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["survival_time"],
    outcome_models="cox",
    event_col="died",  # Applies to all outcomes
)
```

**Bootstrap for survival models**

Bootstrap can be slow for Cox models with low event rates. Use `skip_bootstrap_for_survival` to fall back to robust standard errors:

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=[("survival_time", "died")],
    outcome_models="cox",
    bootstrap=True,
    skip_bootstrap_for_survival=True,  # Use Cox robust SEs instead
)
```

### Bootstrap Inference

Get robust confidence intervals and p-values via bootstrapping:

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["conversion"],
    covariates=["age", "income"],
    adjustment="balance",
    bootstrap=True,
    bootstrap_iterations=2000,
    bootstrap_ci_method="percentile",
    bootstrap_seed=42,  # For reproducibility
)

analyzer.get_effects()

# Bootstrap results include robust CIs
results = analyzer.results
print(results[["outcome", "absolute_effect", "abs_effect_lower", 
               "abs_effect_upper", "inference_method"]])
```

**When to use bootstrap:**
- Small sample sizes
- Non-normal distributions
- Skepticism about asymptotic assumptions
- Want robust, distribution-free inference

### Multiple Experiments

Analyze multiple experiments simultaneously:

```python
# Data with multiple experiments
df = pd.DataFrame({
    "experiment": ["exp_A", "exp_A", "exp_B", "exp_B"] * 100,
    "treatment": [0, 1, 0, 1] * 100,
    "outcome": np.random.randn(400),
    "age": np.random.normal(35, 10, 400),
})

analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["outcome"],
    experiment_identifier="experiment",  # Group by experiment
    covariates=["age"],
)

analyzer.get_effects()

# Results include experiment column
results = analyzer.results
print(results.groupby("experiment")[["absolute_effect", "pvalue"]].first())

# Balance per experiment (automatically calculated)
balance = analyzer.balance
print(balance.groupby("experiment")["balance_flag"].mean())
```

### Categorical Treatment Variables

Compare multiple treatment variants:

```python
df = pd.DataFrame({
    "treatment": np.random.choice(["control", "variant_A", "variant_B"], 1000),
    "outcome": np.random.randn(1000),
})

analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["outcome"],
)

analyzer.get_effects()

# Results show all pairwise comparisons
results = analyzer.results
print(results[["treatment_group", "control_group", "absolute_effect", "pvalue"]])
```

### Instrumental Variables (IV)

When treatment assignment is confounded (e.g., non-compliance in an experiment), use an instrument -- a variable that affects treatment receipt but only affects the outcome through treatment:

```python
import numpy as np
import pandas as pd
from experiment_utils.experiment_analyzer import ExperimentAnalyzer

# Simulate encouragement design with non-compliance
np.random.seed(42)
n = 5000
Z = np.random.binomial(1, 0.5, n)            # Random encouragement (instrument)
U = np.random.normal(0, 1, n)                 # Unobserved confounder
D = np.random.binomial(1, 1 / (1 + np.exp(-(-1 + 0.5 * U + 2.5 * Z))))  # Actual treatment (confounded)
Y = 2.0 * D + 1.0 * U + np.random.normal(0, 1, n)  # Outcome (true LATE = 2.0)

df = pd.DataFrame({"encouragement": Z, "treatment": D, "outcome": Y})

# IV estimation using encouragement as instrument for treatment
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["outcome"],
    instrument_col="encouragement",
    adjustment="IV",
)

analyzer.get_effects()
print(analyzer.results[["outcome", "absolute_effect", "standard_error", "pvalue"]])
```

**IV with covariates:**

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["outcome"],
    instrument_col="encouragement",
    adjustment="IV",
    covariates=["age", "region"],  # Balance checked on instrument
)

analyzer.get_effects()
```

**Key assumptions for valid IV estimation:**
- **Relevance**: The instrument must be correlated with treatment (check first-stage F-statistic)
- **Exclusion restriction**: The instrument affects the outcome *only* through treatment
- **Independence**: The instrument is independent of unobserved confounders (holds by design in randomized encouragement)

> **Note**: IV estimation is only supported for OLS outcome models. For other model types (logistic, Cox, etc.), the analyzer will fall back to unadjusted estimation with a warning.

### Multiple Comparison Adjustments

Control family-wise error rate when testing multiple hypotheses:

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["conversion", "revenue", "retention", "engagement"],
)

analyzer.get_effects()

# Apply Bonferroni correction
analyzer.adjust_pvalues(method="bonferroni")

results = analyzer.results
print(results[["outcome", "pvalue", "pvalue_mcp", "stat_significance_mcp"]])
```

**Available methods:**
- `bonferroni`: Most conservative, controls FWER
- `holm`: Less conservative than Bonferroni, still controls FWER
- `sidak`: Similar to Bonferroni, assumes independence
- `fdr_bh`: Benjamini-Hochberg FDR control (less conservative)

### Non-Inferiority Testing

Test if a new treatment is "not worse" than control:

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["conversion"],
)

analyzer.get_effects()

# Test if treatment is within 10% of control
analyzer.test_non_inferiority(relative_margin=0.10)

results = analyzer.results
print(results[["outcome", "relative_effect", "is_non_inferior", 
               "non_inferiority_margin"]])
```

### Combining Effects (Meta-Analysis)

When you have multiple experiments or segments, pool results using fixed-effects meta-analysis or weighted averaging.

**Fixed-effects meta-analysis (`combine_effects`)**

Combines effect estimates using inverse-variance weighting, producing a pooled effect with proper standard errors:

```python
from experiment_utils.experiment_analyzer import ExperimentAnalyzer

# Analyze multiple experiments
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["conversion"],
    experiment_identifier="experiment",
    covariates=["age"],
)

analyzer.get_effects()

# Pool results across experiments using fixed-effects meta-analysis
pooled = analyzer.combine_effects(grouping_cols=["outcome"])
print(pooled[["outcome", "experiments", "absolute_effect", "standard_error", "pvalue"]])
```

**Custom grouping:**

```python
# Pool by outcome and region (e.g., combine experiments within each region)
pooled_by_region = analyzer.combine_effects(grouping_cols=["region", "outcome"])
print(pooled_by_region)
```

**Weighted average aggregation (`aggregate_effects`)**

A simpler alternative that weights by treatment group size (useful for quick summaries, but `combine_effects` provides better standard error estimates):

```python
aggregated = analyzer.aggregate_effects(grouping_cols=["outcome"])
print(aggregated[["outcome", "experiments", "absolute_effect", "pvalue"]])
```

### Retrodesign Analysis

Assess reliability of significant results (post-hoc power analysis):

```python
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["conversion"],
)

analyzer.get_effects()

# Calculate Type S and Type M errors assuming true effect is 0.02
retro = analyzer.calculate_retrodesign(true_effect=0.02)

print(retro[["outcome", "power", "type_s_error", "type_m_error", "relative_bias"]])
```

**Metrics explained:**
- `power`: Probability of detecting the assumed true effect
- `type_s_error`: Probability of wrong sign when significant (if underpowered)
- `type_m_error`: Expected exaggeration ratio (mean |observed|/|true|)
- `relative_bias`: Expected bias ratio preserving signs (mean observed/true)
  - Typically lower than type_m_error because wrong signs partially offset overestimates



## Power Analysis

Design well-powered experiments using simulation-based power analysis.

### Calculate Power

Estimate statistical power for a given sample size:

```python
from experiment_utils.power_sim import PowerSim

# Initialize power simulator for proportion metric
power_sim = PowerSim(
    metric="proportion",      # or "average" for continuous outcomes
    relative_effect=False,    # False = absolute effect, True = relative
    variants=1,               # Number of treatment variants
    nsim=1000,               # Number of simulations
    alpha=0.05,              # Significance level
    alternative="two-tailed" # or "one-tailed"
)

# Calculate power
power_result = power_sim.get_power(
    baseline=[0.10],          # Control conversion rate
    effect=[0.02],           # Absolute effect size (2pp lift)
    sample_size=[5000]       # Total sample size
)

print(f"Power: {power_result['power'].iloc[0]:.2%}")
```

**Example: Multiple variants**

```python
# Compare 2 treatments vs control
power_sim = PowerSim(metric="proportion", variants=2, nsim=1000)

power_result = power_sim.get_power(
    baseline=0.10,
    effect=[0.02, 0.03],  # Different effects for each variant
    sample_size=6000
)

print(power_result[["comparison", "power"]])
```

### Power from Real Data

When your data doesn't follow standard parametric assumptions, estimate power by bootstrapping directly from observed data using `get_power_from_data()`. Instead of generating synthetic data from a distribution, it repeatedly samples from your actual dataset and injects the specified effect:

```python
from experiment_utils.power_sim import PowerSim
import pandas as pd

# Use real data for power estimation
power_sim = PowerSim(metric="average", variants=1, nsim=1000)

power_result = power_sim.get_power_from_data(
    df=historical_data,          # Your actual dataset
    metric_col="revenue",        # Column to test
    sample_size=5000,            # Sample size per group
    effect=3.0,                  # Effect to inject (absolute)
)

print(f"Power: {power_result['power'].iloc[0]:.2%}")
```

**When to use `get_power_from_data` vs `get_power`:**
- Use `get_power_from_data` when your metric has a non-standard distribution (heavy tails, skewed, zero-inflated)
- Use `get_power` for standard parametric scenarios (proportions, means, counts)

**With compliance:**

```python
# Account for 80% compliance
power_result = power_sim.get_power_from_data(
    df=historical_data,
    metric_col="revenue",
    sample_size=5000,
    effect=3.0,
    compliance=0.80,
)
```

### Grid Power Simulation

Explore power across a grid of parameter combinations using `grid_sim_power()`. This is useful for understanding how power varies with sample size, effect size, and baseline rates:

```python
from experiment_utils.power_sim import PowerSim

power_sim = PowerSim(metric="proportion", variants=1, nsim=1000)

# Simulate power across a grid of scenarios
grid_results = power_sim.grid_sim_power(
    baseline_rates=[0.05, 0.10, 0.15],
    effects=[0.02, 0.03, 0.05],
    sample_sizes=[1000, 2000, 5000, 10000],
    plot=True,  # Generate power curves
)

print(grid_results.head())
```

**With multiple variants and custom compliance:**

```python
power_sim = PowerSim(metric="average", variants=2, nsim=1000)

grid_results = power_sim.grid_sim_power(
    baseline_rates=[50.0],
    effects=[2.0, 5.0],
    sample_sizes=[500, 1000, 2000, 5000],
    standard_deviations=[[20.0]],
    compliances=[[0.8]],
    threads=4,        # Parallelize across scenarios
    plot=True,
)
```

The output DataFrame includes all input parameters alongside the estimated power for each comparison, making it easy to filter and compare scenarios.

### Find Sample Size

Find the minimum sample size needed to achieve target power:

```python
from experiment_utils.power_sim import PowerSim

power_sim = PowerSim(metric="proportion", variants=1, nsim=1000)

# Find sample size for 80% power
sample_result = power_sim.find_sample_size(
    target_power=0.80,
    baseline=0.10,
    effect=0.02
)

print(f"Required sample size: {sample_result['total_sample_size'].iloc[0]:,.0f}")
print(f"Achieved power: {sample_result['achieved_power_by_comparison'].iloc[0]:.2%}")
```

**Different power targets per comparison:**

```python
# Primary outcome needs 90%, secondary needs 80%
power_sim = PowerSim(metric="proportion", variants=2, nsim=1000)

sample_result = power_sim.find_sample_size(
    target_power={(0,1): 0.90, (0,2): 0.80},
    baseline=0.10,
    effect=[0.05, 0.03]
)

print(sample_result[["comparison", "sample_size_by_group", "achieved_power"]])
```

**Optimize allocation ratio:**

```python
# Find optimal allocation to minimize total sample size
sample_result = power_sim.find_sample_size(
    target_power=0.80,
    baseline=0.10,
    effect=0.05,
    optimize_allocation=True
)

print(f"Optimal allocation: {sample_result['allocation_ratio'].iloc[0]}")
print(f"Total sample size: {sample_result['total_sample_size'].iloc[0]:,.0f}")
```

**Custom allocation:**

```python
# 30% control, 70% treatment
sample_result = power_sim.find_sample_size(
    target_power=0.80,
    baseline=0.10,
    effect=0.02,
    allocation_ratio=[0.3, 0.7]
)
```

### Simulate Retrodesign

Prospective analysis of Type S (sign) and Type M (magnitude) errors:

```python
from experiment_utils.power_sim import PowerSim

power_sim = PowerSim(metric="proportion", variants=1, nsim=5000)

# Simulate underpowered study
retro = power_sim.simulate_retrodesign(
    true_effect=0.02,
    sample_size=500,
    baseline=0.10
)

print(f"Power: {retro['power'].iloc[0]:.2%}")
print(f"Type S Error: {retro['type_s_error'].iloc[0]:.2%}")
print(f"Exaggeration Ratio: {retro['exaggeration_ratio'].iloc[0]:.2f}x")
print(f"Relative Bias: {retro['relative_bias'].iloc[0]:.2f}x")
```

**Understanding retrodesign metrics:**

| Metric | Description |
|--------|-------------|
| `power` | Probability of detecting the true effect |
| `type_s_error` | Probability of getting wrong sign when significant |
| `exaggeration_ratio` | Expected overestimation (mean &#124;observed&#124;/&#124;true&#124;) |
| `relative_bias` | Expected bias preserving signs (mean observed/true) <br> Lower than exaggeration_ratio because Type S errors partially cancel out overestimates |
| `median_significant_effect` | Median effect among significant results |
| `prop_overestimate` | % of significant results that overestimate |

**Compare power scenarios:**

```python
# Low power scenario
retro_low = power_sim.simulate_retrodesign(
    true_effect=0.02, sample_size=500, baseline=0.10
)

# High power scenario
retro_high = power_sim.simulate_retrodesign(
    true_effect=0.02, sample_size=5000, baseline=0.10
)

print(f"Low power - Exaggeration: {retro_low['exaggeration_ratio'].iloc[0]:.2f}x, "
      f"Relative bias: {retro_low['relative_bias'].iloc[0]:.2f}x")
print(f"High power - Exaggeration: {retro_high['exaggeration_ratio'].iloc[0]:.2f}x, "
      f"Relative bias: {retro_high['relative_bias'].iloc[0]:.2f}x")
```

**Multiple variants:**

```python
power_sim = PowerSim(metric="proportion", variants=3, nsim=5000)

retro = power_sim.simulate_retrodesign(
    true_effect=[0.02, 0.03, 0.04],  # Different effects per variant
    sample_size=1000,
    baseline=0.10,
    target_comparisons=[(0, 1), (0, 2)]
)

print(retro[["comparison", "power", "type_s_error", "exaggeration_ratio", "relative_bias"]])
```

## Utilities

### Balanced Random Assignment

Generate balanced treatment assignments with optional stratification:

```python
from experiment_utils.utils import balanced_random_assignment
import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
users = pd.DataFrame({
    "user_id": range(1000),
    "age_group": np.random.choice(["18-25", "26-35", "36-45", "46+"], 1000),
    "region": np.random.choice(["North", "South", "East", "West"], 1000),
})

# Simple 50/50 split
users["treatment"] = balanced_random_assignment(
    users, 
    allocation_ratio=0.5,
    seed=42
)

print(users["treatment"].value_counts())
# Output: control: 500, test: 500
```

**Stratified assignment (ensure balance within subgroups):**

```python
# Balance within age_group and region strata
users["treatment_stratified"] = balanced_random_assignment(
    users,
    allocation_ratio=0.5,
    balance_covariates=["age_group", "region"],
    check_balance=True,  # Print balance diagnostics
    seed=42
)
```

**Multiple variants:**

```python
# Three variants with equal allocation
users["assignment"] = balanced_random_assignment(
    users,
    variants=["control", "variant_A", "variant_B"]
)

# Custom allocation ratios
users["assignment_custom"] = balanced_random_assignment(
    users,
    variants=["control", "variant_A", "variant_B"],
    allocation_ratio={"control": 0.5, "variant_A": 0.3, "variant_B": 0.2},
    balance_covariates=["age_group"]
)
```

**Parameters:**
- `allocation_ratio`: Float (for binary) or dict (for multiple variants)
- `balance_covariates`: List of columns to stratify by
- `check_balance`: If True, prints balance diagnostics
- `smd_threshold`: Threshold for balance flag (default 0.1)
- `seed`: Random seed for reproducibility

### Standalone Balance Checker

Check covariate balance on any dataset without using ExperimentAnalyzer:

```python
from experiment_utils.utils import check_covariate_balance
import pandas as pd
import numpy as np

# Create sample data with imbalance
np.random.seed(42)
n_treatment = 300
n_control = 200

df = pd.concat([
    pd.DataFrame({
        "treatment": [1] * n_treatment,
        "age": np.random.normal(40, 10, n_treatment),      # Older in treatment
        "income": np.random.normal(60000, 15000, n_treatment),  # Higher income
    }),
    pd.DataFrame({
        "treatment": [0] * n_control,
        "age": np.random.normal(30, 10, n_control),         # Younger in control
        "income": np.random.normal(45000, 15000, n_control),    # Lower income
    })
])

# Check balance
balance = check_covariate_balance(
    data=df,
    treatment_col="treatment",
    covariates=["age", "income"],
    threshold=0.1  # SMD threshold
)

print(balance)
```

Output:
```
  covariate  mean_treated  mean_control       smd  balance_flag
0       age         40.23         30.15  1.012345             0
1    income      59823.45      45234.12  0.923456             0
```

**With categorical variables:**

```python
df["region"] = np.random.choice(["North", "South", "East", "West"], len(df))

balance = check_covariate_balance(
    data=df,
    treatment_col="treatment",
    covariates=["age", "income", "region"],  # Automatic categorical detection
    threshold=0.1
)

# Region will be expanded to dummy variables
print(balance[balance["covariate"].str.contains("region")])
```

**Use cases:**
- Pre-experiment: Check if randomization worked
- Post-assignment: Validate treatment assignment quality
- Observational data: Assess comparability before adjustment
- Research: Standalone balance analysis for publications

## Advanced Topics

### When to Use Different Adjustment Methods

| Method | `adjustment` | `regression_covariates` | Best for |
|---|---|---|---|
| No adjustment | `None` | `None` | Well-randomized experiments |
| Regression | `None` | `["x1", "x2"]` | Variance reduction, simple confounding |
| IPW | `"balance"` | `None` | Many covariates, non-linear confounding |
| IPW + Regression | `"balance"` | `["x1", "x2"]` | Extra robustness, survival models |
| AIPW (doubly robust) | `"aipw"` | (automatic) | Best protection against misspecification |
| IV | `"IV"` | `None` or `["x1"]` | Non-compliance, endogenous treatment (requires `instrument_col`) |

**Choosing a balance method:**
- `ps-logistic`: Default, fast, interpretable
- `ps-xgboost`: Non-linear relationships, complex interactions
- `entropy`: Exact moment matching, but can be unstable with many covariates

**Choosing an outcome model:**

| Outcome type | Model | `outcome_models` |
|---|---|---|
| Continuous (revenue, time) | OLS | `"ols"` (default) |
| Binary (converted, churned) | Logistic | `"logistic"` |
| Count (orders, clicks) | Poisson | `"poisson"` |
| Overdispersed count | Negative binomial | `"negative_binomial"` |
| Time-to-event | Cox PH | `"cox"` |

### Non-Collapsibility of Hazard and Odds Ratios

When using IPW without regression covariates for Cox or logistic models, the estimated effect may differ from the conditional effect even with perfect covariate balancing. This is not a bug -- it reflects a fundamental property called **non-collapsibility**.

**What happens**: IPW creates a pseudo-population where treatment is independent of covariates, then fits a model without covariates. This estimates the **marginal** effect (population-average). For non-collapsible measures like hazard ratios and odds ratios, the marginal effect differs from the conditional effect.

**When it matters**: The gap increases with stronger covariate effects on the outcome. For Cox models the effect is typically larger than for logistic models.

**Recommendations**:
- For Cox models: use **regression adjustment** or **IPW + Regression** to recover the conditional HR
- For logistic models: the default marginal effects output (probability change) is collapsible, so this mainly affects odds ratios (`compute_marginal_effects=False`)
- For OLS: no issue (mean differences are collapsible)
- AIPW estimates are on the marginal scale but are doubly robust

The package warns when IPW is used without regression covariates for Cox models.

### Handling Missing Data

The package handles missing data automatically:

- **Treatment variable**: Rows with missing treatment are dropped (logged as warning)
- **Categorical covariates**: Missing values become explicit "Missing" category
- **Numeric covariates**: Mean imputation
- **Binary covariates**: Mode imputation

```python
analyzer = ExperimentAnalyzer(
    data=df,  # Can contain missing values
    treatment_col="treatment",
    outcomes=["conversion"],
    covariates=["age", "region"],
)
# Missing data is handled automatically
analyzer.get_effects()
```

### Best Practices

**1. Always check balance:**

```python
analyzer = ExperimentAnalyzer(data=df, treatment_col="treatment", 
                              outcomes=["conversion"], covariates=["age", "income"])

analyzer.get_effects()

# Check balance from results
balance = analyzer.balance
if balance["balance_flag"].mean() < 0.8:  # <80% balanced
    print("Consider rerunning with covariate adjustment")
```

**2. Use bootstrap for small samples:**

```python
if len(df) < 500:
    analyzer = ExperimentAnalyzer(..., bootstrap=True, bootstrap_iterations=2000)
```

**3. Apply multiple comparison correction:**

```python
# Always correct when testing multiple outcomes/experiments
analyzer.get_effects()
analyzer.adjust_pvalues(method="holm")  # Less conservative than Bonferroni
```

**4. Report both absolute and relative effects:**

```python
results = analyzer.results
print(results[["outcome", "absolute_effect", "relative_effect", 
               "abs_effect_lower", "abs_effect_upper"]])
```

**5. Check sensitivity with retrodesign:**

```python
# After finding significant result, check reliability
retro = analyzer.calculate_retrodesign(true_effect=0.01)
if retro["type_m_error"].iloc[0] > 2:
    print("Warning: Results may be exaggerated")
```

### Common Workflows

**Pre-experiment: Sample size calculation**

```python
from experiment_utils.power_sim import PowerSim

# Determine required sample size
power_sim = PowerSim(metric="proportion", variants=1, nsim=1000)
result = power_sim.find_sample_size(
    target_power=0.80,
    baseline=0.10,
    effect=0.02
)
print(f"Need {result['total_sample_size'].iloc[0]:,.0f} users")
```

**During experiment: Balance check**

```python
from experiment_utils.utils import check_covariate_balance

# Check if randomization worked
balance = check_covariate_balance(
    data=experiment_df,
    treatment_col="treatment",
    covariates=["age", "region", "tenure"]
)
print(f"Balance: {balance['balance_flag'].mean():.1%}")
```

**Post-experiment: Analysis**

```python
from experiment_utils.experiment_analyzer import ExperimentAnalyzer

# Full analysis pipeline
analyzer = ExperimentAnalyzer(
    data=df,
    treatment_col="treatment",
    outcomes=["primary_metric", "secondary_metric"],
    covariates=["age", "region"],
    adjustment="balance",
    bootstrap=True,
)

analyzer.get_effects()
analyzer.adjust_pvalues(method="holm")

# Report
results = analyzer.results
print(results[["outcome", "absolute_effect", "relative_effect", 
               "pvalue_mcp", "stat_significance_mcp"]])
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License.

## Citation

If you use this package in your research, please cite:

```bibtex
@software{experiment_utils_pd,
  title = {Experiment Utils PD: A Python Package for Experiment Analysis},
  author = {Sebastian Daza},
  year = {2026},
  url = {https://github.com/sdaza/experiment-utils-pd}
}
```

