Metadata-Version: 2.4
Name: ValidMLInference
Version: 1.2.0
Summary: This package implements bias correction methods for models estimated using synthetic data
Author-email: Konrad Kurczynski <konrad.kurczynski@yale.edu>, Timothy Christensen <timothy.christensen@yale.edu>
License: MIT
Project-URL: Homepage, https://github.com/KonradKurczynski/ValidMLInference
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: jax
Requires-Dist: jaxopt
Requires-Dist: numdifftools
Requires-Dist: patsy
Requires-Dist: pandas
Dynamic: license-file

# ValidMLInference

`ValidMLInference` is a Python package for correcting bias and performing valid inference in regressions that include variables generated by AI/ML methods. The bias-correction methods are described in [Battaglia, Christensen, Hansen & Sacher (2024)](https://arxiv.org/abs/2402.15585). 

## Requirements and installation

`ValidMLInference` runs on Python 3.8 and requires standard numerical packages: `numpy`, `scipy`, `jax`, `jaxopt`, and `numdifftools`. 

To install the package, run 
```
pip install ValidMLInference
```
in your terminal. 

## Using ValidMLInference

To get started, we recommend looking at the following examples and resources: 
1. [**Remote Work**](https://github.com/KonradKurczynski/ValidMLInference/blob/main/remote_work.ipynb): This notebook estimates the association between working from home and salaries using real-world job postings data [(Hansen et al., 2023)](https://dx.doi.org/10.2139/ssrn.4380734). It illustrates how the functions `ols_bca`, `ols_bcm` and `one_step` can be used to correct bias from regressing on AI/ML-generated labels. The notebook reproduces results from Table 1 of [Battaglia, Christensen, Hansen & Sacher (2024)](https://arxiv.org/abs/2402.15585).
2. [**Topic Models**](https://github.com/KonradKurczynski/ValidMLInference/blob/main/topic_model_example.ipynb): This notebook estimates the association between CEO time allocation and firm performance [(Bandiera et al. 2020)](https://doi.org/10.1086/705331). It illustrates how the functions `ols_bca_topic` and `ols_bcm_topic` can be used to correct bias from estimated topic model shares. The notebook reproduces results from Table 2 of [Battaglia, Christensen, Hansen & Sacher (2024)](https://arxiv.org/abs/2402.15585).
3. [**Synthetic Example**](https://github.com/KonradKurczynski/ValidMLInference/blob/main/synthetic_example.ipynb): A synthetic example comparing the performance of different bias-correction methods in the context of AI/ML-generated labels.
4. [**Functionality**](https://github.com/KonradKurczynski/ValidMLInference/blob/main/functionality.md): A detailed reference describing all available functions, optional arguments, and usage tips.

## Quickstart 
Code below compares coefficients obtained by ordinary least squares methods and those obtained by the `one_step` approach, when used on variables subject to classification error. We can see that the 95% confidence interval generated by `one_step` contains the true parameter of 2, whereas the standard `ols` approach doesn't.  

```python
import numpy as np
import pandas as pd
from ValidMLInference import ols, one_step

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data with mislabeling
n = 1000
true_effect = 2.0

# True treatment assignment
X_true = np.random.binomial(1, 0.5, n)

# Observed (mislabeled) treatment with 20% error rate
mislabel_prob = 0.2
X_obs = X_true.copy()
mislabel_mask = np.random.binomial(1, mislabel_prob, n).astype(bool)
X_obs[mislabel_mask] = 1 - X_obs[mislabel_mask]

# Generate outcome with true treatment effect
Y = 1.0 + true_effect * X_true + np.random.normal(0, 1, n)

# Create DataFrame
data = pd.DataFrame({'Y': Y, 'X_obs': X_obs})

# Naive OLS using mislabeled data
ols_result = ols(formula="Y ~ X_obs", data=data)
print("OLS Results (using mislabeled data):")
print(ols_result.summary())

# One-step estimator that corrects for mislabeling
one_step_result = one_step(formula="Y ~ X_obs", data=data)
print("\nOne-Step Results (correcting for mislabeling):")
print(one_step_result.summary())

ols_ci = ols_result.summary().loc['X_obs', ['2.5%', '97.5%']]
one_step_ci = one_step_result.summary().loc['X_obs', ['2.5%', '97.5%']]

print(f"\nTrue treatment effect: {true_effect}")
print(f"OLS 95% CI contains true value: {ols_ci['2.5%'] <= true_effect <= ols_ci['97.5%']}")
print(f"One-step 95% CI contains true value: {one_step_ci['2.5%'] <= true_effect <= one_step_ci['97.5%']}")
```

    OLS Results (using mislabeled data):
               Estimate  Std. Error    z value  P>|z|      2.5%     97.5%
    Intercept  1.392265    0.055828  24.938313    0.0  1.282843  1.501687
    X_obs      1.207589    0.078643  15.355267    0.0  1.053451  1.361727
    
    One-Step Results (correcting for mislabeling):
               Estimate  Std. Error    z value  P>|z|      2.5%     97.5%
    X_obs      1.828638    0.108976  16.780127    0.0  1.615048  2.042228
    Intercept  1.092510    0.107082  10.202534    0.0  0.882633  1.302387
    
    True treatment effect: 2.0
    OLS 95% CI contains true value: False
    One-step 95% CI contains true value: True
