Metadata-Version: 2.4
Name: dml-dev
Version: 0.1.1
Summary: DoubleML build, estimation, plotting, and utility pipelines.
Author: DML Pipeline Contributors
Keywords: administrative-data,causal-inference,doubleml,observational-data,program-evaluation
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Requires-Dist: doubleml
Requires-Dist: joblib
Requires-Dist: oi-tools[figures]
Requires-Dist: plotnine
Requires-Dist: polars
Requires-Dist: pyarrow
Requires-Dist: psutil
Requires-Dist: PyYAML
Requires-Dist: scikit-learn
Requires-Dist: threadpoolctl
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"

# DML Pipeline

This repo is a small framework for running DoubleML on administrative-style
program data. It separates project-specific choices from reusable pipeline code:
you edit `project_configuration/`, then run the pipeline in `dml_code/`.

The repo is currently filled with a synthetic example so you can run the whole
flow before replacing it with real project data.

## Mental Model

The workflow has two main steps:

1. **Build an analysis dataset.** Start from a databank and program file,
   join them, construct event-time variables, and write processed panels to
   `data/build_output/`.
2. **Estimate DML effects.** Read a YAML experiment, resolve its program,
   covariates, filters, and models from the registries, then write logs to
   `outputs/raw/`.

After estimation, scripts can turn the raw logs into plots and tables.

```text
project_configuration/ + data/build/
        |
        v
dml_code.pipeline.step1_build
        |
        v
data/build_output/
        |
        v
dml_code.pipeline.step2_estimate
        |
        v
outputs/raw/ -> outputs/plots/ and outputs/tables/
```

## Run The Example

```bash
python project_scripts/generate_example.py
python -m dml_code.pipeline.step1_build example_program
python -m dml_code.pipeline.step2_estimate synthetic_example
python project_scripts/plot_example.py
```

The first command creates synthetic input data in `data/build/`. Step 1 writes
processed panels to `data/build_output/`. Step 2 writes estimation and
prediction logs to `outputs/raw/`. The plotting script writes diagnostics to
`outputs/plots/` and `outputs/tables/`.

## What You Edit

Most project setup happens in `project_configuration/`.

- `project_configuration/build_spec.py`: define the databank files, columns to carry through,
  relative-time columns to generate, and any generated features created after
  panel construction.
- `project_configuration/registries/programs.py`: define each program: its source file,
  treatment column, enrollment-year column, and program-specific columns.
- `project_configuration/registries/covariate_sets.py`: name reusable covariate lists and mark
  categorical covariates for dummy encoding.
- `project_configuration/registries/filter_sets.py`: name reusable Polars filters for
  estimation samples.
- `project_configuration/registries/models.py`: name outcome and propensity learners.
- `project_configuration/estimation_experiments/*.yaml`: choose combinations of programs, outcomes,
  covariates, filters, models, and control sampling rates to estimate.

The pipeline code in `dml_code/` is meant to stay reusable.

- `dml_code/pipeline/`: runnable steps, `step1_build.py` and
  `step2_estimate.py`.
- `dml_code/src/`: shared helpers for building, estimating, paths, outputs,
  and logging.
  
`project_scripts/` is for ad hoc project work tied to particular runs:
generating example data, viewing outputs, making plots, running diagnostics,
and writing small experiment-specific analyses.

## How To Add A Real Project

1. Put source parquet files somewhere under `data/` or point `project_configuration/` at their
   real locations.
2. Update `project_configuration/build_spec.py` with the databank files and feature-generation
   logic.
3. Add program definitions in `project_configuration/registries/programs.py`.
4. Add covariate sets, filters, and models in the registry files.
5. Create or copy a YAML file in `project_configuration/estimation_experiments/`.
6. Run step 1 for a program, then step 2 for an experiment.

Example:

```bash
python -m dml_code.pipeline.step1_build my_program
python -m dml_code.pipeline.step2_estimate my_experiment
```

Use `project_scripts/` for project-specific follow-up work: viewing outputs
from particular runs, making plots and tables, running diagnostics, robustness
checks, and other exploratory analyses.

## Where Results Go

- `data/build/`: input data used by the example.
- `data/build_output/`: processed analysis datasets created by step 1.
- `outputs/raw/`: machine-readable estimation, prediction, and diagnostic logs.
- `outputs/plots/`: generated figures.
- `outputs/tables/`: generated tables.
