Metadata-Version: 2.4
Name: adaptive-reliability-layer
Version: 0.3.1
Summary: Research prototype and commercial runtime for safe continual test-time adaptation under distribution shift.
Author: Adaptive Reliability Layer contributors
License: MIT
Project-URL: Homepage, https://github.com/adaptive-reliability-layer/adaptive-reliability-layer
Project-URL: Documentation, https://github.com/adaptive-reliability-layer/adaptive-reliability-layer#readme
Keywords: machine-learning,fraud,distribution-shift,mlops,reliability
Classifier: License :: OSI Approved :: MIT License
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: pyyaml>=6.0
Requires-Dist: pandas>=2.0
Provides-Extra: torch
Requires-Dist: torch>=2.0; extra == "torch"
Requires-Dist: torchvision>=0.15; extra == "torch"
Provides-Extra: runtime
Requires-Dist: torch>=2.0; extra == "runtime"
Requires-Dist: torchvision>=0.15; extra == "runtime"
Provides-Extra: research
Requires-Dist: torch>=2.0; extra == "research"
Requires-Dist: torchvision>=0.15; extra == "research"
Requires-Dist: wilds>=2.0; extra == "research"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: torch>=2.0; extra == "dev"
Requires-Dist: torchvision>=0.15; extra == "dev"
Requires-Dist: fastapi>=0.100; extra == "dev"
Requires-Dist: uvicorn>=0.23; extra == "dev"
Requires-Dist: httpx>=0.25; extra == "dev"
Provides-Extra: prometheus
Requires-Dist: prometheus-client>=0.19; extra == "prometheus"
Provides-Extra: serving
Requires-Dist: fastapi>=0.100; extra == "serving"
Requires-Dist: uvicorn>=0.23; extra == "serving"
Requires-Dist: httpx>=0.25; extra == "serving"
Provides-Extra: kafka
Requires-Dist: confluent-kafka>=2.3; extra == "kafka"
Provides-Extra: redis
Requires-Dist: redis>=5.0; extra == "redis"
Provides-Extra: all
Requires-Dist: pytest>=7.0; extra == "all"
Requires-Dist: torch>=2.0; extra == "all"
Requires-Dist: torchvision>=0.15; extra == "all"
Requires-Dist: wilds>=2.0; extra == "all"
Requires-Dist: fastapi>=0.100; extra == "all"
Requires-Dist: uvicorn>=0.23; extra == "all"
Requires-Dist: httpx>=0.25; extra == "all"
Requires-Dist: prometheus-client>=0.19; extra == "all"
Requires-Dist: confluent-kafka>=2.3; extra == "all"
Requires-Dist: redis>=5.0; extra == "all"
Dynamic: license-file

# Adaptive Reliability Layer

An experimental research codebase for safe continual test-time adaptation under distribution shift.

## Project Goal

The project explores whether a deployed ML model can be paired with an adaptive reliability layer that:

- detects distribution shift online
- estimates whether the model remains in its competence zone
- applies bounded, reversible adaptation when safe
- preserves source knowledge over long streams
- recalibrates uncertainty after adaptation

This is a separate project from `Intelligent_NPCs`.

## Initial Scope

The first prototype focuses on a simulated streaming setting rather than production deployment:

- streaming batches from a synthetic nonstationary environment
- latent or feature-space shift monitoring
- simple adaptation policy with safety gating
- uncertainty-aware output surface
- evaluation against a frozen baseline

## Repo Layout

- `docs/` project documents and research notes
- `src/adaptive_reliability_layer/` core package
- `scripts/` runnable entrypoints

Useful starting docs:

- `docs/status_paper_commercial_outreach.md`
- `docs/current_findings.md`
- `docs/next_step_decision_memo.md`

## Quick Start

Create a virtual environment and install in editable mode:

```bash
python3 -m venv .venv
source .venv/bin/activate
# Commercial runtime (torch fraud pilots + sidecar):
pip install -e ".[torch,serving]"
# Full research bench (adds WILDS):
pip install -e ".[research,serving,dev]"
```

PyPI-style install (when published):

```bash
pip install "adaptive-reliability-layer[torch,serving]"
arl-customer-replay --input customer.csv --config configs/customer_shadow.yaml
```

See `docs/customer_replay.md` for the design-partner replay path.

Run the simple simulation:

```bash
python3 scripts/run_simulation.py
```

Run the baseline benchmark:

```bash
python3 scripts/run_benchmark.py
```

Run the real tabular benchmark with the PyTorch adapter model:

```bash
python3 scripts/run_tabular_benchmark.py
```

Run the harder digits-shift benchmark:

```bash
python3 scripts/run_digits_shift_benchmark.py
```

Run the real-image scale-up benchmark on Fashion-MNIST:

```bash
python3 scripts/run_fashion_mnist_shift_benchmark.py
```

Run the delayed-label temporal Fashion-MNIST benchmark:

```bash
python3 scripts/run_temporal_fashion_mnist_benchmark.py
```

Run the temporal delay/severity suite and save aggregated results:

```bash
python3 scripts/run_temporal_benchmark_suite.py
```

Run the broader temporal paper-style suite:

```bash
python3 scripts/run_temporal_paper_suite.py
```

Run the recurrence-first temporal benchmark for specialist reuse:

```bash
python3 scripts/run_recurrence_temporal_benchmark.py
```

Run the first public WILDS benchmark path on CivilComments:

```bash
python3 scripts/run_wilds_civilcomments_benchmark.py
```

Run the multi-seed WILDS CivilComments suite:

```bash
python3 scripts/run_wilds_civilcomments_suite.py
```

Run the graph-native shift benchmark:

```bash
python3 scripts/run_graph_shift_benchmark.py
```

Run the multi-seed image scale-up suite across backbones and shift severities:

```bash
python3 scripts/run_image_scaleup_suite.py
```

Run the multi-seed suite and save aggregated results:

```bash
python3 scripts/run_benchmark_suite.py
```

Run the ablation suite and save aggregated results:

```bash
python3 scripts/run_ablation_suite.py
```

## Current Status

This repository is in the research scaffolding phase. The code currently provides:

- a stream simulator for synthetic regime shift
- a source-reference profile and shift monitor over features plus outputs
- a martingale-style sequential risk monitor
- baseline policies for frozen, naive, and safety-gated adaptation
- a multi-action controller over `bn_refresh`, `label_shift`, `adapter_update`, and `reset`
- experimental `bandit`, `specialist_memory`, and `hybrid` controllers for the next research phase
- a delayed-feedback bandit controller for temporal streams with label latency
- a delayed-feedback hybrid controller that combines specialist memory with delayed controller learning
- a regime-aware delayed bandit controller with short-horizon temporal state
- confidence-filtered pseudo-label adaptation with bounded parameter drift
- a small PyTorch encoder-plus-adapter model for adapter-only test-time updates
- a real tabular streaming benchmark with regime-based shifts
- a harder digits-shift benchmark that degrades the frozen source model much more sharply
- a real-image scale-up benchmark on a fine-grained Fashion-MNIST subset with a BN-heavy convolutional model
- a configurable image scale-up path with `convnet` and `resnet_small` backbones
- `standard` and `harsh` image shift profiles for stress-testing adaptation policies
- an `extreme` image shift profile for harder temporal stress tests
- a delayed-label temporal image benchmark for studying feedback lag
- a temporal benchmark suite over multiple delay and severity settings
- smoothed, trust-weighted retrospective reward updates for delayed-feedback controller learning
- a recurrence-first temporal benchmark for testing specialist reuse under returning regimes
- a broader paper-style temporal suite with win-count and delta summaries
- saved temporal paper-suite artifacts under `results/temporal_paper_suite.{md,json}`
- a graph-native benchmark with topology-aware shift monitoring
- an initial WILDS CivilComments benchmark path for testing the controller stack on a stronger public benchmark family
- a multi-seed WILDS CivilComments suite with compact and medium public-benchmark settings
- a multi-seed benchmark suite with JSON and Markdown outputs under `results/`
- a dedicated image scale-up suite that defaults to a fast `convnet` confirmation loop, with slower `resnet_small` confirmations available when needed
- an ablation suite for controller actions, reward shaping, and specialist-memory routing
- a lightweight uncertainty wrapper
- an evaluation loop for comparing outcomes over time
- per-step traces for inspecting failure modes

## Current Read

The current results point to a clear architectural conclusion:

- naive continual adaptation is consistently brittle
- reset logic is one of the highest-leverage safety mechanisms
- on tabular data, `bandit` and `hybrid` currently have the best utility/risk tradeoff
- on the harder digits-shift benchmark, all serious controllers beat or match frozen while dramatically reducing risk capital
- on the new Fashion-MNIST benchmark, the controller family roughly matches frozen accuracy while cutting sequential risk by an order of magnitude
- the harsher image profile creates a clearer separation between frozen and controller-guided behavior
- the temporal and graph tracks are now in place, so the controller abstractions are being exercised beyond flat iid-style batch streams
- on the temporal suite, regime-aware delayed control helps most in some longer-delay settings, but it is still unstable across the full grid
- on the full temporal paper suite, `controller` currently has the strongest aggregate utility story, while `frozen` still wins the most accuracy settings and `delayed_bandit` is the strongest delayed learner
- after the specialist-quality upgrade, `delayed_hybrid` became more competitive on the full temporal paper suite, but `controller` and `hybrid` still have the strongest aggregate utility story
- on the recurrence-first temporal benchmark, `delayed_hybrid` now opens multiple specialists, but the delayed branch still trails on utility and needs better routing/credit assignment
- richer specialist signatures and support-state warm starts now produce much stronger focused long-delay results for the delayed-memory branch
- the newest result is that delayed specialist memory should likely be **regime-selective**: it helps on some recurring long-delay slices, but `regime_aware_delayed_bandit` is still the strongest general delayed learner on the mixed long-delay grid
- the new explicit regime encoder makes delayed control more selective and improves some long-delay slices, especially `standard / 12` and `extreme / 12`, but it has not yet made delayed memory the strongest overall temporal branch
- the temporal image benchmark now distinguishes between immediate-learning and true delayed-feedback bandit control
- the temporal track now uses retrospective rewards at label reveal time, not just delayed replay of immediate utility
- `delayed_hybrid` is now clearly a real branch rather than collapsing to plain delayed bandit, but it is still not the strongest temporal policy overall
- the graph benchmark now degrades frozen performance meaningfully on structural rewiring, which makes it a better structural-shift testbed even though it still mostly highlights safety over raw accuracy recovery
- the WILDS CivilComments path now has a real `easy / hard / recurring` split instead of recurring collapsing back onto the majority group
- on the multi-seed WILDS suite, `bandit` and `hybrid` currently have the best public-benchmark utility while raw accuracy remains roughly tied with frozen
- delayed specialist memory now forms more controlled specialist pools with richer reuse diagnostics, but it still trails non-delayed `hybrid` on the recurrence-first benchmark
- the most promising recent direction is specialist quality: better routing signatures plus specialist support-state warm starts improved delayed-memory performance far more than controller decoupling did
- those specialist-quality gains are real but mixed at full-suite scale: they improved the delayed branch’s competitiveness without yet making it the strongest overall temporal policy
- making memory more selective helped clarify the story more than it improved the top-line numbers: routing can stay fairly loose, but warm-start reuse needs to be selective under harsher shift
- the project looks strongest as a **controller over bounded interventions**, not as a single always-on adaptation rule
- the public temporal runtime story is now sharper:
  - `PaySim` remains the strongest fraud-style bounded-auto success story
  - `UCI Gas Sensor Drift` is a neutral-but-honest maintenance benchmark
  - `OpenML Electricity` now uses a conservative `sensor_safe` profile and stands down instead of harming accuracy

## Commercial Deployment Runtime

The repo now includes a production-oriented runtime layer on top of the research benchmarks:

- `ReliabilityLayer` — stable deployment surface for every batch (predictions, shift/risk scores, recommended/taken actions, trust state, rollback metadata)
- **Decision record schema** — stable operator-facing payload with `regime_id`, `regime_confidence`, `risk_score`, `why_this_action`, `rollback_eligible`, and `retrain_recommended`
- **Operating modes** — `shadow`, `recommend`, `bounded_auto`
- **Safety budgets** — per-window caps on auto-actions / resets with downgrade-to-`recommend` when budgets are exhausted
- **Model adapters** — `torch_tabular`, `sklearn`, `black_box`
- **Governance** — SQLite audit log, versioned snapshots, one-click rollback
- **Offline replay** — canonical CSV/Parquet historical streams with label-delay simulation and `reveal_labels(step, labels)` for delayed supervision
- **Runtime policies** — `delayed_bandit`, `regime_aware_delayed_bandit` (ported from research for fraud pilots)
- **Operator + buyer reports** — `technical_report.md`, `operator_report.md`, `buyer_report.md`, and replay schema artifacts for every pilot / public-story run
- **Dual-metric reports** — shadow vs `bounded_auto` on the same stream (`dual_metric_report.md`)
- **HTTP serving** — FastAPI sidecar (`/v1/batch`, `/v1/batch/{step}/labels`, `/v1/approve`, `/v1/health`, `/v1/metrics`)
- **Profile-aware runtime control** — drift signatures plus bounded action profiles for `fraud`, `sensor`, and conservative `sensor_safe` streams
- **Ingest contract** — canonical replay schema (`timestamp`, `label`, `feature_*`, optional `regime`, optional `meta_*`) → replay stream
- **Policy persistence** — save/load bandit + regime encoder state across restarts
- **Configurable KPIs** — `kpi` block in runtime YAML for buyer-facing scores
- **Pilot framework** — fraud/risk-style case study with saved KPI report
- **Public ops stories** — one-command replay artifacts for public datasets (`scripts/run_public_ops_story.py`)
- **Observability** — optional Prometheus metrics endpoint

Product milestone checklist (dual-metric pilots, verification, sidecar): **`docs/product_milestones.md`**. Run all five with:

```bash
python3 scripts/run_product_milestones.py
```

### Quick start (commercial path)

```bash
pip install -e ".[dev,prometheus,serving]"

# Shadow-mode offline replay on synthetic fraud-like stream
python3 scripts/run_offline_replay.py --synthetic --config configs/default.yaml

# Pilot case study artifact (report + JSON KPIs; dual-metric when layer_builder is set)
python3 scripts/run_pilot_case_study.py --config configs/pilot_fraud_tabular.yaml

# PaySim torch pilot with regime-aware delayed bandit + dual-metric report
python3 scripts/run_pilot_torch.py

# Ingest CSV/JSONL and replay (optional --dual-mode)
python3 scripts/run_ingest_replay.py --input data/openml/credit_german.csv --dual-mode

# Canonical offline replay on a CSV or Parquet stream
python3 scripts/run_offline_replay.py --input data/openml/credit_german.csv

# Public ops story on a public dataset
python3 scripts/run_public_ops_story.py --source-id paysim_fraud --controller-name multi_action --stream-cycles 4

# Correction-centric parallel-path evaluation on the fraud SOTA suite
python3 scripts/run_correction_path_evaluation.py

# Focused decomposition of the flagship fraud SOTA lane
python3 scripts/run_production_failure_analysis.py --source ieee_cis_fraud_torch

# HTTP sidecar (production pilot)

```bash
pip install -e ".[serving,prometheus]"
python3 scripts/export_bundled_fraud_data.py
python3 scripts/run_serve.py --config configs/serving_pilot_fraud_torch.yaml --force-shadow
python3 scripts/run_serving_parity.py
```

Docs: `docs/sidecar_production.md` | Docker: `docker compose up arl-sidecar`

# Prometheus metrics (optional)
python3 scripts/run_metrics_server.py --config configs/default.yaml
```

### Configuration

Default runtime config: `configs/default.yaml`

Pilot config: `configs/pilot_fraud_tabular.yaml`

Key fields:

- `operating_mode`: `shadow` | `recommend` | `bounded_auto`
- `bounded_auto_actions`: low-risk actions allowed in bounded auto mode
- `safety_budget.window_steps`: control horizon for bounded-auto budgets
- `safety_budget.max_auto_actions_per_window`: automatic intervention cap per horizon
- `safety_budget.max_resets_per_window`: reset cap per horizon
- `safety_budget.downgrade_to_recommend`: force human approval when budgets are exhausted
- `governance.audit_db_path` / `governance.snapshot_dir`: audit + rollback storage
- `replay.label_delay_steps`: delayed-label simulation for offline replay
- `replay_schema.md`: generated canonical input contract for customer logs

### Public ops story artifacts

Every public or pilot replay now writes:

- `technical_report.md`: replay summary and per-strategy metrics
- `operator_report.md`: intervention timeline and top drift episodes
- `buyer_report.md`: risk / accuracy / retrain-deferral summary
- `summary.json`: machine-readable KPI payload
- `replay_schema.md`: canonical log schema for ingestion

Current strongest public fraud/risk ops story:

- `results/ops_story_paysim_fraud_multi_action/`
- bounded auto improved accuracy from `87.0%` to `96.9%`
- intervention rate stayed at `4.4%` of batches
- retrain trigger was deferred by `1` step
- harmful drift events avoided vs frozen: `2`

### Real-data verification (before design partners)

Verify the commercial runtime across multiple public datasets:

```bash
python3 scripts/run_real_data_verification.py --config configs/real_data_verification.yaml
```

Sources included by default:

| Source | Type | Wedge |
|--------|------|-------|
| `breast_cancer` | sklearn UCI | general tabular |
| `digits` | sklearn | general tabular |
| `tabular_breast_cancer_shift` | in-repo shift stream | general tabular |
| `openml_credit_g` | OpenML German Credit | fraud/risk adjacent |
| `paysim_fraud` | PaySim-style synthetic mobile money | fraud ops proxy (time-ordered) |
| `ieee_cis_fraud` | IEEE-CIS sample or synthetic fallback | imbalanced fraud tabular |
| `openml_electricity` | OpenML Electricity | predictive maintenance proxy |
| `uci_gas_sensor_drift` | UCI Gas Sensor Array Drift | natural batch-chronological drift benchmark |
| `wilds_civilcomments_csv` | local WILDS CSV | public NLP / moderation |

Each source runs through all 8 commercial priorities: deployment surface, operating modes, offline replay, model adapters, engineering maturity, observability hooks, governance/audit, and real-data KPI evidence.

Results are saved under `results/real_data_verification/`.

Bundled offline fallbacks for OpenML-style datasets live in `data/openml/` (UCI German Credit + Spambase). Regenerate with:

```bash
python3 scripts/export_bundled_real_data.py
```

Grafana dashboard template: `observability/grafana/arl_dashboard.json`

### Docker

```bash
docker compose run --rm replay
docker compose up metrics
```
