Metadata-Version: 2.4
Name: verifiers-monitor
Version: 0.1.1
Summary: Observability framework for Verifiers RL training and evaluation
Project-URL: Homepage, https://github.com/kaushikb11/verifiers-monitor
Project-URL: Repository, https://github.com/kaushikb11/verifiers-monitor
Project-URL: Issues, https://github.com/kaushikb11/verifiers-monitor/issues
Project-URL: Documentation, https://github.com/kaushikb11/verifiers-monitor#readme
Author-email: Kaushik Bokka <kaushikbokka@gmail.com>
License: Apache-2.0
License-File: LICENSE
Keywords: machine-learning,monitoring,observability,reinforcement-learning,verifiers
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: System :: Monitoring
Requires-Python: >=3.9
Requires-Dist: aiohttp>=3.12.0
Requires-Dist: fastapi>=0.118.0
Requires-Dist: psutil>=7.1.0
Requires-Dist: pyngrok>=7.4.0
Requires-Dist: rich>=14.1.0
Requires-Dist: sqlmodel>=0.0.25
Requires-Dist: uvicorn>=0.37.0
Requires-Dist: verifiers
Requires-Dist: websockets>=15.0.0
Provides-Extra: analysis
Requires-Dist: pandas>=2.3.0; extra == 'analysis'
Provides-Extra: dev
Requires-Dist: black>=25.9.0; extra == 'dev'
Requires-Dist: isort>=6.1.0; extra == 'dev'
Requires-Dist: mypy>=1.18.0; extra == 'dev'
Requires-Dist: pre-commit>=4.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.25.0; extra == 'dev'
Requires-Dist: pytest>=8.4.0; extra == 'dev'
Provides-Extra: gpu
Requires-Dist: gputil>=1.4.0; extra == 'gpu'
Requires-Dist: pynvml>=8.0.4; extra == 'gpu'
Description-Content-Type: text/markdown

# Verifiers Monitor

**Real-time observability for RL training and evaluation**

Running RL experiments without visibility into rollout quality, reward distributions, or failure modes wastes time. Monitor gives you live tracking, per-example inspection, and programmatic access—see what's happening during runs and debug what went wrong after.

![Dashboard](./verifiers_monitor/assets/dashboard.png)

## Quick Start

```python
import verifiers as vf
from verifiers_monitor import monitor

# One-line integration
env = monitor(vf.load_environment("gsm8k"))
results = env.evaluate(client, model="gpt-5-mini")
# Dashboard automatically launches at localhost:8080
```

⚠️ **Training monitoring**: Not yet supported (coming soon)

See `scripts/01_monitor.py` and `scripts/02_access_data.py` for examples.

## Installation

```bash
pip install verifiers-monitor
```

## What You Get

- Live progress tracking with WebSocket updates (know when long runs stall)
- Real-time reward charts showing trends as rollouts complete
- Per-example status: see which prompts pass, which fail, why
- Inspect failures: view full prompts, completions, and reward breakdowns
- Multi-rollout analysis: identify high-variance examples where model is inconsistent
- Reward attribution: see which reward functions contribute most to scores
- Session comparison: track metrics across training iterations or evaluation experiments

## Dashboard

Launches automatically at `http://localhost:8080`. Shows success rates, response times, consistency metrics, and per-example breakdowns in real-time.

## Programmatic Analysis

Access rollout data for custom analysis and debugging:

```python
from verifiers_monitor import MonitorData

data = MonitorData()

# Find worst-performing examples to understand model weaknesses
session = data.get_latest_session(env_id="math-python")
worst = data.get_top_failures(session.session_id, n=10)
for ex in worst:
    print(f"Example {ex.example_number}: avg={ex.mean_reward:.2f}, std={ex.std_reward:.2f}")
    # Check if unstable (high variance across rollouts)
    if ex.is_unstable(threshold=0.3):
        print(f"  ⚠️ Unstable: variance {ex.std_reward:.2f}")
    # Get best/worst rollouts
    best = ex.get_best_rollout()
    print(f"  Best: {best.reward:.2f}, Worst: {ex.get_worst_rollout().reward:.2f}")

# Inspect prompts and completions
failures = data.get_failed_examples(session.session_id, threshold=0.5)
for ex in failures[:5]:
    rollout = ex.rollouts[0]
    # Use convenience properties
    print(f"Prompt: {rollout.prompt_messages[0]['content'][:50]}...")
    if rollout.has_tool_calls:
        print("  Contains tool calls")

# Export to pandas for custom analysis
df = data.to_dataframe(session.session_id)
variance_analysis = df.groupby('example_number')['reward'].std()
high_variance = variance_analysis[variance_analysis > 0.3]
print(f"Found {len(high_variance)} unstable examples")
```

---
Questions? Create an Issue or reach out on [X](https://x.com/kaushik_bokka)

Happy building! 🚀
