Metadata-Version: 2.1
Name: benchmark-reliability
Version: 0.1.3
Summary: Benchmark Reliability Framework (BRF) - dataset-level reliability auditing for predictive benchmarks
Author-email: zhanglizhuo <zhanglizhuo@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/zhanglizhuo/BenchmarkReliability
Project-URL: Repository, https://github.com/zhanglizhuo/BenchmarkReliability
Keywords: benchmark reliability,dataset auditing,educational AI,machine learning
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.21
Requires-Dist: scikit-learn>=1.0
Requires-Dist: matplotlib>=3.5

# BenchmarkReliability - BRF Python Package

## Target

Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.

## Method

The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:

```python
from brf import BRFAnalyzer
from brf.phase import plot_phase_diagram
from brf.report import export_json

analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
print(analyzer.brf_vector)   # (B, I, N, M) -> (S, E) -> class

# Visualization
plot_phase_diagram(
    [analyzer.S], [analyzer.E],
    labels=[analyzer.class_],
    classes=[analyzer.class_],
)

# Export
export_json(analyzer.brf_vector, "results.json")
```

## Package Structure

```
brf/
|-- __init__.py
|-- analyzer.py          <- BRFAnalyzer main class
|-- metrics/
|   |-- baseline_gap.py  <- B
|   |-- instability.py   <- I
|   |-- null_test.py     <- N (permutation test)
|   |-- metadata.py      <- M
|-- phase/
|   |-- embedding.py     <- S = N - I, E = B + M
|   |-- classifier.py    <- Reliable / Fragile / Void
|   |-- visualization.py <- phase diagram, clustering plot
|-- report/
|   |-- json_export.py
|   |-- latex_export.py
```

## Steps

### Phase 1: Package skeleton (1-2 weeks)
- [x] Initialize Python project with `pyproject.toml`
- [x] Implement `BRFAnalyzer` main class with fit/predict interface
- [x] Port `compute_b`, `compute_i`, `compute_n`, `compute_m` from BehaviorAudit
- [x] Write unit tests for each metric

### Phase 2: Phase embedding + classification (1 week)
- [x] Implement `compute_phase(S, E)` and `classify_dataset(S, E)`
- [x] Build phase diagram visualization (matplotlib)
- [x] Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results

### Phase 3: Documentation + distribution (1-2 weeks)
- [x] Write README with quick-start tutorial and API docs
- [ ] Publish to TestPyPI -> PyPI
- [ ] Set up ReadTheDocs for auto-generated documentation
- [ ] Add GitHub Actions CI (test on Python 3.9-3.12)

### Phase 4: HuggingFace Hub integration (optional, 1 week)
- [ ] Add HF dataset loading wrapper
- [ ] Allow `brf.fit(dataset_id="OULAD")` shorthand

## Dependencies

- `numpy>=1.21`
- `scikit-learn>=1.0`
- `matplotlib>=3.5`
- No deep learning dependencies required

## Relationship to Sister Repos

- `BehaviorAudit/`: source of the audit logic; this package refactors and generalizes it
- `LLMScoringAudit/`: first applied use case (MM-TBA x multiple LLMs)
- `BenchmarkPhase/`: large-scale application (30 datasets BRF leaderboard)
- `llm-annotation/`: cited for complementary MLLM pseudo-label reliability findings

## Target Journal

- Journal of Open Source Software (JOSS) - tool paper, lightweight submission
- Followed by application papers in C&E / BJET

## Timeline

- Phase 1-2: 3 weeks
- Phase 3: 2 weeks
- Phase 4: optional
- JOSS submission: after Phase 3
