Metadata-Version: 2.1
Name: benchmark-reliability
Version: 0.1.0
Summary: Benchmark Reliability Framework (BRF) =?unknown-8bit?b?4oCU?= dataset-level reliability auditing for predictive benchmarks
Author-email: zhanglizhuo <zhanglizhuo@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/zhanglizhuo/BenchmarkReliability
Project-URL: Repository, https://github.com/zhanglizhuo/BenchmarkReliability
Keywords: benchmark reliability,dataset auditing,educational AI,machine learning
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy (>=1.21)
Requires-Dist: scikit-learn (>=1.0)
Requires-Dist: matplotlib (>=3.5)

# BenchmarkReliability ��� BRF Python Package

## Target

Provide a standardized, pip-installable Python package that computes the Benchmark Reliability Framework (BRF) for any predictive dataset, enabling researchers to run the four-dimension audit protocol with a single API call.

## Method

The package wraps the core logic from the BehaviorAudit project into a sklearn-style API:

```python
from brf import BRFAnalyzer
from brf.phase import plot_phase_diagram
from brf.report import export_json

analyzer = BRFAnalyzer(n_splits=30, n_permutations=200).fit(X, y, groups=groups)
print(analyzer.brf_vector)   # (B, I, N, M) ��� (S, E) ��� class

# Visualization
plot_phase_diagram(
    [analyzer.S], [analyzer.E],
    labels=[analyzer.class_],
    classes=[analyzer.class_],
)

# Export
export_json(analyzer.brf_vector, "results.json")
```

## Package Structure

```
brf/
��������� __init__.py
��������� analyzer.py          ��� BRFAnalyzer main class
��������� metrics/
���   ��������� baseline_gap.py  ��� B
���   ��������� instability.py   ��� I
���   ��������� null_test.py     ��� N (permutation test)
���   ��������� metadata.py      ��� M
��������� phase/
���   ��������� embedding.py     ��� S = N - I, E = B + M
���   ��������� classifier.py    ��� Reliable / Fragile / Void
���   ��������� visualization.py ��� phase diagram, clustering plot
��������� report/
���   ��������� json_export.py
���   ��������� latex_export.py
```

## Steps

### Phase 1: Package skeleton (1-2 weeks)
- [x] Initialize Python project with `pyproject.toml`
- [x] Implement `BRFAnalyzer` main class with fit/predict interface
- [x] Port `compute_b`, `compute_i`, `compute_n`, `compute_m` from BehaviorAudit
- [x] Write unit tests for each metric

### Phase 2: Phase embedding + classification (1 week)
- [x] Implement `compute_phase(S, E)` and `classify_dataset(S, E)`
- [x] Build phase diagram visualization (matplotlib)
- [x] Test on all 7 datasets from BehaviorAudit; verify BRF output matches SR paper results

### Phase 3: Documentation + distribution (1-2 weeks)
- [x] Write README with quick-start tutorial and API docs
- [ ] Publish to TestPyPI ��� PyPI
- [ ] Set up ReadTheDocs for auto-generated documentation
- [ ] Add GitHub Actions CI (test on Python 3.9���3.12)

### Phase 4: HuggingFace Hub integration (optional, 1 week)
- [ ] Add HF dataset loading wrapper
- [ ] Allow `brf.fit(dataset_id="OULAD")` shorthand

## Dependencies

- `numpy>=1.21`
- `scikit-learn>=1.0`
- `matplotlib>=3.5`
- No deep learning dependencies required

## Relationship to Sister Repos

- `BehaviorAudit/`: source of the audit logic; this package refactors and generalizes it
- `LLMScoringAudit/`: first applied use case (MM-TBA �� multiple LLMs)
- `BenchmarkPhase/`: large-scale application (30 datasets BRF leaderboard)
- `llm-annotation/`: cited for complementary MLLM pseudo-label reliability findings

## Target Journal

- Journal of Open Source Software (JOSS) ��� tool paper, lightweight submission
- Followed by application papers in C&E / BJET

## Timeline

- Phase 1���2: 3 weeks
- Phase 3: 2 weeks
- Phase 4: optional
- JOSS submission: after Phase 3
