Metadata-Version: 2.4
Name: georf
Version: 0.1.0
Summary: Geometry-aware random forest with HVRT-powered generative diversity
Author: Jake Peace
License-Expression: AGPL-3.0-or-later
Project-URL: Homepage, https://github.com/jpeaceau/georf
Project-URL: Repository, https://github.com/jpeaceau/georf.git
Project-URL: Bug Tracker, https://github.com/jpeaceau/georf/issues
Keywords: random-forest,generative-diversity,synthetic-data,machine-learning,HVRT,ensemble
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: scikit-learn
Requires-Dist: hvrt>=2.3.0
Requires-Dist: joblib
Provides-Extra: insights
Requires-Dist: matplotlib>=3.7; extra == "insights"
Requires-Dist: seaborn>=0.13; extra == "insights"
Requires-Dist: scipy>=1.11; extra == "insights"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Provides-Extra: xgb
Requires-Dist: xgboost>=1.7; extra == "xgb"
Requires-Dist: optuna>=3.0; extra == "xgb"
Dynamic: license-file

# GeoRF — Geometry-Aware Random Forest

GeoRF replaces bootstrap resampling with **HVRT-powered generative diversity**.
Each tree trains on a completely unique synthetic dataset drawn from learned
per-partition kernel density estimates.  No tree ever sees a real sample.  No
two trees share a single training point.

## Why?

Bootstrap bagging has a diversity ceiling.  With *n* = 250 samples, each
bootstrap draw contains ≈158 unique samples.  GeoRF removes this ceiling: 100
trees × 500 samples = **50 000 unique synthetic training points**.

See [`benchmark/results/`](benchmark/results/) for full benchmark results
comparing GeoRF against Random Forest, Gradient Boosting, XGBoost, LightGBM,
and MLP (sklearn + PyTorch) on standard datasets.  Run the benchmarks yourself:

```bash
cd benchmark
pip install -r requirements.txt
python run_classification.py
python run_regression.py
```

## Install

```bash
pip install -e .          # editable
# or
pip install georf
```

**Requirements:** Python ≥ 3.10, `hvrt >= 2.3.0`, `scikit-learn`, `numpy`, `joblib`

## Quick start

```python
from georf import GeoRFClassifier, GeoRFRegressor

# Classification
clf = GeoRFClassifier(n_estimators=100, n_samples_per_tree=500, n_jobs=-1)
clf.fit(X_train, y_train)
y_pred  = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)   # (n_samples, n_classes)

# Regression
reg = GeoRFRegressor(n_estimators=100, n_samples_per_tree=500, n_jobs=-1)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

# Interpretability
clf.feature_importances(feature_names=cols)  # dict, sorted descending
clf.tree_quality_scores()                    # per-tree AUC array
clf.diversity_score()                        # float: pairwise disagreement rate
clf.provenance()                             # dataset / expansion metadata
```

## Parameters

| Parameter | Default | Description |
|---|---|---|
| `n_estimators` | 100 | Number of trees |
| `n_samples_per_tree` | 500 | Synthetic samples per tree |
| `max_depth` | 6 | Max tree depth |
| `min_samples_leaf` | 5 | Min samples per leaf |
| `max_features` | None | Feature subsampling (`None` = all) |
| `bandwidth` | `'auto'` | HVRT KDE bandwidth (`'auto'` = per-partition auto-selection) |
| `n_jobs` | None | Workers (`-1` = all cores) |
| `random_state` | 42 | Reproducibility seed |

## Running tests

```bash
pip install pytest
pytest tests/
# exclude slow timing test:
pytest tests/ -m "not slow"
```

## License

AGPL-3.0-or-later
