Metadata-Version: 2.4
Name: evoforest-tab
Version: 0.1.0
Summary: Evolved universal tabular feature map + closed-form ridge: an interpretable, training-free, local in-context learner for tabular data.
License: MIT
Keywords: tabular,in-context-learning,feature-map,tabpfn,ridge,automl
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=1.13
Requires-Dist: numpy>=1.21
Requires-Dist: pyyaml>=5.4
Provides-Extra: sklearn
Requires-Dist: scikit-learn>=1.0; extra == "sklearn"
Provides-Extra: examples
Requires-Dist: scikit-learn>=1.0; extra == "examples"
Requires-Dist: pandas>=1.3; extra == "examples"
Dynamic: license-file

# tabmap — EvoForest-Tab: an evolved universal tabular feature map

`tabmap` is the reference implementation of **EvoForest-Tab** (the EvoForest computation-search framework specialized to tabular data).

`tabmap` is an interpretable, training-free, **local** in-context learner for tabular data: an
evolved universal feature map `φ: row → ℝᴷ` (16 transform families over rank-gauss, count-encoding,
and categorical-mask channels) paired with a per-dataset **closed-form Bayesian-ridge head**. Given a
labeled *support* set and an unlabeled *query* set, it predicts in a single SVD solve — no gradient
descent, no per-dataset tuning, no GPU. It is competitive with gradient boosting and with the
published **TabPFN-v2** tabular foundation model, while remaining free to run and fully inspectable.

This repository accompanies the paper *"Evolving a Universal Tabular Feature Map: Interpretable,
Closed-Form In-Context Learning Competitive with Tabular Foundation Models"* and is **stand-alone**:
the deployment pipeline (feature map + ridge) depends only on `torch`, `numpy`, and `pyyaml`.

## Install
```bash
pip install -e .            # editable; or: pip install .
# deps: torch, numpy, pyyaml  (+ scikit-learn for the estimator base classes & examples)
```

## Usage (scikit-learn style)
```python
from evoforest_tab import TabMapClassifier, TabMapRegressor

clf = TabMapClassifier(n_estimators=6).fit(X_support, y_support)   # X: ndarray or DataFrame
proba = clf.predict_proba(X_query)                                  # in-context: query needed to fit φ channels
pred  = clf.predict(X_query)

reg = TabMapRegressor(n_estimators=6).fit(X_support, y_support)
yhat = reg.predict(X_query)
```
Notes:
- It is an **in-context** learner: `predict` builds the (label-free, transductive) channels over the
  pooled support+query rows, so the query rows are needed at prediction time (as with TabPFN).
- `n_estimators` is the random-feature ensemble size (averaged decorrelated seed-variants of `φ`);
  `n_estimators=1` is the single map, `6` is the paper default (variance reduction toward the kernel limit).
- `cat_features=[...]` marks categorical columns (indices or DataFrame names); omitted → auto-detected.
- No class-count ceiling (unlike TabPFN-v2's ≤10 classes); runs on CPU in milliseconds.

## What's inside
```
tabmap/
  _channels.py   raw rows  -> input channels (col-z, rank-gauss, count-encoding, categorical mask), nan-safe
  _genome.py     evaluate the evolved genome (champion.yaml) -> feature matrix Phi; seed-variants for the ensemble
  _ridge.py      closed-form Bayesian-ridge head (evidence-maximized lambda), single SVD solve
  estimator.py   TabMapClassifier / TabMapRegressor (sklearn API) + K-seed ensemble
  champion.yaml  the evolved 16-family genome (the deployment artifact)
examples/quickstart.py
reproduce/        scripts + cached TabPFN-v2 predictions to reproduce the paper's experiments
tests/
```

## Reproducing the paper
See [`reproduce/README.md`](reproduce/README.md). The cached TabPFN-v2 cloud predictions are included
so the head-to-head and routing experiments reproduce **without** any API key.

## Contributing this method upstream
`tabmap` is designed to drop into the tabular ML ecosystem. Best integration targets (most aligned first):

| Repo | Why it fits | Integration |
|---|---|---|
| **PriorLabs/tabpfn-extensions** | community extensions around TabPFN; our method is a free/local **complementary** in-context learner and a natural **cost-aware router** companion (route hard datasets to TabPFN, the rest to `tabmap`) | add as an extension module + a routing utility (`sklearn`-compatible) |
| **scikit-learn-contrib** | `TabMapClassifier`/`TabMapRegressor` already follow the estimator API | publish as a standalone `scikit-learn-contrib` project |
| **skrub** (ex dirty-cat) | tabular feature engineering / encoders; our channels (rank-gauss, count-encoding) + `φ` are a drop-in `TransformerMixin` featurizer | contribute `TabMapEncoder` (transform-only) |
| **pyg-team/pytorch-frame** | deep tabular; `φ` is a fixed featurizer usable as an input stem | add as an `encoder`/`stype` transform |
| **autogluon / TabArena** | leaderboard model implementations | submit `tabmap` as a model for the TabArena living benchmark |

The estimator's sklearn-compatible surface (`fit`/`predict`/`predict_proba`, `get_params`) is the
contribution-ready API; the transform-only `build_channels`+`build_phi` path serves the encoder use-cases.


## Combining with a foundation model (e.g. TabPFN)
`StackedTabularEnsemble` combines TabMap with any in-context base model (such as TabPFN's client) into a
single, stronger predictor -- the paper's complementarity result (our map tends to win classification,
TabPFN regression; combining beats either alone). Three methods: `blend` (50/50), `compwt`
(label-free, weight each model by its support-cross-validated competence), `meta` (a learned ridge head
over the models' out-of-fold support predictions; most robust). All are leakage-safe and in-context
(weights/head fit on support, no query labels).

```python
from evoforest_tab import TabMapClassifier, StackedTabularEnsemble
from tabpfn_client import TabPFNClassifier            # or any sklearn-surface in-context model

ens = StackedTabularEnsemble(
        [TabMapClassifier(n_estimators=6), TabPFNClassifier()],
        task="classification", method="meta",          # "meta" | "compwt" | "blend"
      ).fit(X_support, y_support)
proba = ens.predict_proba(X_query)
```
The learned head (`meta`) is robust whether the two models are evenly matched or one dominates; the
label-free `compwt` is a close, deployable second with no meta-learner. See `examples/combine_tabpfn.py`.

## Citation
If you use this library, please cite the accompanying paper *"Evolving a Universal Tabular Feature Map:
Interpretable, Closed-Form In-Context Learning Competitive with Tabular Foundation Models."* (anonymized
for review; see `../tabular_paper/`).

## License
MIT (see `LICENSE`).
