Metadata-Version: 2.4
Name: zerofolio
Version: 0.1.0
Summary: Algorithm selection with zero domain knowledge via text embeddings
Author-email: Stefan Szeider <sz@ac.tuwien.ac.at>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: ASlib,AutoML,algorithm selection,embeddings
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Requires-Dist: numpy>=1.21
Description-Content-Type: text/markdown

# ZeroFolio

Algorithm selection with zero domain knowledge via text embeddings.

ZeroFolio selects the best algorithm for a problem instance using a three-step pipeline:

1. **Serialize** — read the raw instance file as plain text (`zf.serialize`)
2. **Embed** — pass the text through any pretrained embedding model (user-provided)
3. **Select** — pick the best algorithm via weighted k-nearest neighbors (`zf.KNNSelector`)

No feature engineering, no domain expertise, no training required.
ZeroFolio handles serialization and selection; you bring your own embedding API.

## Installation

```bash
pip install zerofolio
```

## Verify Installation

A quick smoke test with synthetic data (no API key needed):

```python
import numpy as np
import zerofolio as zf

# Synthetic embeddings and runtimes for 50 instances, 4 algorithms
X_train = np.random.rand(50, 128)
rt_train = np.random.rand(50, 4) * 100

selector = zf.KNNSelector(k=5)
selector.fit(X_train, rt_train)

X_test = np.random.rand(3, 128)
print(selector.predict(X_test))  # e.g. array([2, 0, 3])
```

## Quick Start (with Gemini Embeddings)

A full example using Google's Gemini embedding model.
Requires: `pip install google-genai`

```python
import os
import numpy as np
import zerofolio as zf
from google import genai

# --- Step 1: Serialize instance files ---
# Reads the file as plain text, shuffles lines, truncates to budget.
text = zf.serialize("instance.cnf", strategy="line_shuffle", max_chars=10000, seed=42)

# --- Step 2: Embed (user-provided) ---
# ZeroFolio does not call any API itself. You provide an embedding function.
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

def embed(text):
    result = client.models.embed_content(
        model="gemini-embedding-001",
        contents=text,
    )
    return result.embeddings[0].values

# Embed all your training instances into numpy arrays:
# For each instance file, call zf.serialize() then embed().
# train_embeddings: np.array of shape (n_train, 3072)
# train_runtimes:   np.array of shape (n_train, n_algorithms)
#     Each row gives the runtime (seconds) of each algorithm on that instance.
#     Use PAR10 values (10× cutoff for unsolved) for best results.
train_embeddings = np.array([embed(zf.serialize(f)) for f in train_files])

# --- Step 3: Select via k-NN ---
selector = zf.KNNSelector(k=10, metric="manhattan")
selector.fit(train_embeddings, train_runtimes)

test_embedding = np.array([embed(text)])
best_algo_idx = selector.predict(test_embedding)
print(f"Selected algorithm index: {best_algo_idx[0]}")
```

### Using Algorithm Names

Pass algorithm names to get predictions as strings instead of indices:

```python
algos = ["minisat", "glucose", "cadical", "kissat"]
selector = zf.KNNSelector(k=10, algorithm_names=algos)
selector.fit(train_embeddings, train_runtimes)

print(selector.predict(test_embedding))  # e.g. ["glucose"]
```

## Multi-Seed Voting

Different random seeds expose different parts of the instance file to the embedding model.
Average the scores across seeds for more robust selection.
(Assumes `embed` and `selector` are set up as in the Quick Start above.)

```python
scores = []
for seed in [42, 100, 500]:
    text = zf.serialize("instance.cnf", seed=seed)
    emb = np.array([embed(text)])
    scores.append(selector.predict_scores(emb))  # shape (1, n_algorithms)

avg_scores = np.mean(scores, axis=0)  # shape (1, n_algorithms)
best_algo_idx = int(np.argmin(avg_scores[0]))
```

## API Reference

**`zf.serialize(path, strategy, max_chars, seed)`**
Serialize an instance file to text. Works with any text-based format (CNF, WCNF, QDIMACS, MiniZinc, ASP, MPS, etc.). Gzipped files (`.gz`) are handled automatically.

- `strategy`: `"line_shuffle"` (default, recommended) or `"raw"`. Use `zf.list_strategies()` to see all options.
- `max_chars`: Character budget for truncation (default 10,000). The effective context window of Gemini Embedding models is approximately 2,048 tokens.
- `seed`: Random seed for line shuffling (default 42). Different seeds yield different views of the same instance. Ignored by the `"raw"` strategy.

**`zf.KNNSelector(k, metric, algorithm_names)`**
Weighted k-NN algorithm selector using inverse-distance weighting.

- `k`: Number of neighbors (default 10). Must not exceed the number of training instances.
- `metric`: `"manhattan"` (default, recommended) or `"cosine"`.
- `algorithm_names`: Optional list of algorithm name strings. When provided, `predict()` returns a list of name strings; otherwise it returns a numpy array of column indices.

Methods:
- `.fit(embeddings, runtimes)` — fit on training embeddings and runtime matrix. Raises `ValueError` if `k` exceeds the number of training instances.
- `.predict(embeddings)` — return best algorithm per instance (names or indices)
- `.predict_scores(embeddings)` — return per-algorithm weighted-average scores, shape `(n_test, n_algorithms)`

**`zf.list_strategies()`**
Return the list of available serialization strategy names.

## Persistence

ZeroFolio selectors can be saved and loaded with `pickle`:

```python
import pickle

with open("selector.pkl", "wb") as f:
    pickle.dump(selector, f)

with open("selector.pkl", "rb") as f:
    selector = pickle.load(f)
```

## Citation

```bibtex
@article{szeider2026zerofolio,
  title={Algorithm Selection with Zero Domain Knowledge via Text Embeddings},
  author={Szeider, Stefan},
  year={2026}
}
```

## License

Apache 2.0
