Metadata-Version: 2.4
Name: phobic
Version: 0.2.0
Summary: Fast minimal perfect hash functions
Author-email: Alexander Towell <lex@metafunctor.com>
License-Expression: MIT
Project-URL: Repository, https://github.com/queelius/phobic
Keywords: hash,perfect-hash,phf,minimal-perfect-hash,hashing
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: C
Classifier: Operating System :: POSIX
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# phobic

Fast minimal perfect hash functions for Python.

A **perfect hash function** maps a known set of *n* keys to distinct integers in [0, *m*) with no collisions. Build it once from your key set, then every lookup is O(1). Implemented in C11 with no runtime dependencies.

## Install

```
pip install phobic
```

## Usage

```python
import phobic

# Build a perfect hash function
keys = ["alice", "bob", "charlie", "diana"]
phf = phobic.build(keys)

# Query: returns a unique integer per key
phf["alice"]    # e.g., 2
phf["bob"]      # e.g., 0

# Properties
phf.num_keys      # 4
phf.range_size    # 8 (with default alpha=1.0)
phf.bits_per_key  # ~1.0
phf.collisions    # 0
phf.is_perfect    # True

# Persistence
data = phf.to_bytes()
phf2 = phobic.from_bytes(data)
```

### Build with slot map

When you'll immediately fill a slot array indexed by `phf[k]` for every key (a common pattern in PHF-backed retrieval and cipher-map structures), `build_with_slots` returns the slot list as a build by-product, saving a redundant pass:

```python
phf, slots = phobic.build_with_slots(keys)
# slots[i] == phf[keys[i]] for every i
table = [None] * phf.range_size
for i, s in enumerate(slots):
    table[s] = values[i]
```

### Options

```python
# Closer to minimal: 5% overhead instead of 100% (slower build, may need more retries)
phf = phobic.build(keys, alpha=0.05)

# Fixed seed for reproducibility
phf = phobic.build(keys, seed=42)

# Control retry budget (default 100)
phf = phobic.build(keys, max_retries=200)

# Average keys per bucket. None (default) picks ceil(log2 N), which
# minimises bits/key. Smaller values (e.g. 8) build dramatically
# faster at large N, at the cost of ~2x bits/key. At N=100K keys,
# bucket_size=8 is ~6x faster than the default for cipher-maps-style
# workloads. See .claude/SWEEP_BUCKET_SIZE.md for the full sweep.
phf = phobic.build(keys, bucket_size=8)
```

### Non-perfect builds

By default, `build()` raises `RuntimeError` if no perfect hash is found within the retry budget. With `strict=False`, it always returns the best result found:

```python
phf = phobic.build(keys, alpha=0.05, strict=False)
phf.is_perfect   # False if no perfect build was found
phf.collisions   # number of keys placed in already-occupied slots
```

The builder tries multiple seeds and gradually widens `alpha` across retries, keeping the attempt with the fewest collisions.

## Space Efficiency

PHOBIC achieves ~1 bit/key for the hash structure at 100k keys. When used as a filter (PHF + n-bit fingerprints), total space is `bpk + log2(1/e)` bits/key for false positive rate `e`. This beats Bloom filters (`1.44 * log2(1/e)`) whenever `e` is small enough that the PHF overhead is absorbed.

## Performance

C11 core, single-threaded. The GIL is released during construction so other Python threads can run concurrently. Typical build times: ~0.5 us/key at n=1k, ~1.8 us/key at n=100k. Query throughput: ~2M queries/sec from Python (~500 ns/query including str-to-bytes encoding).

## Scale: partitioned builds

For n above ~500K, single-threaded pilot search becomes the bottleneck; at n=1M the default parameters take ~100 s and above ~2M the build can fail entirely. Use `build_partitioned` to shard across cores:

```python
import phobic

keys = [f"key_{i}".encode() for i in range(10_000_000)]

# Uses ThreadPoolExecutor + the GIL-releasing C build to parallelize.
phf = phobic.build_partitioned(keys)

phf.num_keys      # 10_000_000
phf.num_shards    # 666 (auto: 15K keys per shard)
phf.bits_per_key  # ~1.18 (per-shard headers + partition wrapper)
phf[b"key_42"]    # unique slot in [0, phf.range_size)

# Wire-serializable with its own magic bytes.
data = phf.to_bytes()
phf2 = phobic.PartitionedPHF.from_bytes(data)
```

Measured speedup (8-core machine):

| n | serial | partitioned | speedup |
|---:|-------:|-----------:|--------:|
| 100K | 0.10 s | 0.18 s | 0.6x (threading overhead wins at small n) |
| 500K | 1.08 s | 0.94 s | 1.2x |
| 1M | 133.5 s | 2.0 s | **67.7x** |
| 2M | fails | 4.0 s | serial unbuildable |
| 10M | - | 20.6 s | - |

At small `n` the per-shard metadata makes partitioning slightly slower. From roughly 500K up it wins decisively; at 1M+ it is often the only way to build at all. See [`src/phobic/partitioned.py`](src/phobic/partitioned.py) for the implementation.

## Demo

See [`demo.ipynb`](demo.ipynb) for an interactive walkthrough covering the API, alpha trade-offs, space efficiency, serialization, and a matplotlib plot of how collisions decrease with retry budget.

## License

MIT
