Metadata-Version: 2.4
Name: lumina-data
Version: 0.1.0.dev5
Summary: Python SDK for Lumina vector search engine
Home-page: https://github.com/aliyun/lumina
Author: Alibaba Storage Service Team
License: Apache-2.0
Project-URL: Homepage, https://github.com/aliyun/lumina
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: home-page
Dynamic: requires-python

# lumina-data

Python SDK for the [Lumina](https://github.com/aliyun/lumina) vector search engine.
Provides zero-overhead ctypes bindings to the Lumina C++ library for building and
searching vector indexes (DiskANN, Bruteforce, IVF).

## Requirements

- Linux x86_64
- Python >= 3.6

## Install

```bash
pip install .
```

Pre-built native libraries are bundled in the package. No compilation needed.

## Usage

### High-level API (list in, list out)

```python
from lumina_data import LuminaBuilder, LuminaSearcher

options = {
    "index.type": "diskann",
    "index.dimension": "128",
    "distance.metric": "l2",
    "encoding.type": "rawf32",
}

# Build
n, dim = 10000, 128
vectors = [...]  # list of n*dim floats
ids = list(range(n))

builder = LuminaBuilder(options)
builder.pretrain_from_list(vectors, n, dim)
builder.insert_from_list(vectors, ids, n, dim)
builder.dump("/path/to/index.lmi")
builder.close()

# Search
searcher = LuminaSearcher(options)
searcher.open("/path/to/index.lmi")

query = [0.1, 0.2, ...]  # list of dim floats
distances, labels = searcher.search_list(query, n=1, k=10)

for i in range(len(labels)):
    print("id=%d  distance=%.4f" % (labels[i], distances[i]))

searcher.close()
```

### Raw ctypes API (zero-copy, for performance-critical code)

```python
import ctypes
from lumina_data import LuminaBuilder, LuminaSearcher

options = {
    "index.type": "diskann",
    "index.dimension": "128",
    "distance.metric": "l2",
    "encoding.type": "rawf32",
}

n, dim, k = 10000, 128, 10

# Build
vectors = (ctypes.c_float * (n * dim))(*data)
ids = (ctypes.c_uint64 * n)(*range(n))

with LuminaBuilder(options) as builder:
    builder.pretrain(vectors, n, dim)
    builder.insert(vectors, ids, n, dim)
    builder.dump("/path/to/index.lmi")

# Search
with LuminaSearcher(options) as searcher:
    searcher.open("/path/to/index.lmi")

    query = (ctypes.c_float * dim)(*query_data)
    distances = (ctypes.c_float * k)()
    labels = (ctypes.c_uint64 * k)()

    searcher.search(query, 1, k, distances, labels,
                    {"diskann.search.list_size": "32"})

    for i in range(k):
        print("id=%d  distance=%.4f" % (labels[i], distances[i]))
```

### Filtered Search

```python
# High-level
distances, labels = searcher.search_with_filter_list(
    query, n=1, k=10, filter_ids=[0, 2, 4, 6, 8])

# Raw ctypes
filter_arr = (ctypes.c_uint64 * 5)(0, 2, 4, 6, 8)
searcher.search_with_filter(
    query_arr, 1, k, filter_arr, 5, distances, labels)
```

### Batch Queries

```python
# High-level
all_queries = [...]  # list of n_queries * dim floats
distances, labels = searcher.search_list(all_queries, n=5, k=10)

# Raw ctypes
queries = (ctypes.c_float * (5 * dim))(*data)
distances = (ctypes.c_float * (5 * k))()
labels = (ctypes.c_uint64 * (5 * k))()
searcher.search(queries, 5, k, distances, labels)
```

### Metadata

```python
from lumina_data import LuminaIndexMeta

# Serialize (compatible with paimon-lumina Java and paimon-cpp)
meta = LuminaIndexMeta({
    "index.dimension": "128",
    "distance.metric": "l2",
    "index.type": "diskann",
    "encoding.type": "rawf32",
})
data = meta.serialize()       # -> bytes (JSON)

# Deserialize
meta = LuminaIndexMeta.deserialize(data)
print(meta.dim, meta.metric)  # 128, MetricType.L2
```

## API Reference

### LuminaBuilder

| Method | Input | Description |
|--------|-------|-------------|
| `__init__(options)` | `dict` | Create builder with native Lumina options. |
| `pretrain(vectors, n, dim)` | ctypes arrays | Pretrain with `n` vectors. |
| `insert(vectors, ids, n, dim)` | ctypes arrays | Insert vectors with IDs. |
| `pretrain_from_list(vectors, n, dim)` | Python lists | High-level pretrain. |
| `insert_from_list(vectors, ids, n, dim)` | Python lists | High-level insert. |
| `dump(path)` | `str` | Write index to file. |
| `close()` | | Release native resources. Supports `with`. |

### LuminaSearcher

| Method | Input/Output | Description |
|--------|--------------|-------------|
| `__init__(options)` | `dict` | Create searcher. |
| `open(path)` | `str` | Load index from file. |
| `search(q, n, k, dist, labels, opts)` | ctypes in/out | Raw search. |
| `search_with_filter(q, n, k, fids, fc, dist, labels, opts)` | ctypes in/out | Raw filtered search. |
| `search_list(q, n, k, opts)` | list in, list out | High-level search. |
| `search_with_filter_list(q, n, k, fids, opts)` | list in, list out | High-level filtered search. |
| `get_count()` | | Number of vectors in index. |
| `get_dimension()` | | Vector dimension. |
| `close()` | | Release native resources. Supports `with`. |

### Index Options

| Key | Values | Default |
|-----|--------|---------|
| `index.type` | `bruteforce`, `diskann`, `ivf` | `diskann` |
| `index.dimension` | integer | `128` |
| `distance.metric` | `l2`, `cosine`, `inner_product` | `inner_product` |
| `encoding.type` | `rawf32`, `sq8`, `pq` | `pq` |
| `diskann.build.ef_construction` | integer | `1024` |
| `diskann.build.neighbor_count` | integer | `64` |
| `diskann.build.thread_count` | integer | `32` |
| `diskann.search.list_size` | integer | auto (1.5x top_k) |
| `diskann.search.beam_width` | integer | `4` |

## Performance

Query latency compared to native C++ (DiskANN, 100K vectors, dim=128, top-10):

| | Avg Latency | Throughput | vs C++ |
|---|---|---|---|
| C++ native | 0.367 ms | 2724 qps | baseline |
| **Raw ctypes** | **0.370 ms** | **2705 qps** | **+0.8%** |
| High-level API | 0.494 ms | 2024 qps | +34% |

Raw ctypes adds < 1% overhead. High-level API overhead comes from
`list -> ctypes` conversion per call.

## Packaging & Publishing

### Build wheel

```bash
pip install wheel setuptools
python setup.py bdist_wheel
```

### Upload to PyPI

```bash
pip install twine

# Rename for PyPI (requires manylinux tag)
cd dist
mv lumina_data-0.1.0-*.whl lumina_data-0.1.0-cp36-cp36m-manylinux1_x86_64.whl

# Upload
twine upload dist/*.whl
```

### Install from PyPI

```bash
pip install lumina-data
```

## Tests

```bash
python3 tests/test_lumina_index.py
```

## License

Apache License 2.0
