Metadata-Version: 2.1
Name: colstore
Version: 0.1.0
Summary: Memory-mapped columnar binary format for fast random-access I/O on structured arrays.
Author-Email: Alkaid Cheng <alkaid.ccheng@gmail.com>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: C++
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Project-URL: Homepage, https://github.com/AlkaidCheng/colstore
Project-URL: Issues, https://github.com/AlkaidCheng/colstore/issues
Requires-Python: >=3.10
Requires-Dist: numpy>=1.25
Requires-Dist: psutil>=5.9
Provides-Extra: pandas
Requires-Dist: pandas>=1.5; extra == "pandas"
Provides-Extra: progress
Requires-Dist: tqdm>=4.60; extra == "progress"
Provides-Extra: numba
Requires-Dist: numba>=0.59; extra == "numba"
Provides-Extra: all
Requires-Dist: pandas>=1.5; extra == "all"
Requires-Dist: tqdm>=4.60; extra == "all"
Requires-Dist: numba>=0.59; extra == "all"
Provides-Extra: dev
Requires-Dist: pandas>=1.5; extra == "dev"
Requires-Dist: tqdm>=4.60; extra == "dev"
Requires-Dist: numba>=0.59; extra == "dev"
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: black>=24.0; extra == "dev"
Requires-Dist: mypy>=1.8; extra == "dev"
Description-Content-Type: text/markdown

# ColStore

A memory-mapped columnar binary format for fast, memory-efficient I/O on
structured arrays. `colstore` lets you write a tabular dataset to a single
`.cstore` file once and then load arbitrary row/column subsets without
materializing the rest. Internally, columns are stored back-to-back as raw
NumPy bytes, reads use `np.memmap`, and fancy-index gathers run through a
parallel C++ kernel (OpenMP + software prefetching) bound via Cython. Process
memory stays bounded by the size of the output you ask for; the source file
is never fully read into RAM.

## Install

```bash
pip install colstore
```

Building from source needs a C++17 compiler and CMake ≥ 3.18. On macOS install
`libomp` (`brew install libomp`) to get the parallel kernel; without it the
build still succeeds but the kernel runs single-threaded.

## Quick start

```python
from colstore import ColStore

# Write and open in one call. `.cstore` is the canonical extension.
ds = ColStore.from_dataframe(df, "data.cstore")

# Indexing returns lazy views; no data is read yet.
ds['price']                          # ColumnView
ds[100:200]                          # TableView
ds[100:200, 'price']                 # ColumnView
ds[100:200, ['price', 'qty']]        # TableView
ds[[1, 5, 9], ['price', 'qty']]      # TableView (fancy rows + cols)

# Materialize through one of the to_* methods.
ds['price'].to_array()                          # 1D ndarray
ds[indices, ['price', 'qty']].to_dict()         # dict of 1D arrays
ds[indices, ['price', 'qty']].to_record()       # structured ndarray
ds[indices, ['price', 'qty']].to_dataframe()    # pandas DataFrame
```

## Writing from other sources

```python
from colstore import ColStore
import numpy as np

# From a dict of 1D arrays.
ColStore.from_dict(
    {"x": np.arange(100, dtype=np.float32), "y": np.arange(100, dtype=np.int64)},
    "data.cstore",
)

# From a structured (record) array.
records = np.empty(100, dtype=[("price", np.float32), ("qty", np.int32)])
ColStore.from_records(records, "data.cstore")
```

Each factory returns an opened `ColStore` ready to read from.

## Configuration

```python
from colstore import set_max_workers, set_default_madvise, set_default_backend

set_max_workers(8)                # parallel gathers across columns
set_default_madvise("sequential") # OS read-ahead hint for sorted-index reads
set_default_backend("cpp")        # gather kernel: cpp | numpy | numba
```

## On-disk format

```
[magic 8B = b"CSTORE\x00\x01"]
[manifest_len 8B (u64 little-endian)]
[manifest_json]
[zero-padding to 64-byte alignment]
[column_0 raw bytes][column_1 raw bytes]...[column_n raw bytes]
```

The manifest is a small JSON object recording `format_version`, `n_rows`,
and per-column `{name, dtype}`. Column dtypes are preserved byte-for-byte;
columns are stored back-to-back with no per-row overhead.

## Supported dtypes

Fixed-size only: `float32`, `float64`, `int8/16/32/64`, `uint8/16/32/64`,
`bool`. Object dtype (strings, Python objects) is rejected at write time —
the design point is zero-copy random access, which requires a fixed stride.

## Layout

```
colstore/
├── pyproject.toml              # scikit-build-core build
├── CMakeLists.txt              # Cython + C++ build
├── include/colstore/
│   └── gather.hpp              # public C++ header
├── src/
│   ├── cpp/gather.cpp          # OpenMP + prefetch kernel
│   ├── cython/_gather.pyx      # dtype-dispatched binding
│   └── colstore/               # Python package
│       ├── __init__.py
│       ├── config.py
│       ├── format.py
│       ├── kernels.py
│       ├── view.py             # ColumnView + TableView
│       └── store.py
└── tests/                      # pytest suite
```

## License

MIT.
