Metadata-Version: 2.4
Name: weji-goose
Version: 0.3.0
Summary: Goose — a columnar storage engine with 10 specialized encodings, roaring-bitmap predicate pushdown, bloom-filter partition skipping, zone-map pruning, probabilistic sketches, and an SQL query advisor. Ships with a PostgreSQL profiler/exporter CLI.
Author-email: Gary Lucas <garylucas@bluedrop.com>
License: MIT License
        
        Copyright (c) 2026 WEJI Northern Technologies Inc.
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Project-URL: Homepage, https://github.com/wejinortherntechnologiesinc/Goose
Project-URL: Repository, https://github.com/wejinortherntechnologiesinc/Goose
Project-URL: Issues, https://github.com/wejinortherntechnologiesinc/Goose/issues
Keywords: columnar,compression,database,analytics,predicate-pushdown,postgres,roaring-bitmap,bloom-filter
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Database :: Front-Ends
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: zstandard>=0.20
Requires-Dist: sqlglot>=25.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Provides-Extra: sql
Requires-Dist: sqlglot>=25.0; extra == "sql"
Provides-Extra: pg
Requires-Dist: psycopg2-binary>=2.9; extra == "pg"
Dynamic: license-file

# Goose

> A columnar storage engine for analytical workloads — 10 specialized column encodings, roaring-bitmap predicate pushdown, bloom-filter partition skipping, zone-map pruning, probabilistic sketches, and an SQL query advisor. Ships with a PostgreSQL profiler/exporter CLI.

[![CI](https://github.com/wejinortherntechnologiesinc/Goose/actions/workflows/ci.yml/badge.svg)](https://github.com/wejinortherntechnologiesinc/Goose/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/)

Goose is a **universal booster**: a column-encoding and predicate-pushdown layer you can use three ways —

1. **As a Python library** — point it at data, let it pick the best encoding per column, write a compressed columnar store, and query it with predicate pushdown.
2. **As a PostgreSQL cold tier** — profile a Postgres table, export it to Goose, and query the compressed copy (CDC keeps them in sync).
3. **As an embedded encoding library** — link the C (`libgoose.a`) or Rust (`goose-encoding`) reference implementation into another database to get encoding intelligence and predicate pushdown without changing your storage engine.

The binary format, encoding selection algorithm, and predicate evaluation logic are specified language-agnostically in [`reference/SPEC.md`](reference/SPEC.md), so any database can implement a compatible reader/writer.

---

## Why

PostgreSQL is great at OLTP and terrible at analytical scans over years of history. Goose takes the analytical workload off Postgres: a `verb_id` column with 6 distinct values compresses **58×**; a monotonic timestamp column compresses **10.8×** with FOR+ZSTD; a sparse boolean flag becomes a roaring bitmap you can filter *without decoding the column*. Predicate pushdown then skips whole partitions using bloom filters and zone maps before touching row data.

## Features

- **10 column encodings** — `raw`, `bitpacked`, `dictionary`, `offset_blob`, `delta_zstd`, `zstd`, `for` (frame-of-reference), `for_zstd`, `roaring` (roaring-bitmap booleans), `zstd_dict` (shared trained dictionary).
- **Predicate pushdown** — `eq`, `neq`, `gt`, `gte`, `lt`, `lte`, `in`, compound `and`/`or`. Each predicate produces a roaring bitmap of matching rows; predicates are intersected, never materializing the full column.
- **Partition skipping** — per-partition and per-block bloom filters + zone maps let the reader rule out entire partitions before decode.
- **Adaptive selectivity feedback** — observed predicate selectivities reorder filters so the most selective (cheapest) predicates run first.
- **Cross-column correlation statistics** — propagate partition pruning across correlated columns.
- **Probabilistic sketches** — approximate answers (count-distinct, etc.) for fast exploration.
- **Atomic writes + CRC32 checksums** — every file is written to a `.tmp` then renamed; partition manifests carry checksums, verified on read.
- **SQL query advisor** — `goose.advisor.optimize()` takes a SQL `WHERE` clause, reorders predicates by selectivity, and returns optimized SQL + a structured predicate tree for `GooseTable.query()`.
- **PostgreSQL integration** — `goose-pg` CLI to profile, export, benchmark, and query.

## Install

```bash
pip install weji-goose                 # core engine (numpy, zstandard, sqlglot)
pip install "weji-goose[pg]"           # + psycopg2-binary for the goose-pg CLI
pip install "weji-goose[dev]"          # + pytest for running the test suite
```

> **Note:** the PyPI distribution is `weji-goose`; the import name is `import weji_goose`.

Requires Python ≥ 3.10.

## Quick start — Python library

```python
import numpy as np
from weji_goose.schema import TableSchema
from weji_goose.column import ColumnSpec, ColumnType, Encoding
from weji_goose.table import GooseTable
from weji_goose.query import Predicate, CompoundPredicate

schema = TableSchema("demo", columns=[
    ColumnSpec("id",    ColumnType.INT64,   Encoding.DELTA_ZSTD),  # monotonic → delta
    ColumnSpec("flag",  ColumnType.BOOLEAN, Encoding.ROARING),     # sparse bool → bitmap
    ColumnSpec("city",  ColumnType.TEXT,    Encoding.DICTIONARY),  # low-card text → dict
])

# Create and write
table = GooseTable.create(schema, "./demo_table")
table.insert({
    "id":   np.arange(1_000_000, dtype=np.int64),
    "flag": np.array([i % 5 == 0 for i in range(1_000_000)], dtype=bool),
    "city": np.array(["stjohns", "corner", "bay"] * 333_333 + ["stjohns"], dtype=object),
})

# Predicate pushdown: filter without decoding the columns
table = GooseTable.open("./demo_table")
result = table.query(
    ["city", "id"],
    where=CompoundPredicate("and", [
        Predicate("flag", "eq", True),
        Predicate("id",   "gte", 999_000),
    ]),
)
print(result["city"][:5], result["id"][:5])
print("on-disk size:", table.total_size_bytes(), "bytes")
```

### Supported types

`int64`, `int32`, `int16`, `float64`, `float32`, `boolean`, `uuid`, `text`, `interval`.

## Quick start — PostgreSQL cold tier (`goose-pg`)

```bash
# 1. Profile a table → schema + per-column compression estimates
goose-pg profile \
    --db-url "postgresql://user:pass@host/db" \
    --table sensor_readings \
    --output ./sensor_readings_schema.json -v

# 2. Export the table to Goose format (auto-partitioned, resumable)
goose-pg export \
    --db-url "postgresql://user:pass@host/db" \
    --table sensor_readings \
    --output-dir ./goose_data \
    --partition-column recorded_at --partition-interval month --resume -v

# 3. Benchmark: profile + export + measured compression ratio
goose-pg benchmark \
    --db-url "postgresql://user:pass@host/db" \
    --table sensor_readings \
    --output-dir ./bench

# 4. Query the compressed Goose data with predicate pushdown
goose-pg query \
    --goose-dir ./goose_data/sensor_readings \
    --columns recorded_at,pm25 \
    --where "pm25 >= 35 AND region = 'bay_st_george'" \
    --format csv
```

`goose-pg query --where` parses the SQL `WHERE` clause through the Goose advisor (using the opened table's schema) and pushes it down. Output formats: `table` (default), `csv`, `json`.

A `docker-compose.yml` is included for integration testing against a throwaway Postgres. The `goose` service is profile-gated under `cli`, so pass `--profile cli` to invoke it:

```bash
docker compose up -d postgres
docker compose --profile cli run --rm goose --help
docker compose --profile cli run --rm goose query --goose-dir ./goose_data/mytable --columns id,name
```

## The query advisor

`goose.advisor.optimize(sql, schema)` reorders `WHERE` predicates by selectivity — roaring-bitmap lookups first, then bloom-filtered ID lookups, then range predicates — and returns optimized SQL plus a predicate tree:

```python
from weji_goose.advisor import optimize, SCHEMAS

oq = optimize(
    "SELECT id, verb_id FROM xapi_events "
    "WHERE verb_id IN ('completed', 'passed') AND actor_id = 42",
    schema=SCHEMAS["xapi_events"],
)
print(oq.optimized_sql)
# SELECT id, verb_id
# FROM xapi_events
# WHERE actor_id = 42 AND verb_id IN ('completed', 'passed')

print(oq.predicate_json)
# {'op': 'and', 'predicates': [
#     {'column': 'actor_id', 'op': 'eq',  'value': 42},
#     {'column': 'verb_id',  'op': 'in',  'value': ['completed', 'passed']}]}
```

`OptimizedQuery` exposes: `original_sql`, `optimized_sql`, `predicate_json`, `table_name`, `selected_columns`, `warnings`.

## Encodings

| Encoding     | Best for                                              |
|--------------|------------------------------------------------------|
| `raw`        | High-cardinality, incompressible numeric data         |
| `bitpacked`  | Dense booleans (8 values/byte)                        |
| `dictionary` | Low-cardinality text/categorical (≈ < 50k distinct)   |
| `offset_blob`| Variable-length blobs with a dictionary offset index  |
| `delta_zstd` | Monotonic / near-sequential int64 (timestamps, IDs)   |
| `zstd`       | Generic high-entropy compressible data                |
| `for`        | Frame-of-reference: clustered int ranges              |
| `for_zstd`   | FOR + ZSTD for clustered ints that still compress     |
| `roaring`    | Sparse booleans — queryable without column decode     |
| `zstd_dict`  | Repeated text patterns via a shared trained dictionary |

## Benchmarks

On a 10,000-row synthetic sample (`weji_goose/benchmarks/benchmark_results.json`, reproducible via `python -m weji_goose.benchmarks.demo`):

| Metric                       | Value           |
|------------------------------|-----------------|
| Overall compression vs PG    | **3.83×**       |
| Write throughput             | ~10,000 rows/s  |
| Full-scan throughput         | ~3.96M rows/s   |
| Best per-column compression  | `verb_id` → **58.5×** (dictionary, 6 distinct) |

| Column               | Encoding    | Compression |
|----------------------|-------------|-------------|
| `verb_id`            | dictionary  | 58.5×       |
| `context_org`        | dictionary  | 14.8×       |
| `stored`             | delta_zstd  | 10.8×       |
| `actor_id`           | for         | 8.0×        |
| `object_id`          | dictionary  | 7.4×        |

## Project layout

```
weji_goose/            core engine (import as `import weji_goose`)
  schema.py            TableSchema, ColumnType, Encoding enums
  table.py             GooseTable: create / open / insert / query / scan
  query.py             predicate pushdown engine, Predicate / CompoundPredicate
  advisor.py           SQL WHERE → optimized predicates
  reader.py writer.py  checksummed, atomic, mmap-backed I/O
  bloom.py roaring.py  partition skipping + row bitmaps
  sketch.py            probabilistic sketches
  correlation.py       cross-column pruning stats
goose_pg/              PostgreSQL integration
  cli.py               goose-pg CLI: profile / export / benchmark / query
  profiler.py          schema introspection + data profiling
  exporter.py          bulk export to Goose partitions
  type_map.py          PostgreSQL → Goose type/encoding mapping
reference/
  SPEC.md              language-agnostic binary format + algorithm spec
  c/                   C reference (libgoose.a / libgoose.so + tests)
  rust/                Rust reference crate (goose-encoding)
tests/                 272 tests
```

## Reference implementations & spec

Goose's on-disk format and algorithms are defined in [`reference/SPEC.md`](reference/SPEC.md) so any database can implement a compatible reader/writer. Reference implementations live under `reference/`:

- **C** — `reference/c/` builds `libgoose.a` / `libgoose.so` (`make`, `make test`).
- **Rust** — `reference/rust/` is the `goose-encoding` crate (`cargo build`).

These are maintained alongside the Python engine as the canonical cross-language contract.

## Development

```bash
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,pg]"
pytest -q          # 272 tests
```

The Python reference encoders/decoders are in `weji_goose/`; the C and Rust references are in `reference/`. Benchmarks: `python -m weji_goose.benchmarks.demo`.

## License

[MIT](LICENSE) © 2026 WEJI Northern Technologies Inc.
