Metadata-Version: 2.4
Name: vgi-python
Version: 0.8.1
Summary: Vector Gateway Interface - Connect DuckDB to external programs via Apache Arrow
Project-URL: Homepage, https://query.farm
Project-URL: Repository, https://github.com/Query-farm/vgi-python
Project-URL: Documentation, https://vgi-python.query.farm/
Project-URL: Issues, https://github.com/Query-farm/vgi-python/issues
Author-email: Rusty Conover <rusty@query.farm>
Maintainer-email: Query Farm LLC <hello@query.farm>
License: Query Farm Source-Available License, Version 1.0
        
        Copyright (c) 2025, 2026 Query Farm LLC. All rights reserved.
        
        ## 1. Definitions
        
        "Licensor" means Query Farm LLC (http://query.farm, hello@query.farm) and its
        affiliates under common control.
        
        "VGI" means the Vector Gateway Interface, the DuckDB extension technology developed
        by the Licensor, also referred to by the Licensor as its "Hyperfederation" database
        technology.
        
        "Licensed Work" means VGI, including its source code, object code, and any
        documentation distributed with it, in each version made available by the Licensor
        under this License.
        
        "You" (or "Your") means the individual or legal entity exercising rights under this
        License, together with all affiliates under common control with that entity.
        
        "Production Use" means any use of the Licensed Work other than for development,
        testing, evaluation, experimentation, or other non-production purposes.
        
        "Hyperfederation Services" means services relating to the federation, gateway,
        integration, querying, or interoperation of data sources using VGI or
        functionally equivalent technology, including services that expose, broker, or
        provide access to such federated or gateway capabilities.
        
        "Commercial Marketplace" means any platform, exchange, or intermediary service,
        whether or not operated for a fee, that connects providers and consumers of
        Hyperfederation Services, or that facilitates the offering, discovery, exchange,
        sale, or licensing of Hyperfederation Services among third parties.
        
        "Competing Offering" means a product or service that You make available to third
        parties, on a paid basis (including through paid support, subscription, or hosting
        arrangements), whose capabilities significantly overlap with those of the Licensor's
        version(s) of the Licensed Work.
        
        ## 2. Grant of Rights
        
        Subject to the terms and limitations of this License, the Licensor grants You a
        worldwide, royalty-free, non-exclusive license to:
        
        (a) use, copy, and run the Licensed Work for any non-production purpose;
        
        (b) modify the Licensed Work and create derivative works of it;
        
        (c) redistribute the Licensed Work and Your derivative works, provided You comply
        with Section 5; and
        
        (d) make Production Use of the Licensed Work, except where such use is restricted by
        Section 3 or reserved to the Licensor by Section 4.
        
        ## 3. Production Use Conditions
        
        The grant of Production Use in Section 2(d) does not extend to, and You may not
        without a separate commercial license from the Licensor:
        
        (a) provide a Competing Offering to third parties; or
        
        (b) offer the Licensed Work, or any derivative work of it, to third parties on a
        hosted, embedded, or as-a-service basis where doing so competes with the Licensor's
        commercial interests in the Licensed Work.
        
        "Embedded" includes incorporating the source or object code of the Licensed Work
        into a Competing Offering, and packaging a Competing Offering such that the Licensed
        Work must be accessed or downloaded for that offering to function.
        
        Hosting or using the Licensed Work for Your own internal purposes is not a Competing
        Offering and is permitted, including across Your affiliates under common control.
        
        ## 4. Reserved Rights
        
        Notwithstanding any other provision of this License, the Licensor reserves to itself
        the exclusive right to build, operate, offer, or authorize a Commercial Marketplace
        that incorporates, integrates, is built upon, or otherwise uses the Licensed Work.
        
        This License grants You no right to construct, operate, or enable a Commercial
        Marketplace using the Licensed Work, whether on a commercial or non-commercial basis,
        and any such use requires a separate written agreement with the Licensor.
        
        ## 5. Redistribution
        
        If You redistribute the Licensed Work or any derivative work of it, in original or
        modified form, You must:
        
        (a) include a complete, unmodified copy of this License with each copy; and
        
        (b) cause any recipient to receive the Licensed Work subject to the terms of this
        License.
        
        The conditions in Sections 3 and 4 apply to every recipient of the Licensed Work,
        whether received directly from the Licensor or through a third party.
        
        ## 6. Conversion to Open Source
        
        For each version of the Licensed Work, on the tenth anniversary of the date the
        Licensor first made that version publicly available (the "Change Date" for that
        version), the Licensor additionally grants You the right to use that version under
        the terms of the Apache License, Version 2.0, and on and after that version's Change
        Date the restrictions in Sections 3 and 4 no longer apply to that version.
        
        This License applies separately to each version of the Licensed Work, and the Change
        Date may differ between versions.
        
        ## 7. Commercial Licensing
        
        If Your intended use is not permitted under this License, You may obtain a separate
        commercial license from the Licensor by contacting hello@query.farm. Absent such a
        license, You must refrain from the restricted use.
        
        ## 8. Trademarks
        
        This License does not grant You any right to use the names, trademarks, service
        marks, or logos of the Licensor, including "Vector Gateway Interface," "VGI," and
        "Hyperfederation," except as required for reasonable and customary use in describing
        the origin of the Licensed Work.
        
        ## 9. Termination
        
        Any use of the Licensed Work in violation of this License automatically terminates
        Your rights under this License for the current and all other versions of the Licensed
        Work. Your rights may be reinstated only by a writing signed by the Licensor.
        
        ## 10. Disclaimer of Warranty and Limitation of Liability
        
        TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, THE LICENSED WORK IS PROVIDED ON
        AN "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EXPRESS OR IMPLIED,
        INCLUDING WITHOUT LIMITATION ANY WARRANTIES OR CONDITIONS OF MERCHANTABILITY, FITNESS
        FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, OR TITLE.
        
        TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, IN NO EVENT WILL THE LICENSOR BE
        LIABLE TO YOU FOR ANY DAMAGES ARISING OUT OF OR RELATING TO THIS LICENSE OR THE USE
        OF THE LICENSED WORK, WHETHER IN CONTRACT, TORT, OR OTHERWISE.
License-File: LICENSE
Keywords: analytics,apache-arrow,arrow,data-engineering,database,duckdb,pyarrow,rpc,sql,udf,user-defined-functions,vgi
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Database
Classifier: Topic :: Database :: Database Engines/Servers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.13
Requires-Dist: click
Requires-Dist: httpx>=0.24
Requires-Dist: platformdirs
Requires-Dist: pyarrow
Requires-Dist: typer>=0.9
Requires-Dist: vgi-rpc>=0.20.4
Provides-Extra: azure
Requires-Dist: azure-identity>=1.16.0; extra == 'azure'
Requires-Dist: pymssql>=2.3.0; extra == 'azure'
Provides-Extra: dev
Requires-Dist: azure-identity>=1.16.0; extra == 'dev'
Requires-Dist: duckdb>=1.5.3; extra == 'dev'
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: numpy>=2.4.1; extra == 'dev'
Requires-Dist: pyarrow-stubs; extra == 'dev'
Requires-Dist: pydoclint; extra == 'dev'
Requires-Dist: pymssql>=2.3.0; extra == 'dev'
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest-examples; extra == 'dev'
Requires-Dist: pytest-ruff; extra == 'dev'
Requires-Dist: pytest-xdist; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Requires-Dist: sqlglot; extra == 'dev'
Requires-Dist: vgi-rpc[conformance,external,http,oauth,otel,sentry]; extra == 'dev'
Requires-Dist: vgi-rpc[http]; extra == 'dev'
Requires-Dist: vgi-rpc[oauth]; extra == 'dev'
Requires-Dist: vgi-rpc[otel]; extra == 'dev'
Requires-Dist: vgi-rpc[sentry]; extra == 'dev'
Provides-Extra: duckdb
Requires-Dist: duckdb>=1.5.3; extra == 'duckdb'
Provides-Extra: fixtures
Requires-Dist: vgi-fixtures; extra == 'fixtures'
Provides-Extra: haybarn
Requires-Dist: haybarn>=1.5.3rc10; extra == 'haybarn'
Provides-Extra: http
Requires-Dist: vgi-rpc[http]; extra == 'http'
Provides-Extra: oauth
Requires-Dist: vgi-rpc[oauth]; extra == 'oauth'
Provides-Extra: otel
Requires-Dist: vgi-rpc[otel]; extra == 'otel'
Provides-Extra: sentry
Requires-Dist: vgi-rpc[sentry]; extra == 'sentry'
Provides-Extra: test-fixtures
Requires-Dist: numpy>=2.4.1; extra == 'test-fixtures'
Provides-Extra: test-fixtures-writable
Requires-Dist: numpy>=2.4.1; extra == 'test-fixtures-writable'
Requires-Dist: sqlglot; extra == 'test-fixtures-writable'
Provides-Extra: transactor
Requires-Dist: sqlglot; extra == 'transactor'
Description-Content-Type: text/markdown

# VGI (Vector Gateway Interface)

<p align="center">
  <img src="https://raw.githubusercontent.com/Query-farm/vgi-python/main/docs/vgi-logo.png" alt="VGI Logo" width="480">
</p>

<p align="center">
  <strong>Apache Arrow-based protocol for extending DuckDB using any language.</strong>
</p>

<p align="center">
  Created by <a href="https://query.farm">Query.Farm</a>
</p>

---

## See It in Action

```python
# my_worker.py
# /// script
# requires-python = ">=3.13"
# dependencies = ["vgi-python"]
# ///
from typing import Annotated
from vgi import ScalarFunction, Param, Returns, Worker
from vgi.catalog import Catalog, Schema
import pyarrow as pa
import pyarrow.compute as pc

class Multiply(ScalarFunction):
    """Multiply two columns element-wise."""

    @classmethod
    def compute(
        cls,
        a: Annotated[pa.Int64Array, Param(doc="First operand")],
        b: Annotated[pa.Int64Array, Param(doc="Second operand")],
    ) -> Annotated[pa.Int64Array, Returns()]:
        return pc.multiply(a, b)

class MyWorker(Worker):
    catalog = Catalog(
        name="my_worker",
        schemas=[Schema(name="main", functions=[Multiply])],
    )

if __name__ == "__main__":
    MyWorker().run()
```

The `# /// script` block is [inline script metadata](https://packaging.python.org/en/latest/specifications/inline-script-metadata/):
`uv run my_worker.py` reads it, provisions an isolated environment with
`vgi-python`, and runs the worker — no virtualenv to create or activate.

```sql
-- First time only.
INSTALL vgi FROM community;
LOAD vgi;
-- LOCATION is the command that launches the worker. `uv run` resolves the
-- script's inline dependencies, so nothing needs to be installed first.
ATTACH 'my_worker' (TYPE vgi, LOCATION 'uv run my_worker.py');

SELECT my_worker.multiply(6, 7);
-- 42
```

Or you can launch the [Haybarn](https://github.com/Query-farm-haybarn/haybarn)
CLI and attach the worker in one step:

```bash
uvx haybarn-cli "vgi:my_worker?location=uv run my_worker.py"
```

This drops you into a session with the functions you just added, available as
`my_worker.multiply(...)`.

That's it. No C++ compilation, no extension versioning, no complex build process. Just a Python script that Haybarn (or DuckDB) can call.

---

## Installation

The Python package is published on PyPI as `vgi-python` (the `vgi` name was
taken), but you still `import vgi` in code. The examples above don't install it
explicitly — the worker script's inline `# /// script` metadata lets `uv run`
provision it on demand. To add it to a project or environment directly:

```bash
pip install vgi-python      # or: uv add vgi-python
```

You also need a DuckDB-compatible SQL engine to load the `vgi` extension and
call your functions. These examples use [Haybarn](https://github.com/Query-farm-haybarn/haybarn),
Query Farm's DuckDB distribution, which ships the `vgi` extension signed for its
own catalog and runs with no install via `uvx`:

```bash
uvx haybarn-cli              # start an interactive SQL session
```

Stock `duckdb` works too — `INSTALL vgi FROM community; LOAD vgi;` resolves the
extension from the DuckDB community repository instead.

---

## Why VGI?

VGI lets you extend DuckDB with Python functions that run in separate processes, communicating via Apache Arrow IPC. This means:

| Traditional Extensions | VGI Workers |
|----------------------|-------------|
| C/C++ compilation required | Any language with an Apache Arrow library |
| Tied to DuckDB version | Version independent |
| Complex build/release cycle | Ship a script or executable |
| Runs in-process | Process isolation |

**Use cases:**
- Call REST APIs or external services from SQL
- Run ML inference (PyTorch, scikit-learn, etc.)
- Process data with Python libraries (pandas, numpy)
- Build custom ETL transforms
- Create domain-specific functions for your team
- Expose external data sources as queryable tables and views

---

## Quick Start

### Step 1: Create a Worker

A worker is a Python script that defines one or more functions:

```python
# my_worker.py
# /// script
# requires-python = ">=3.13"
# dependencies = ["vgi-python"]
# ///
from typing import Annotated
import pyarrow as pa
import pyarrow.compute as pc
from vgi import ScalarFunction, Param, Returns, Worker
from vgi.catalog import Catalog, Schema


class UpperCase(ScalarFunction):
    """Convert string values to uppercase."""

    @classmethod
    def compute(
        cls,
        value: Annotated[pa.StringArray, Param(doc="String value to uppercase")],
    ) -> Annotated[pa.StringArray, Returns()]:
        return pc.utf8_upper(value)


class MyWorker(Worker):
    catalog = Catalog(
        name="my_funcs",
        schemas=[Schema(name="main", functions=[UpperCase])],
    )


if __name__ == "__main__":
    MyWorker().run()
```

### Step 2: Use from SQL

```sql
-- Attach the worker as a catalog (its catalog name is "my_funcs")
ATTACH 'my_funcs' (TYPE vgi, LOCATION 'uv run my_worker.py');

-- Call your function (qualify with the catalog name, or run `USE my_funcs;` first)
SELECT my_funcs.upper_case(name) FROM users;

-- Use in complex queries
SELECT id, my_funcs.upper_case(status) as status
FROM orders
WHERE created_at > '2024-01-01';
```

### Step 3: There is no step 3

Your function is now available in any DuckDB-compatible engine. Ship the Python script to your team, and they can use it immediately.

---

## Function Types

VGI supports five function types:

| Type | Base Class | SQL Pattern | Use Case |
|------|------------|-------------|----------|
| **Scalar** | `ScalarFunction` | `SELECT func(col) FROM t` | Per-row transforms (1 row → 1 value) |
| **Table** | `TableFunctionGenerator` | `SELECT * FROM func(args)` | Generate rows from arguments (args → N rows) |
| **Table-In-Out** | `TableInOutFunction` | `SELECT * FROM func((SELECT ...))` | Reshape or filter a streamed table (N rows → M rows) |
| **Aggregate** | `AggregateFunction` | `SELECT func(col) FROM t GROUP BY k` | Accumulate per `GROUP BY` group (N rows → 1 value) |
| **Buffering** | `TableBufferingFunction` | `SELECT * FROM func((SELECT ...))` | See every row first — sort, top-k, full reduction (stream → state → stream) |

Each type overrides a small, predictable surface (see the `Multiply` example
above for a complete scalar worker):

- **Scalar** — `compute()` maps input arrays to one output array of the same length.
- **Table** — `process()` streams batches via `out.emit()` until `out.finish()`.
- **Table-In-Out** — `transform()` reshapes each input batch; `finish()` emits any trailing rows.
- **Aggregate** — `initialize` / `update` / `combine` / `finalize` accumulate one value per `GROUP BY` group, merging partial states across parallel workers.
- **Buffering** — like aggregate, but `finalize` streams a whole relation out once every input row has been seen (sort, top-k, full-stream reductions).

---

## Beyond Functions: Full Catalog Support

VGI workers can expose more than just functions. A worker can provide a complete database catalog with:

- **Schemas** - Organize objects into namespaces
- **Tables** - Expose external data as queryable tables
- **Views** - Define SQL views over your data
- **Functions** - Scalar, table, and table-in-out functions

```sql
ATTACH 'external_db' (TYPE vgi, LOCATION 'uv run my_catalog_worker.py');

-- Query tables from the attached catalog
SELECT * FROM external_db.main.users;

-- Use views
SELECT * FROM external_db.analytics.daily_summary;

-- Call functions
SELECT external_db.main.transform(col) FROM my_table;
```

This enables VGI workers to act as bridges to external systems—databases, APIs, file systems—presenting them as native DuckDB catalogs.

See [Catalog Interface](docs/catalog-interface.md) for implementation details.

---

## Parallel Execution

Functions can run across multiple worker processes. The client automatically
distributes input batches round-robin across workers and collects results.

See [Function API Reference](docs/generator-api.md) for advanced patterns like distributed aggregation.

---

## Error Handling

Errors in your functions propagate to DuckDB with clear messages:

```python test="skip"
@classmethod
def compute(cls, value: Annotated[pa.Int64Array, Param()]) -> Annotated[pa.Int64Array, Returns()]:
    raise ValueError("Something went wrong")
```

```sql
SELECT my_func(col) FROM my_table;
-- Error: Something went wrong
```

Type bound violations are caught at bind time (before processing starts):

```sql
SELECT add_values(name, price) FROM orders;
-- Error: Argument 'left': Column 'name' has type string,
--        but type bound requires: is_integer
```

### Debugging Worker Failures

When a worker fails, the Python traceback is written to stderr. By default, the client captures this stderr and includes it in the error message (last 50 lines), so you get the full context:

```
ClientError: Worker Exception: function 'my_func' raised ValueError

Worker stderr:
Traceback (most recent call last):
  File "my_worker.py", line 42, in compute
    ...
ValueError: Something went wrong
```

For real-time debugging, set `VGI_WORKER_DEBUG=1` to stream worker logs directly to your terminal and enable DEBUG-level logging:

```bash
VGI_WORKER_DEBUG=1 python my_script.py
```

This is especially useful when integrating from C++ or other clients where stderr might otherwise be lost.

---

## Testing Your Functions

Use the VGI client for integration tests:

```python
from vgi.client import Client
from vgi import Arguments
import pyarrow as pa

batch = pa.RecordBatch.from_pydict({"name": ["alice", "bob"]})

with Client("./my_worker.py") as client:
    results = list(client.scalar_function(
        function_name="upper_case",
        input=iter([batch]),
        arguments=Arguments(positional=[pa.scalar("name")]),
    ))

assert results[0]["result"].to_pylist() == ["ALICE", "BOB"]
```

---

## Protocol Overview

VGI uses `vgi_rpc`, an Apache Arrow IPC-based RPC framework, for all
client-worker communication over stdin/stdout pipes:

```
Client                              Worker
  │                                   │
  │──── bind(request) ──────────────▶ │  Function name, args, input schema
  │◀─── BindResponse ────────────────  │  Output schema, opaque data
  │                                   │
  │──── init(request) ──────────────▶ │  Start processing stream
  │◀─── Stream header ───────────────  │  execution_id, max_workers
  │                                   │
  │──── exchange(batch1) ───────────▶ │
  │◀─── output batch 1 ──────────────  │  transform(batch)
  │         ...                       │
  │──── [stream close] ─────────────▶ │  Signal end of input
  │                                   │
  │──── init(phase=FINALIZE) ───────▶ │  Start finalize stream
  │◀─── final output batches ────────  │  finish() results
  └───────────────────────────────────┘
```

---

## External Batch Offloading (Demo Storage)

When record batches are too large for HTTP request/response bodies, VGI supports
externalizing them to blob storage. The server replaces oversized batches with
pointer batches containing a URL, and the client transparently fetches the data.

The example HTTP server includes a built-in demo blob store for testing this
without S3 or any cloud infrastructure:

```bash
# Start with demo storage (4 KiB threshold for testing)
vgi-fixture-http --demo-storage --externalize-threshold-bytes 4096

# With zstd compression
vgi-fixture-http --demo-storage --externalize-threshold-bytes 4096 --externalize-compression zstd
```

When `--demo-storage` is enabled:
- Batches exceeding `--externalize-threshold-bytes` are stored in-memory and
  served from `/__blobs__/{id}` endpoints on the same server
- Clients can request upload URLs for large inputs via the `__upload_url__` endpoint
- The server advertises `VGI-Max-Request-Bytes` and rejects oversized requests with 413

For production use, implement the `ExternalStorage` protocol from `vgi_rpc` against
your cloud storage (S3, GCS, etc.). The example server also supports S3 via `--s3-bucket`.

---

## Documentation

- [Function Lifecycle](docs/lifecycle.md) - Bind, init, process, finalize
- [Metadata API](docs/metadata.md) - Function introspection
- [Function API Reference](docs/generator-api.md) - Advanced function patterns
- [Catalog Interface](docs/catalog-interface.md) - DuckDB ATTACH integration

---

## Logging

Workers support `--debug`, `--log-level`, `--log-format`, and `--log-logger` options:

```bash
# Enable debug logging
vgi-fixture-worker --debug

# JSON-formatted logs for structured pipelines
vgi-fixture-worker --log-format json

# Target a specific logger
vgi-fixture-worker --log-level DEBUG --log-logger vgi.worker
```

You can also use the `VGI_WORKER_DEBUG=1` environment variable, which enables `--debug` on the worker and stderr passthrough on the client without changing any code or CLI flags:

```bash
VGI_WORKER_DEBUG=1 python my_script.py
```

See [CLI Reference](docs/cli.md#worker-logging) for the full list of loggers and options.

---

## Development

```bash
git clone https://github.com/query-farm/vgi-python
cd vgi-python

uv sync --all-extras        # Install dependencies
uv run pytest -n auto       # Run tests
uv run ruff check --fix .   # Lint
uv run ruff format .        # Format
uv run mypy vgi/            # Type check
```

## Requirements

- Python >= 3.13
- pyarrow
- A DuckDB-compatible engine for SQL integration — [Haybarn](https://github.com/Query-farm-haybarn/haybarn) (`uvx haybarn-cli`) or stock DuckDB

---

## License

Copyright (c) 2025, 2026 Query Farm LLC.

Licensed under the **Query Farm Source-Available License, Version 1.0** — see
[LICENSE](LICENSE) for the binding terms.

For a commercial license or any licensing questions, contact
[hello@query.farm](mailto:hello@query.farm).
