Metadata-Version: 2.4
Name: deltacat-io-core
Version: 0.1.13
Summary: Shared local IO execution layer for DeltaCAT read/write clients.
Author: Ray Team
License-Expression: Apache-2.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: <3.13,>=3.11
Requires-Dist: boto3~=1.34
Requires-Dist: fsspec
Requires-Dist: msgpack~=1.0.7
Requires-Dist: numpy<2,>=1.23
Requires-Dist: pyarrow<24,>=17
Requires-Dist: typing-extensions>=4.6.1
Provides-Extra: all
Requires-Dist: daft==0.5.22; extra == 'all'
Requires-Dist: fastavro; extra == 'all'
Requires-Dist: pandas==2.2.3; extra == 'all'
Requires-Dist: polars==1.28.1; extra == 'all'
Requires-Dist: pylance>=0.37.0; extra == 'all'
Provides-Extra: daft
Requires-Dist: daft==0.5.22; extra == 'daft'
Provides-Extra: io
Requires-Dist: fastavro; extra == 'io'
Provides-Extra: lance
Requires-Dist: pylance>=0.37.0; extra == 'lance'
Provides-Extra: pandas
Requires-Dist: pandas==2.2.3; extra == 'pandas'
Provides-Extra: polars
Requires-Dist: polars==1.28.1; extra == 'polars'
Description-Content-Type: text/markdown

# deltacat-io-core

`deltacat-io-core` is the shared local execution layer for DeltaCAT reads and
writes.

It is used by both:

- `deltacat-client` for direct thin-plan execution
- `deltacat` for shared local execution and compatibility wrappers

## Naming

- distribution/package name: `deltacat-io-core`
- Python import module: `deltacat_io_core`

The distribution uses dashes for consistency with `deltacat-client`. The import
module keeps underscores because Python module names cannot contain `-`.

## Scope

`deltacat-io-core` owns the code that should behave the same regardless of
whether the caller is using the thin client or the thick DeltaCAT package.

Today that includes:

- direct execution of thin `Plan` objects
- MOR execution for thin and thick paths
- local file materialization and manifest building
- schema alignment and table conversion helpers
- sort-aware file ordering and manifest handling
- shared compaction/MOR helper layers and model types
- format-specific local readers/writers

## Non-Goals

`deltacat-io-core` does not own:

- server routes or REST/MCP request handling
- authoritative catalog/storage mutations
- native Ray job orchestration surfaces
- public end-user API shape for `deltacat` or `deltacat-client`

It is a shared implementation layer, not the top-level user product.

## Architecture

The current read architecture is:

1. The server resolves a thin `Plan`.
2. `client.catalog.read(plan=...)` executes that plan directly through
   `deltacat-io-core`.
3. `dc.read_table(plan=...)` for thin plans also executes through the same
   shared path.

There is no longer a runtime bridge back into thick DeltaCAT for thin plan
execution. The plan contract is expected to carry the metadata required for
direct execution.

The current write architecture is:

1. The client stages local files or materializes local data through shared
   helpers.
2. The authoritative commit still happens through DeltaCAT server/native
   boundaries.
3. Shared write-preparation and manifest logic lives in `deltacat-io-core`.

## Installation

Base install:

```bash
uv pip install deltacat-io-core
```

Optional extras:

- `deltacat-io-core[io]` for local file readers/writers (`pyarrow`, `fastavro`)
- `deltacat-io-core[pandas]` for Pandas conversions
- `deltacat-io-core[polars]` for Polars conversions and lazy scan helpers
- `deltacat-io-core[daft]` for Daft conversions and lazy scan helpers
- `deltacat-io-core[lance]` for Lance dataset support
- `deltacat-io-core[all]` for the full local IO stack

## Read Capabilities

The shared read executor currently handles:

- schema-table reads
- schemaless manifest-table reads
- MOR reads
- direct `pyarrow`, `pandas`, `polars`, `numpy`, `daft`, and `ray_dataset`
  outputs where supported
- lazy `pyarrow_parquet`
- lazy `lance`

It also enforces direct validation for unsupported combinations, for example:

- schemaless + `pyarrow_parquet`
- schemaless + `lance`
- mixed-content lazy plans for format-specific readers
- unknown content types in the shared path

### Polars / Daft Capability Matrix

The shared executor applies the same capability decision in thin
`execute_read_plan(...)` execution and in thick reads that delegate into
that shared path.

| Engine | Content | v1 behavior |
| --- | --- | --- |
| Polars | Parquet | Lazy scan via `pl.scan_parquet(...)` when the existing local preconditions hold |
| Polars | Lance | Explicit eager fallback; no reader-level Lance row-filter pushdown |
| Polars | PackDS | Same as Lance; PackDS plans stay on the explicit eager Lance fallback |
| Daft | Parquet | Lazy scan via shared `build_daft_lazy_scan(...)` when the group is local/shared-eligible |
| Daft | Lance | Lazy only for a single dataset on the shared local path; multi-dataset falls back eagerly |
| Daft | PackDS | Same as Lance under PackDS v5: a single pruned episode dataset can use native lazy Lance scanning; multi-episode plans fall back eagerly |

Notes:

- Mixed-schema lazy eligibility on the shared path requires per-file
  `schema_id` lookups plus top-level schema information with resolvable
  field types, whether that comes from `schema_serialized` or a typed
  top-level `schema` summary.
- On the shared Daft path, non-identity Parquet content encodings (for
  example `.parquet.gz`) stay on the eager PyArrow path.
- When the process is pinned to `DAFT_RUNNER=ray`, the shared local Daft
  lazy path declines and falls back to the eager shared path instead of
  spawning a Ray-backed local lazy scan.

## Write Capabilities

The shared write layer currently covers:

- write input normalization
- local data materialization
- manifest construction for existing files and datasets
- schema/read compatibility helpers
- standard catalog write orchestration slices

Authoritative catalog mutation, commit, retention, and compaction boundaries
still remain on the native/server side where they belong.

## Relationship To Other Packages

Use `deltacat-client` when you want the public thin client.

Use `deltacat` when you want the thick/native package.

Use `deltacat-io-core` directly only if you are intentionally building against
the shared execution layer itself.
