Metadata-Version: 2.3
Name: polars-checkpoint
Version: 0.1.3
Summary: Convenient APIs to break lineage in complex Polars LazyFrame jobs. Can significantly reduce memory overhead for iterative processing.
Requires-Dist: polars>=1.39.3
Requires-Python: >=3.11.1
Description-Content-Type: text/markdown

# polars-checkpoint

Materialise Polars LazyFrames to parquet files and scan them back lazily. Defaults to the streaming engine for sink/scan. Useful for managing & reducing memory pressure due to expensive intermediate results in complex multi-step transforms.

## Installation

Requires `polars` and Python 3.10+.

## Quick Start

```python
import polars as pl
from polars_checkpoint import checkpoint

lf = pl.LazyFrame({"x": range(1_000_000)}).with_columns(y=pl.col("x") * 2)

# Materialises to a temp parquet file; returns a lazy re-scan
lf = checkpoint(lf)

lf.filter(pl.col("y") > 100).collect()
```

A process-wide default session manages the temp directory and cleans it up at exit.

## Session API

For explicit control over storage location and lifecycle, use `CheckpointSession`:

```python
from polars_checkpoint import CheckpointSession

# As a context manager — cleans up on exit from the block
with CheckpointSession(root_dir="./my_checkpoints") as sess:
    lf = pl.scan_csv("big.csv")
    lf = sess.checkpoint(lf, name="after-parse")
    lf = lf.filter(pl.col("status") == "active")
    lf = sess.checkpoint(lf, name="filtered")
```

```python
# Without a context manager — cleans up at GC or interpreter shutdown,
# or when you call close() explicitly
sess = CheckpointSession(root_dir="./my_checkpoints")
lf = sess.checkpoint(pl.scan_csv("big.csv"), name="raw")
reloaded = sess["raw"]
print(sess.summary())

sess.close()  # optional; triggers early cleanup
```

### `CheckpointSession` constructor

| Parameter | Default | Description |
|---|---|---|
| `root_dir` | `None` (auto temp dir) | Parent directory for checkpoint folders. |
| `cleanup` | `True` | Delete checkpoint files on close / GC / interpreter exit. |
| `default_sink_kwargs` | `{"compression": "zstd"}` | Defaults passed to `sink_parquet` / `write_parquet`. |
| `default_scan_kwargs` | `{}` | Defaults passed to `scan_parquet`. |

### Key methods & features

- **`checkpoint(lf, *, name=None, streaming=True, ...)`** — Materialise a LazyFrame to parquet. Auto-generates a name if none given. Falls back to `collect().write_parquet()` when `streaming=False`.
- **`session[name]`** — Retrieve a checkpoint as a `LazyFrame`.
- **`name in session`** — Check existence.
- **`len(session)`** / **`iter(session)`** — Count / list checkpoints.
- **`summary()`** — Returns a Polars DataFrame with name, size (MB), and path of each checkpoint.
- **`close(timeout=None)`** — Waits for in-flight writes, then cleans up. Also usable as a context manager.

## Thread Safety

Sessions are internally locked. Concurrent `checkpoint()` calls from multiple threads are safe; `close()` waits for all in-flight materialisations before removing files.

## Cleanup Behaviour

| Scenario | `cleanup=True` (default) | `cleanup=False` |
|---|---|---|
| `close()` / `__exit__` | Files deleted | Files retained |
| GC / interpreter shutdown | Files deleted (via `weakref.finalize`) | Files retained |

When `root_dir` is auto-generated, the entire temp directory is removed. When user-supplied, only the individual checkpoint subdirectories created by the session are removed.