Metadata-Version: 2.4
Name: pycasher
Version: 0.5.11
Summary: Cache function results and side effects (stdout, stderr, file writes) with automatic file I/O discovery via strace or audit hooks
License-Expression: MIT
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.12
Requires-Dist: loguru
Requires-Dist: lz4
Provides-Extra: pandas
Requires-Dist: pandas; extra == 'pandas'
Requires-Dist: pyarrow; extra == 'pandas'
Provides-Extra: polars
Requires-Dist: polars; extra == 'polars'
Description-Content-Type: text/markdown

# pycasher

Cache Python function results **and their side effects** — stdout, stderr, and filesystem writes — with automatic invalidation.

```bash
uv add pycasher
```

If you want the `casher` CLI outside a project environment, install it as a
tool instead:

```bash
uv tool install pycasher
```

## What makes it different

Most caching libraries cache return values. casher also captures and replays:

- **stdout/stderr** printed during execution
- **Files written** by the function (restored from cache on hit)
- **Files read** by the function (used as cache keys — change a true upstream input file, cache auto-invalidates)

No manual file declarations needed. casher discovers file I/O automatically via `strace` (subprocess mode) or wrapped Python file handles plus tracked `shutil.copyfile()`-based copies in in-process mode.
Files that are written before they are first read during one invocation are treated as generated outputs, not cache inputs.

## Usage

```python
from casher import cached, expand_input_dir

@cached
def train(data_path: str, output_path: str, lr: float = 0.01) -> dict:
    df = read_csv(data_path)
    model = fit(df, lr=lr)
    save(model, output_path)
    return {"accuracy": model.score}

# First call — runs function, traces file I/O, caches everything
result = train("train.csv", "model.pkl")

# Second call — instant replay from cache (model.pkl restored too)
result = train("train.csv", "model.pkl")

# Change train.csv — casher detects it, re-runs automatically
```

For directory-shaped inputs, keep the argument semantics explicit instead of
making every directory `Path` recursive by magic:

```python
from pathlib import Path

from casher import cached, expand_input_dir


@cached(input_files=lambda data_dir: expand_input_dir(data_dir, "*.csv"))
def build_dataset(data_dir: Path) -> int:
    return len(list(data_dir.glob("*.csv")))
```

`Path` arguments that point to files are hashed by file content for the
function-argument portion of the cache key. Auto-discovered input files remain
path-sensitive and content-sensitive.

If you pass an existing filesystem path as a plain `str`, casher warns once per
parameter and process. String arguments still hash as strings; `Path`
arguments make path-aware hashing intent explicit.

Workflow-style functions can declare output directories explicitly to keep
reads under those roots out of `input_files`:

```python
from pathlib import Path

from casher import cached


work_dir = Path("work")


@cached(output_roots=[work_dir], replay_outputs="if-missing")
def assemble_workset() -> Path:
    generated = work_dir / "reference" / "mworld.par"
    generated.parent.mkdir(parents=True, exist_ok=True)
    generated.write_text("patched content")
    generated.read_text()
    return generated
```

On cache hit, unchanged output files are not restored again. You can also set
`replay_outputs=False` or `replay_outputs="if-missing"` to control file replay.

Cache any shell command without code changes:

```bash
casher -- python train.py --data train.csv
```

## Key features

- **Automatic file tracking**: strace (kernel-level, catches C extensions) or wrapped file handles plus `shutil.copy2()` / `copytree()` coverage in in-process mode
- **Generated-output awareness**: files written before their first read are excluded from `input_files`
- **Dependency invalidation**: changes to imported `.py` files invalidate the cache
- **Narrow dependency overrides**: use `dep_files=[...]` when only specific modules should invalidate a function
- **File-hash memoization**: unchanged files reuse cached content hashes from a small SQLite metadata store
- **LRU eviction**: configurable via `max_cache_bytes` or `CASHER_MAX_CACHE_BYTES` env var (default 32 GB)
- **Faster hits for large artifacts**: output replay skips files whose current hash already matches the cached output
- **DataFrame support**: polars and pandas DataFrames serialized via Arrow IPC
- **Environment-aware**: include env vars in cache key with `env_vars=["MY_VAR"]`
- **Miss diagnostics**: `diagnose_misses=True` logs which recorded input changed or disappeared
- **Path-clarity warnings**: existing filesystem paths passed as plain `str` args emit a one-time warning suggesting `Path(...)`
- **Earlier progress logs**: lookup, execution start, execution finish, and cache-store phases are logged so long misses are visible live
- **Structured logging**: loguru INFO for config changes, enablement, hit/miss, mode, eviction
- **Explicit directory expansion helper**: `expand_input_dir()` for stable `input_files` lists

## Configuration

| Env var                  | Default               | Description                                                        |
| ------------------------ | --------------------- | ------------------------------------------------------------------ |
| `CASHER_CACHE_DIR`       | _unset_               | Cache storage directory. Caching stays disabled until this is set. |
| `CASHER_MAX_CACHE_BYTES` | `34359738368` (32 GB) | Max cache size before LRU eviction                                 |

Or set programmatically (takes priority over env vars):

```python
from casher import configure, get_config

configure(cache_dir="/data/my_cache", max_cache_bytes=10 * 1024**3)
print(get_config())  # effective config
```

If no cache directory is configured via `CASHER_CACHE_DIR`, `configure(cache_dir=...)`,
`@cached(cache_dir=...)`, or `casher --cache-dir ...`, casher runs transparently without
caching and emits a one-time warning.

## Platform support

Full caching on **Linux** only (requires strace for subprocess mode, fcntl for locking). On macOS and Windows the decorator is a transparent pass-through — functions execute normally, caching is skipped with a one-time warning.

## Documentation

See [documentation/](documentation/) for detailed docs:

- [Introduction](documentation/00-Introduction.md) — architecture, limitations, in-process vs subprocess comparison
- [Quick Start](documentation/10-Quick-Start.md) — installation, decorator options, CLI usage
- [API Reference](documentation/20-API-Reference.md) — full API surface

## License

MIT
