Metadata-Version: 2.4
Name: pycasher
Version: 0.5.15
Summary: Cache function results and side effects (stdout, stderr, file writes) with automatic file I/O discovery via strace or audit hooks
License-Expression: MIT
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.12
Requires-Dist: loguru
Requires-Dist: lz4
Provides-Extra: pandas
Requires-Dist: pandas; extra == 'pandas'
Requires-Dist: pyarrow; extra == 'pandas'
Provides-Extra: polars
Requires-Dist: polars; extra == 'polars'
Description-Content-Type: text/markdown

# pycasher

Cache Python function results **and their side effects** — stdout, stderr, and filesystem writes — with automatic invalidation.

```bash
uv add pycasher
```

If you want the `casher` CLI outside a project environment, install it as a
tool instead:

```bash
uv tool install pycasher
```

## What makes it different

Most caching libraries cache return values. casher also captures and replays:

- **stdout/stderr** printed during execution
- **Files written** by the function (restored from cache on hit)
- **Files read** by the function (used as cache keys — change a true upstream input file, cache auto-invalidates)

No manual file declarations needed. casher discovers file I/O automatically via `strace` (subprocess mode) or wrapped Python file handles plus tracked `shutil.copyfile()`-based copies in in-process mode.
Files that are written before they are first read during one invocation are treated as generated outputs, not cache inputs.

On Linux, `subprocess=True` is the authoritative dependency mode: cache reuse is validated against the exact implicit input paths recorded during the last successful execution, including imported project source files discovered in the child process. `subprocess=False` remains useful, but only as best-effort Python-level tracking.

## Usage

```python
from casher import cached, expand_input_dir

@cached
def train(data_path: str, output_path: str, lr: float = 0.01) -> dict:
    df = read_csv(data_path)
    model = fit(df, lr=lr)
    save(model, output_path)
    return {"accuracy": model.score}

# First call — runs function, traces file I/O, caches everything
result = train("train.csv", "model.pkl")

# Second call — instant replay from cache (model.pkl restored too)
result = train("train.csv", "model.pkl")

# Change train.csv — casher detects it, re-runs automatically
```

For directory-shaped inputs, keep the argument semantics explicit instead of
making every directory `Path` recursive by magic:

```python
from pathlib import Path

from casher import cached, expand_input_dir


@cached(input_files=lambda data_dir: expand_input_dir(data_dir, "*.csv"))
def build_dataset(data_dir: Path) -> int:
    return len(list(data_dir.glob("*.csv")))
```

`Path` arguments that point to files are hashed by file content for the
function-argument portion of the cache key. Auto-discovered input files remain
path-sensitive and content-sensitive.

If you pass an existing filesystem path as a plain `str`, casher warns once per
parameter and process. String arguments still hash as strings; `Path`
arguments make path-aware hashing intent explicit.

Declared output paths are treated differently: `Path` arguments that are listed
in `output_files=` or fall under `output_roots=` are hashed by path identity,
not file content. That keeps cache keys stable when the function creates those
outputs during the first run.

casher now also learns this automatically for common workflow signatures. If a
`Path` argument is observed during execution to be a generated output, later
lookups treat that argument as an output-path identity even without explicit
`output_files=` / `output_roots=` declarations.

Workflow-style functions can declare output directories explicitly to keep
reads under those roots out of `input_files`:

```python
from pathlib import Path

from casher import cached


work_dir = Path("work")


@cached(output_roots=[work_dir], replay_outputs="if-missing")
def assemble_workset() -> Path:
    generated = work_dir / "reference" / "mworld.par"
    generated.parent.mkdir(parents=True, exist_ok=True)
    generated.write_text("patched content")
    generated.read_text()
    return generated
```

On cache hit, unchanged output files are not restored again. You can also set
`replay_outputs=False` or `replay_outputs="if-missing"` to control file replay.

Cache any shell command without code changes:

```bash
casher -- python train.py --data train.csv
```

## Key features

- **Automatic file tracking**: authoritative `strace`-based subprocess tracking on Linux, plus best-effort Python-level tracking in in-process mode
- **Generated-output awareness**: files written before their first read are excluded from `input_files`
- **Automatic output-arg stabilization**: generated output `Path` arguments are learned from runtime I/O and matched by path identity on later lookups
- **Dependency invalidation**: in subprocess mode, imported project `.py` files are recorded as ordinary implicit inputs and revalidated by path + content hash on lookup
- **Optional dependency narrowing**: use `dep_roots=[...]` or `dep_files=[...]` to limit which imported source files from subprocess mode are considered relevant
- **File-hash memoization**: unchanged files reuse cached content hashes from a small SQLite metadata store
- **LRU eviction**: configurable via `max_cache_bytes` or `CASHER_MAX_CACHE_BYTES` env var (default 32 GB)
- **Faster hits for large artifacts**: output replay skips files whose current hash already matches the cached output
- **DataFrame support**: polars and pandas DataFrames serialized via Arrow IPC
- **Environment-aware**: include env vars in cache key with `env_vars=["MY_VAR"]`
- **Miss diagnostics**: `diagnose_misses=True` logs which recorded input changed or disappeared
- **Path-clarity warnings**: existing filesystem paths passed as plain `str` args emit a one-time warning suggesting `Path(...)`
- **Earlier progress logs**: lookup, execution start, execution finish, and cache-store phases are logged so long misses are visible live
- **Structured logging**: loguru INFO for config changes, enablement, hit/miss, mode, eviction
- **Explicit directory expansion helper**: `expand_input_dir()` for stable `input_files` lists

## Configuration

| Env var                  | Default               | Description                                                        |
| ------------------------ | --------------------- | ------------------------------------------------------------------ |
| `CASHER_CACHE_DIR`       | _unset_               | Cache storage directory. Caching stays disabled until this is set. |
| `CASHER_MAX_CACHE_BYTES` | `34359738368` (32 GB) | Max cache size before LRU eviction                                 |

Or set programmatically (takes priority over env vars):

```python
from casher import configure, get_config

configure(cache_dir="/data/my_cache", max_cache_bytes=10 * 1024**3)
print(get_config())  # effective config
```

If no cache directory is configured via `CASHER_CACHE_DIR`, `configure(cache_dir=...)`,
`@cached(cache_dir=...)`, or `casher --cache-dir ...`, casher runs transparently without
caching and emits a one-time warning.

## Platform support

Full caching on **Linux** only (requires strace for authoritative subprocess mode, fcntl for locking). On macOS and Windows the decorator is a transparent pass-through — functions execute normally, caching is skipped with a one-time warning.

## Documentation

See [documentation/](documentation/) for detailed docs:

- [Introduction](documentation/00-Introduction.md) — architecture, limitations, in-process vs subprocess comparison
- [Quick Start](documentation/10-Quick-Start.md) — installation, decorator options, CLI usage
- [API Reference](documentation/20-API-Reference.md) — full API surface

## License

MIT
