Metadata-Version: 2.3
Name: stable-write
Version: 0.1.2
Summary: Deterministic, atomic, save-only-if-modified file writing for Python data pipelines.
Keywords: deterministic,atomic,file-io,data-pipeline,xlsx,zip,shapefile
Author: Fabien Farella
Author-email: Fabien Farella <fabien.farella@gmail.com>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Requires-Dist: dataclasses>=0.6 ; python_full_version < '3.7'
Requires-Dist: importlib-metadata>=3.6 ; python_full_version < '3.8'
Requires-Dist: typing-extensions>=3.7.4 ; python_full_version < '3.8'
Requires-Python: >=3.6
Project-URL: Homepage, https://github.com/ews-ffarella/stablewrite
Project-URL: Repository, https://github.com/ews-ffarella/stablewrite
Project-URL: Bug Tracker, https://github.com/ews-ffarella/stablewrite/issues
Description-Content-Type: text/markdown

# 📝 stablewrite

[![CI](https://github.com/ews-ffarella/stablewrite/actions/workflows/ci.yml/badge.svg)](https://github.com/ews-ffarella/stablewrite/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/stable-write)](https://pypi.org/project/stable-write/)

**Deterministic, atomic, save-only-if-modified file writing for Python data pipelines.**

`stablewrite` is for scripts that generate files repeatedly but should only touch the output when the underlying data actually changed.

If you use **Snakemake, Make, Docker volumes, CI caches, notebooks, or report pipelines**, you have probably seen this: a script re-runs, writes the same data again, updates the file modification time, and suddenly half the downstream workflow rebuilds for no real reason.

`stablewrite` fixes that by writing into an isolated temporary directory first, normalizing volatile metadata, comparing the result with the existing destination, and publishing only when the finalized output is meaningfully different.

```python
from stable_write import save_if_changed

with save_if_changed("output/report.csv") as saver:
    saver.path.write_text("id,value\n1,100\n", encoding="utf-8")

print(saver)  # saved or skipped, with hashes and reason available on the object
```

If the generated bytes match the existing file, the destination is left untouched. Its `mtime` stays exactly as it was.

## ✨ Features

- **Save only if changed**: unchanged outputs are discarded, preserving destination `mtime` and avoiding unnecessary downstream rebuilds.
- **Atomic publish step**: files are staged away from the destination, then copied to a destination-side temp file and published with `os.replace`.
- **Deterministic ZIP/OOXML profiles**: built-in profiles for `.zip`, `.xlsx`, `.docx`, and `.pptx` normalize ZIP metadata and strip volatile OOXML core properties.
- **Companion file support**: publish bundles such as ESRI Shapefiles (`.shp`, `.dbf`, `.shx`, `.prj`, `.cpg`) together with the main file.
- **Strict explicit companions**: if you request `companions=["foo.csv"]`, that file must be created, otherwise the save fails without publishing anything.
- **Semantic comparison hook**: use `is_equal=` for formats where byte stability is unrealistic but structural equality is easy to check.
- **Zero core dependencies**: the built-in profiles use only the Python standard library. Writer libraries such as pandas, openpyxl, and GeoPandas are only needed by your own code.
- **Large ZIP friendly**: ZIP entries are streamed during normalization, so embedded media in `.pptx` or `.docx` files do not need to be loaded fully into memory.

## 📦 Installation

Install the core package:

```bash
pip install stablewrite
```

The core library has no runtime dependencies. Install the writer libraries you use in your own pipeline:

```bash
pip install pandas openpyxl      # if you generate Excel files
pip install geopandas            # if you write shapefiles or GeoPackages
```

The built-in `xlsx`, `docx`, and `pptx` profiles do not require `openpyxl`; they patch the OOXML ZIP structure directly with the standard library.

## 🚀 Quickstart

### Basic Usage

Write to `saver.path`, not directly to the final destination. After the `with` block exits, `stablewrite` decides whether to publish.

```python
from stable_write import save_if_changed

with save_if_changed("output/report.csv") as saver:
    saver.path.write_text("id,value\n1,100\n", encoding="utf-8")

if saver.saved:
    print(f"Updated {saver.destination} ({saver.new_hash})")
else:
    print(f"Skipped: {saver.reason}")
```

### The Excel Timestamp Problem

`pandas.DataFrame.to_excel()` writes an OOXML workbook. The workbook can include dynamic metadata such as `dcterms:modified`, so two identical DataFrames saved one second apart can produce different file hashes.

Use the `xlsx` profile, or the convenience wrapper, to normalize the workbook before comparison:

```python
import pandas as pd
from stable_write import save_xlsx_if_changed

df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})

with save_xlsx_if_changed("results/data.xlsx") as saver:
    df.to_excel(saver.path, index=False)

if saver.saved:
    print("Excel report changed")
```

Under the hood the `xlsx` profile patches `docProps/core.xml` and then rewrites the ZIP container with deterministic entry ordering, timestamps, and extra fields.

### Companion Files

Some formats are bundles, not single files. ESRI Shapefiles are the classic example: writing `spatial.shp` usually also creates `spatial.shx`, `spatial.dbf`, `spatial.prj`, and `spatial.cpg`.

Use `companions="auto"` when the writer decides which companion files exist:

```python
import geopandas as gpd
from stable_write import save_if_changed

gdf = gpd.read_file("raw_data.geojson")

with save_if_changed("processed/spatial.shp", companions="auto") as saver:
    gdf.to_file(saver.path)

if "spatial.dbf" in saver.changed_companions:
    print("Attribute table changed")
```

If any companion changes, the save is treated as changed and the bundle is published. Each file is replaced atomically on its own; the bundle as a whole is not transactional across multiple files.

Use an explicit list when every companion is required:

```python
with save_if_changed(
    "processed/spatial.shp",
    companions=["spatial.shx", "spatial.dbf", "spatial.prj"],
) as saver:
    gdf.to_file(saver.path)
```

If one of those listed files is missing from the temporary directory, `stablewrite` raises `FileNotFoundError` and leaves the destination untouched. That makes explicit companions a contract, while `companions="auto"` remains the optional/discovery mode.

### Custom Semantic Comparison

Some formats are not realistically byte-stable. SQLite-based formats such as GeoPackage (`.gpkg`) may include internal metadata, page ordering, or timestamps that make byte hashes noisy.

For those cases, provide `is_equal=`. The callable receives the newly generated temp file and the existing destination and returns whether they are equivalent.

```python
from pathlib import Path

import geopandas as gpd
from stable_write import save_if_changed


def gpkg_is_equal(new: Path, existing: Path) -> bool:
    """Compare GeoPackages by data content, not raw bytes."""
    new_data = gpd.read_file(new)
    old_data = gpd.read_file(existing)
    return new_data.equals(old_data)


with save_if_changed("data/roads.gpkg", is_equal=gpkg_is_equal) as saver:
    gdf.to_file(saver.path, driver="GPKG")
```

`old_hash` and `new_hash` are still computed and stored. `is_equal` only replaces the equality decision for the main file. Companion files are still compared by hash.

## ⚙️ API Overview

### `save_if_changed(...)`

```python
save_if_changed(
    path,
    *,
    profile=None,
    finalizers=None,
    save_strategy="overwrite",
    algo="blake2b",
    safe_copy=False,
    companions="auto",
    is_equal=None,
)
```

| Argument        | Purpose                                                                                    |
| --------------- | ------------------------------------------------------------------------------------------ |
| `path`          | Final destination path.                                                                    |
| `profile`       | Named profile: `"zip"`, `"xlsx"`, `"docx"`, `"pptx"`, or any registered custom profile.   |
| `finalizers`    | Ordered list of custom `(Path) -> None` functions run before hashing. Overrides `profile`. |
| `save_strategy` | What to do when content changed: `"overwrite"`, `"raise"`, or `"skip"`.                    |
| `algo`          | Hash algorithm used for byte comparison. Defaults to `"blake2b"`.                          |
| `safe_copy`     | Use `shutil.copyfile` instead of `shutil.copy2` for the publish copy.                      |
| `companions`    | `"auto"`, `None`, `[]`, or an explicit list of companion filenames.                        |
| `is_equal`      | Optional semantic comparator for the main file.                                            |

### Registry

Profiles are stored in a global registry. The following functions manage it:

| Function                                           | Purpose                                                         |
| -------------------------------------------------- | --------------------------------------------------------------- |
| `register_profile(name, finalizers, is_equal, force)` | Register a named profile for use with `profile=`.            |
| `get_profile(name) → Profile`                      | Retrieve a registered profile; raises `ValueError` if absent.  |
| `list_profiles() → list[str]`                      | Return a sorted list of all registered profile names.           |

All three are importable directly from `stable_write`.

### Built-In Profiles

| Profile | Finalizers                                       | Use case                                                               |
| ------- | ------------------------------------------------ | ---------------------------------------------------------------------- |
| `zip`   | `normalize_zip_metadata`                         | Generic ZIP archives with volatile entry metadata.                     |
| `xlsx`  | `strip_ooxml_metadata`, `normalize_zip_metadata` | Generated Excel workbooks, including pandas/openpyxl output.           |
| `docx`  | `strip_ooxml_metadata`, `normalize_zip_metadata` | Generated Word documents.                                              |
| `pptx`  | `strip_ooxml_metadata`, `normalize_zip_metadata` | Generated PowerPoint files, including files with large embedded media. |

### Result Object

Inside the context manager you receive a `Saver`. After the context exits, it exposes:

| Attribute                  | Meaning                                                     |
| -------------------------- | ----------------------------------------------------------- |
| `saver.path`               | Temporary path you should write to inside the `with` block. |
| `saver.destination`        | Final destination path.                                     |
| `saver.saved`              | `True` if the destination was replaced.                     |
| `saver.changed`            | `True` if the new output differed from the existing output. |
| `saver.reason`             | Human-readable decision reason.                             |
| `saver.old_hash`           | Hash of the existing destination, or `None` when missing.   |
| `saver.new_hash`           | Hash of the finalized temp file.                            |
| `saver.changed_companions` | Companion filenames whose bytes changed or appeared.        |

## 🧭 Save Strategies

Use `save_strategy` to control what happens when content changed:

- `"overwrite"` (default): publish the new output.
- `"raise"`: raise `FileExistsError` and leave the destination untouched.
- `"skip"`: do not publish, but populate `changed`, `reason`, and hashes on the saver.

`"raise"` is useful for strict notebook evaluation or audit workflows where a rerun must never mutate canonical outputs silently.

## 🔌 Custom Profiles

You can package reusable finalizer chains as named profiles. A registered
profile can be selected with `profile=` anywhere you call `save_if_changed`,
including in third-party libraries built on top of `stablewrite`.

```python
from pathlib import Path

from stable_write import register_profile, save_if_changed
from stable_write.finalizers import normalize_zip_metadata


def strip_my_app_header(path: Path) -> None:
    """Remove the generated-on comment from app-specific text exports."""
    lines = path.read_text(encoding="utf-8").splitlines()
    cleaned = [l for l in lines if not l.startswith("# Generated on")]
    path.write_text("\n".join(cleaned) + "\n", encoding="utf-8")


register_profile("my_zip", finalizers=[strip_my_app_header, normalize_zip_metadata])

with save_if_changed("output/bundle.zip", profile="my_zip") as saver:
    build_bundle(saver.path)
```

You can also attach a default `is_equal` comparator to a profile. When
`save_if_changed` resolves the profile, `is_equal` is used automatically unless
the caller provides their own.

```python
register_profile("gpkg", is_equal=gpkg_is_equal)
```

Use `force=True` to replace an existing registration (for example, when
testing or when upgrading a profile at startup).

## 🧹 Custom Finalizers

Finalizers are small functions that mutate the staged temporary file before hashing. They are the right tool when you want the file on disk to be canonical.

Common uses:

- Remove generated headers such as `# Generated on 2026-05-28` from text exports.
- Re-serialize JSON/YAML with sorted keys and stable indentation.
- Strip image metadata from generated plots.
- Remove absolute local paths from generated reports.

Example: canonical JSON output.

```python
import json
from pathlib import Path

from stable_write import save_if_changed


def canonical_json(path: Path) -> None:
    data = json.loads(path.read_text(encoding="utf-8"))
    path.write_text(
        json.dumps(data, sort_keys=True, indent=2, ensure_ascii=False) + "\n",
        encoding="utf-8",
    )


with save_if_changed("config.json", finalizers=[canonical_json]) as saver:
    some_library.write_json(saver.path)
```

If the finalizer raises, nothing is published. The existing destination stays untouched.

## 🤔 Finalizers vs. `is_equal`

Both features help with formats that produce noisy bytes. They solve different problems.

Use a **finalizer** when you want to fix the generated file before it lands on disk:

- the stored file should have stable formatting;
- downstream tools rely on byte-level stability;
- Git diffs should be clean;
- hashes should represent the normalized artifact.

Use **`is_equal`** when you only need a smarter comparison:

- the file format is hard to rewrite safely;
- semantic equality is easy to compute in Python;
- you want to ignore fields during comparison without altering newly saved files;
- you need tolerance-based comparison, such as approximate floats.

Example: compare JSON semantically while ignoring a volatile nested key.

```python
import json
from pathlib import Path

from stable_write import save_if_changed


def json_equal_ignoring_timestamp(new_path: Path, existing_path: Path) -> bool:
    new_data = json.loads(new_path.read_text(encoding="utf-8"))
    old_data = json.loads(existing_path.read_text(encoding="utf-8"))

    new_data.get("metadata", {}).pop("generated_at", None)
    old_data.get("metadata", {}).pop("generated_at", None)

    return new_data == old_data


with save_if_changed("config.json", is_equal=json_equal_ignoring_timestamp) as saver:
    some_library.write_json(saver.path)
```

If `is_equal` returns `False`, the raw generated temp file is published. If you also want to clean the file before publication, use a finalizer as well.

| Scenario                                      | Prefer finalizer                  | Prefer `is_equal`   |
| --------------------------------------------- | --------------------------------- | ------------------- |
| Stable JSON key order on disk                 | Yes                               | Maybe not necessary |
| Ignore a nested timestamp only for comparison | Possible, but changes stored file | Yes                 |
| Clean Git diffs                               | Yes                               | No                  |
| Approximate float comparison                  | No                                | Yes                 |
| Non-Python downstream byte cache              | Yes                               | No                  |
| Expensive or risky binary rewrite             | No                                | Yes                 |

## 🧱 Guarantees and Boundaries

`stablewrite` is intentionally conservative:

- Finalizers run before hashing, so profiles can make noisy output deterministic.
- Finalizer failures leave the destination untouched.
- The final publish uses destination-side temporary files and `os.replace`.
- For companion bundles, each file is replaced atomically, but the bundle is not a transaction.
- `is_equal` affects only the main file; companions are still tracked by hash.
- Explicit companion lists are strict. Use `companions="auto"` when companion files are optional.

## 🧪 Why This Matters

A plain write updates `mtime` even when the content is identical:

```python
Path("report.csv").write_text(render_report())
```

That is enough to wake up downstream jobs in Make, Snakemake, Docker layer caches, or CI artifacts.

`stablewrite` makes the write conditional on the finalized artifact:

```python
with save_if_changed("report.csv") as saver:
    saver.path.write_text(render_report(), encoding="utf-8")
```

Same data means no replacement, no new `mtime`, and no accidental rebuild.
