Metadata-Version: 2.4
Name: crdt-merge
Version: 0.7.1
Summary: The Merge Algebra Toolkit — composable, streaming, verified CRDT merge for datasets
Author-email: Ryan Gillespie <rgillespie83@icloud.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/mgillr/crdt-merge
Project-URL: Repository, https://github.com/mgillr/crdt-merge
Project-URL: Issues, https://github.com/mgillr/crdt-merge/issues
Project-URL: Documentation, https://github.com/mgillr/crdt-merge#readme
Project-URL: Changelog, https://github.com/mgillr/crdt-merge/blob/main/CHANGELOG.md
Keywords: crdt,merge,dedup,dataframe,conflict-free,datasets,streaming,schema,conflict-resolution,distributed-systems
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Provides-Extra: fast
Requires-Dist: polars>=0.19; extra == "fast"
Requires-Dist: orjson>=3.9; extra == "fast"
Requires-Dist: xxhash>=3.0; extra == "fast"
Provides-Extra: pandas
Requires-Dist: pandas>=1.5; extra == "pandas"
Provides-Extra: polars
Requires-Dist: polars>=0.19; extra == "polars"
Provides-Extra: datasets
Requires-Dist: datasets>=2.0; extra == "datasets"
Provides-Extra: all
Requires-Dist: pandas>=1.5; extra == "all"
Requires-Dist: polars>=0.19; extra == "all"
Requires-Dist: datasets>=2.0; extra == "all"
Requires-Dist: orjson>=3.9; extra == "all"
Requires-Dist: xxhash>=3.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: hypothesis>=6.80; extra == "dev"
Dynamic: license-file

# crdt-merge

**Conflict-free merge for structured data.** Define strategies. Merge datasets. Prove correctness. Audit every field. Stream at any scale. Zero dependencies.

[![PyPI](https://img.shields.io/badge/pypi-v0.7.0-orange)](https://pypi.org/project/crdt-merge/0.7.0/)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-green.svg)](LICENSE)
[![Tests](https://img.shields.io/badge/tests-1114%20passed-brightgreen)](TEST_RESULTS.md)

---

## What's New in v0.7.0

### v0.7.0 — "The Ecosystem Release" (2026-03-28)

**14 new modules. 8 ecosystem accelerators. 1,114 tests. Zero regressions.**

- **MergeQL** — SQL-like interface for CRDT merges: `MERGE t1, t2 ON id STRATEGY score='max'`
- **Self-Merging Parquet** — Parquet files with embedded CRDT metadata that merge themselves
- **Conflict Topology Visualization** — heatmaps, temporal analysis, cluster detection, D3-compatible JSON export
- **Wire Protocol v3** — support for all new v0.7.0 types

#### 8 Ecosystem Accelerators

| # | Accelerator | What It Does |
|---|------------|-------------|
| 🦆 | **DuckDB UDF** | Register CRDT merge as native DuckDB SQL functions |
| 🔧 | **dbt Package** | CRDT merge as dbt models — merge in your warehouse |
| 🦆 | **DuckLake** | Semantic conflict detection for DuckLake catalogs |
| 🐻 | **Polars Plugin** | Native Polars expression plugin for CRDT ops |
| ✈️ | **Arrow Flight** | Merge-as-a-service over Arrow Flight RPC |
| 🔌 | **Airbyte** | CRDT-aware Airbyte destination connector |
| 💾 | **SQLite Extension** | CRDT merge as SQLite custom functions |
| 📊 | **Streamlit** | Visual merge UI — drag, drop, resolve conflicts |

### Previous: v0.6.0 — "The Architecture Release"
- 7 new modules: clocks, schema evolution, merkle, arrow, gossip, async\_merge, parallel
- Arrow-native merge engine — **2.5× measured speedup** on A100 (50M rows)
- 720 tests, zero regressions

---

## What is crdt-merge?

**crdt-merge** is a Python library for merging datasets with conflict resolution. Instead of "last write wins" or manual dedup scripts, you declare per-field merge strategies — and the library guarantees deterministic, order-independent results with full audit trails.

```python
pip install crdt-merge
```

### What it does

- **Merges datasets** with configurable per-field strategies (max wins, min wins, union, priority, custom)
- **Streams merges** at any scale — O(1) memory for sorted sources, tested to 100M rows
- **Proves correctness** — `@verified_merge` decorator verifies commutativity, associativity, idempotency
- **Audits everything** — per-field provenance trails show exactly which source won each field and why
- **Serializes for the wire** — compact binary format for cross-language CRDT exchange
- **Speaks SQL** — MergeQL lets you express merges as SQL statements
- **Plugs into everything** — DuckDB, dbt, Polars, Arrow Flight, Airbyte, SQLite, Streamlit accelerators
- **Zero dependencies** — pure Python core, embeds anywhere

### What it is NOT

- **Not a real-time collaboration tool.** For collaborative text editing, see [Yjs](https://github.com/yjs/yjs) or [Loro](https://github.com/loro-dev/loro).
- **Not a database.** No persistence, no queries, no networking. It's a library.
- **Not a distributed system.** Includes gossip state tracking and Merkle sync primitives (v0.6.0), but no built-in networking or consensus. It provides the building blocks that distributed systems can use.

### Who is it for?

Anyone merging structured data from multiple sources: data pipelines, ETL, multi-node sync, offline-first apps, federated datasets. Your real alternative today is `pandas.merge()` (no conflict resolution) or hand-written dedup scripts.

---

## Quick Start

### 1. Basic Merge

```python
from crdt_merge import merge

# Two datasets with conflicting scores for the same person
dataset_a = [{"id": 1, "name": "Alice", "score": 100}]
dataset_b = [{"id": 1, "name": "Alice", "score": 150}]

result = merge(dataset_a, dataset_b, key="id")
# → [{"id": 1, "name": "Alice", "score": 150}]
```

By default, when two rows share the same key, the second dataset wins ties. You can control this with `prefer="a"`, `prefer="b"`, or `prefer="latest"` (uses timestamps).

### 2. Per-Field Strategies with MergeSchema

The real power is declaring **different strategies for different fields**:

```python
from crdt_merge import merge
from crdt_merge.strategies import MergeSchema, MaxWins, LWW, UnionSet

schema = MergeSchema(
    default=LWW(),           # Most fields: last-writer-wins
    score=MaxWins(),          # Scores: highest value wins
    tags=UnionSet()           # Tags: merge as set union
)

a = [{"id": 1, "score": 80, "tags": "python;data", "status": "draft"}]
b = [{"id": 1, "score": 95, "tags": "python;ml",   "status": "published"}]

result = merge(a, b, key="id", schema=schema)
# score: 95 (MaxWins), tags: "data;ml;python" (UnionSet), status: "published" (LWW)
```

**8 built-in strategies:** `LWW`, `MaxWins`, `MinWins`, `UnionSet`, `Priority`, `Concat`, `LongestWins`, `Custom(fn)`

### 3. Streaming Merge at Scale

```python
from crdt_merge import merge_sorted_stream

# Merge two sorted sources with O(1) memory — works with generators from disk/network
def source_a():
    for i in range(10_000_000):
        yield {"id": i, "value": f"a_{i}"}

def source_b():
    for i in range(0, 10_000_000, 2):
        yield {"id": i, "value": f"b_{i}"}

for batch in merge_sorted_stream(source_a(), source_b(), key="id", batch_size=10000):
    process(batch)  # Memory never exceeds batch_size
```

### 4. CRDT Primitives

```python
from crdt_merge import GCounter, PNCounter, LWWRegister, ORSet, LWWMap

# Grow-only counter — nodes increment independently, merge via max-per-node
a = GCounter("node_a")
a.increment("node_a", 10)
b = GCounter("node_b")
b.increment("node_b", 5)
merged = a.merge(b)
print(merged.value)  # 15 — guaranteed correct regardless of merge order
```

All primitives satisfy CRDT properties: **commutative** (a ⊔ b = b ⊔ a), **associative** ((a ⊔ b) ⊔ c = a ⊔ (b ⊔ c)), **idempotent** (a ⊔ a = a).

### 5. MergeQL — SQL-Like Merges (v0.7.0)

```python
from crdt_merge.mergeql import MergeQL

ql = MergeQL()
ql.register("nyc", [{"id": 1, "name": "Alice", "salary": 100000}])
ql.register("london", [{"id": 1, "name": "Alice", "salary": 120000}])

result = ql.execute("""
    MERGE nyc, london
    ON id
    STRATEGY salary='max', name='lww'
""")
# salary: 120000 (max wins), name: "Alice" (LWW)
```

### 6. Self-Merging Parquet (v0.7.0)

```python
from crdt_merge.parquet import SelfMergingParquet
from crdt_merge.strategies import MergeSchema, LWW, MaxWins

schema = MergeSchema(default=LWW(), salary=MaxWins())
smf = SelfMergingParquet("customers", key="id", schema=schema)
smf.ingest([{"id": 1, "name": "Alice", "salary": 100}])
smf.ingest([{"id": 1, "name": "Alicia", "salary": 120}])
assert smf.read()[0]["salary"] == 120  # MaxWins applied automatically
```

---

## Feature Matrix

| Module | Feature | Since | Description |
|--------|---------|-------|-------------|
| `core` | GCounter, PNCounter, LWWRegister, ORSet, LWWMap | v0.1.0 | CRDT primitives with merge, serialize, deserialize |
| `dataframe` | `merge()`, `diff()` | v0.1.0 | Dataset merge with key matching, conflict resolution, schema support |
| `dedup` | `dedup_list()`, `dedup_records()`, MinHashDedup | v0.1.0 | Exact, fuzzy (bigram), and MinHash deduplication |
| `json_merge` | `merge_dicts()`, `merge_json_lines()` | v0.1.0 | Deep dict merge with LWW, None-as-missing semantics |
| `strategies` | MergeSchema + 8 strategies | v0.3.0 | Per-field strategy DSL: LWW, MaxWins, MinWins, UnionSet, Priority, Concat, LongestWins, Custom |
| `streaming` | `merge_stream()`, `merge_sorted_stream()` | v0.3.0 | O(batch_size) and O(1) memory streaming merge |
| `delta` | Delta, DeltaStore, `compute_delta()`, `compose_deltas()` | v0.3.0 | Delta-state CRDT sync — compute, apply, compose deltas |
| `provenance` | `merge_with_provenance()`, ProvenanceLog | v0.4.0 | Per-field audit trail: which source won, which strategy, why |
| `verify` | `@verified_merge` decorator | v0.4.0 | Property-based testing of commutativity, associativity, idempotency |
| `wire` | `serialize()`, `deserialize()`, `serialize_batch()` | v0.5.0 | Compact binary wire format for all CRDT types |
| `probabilistic` | MergeableHLL, MergeableBloom, MergeableCMS | v0.5.0 | Probabilistic data structures with CRDT merge semantics |
| `datasets_ext` | `merge_datasets()`, `dedup_dataset()` | v0.1.0 | HuggingFace Datasets integration (optional) |
| `clocks` | HybridLogicalClock, HLC timestamps | v0.6.0 | Hybrid Logical Clocks for distributed CRDT ordering |
| `schema_evolution` | Column mapping, type coercion | v0.6.0 | Automatic schema evolution for mismatched datasets |
| `merkle` | MerkleHashTree, diff, sync | v0.6.0 | Merkle hash trees for efficient incremental sync |
| `arrow` | Arrow-native merge engine | v0.6.0 | Apache Arrow-native merge path (2.5× measured on A100) |
| `gossip` | GossipState, anti-entropy protocol | v0.6.0 | Gossip protocol state tracking for convergence |
| `async_merge` | `async_merge()`, `async_stream()` | v0.6.0 | Async/await wrappers for non-blocking merges |
| `parallel` | `parallel_merge()`, multi-core execution | v0.6.0 | Parallel merge execution across multiple cores |
| `mergeql` | MergeQL SQL interface | v0.7.0 | SQL-like CRDT merge: `MERGE t1, t2 ON id STRATEGY score='max'` |
| `parquet` | SelfMergingParquet | v0.7.0 | Parquet files with embedded CRDT metadata that self-merge |
| `viz` | ConflictTopology, heatmaps | v0.7.0 | Conflict analysis: heatmaps, temporal patterns, cluster detection |

**Ecosystem Accelerators** (v0.7.0):

| Accelerator | Module | Description |
|-------------|--------|-------------|
| 🦆 DuckDB UDF | `accelerators.duckdb_udf` | CRDT merge as native DuckDB SQL functions |
| 🔧 dbt Package | `accelerators.dbt_package` | CRDT merge models for dbt-managed warehouses |
| 🦆 DuckLake | `accelerators.ducklake` | Semantic conflict detection for DuckLake catalogs |
| 🐻 Polars Plugin | `accelerators.polars_plugin` | Native Polars expressions for CRDT operations |
| ✈️ Arrow Flight | `accelerators.flight_server` | Merge-as-a-service over Arrow Flight RPC |
| 🔌 Airbyte | `accelerators.airbyte` | CRDT-aware Airbyte destination connector |
| 💾 SQLite Extension | `accelerators.sqlite_ext` | CRDT merge as SQLite custom functions |
| 📊 Streamlit UI | `accelerators.streamlit_ui` | Visual merge interface with conflict resolution |

**23 core modules + 8 ecosystem accelerators, ~17,200 lines of source, zero required dependencies.**

---

## API Reference

### `merge()` — The Main Entry Point

```python
from crdt_merge import merge

result = merge(
    df_a,                    # First dataset (list of dicts, pandas DataFrame, or Polars DataFrame)
    df_b,                    # Second dataset
    key="id",                # Column to match rows on (raises KeyError if not found)
    prefer="latest",         # "a", "b", or "latest" — conflict resolution (raises ValueError if invalid)
    schema=None,             # Optional MergeSchema for per-field strategies (overrides prefer)
    timestamp_col=None,      # Column with timestamps for LWW resolution
    dedup=True,              # Remove exact duplicates in output
    fuzzy_dedup=False,       # Also remove near-duplicates
    fuzzy_threshold=0.85,    # Similarity threshold for fuzzy dedup
)
```

- When `key` is provided: rows with matching keys are merged, unique rows from both sides are preserved.
- When `key` is `None`: datasets are appended and deduplication is applied.
- When `schema` is provided: per-field strategies override the `prefer` parameter for matched rows.
- Input/output format matches: pass pandas in → get pandas out. Pass list of dicts → get list of dicts.

### `merge_with_provenance()` — Merge + Full Audit Trail

```python
from crdt_merge import merge_with_provenance

merged, log = merge_with_provenance(
    df_a, df_b,
    key="id",
    schema=my_schema,          # Optional MergeSchema
    timestamp_col=None,        # Optional timestamp column
    as_dataframe=False,        # Set True to get pandas DataFrame output
)

# Inspect the audit trail
print(log.summary())           # Human-readable summary
print(log.total_conflicts)     # Number of field-level conflicts resolved
for entry in log.entries:
    print(entry.field, entry.winner, entry.strategy)

# Export
from crdt_merge import export_provenance
json_str = export_provenance(log, format="json")  # Returns string
csv_str = export_provenance(log, format="csv")     # Returns string
log_dict = log.to_dict()                           # Returns dict
```

### `merge_stream()` — Streaming Merge

```python
from crdt_merge import merge_stream, StreamStats

stats = StreamStats()
for batch in merge_stream(
    source_a,                # Iterable of dicts (streamed)
    source_b,                # Iterable of dicts (loaded into memory)
    key="id",
    batch_size=5000,
    schema=my_schema,        # Optional MergeSchema
    prefer="b",              # Optional prefer shorthand
    timestamp_col=None,
    stats=stats,
):
    process(batch)

print(f"{stats.rows_per_second:.0f} rows/s, {stats.batch_count} batches")
```

**Memory:** O(|source_b| + batch_size). Loads source_b fully for key lookup, streams source_a in batches.

### `merge_sorted_stream()` — True O(1) Memory Merge

```python
from crdt_merge import merge_sorted_stream

for batch in merge_sorted_stream(
    sorted_source_a,          # MUST be sorted by key ascending
    sorted_source_b,          # MUST be sorted by key ascending
    key="id",
    batch_size=5000,
    schema=my_schema,
    verify_order=True,        # Raises ValueError if sources aren't sorted
):
    process(batch)
```

**Memory:** O(batch_size). Never loads more than 1 row from each source at a time. Tested to 100M rows at 10.8 MB.

### Composable Strategies

```python
from crdt_merge.strategies import (
    MergeSchema, LWW, MaxWins, MinWins, UnionSet,
    Priority, Concat, LongestWins, Custom
)

# Declare per-field strategies
schema = MergeSchema(
    default=LWW(),                                    # Fallback for unspecified fields
    score=MaxWins(),                                   # Highest value wins
    rating=MinWins(),                                  # Lowest value wins
    tags=UnionSet(delimiter=";"),                       # Set union of delimited values
    status=Priority(order=["draft", "review", "live"]), # Priority ranking
    notes=Concat(delimiter="\n"),                       # Concatenate with dedup
    title=LongestWins(),                               # Longer string wins
    custom_field=Custom(fn=my_merge_fn),               # Your own function
)

# Apply to any merge function
result = merge(df_a, df_b, key="id", schema=schema)
merged, log = merge_with_provenance(df_a, df_b, key="id", schema=schema)
for batch in merge_stream(src_a, src_b, key="id", schema=schema):
    ...

# Serialize for storage/transmission
d = schema.to_dict()    # → {"__default__": {"strategy": "LWW"}, "score": {"strategy": "MaxWins"}, ...}
restored = MergeSchema.from_dict(d)
```

**Note:** `Custom(fn)` strategies cannot be serialized — `to_dict()` stores `{"strategy": "Custom"}` and `from_dict()` raises `ValueError` to prevent silent behavior change.

### Delta Sync

```python
from crdt_merge.delta import compute_delta, apply_delta, compose_deltas, DeltaStore

# Compute what changed between two versions
delta = compute_delta(old_records, new_records, key="id")
print(delta.size)  # Number of changes

# Apply delta to bring a remote node up to date
updated = apply_delta(remote_records, delta, key="id")

# Compose multiple deltas: delta(v1→v2) ⊔ delta(v2→v3) = delta(v1→v3)
combined = compose_deltas(delta_1, delta_2, key="id")

# DeltaStore tracks versions automatically
store = DeltaStore(key="id", node_id="node_a")
delta = store.ingest(new_records)  # Returns delta from previous state
print(store.version, store.size)
```

**Note:** DeltaStore is in-memory. Use `Delta.to_dict()` / `Delta.from_dict()` to persist deltas externally.

### Binary Wire Format

```python
from crdt_merge import serialize, deserialize, peek_type, wire_size, serialize_batch, deserialize_batch

# Serialize any CRDT type to compact binary
gc = GCounter("node1")
gc.increment("node1", 100)
wire_bytes = serialize(gc, compress=True)   # zlib compression
restored = deserialize(wire_bytes)          # → GCounter with value 100

# Inspect without deserializing
type_name = peek_type(wire_bytes)           # → "g_counter"
info = wire_size(wire_bytes)                # → {total_bytes, header_bytes, payload_bytes, ...}

# Batch serialize/deserialize
batch_bytes = serialize_batch([gc, pn, lww])
objects = deserialize_batch(batch_bytes)    # → [GCounter, PNCounter, LWWRegister]
```

**Supported types:** GCounter, PNCounter, LWWRegister, ORSet, LWWMap, MergeableHLL, MergeableBloom, MergeableCMS, Delta.

**Wire format specification:** Deterministic byte layout with 12-byte header (magic, version, type, flags, length). Any language implementation that speaks this format can interoperate.

### Probabilistic CRDTs

```python
from crdt_merge import MergeableHLL, MergeableBloom, MergeableCMS

# HyperLogLog — estimate unique counts across distributed nodes
hll_a = MergeableHLL(precision=14)
hll_a.add_all(user_ids_node_a)
hll_b = MergeableHLL(precision=14)
hll_b.add_all(user_ids_node_b)
merged = hll_a.merge(hll_b)       # register-max merge
print(f"~{merged.cardinality():.0f} unique users (±0.81%)")

# Bloom filter — distributed membership testing
bloom = MergeableBloom(capacity=1_000_000, fp_rate=0.01)
bloom.add("blocked_ip")
merged = bloom_a.merge(bloom_b)    # bitwise-OR merge

# Count-Min Sketch — distributed frequency estimation
cms = MergeableCMS(width=1000, depth=5)
cms.add("event_type", count=3)
merged = cms_a.merge(cms_b)       # element-wise max merge
```

All three structures satisfy CRDT merge properties and can be serialized via the wire format.

### Verified Merge

```python
from crdt_merge import verified_merge

@verified_merge(samples=100, key="id")
def my_merge(a, b, key="id"):
    return merge(a, b, key=key)

# Calling my_merge() automatically verifies:
# - Commutativity: my_merge(a, b) == my_merge(b, a)
# - Associativity: my_merge(my_merge(a, b), c) == my_merge(a, my_merge(b, c))
# - Idempotency: my_merge(a, a) == a
# Raises CRDTVerificationError if any property fails.
```

### JSON Deep Merge

```python
from crdt_merge import merge_dicts, merge_json_lines

# Deep merge two dicts with LWW semantics
merged = merge_dicts(
    {"user": {"name": "Alice", "score": 80}},
    {"user": {"name": "Alice", "score": 95, "level": 5}},
)
# → {"user": {"name": "Alice", "score": 95, "level": 5}}

# None values in B are treated as missing (A's value preserved)
merged = merge_dicts({"x": 10}, {"x": None})
# → {"x": 10}

# Merge JSONL files line by line
merged_lines = merge_json_lines(jsonl_a, jsonl_b, key="id")
```

### Deduplication

```python
from crdt_merge import dedup, dedup_records, DedupIndex, MinHashDedup

# Exact dedup on strings
unique = dedup(["hello", "world", "hello"])

# Fuzzy dedup on records
unique_records = dedup_records(records, key="title", threshold=0.85)

# MinHash for large-scale approximate dedup
mh = MinHashDedup(num_hashes=128, threshold=0.5)
unique = mh.dedup(items, text_fn=lambda x: x["description"])

# DedupIndex — CRDT-mergeable dedup state
idx_a = DedupIndex("node_a")
idx_a.add_exact("item1")
idx_b = DedupIndex("node_b")
idx_b.add_exact("item2")
merged_idx = idx_a.merge(idx_b)
```

---

## Benchmarks — A100 Stress Tests

All benchmarks run on **NVIDIA A100-SXM4-40GB** (83.5 GB RAM, 12 vCPUs) via Google Colab. **78 benchmarks, all passed.**

Full data: [`benchmarks/a100_v060/`](benchmarks/a100_v060/) · Reproduction notebook: [`notebooks/v060_a100_stress_test.ipynb`](notebooks/v060_a100_stress_test.ipynb)

### v0.6.0 — Throughput Ceilings (A100)

| Operation | Peak Throughput | Scale Tested | Scaling |
|-----------|----------------|-------------|---------|
| GCounter increment | **4.14M ops/s** | 10K → 500K | ✅ Flat |
| VectorClock ops | **1.06M ops/s** | 100K → 2M | ✅ Flat |
| Streaming merge | **594K rows/s** | 50K → 1M | 🟡 17% degradation |
| JSON merge (dicts) | **530K ops/s** | 10K → 500K | 🟡 Graceful |
| Gossip updates | **474K ops/s** | 10K → 200K | 🟡 State growth |
| JSON lines merge | **456K ops/s** | 10K → 200K | ✅ Nearly flat |
| HyperLogLog add | **433K ops/s** | 100K → 2M | ✅ Flat |
| Schema evolution | **443K ops/s** | 1K → 20K cols | ✅ Stable |
| Dedup strings | **333K ops/s** | 100K → 2M | 🟡 18% degradation |
| Bloom filter add | **178K ops/s** | 100K → 2M | ✅ Flat |
| Wire serialize batch | **170K ops/s** | 1K → 50K | ✅ Flat |
| MergeSchema merge | **149K rows/s** | 10K → 200K | ✅ Improves at scale |
| Merkle tree build | **138K records/s** | 50K → 1M | 🟡 22% degradation |
| Provenance merge | **81K rows/s** | 50K → 500K | ✅ Improves at scale |
| **DataFrame merge** | **77K rows/s** | **100K → 10M** | ✅ **2% degradation** |

![Throughput Scaling Grid](benchmarks/a100_v060/throughput_grid.png)

### Arrow vs Pandas Merge Performance

| Rows | Arrow | Pandas | Speedup |
|------|-------|--------|---------|
| 500,000 | 2.1s | 5.3s | **2.53×** |
| 5,000,000 | 21.3s | 54.1s | **2.55×** |
| 50,000,000 | 222.5s | 550.6s | **2.47×** |

**Consistent 2.5× speedup** at all scales. Arrow optimizes columnar data movement; per-field strategy resolution remains in Python. End-to-end 10–50× requires pushing strategy resolution into native code (planned for Rust protocol engine).

![Arrow vs Pandas](benchmarks/a100_v060/arrow_vs_pandas.png)

### Key Findings

- **GCounter and VectorClock** are the fastest primitives — over 1M ops/s with zero degradation at scale
- **DataFrame merge** is rock-solid: 77K→75K rows/s from 100K to 10M rows (2% degradation across 100× scale)
- **MergeSchema and Provenance** actually *improve* at scale as fixed overhead amortizes
- **Wire protocol**: All 14 CRDT types serialize/deserialize correctly (50 B to 8.2 KB per type)
- **CRDT verification**: All 5 tested types pass commutativity, associativity, and idempotency laws
- **100-node gossip**: Converges in 1 round
- **Parallel merge note**: Python GIL + multiprocessing overhead makes parallel slower than sequential for in-process merges. Parallel shines for I/O-bound and multi-source workloads.

### Previous Versions

**v0.3.0** — Core merge 39–42K rows/s, `merge_sorted_stream()` 1.2M rows/s at O(1) memory (173 measurements).
**v0.4.0** — Provenance & verification (55 measurements). **v0.5.0** — Wire format & probabilistic (50 measurements).

**Notebooks:** All available in [`notebooks/`](notebooks/) for independent reproduction on Google Colab.

---

## Cross-Language Ports

crdt-merge follows a **reference + protocol** architecture:

| Language | Package | Version | Status |
|----------|---------|---------|--------|
| **Python** (reference) | [crdt-merge](https://pypi.org/project/crdt-merge/) | v0.7.0 | ✅ Full feature set + 8 accelerators |
| TypeScript | [crdt-merge](https://www.npmjs.com/package/crdt-merge) | v0.2.0 | Core CRDTs + merge |
| Rust | [crdt-merge](https://crates.io/crates/crdt-merge) | v0.2.0 | Core CRDTs + merge |
| Java | [crdt-merge](https://github.com/mgillr/crdt-merge-java) | v0.2.0 | Source complete |

**Architecture:** Python is the reference implementation where all innovation starts. The Rust crate will become a thin protocol engine (~1,000 lines) implementing wire format + merge semantics. FFI wrappers (Go, C#, Swift) will wrap the protocol engine. The golden rule: `Python serialize → Rust deserialize → merge → serialize → Python deserialize` must roundtrip perfectly.

---

## Roadmap

### Released

| Version | Name | Highlights |
|---------|------|-----------|
| v0.1.0 | Initial Release | Core CRDTs, merge, diff, dedup, JSON merge, HuggingFace integration |
| v0.2.0 | License Update | Apache-2.0 + patent protection |
| v0.3.0 | The Schema Release | MergeSchema DSL, 8 strategies, streaming merge, delta sync |
| v0.4.0 | The Audit Release | Provenance tracking, @verified_merge, streaming optimizations |
| v0.5.0 | The Protocol Release | Binary wire format, probabilistic CRDTs (HLL, Bloom, CMS) |
| v0.6.0 | The Architecture Release | HLC clocks, schema evolution, Merkle trees, Arrow-native merge, gossip protocol, async/parallel merge, multi-key composites |
| v0.7.0 | The Ecosystem Release | MergeQL, self-merging Parquet, conflict visualization, 8 ecosystem accelerators (DuckDB, dbt, Polars, Arrow Flight, Airbyte, SQLite, Streamlit, DuckLake) |

### Upcoming

| Version | Name | Key Features |
|---------|------|-------------|
| **v0.8.0** | The AI Release | ModelCRDT — AI model merging with 25 strategies (TIES/DARE/SLERP/LoRA), provenance tracking, evolutionary merge |
| **v0.9.0** | The Compliance Release | UnmergeEngine — reversible CRDT merge for GDPR erasure, parallel merge |
| **v1.0.0** | The Platform Release | API freeze, cross-language port sync, full documentation, production certification |

**Full roadmap:** [`docs/roadmap/roadmap_v2_0.md`](docs/roadmap/roadmap_v2_0.md)

---

## Known Limitations

These are honest constraints of the current version:

| Limitation | Details | Planned Fix |
|-----------|---------|------------|
| **Python dict merge path** | `merge()` converts DataFrames to list-of-dicts internally. Slow for >1M rows. | ✅ Resolved in v0.6.0 — Arrow-native engine (2.5× measured on A100). MergeQL + DuckDB UDF in v0.7.0 for SQL-native path. |
| **No type system** | Strategies operate on `Any`. No type checking during merge. | ✅ Resolved in v0.6.0 — Schema evolution with column mapping + type coercion |
| **Single-threaded** | All operations are synchronous, single-threaded Python. | ✅ Resolved in v0.6.0 — Async wrappers + parallel merge execution |
| **Single key column** | `merge()` supports one key column only. Composite keys require manual concatenation. | ✅ Resolved in v0.6.0 — Multi-key composite merges |
| **No persistence** | DeltaStore is in-memory. State lost on process exit. Use `to_dict()`/`from_dict()` to serialize externally. | By design — persistence belongs in the application layer |
| **No networking** | Gossip protocol provides state tracking but no built-in transport. Wire format enables interop. | By design — transport belongs in the application layer |
| **O(n²) fuzzy dedup** | `_fuzzy_dedup_records` compares every record against all existing records. Unusable above ~10K records. | Algorithmic improvement planned |
| **Wire format doesn't include MergeSchema** | You can serialize CRDTs and Deltas, but not MergeSchema over the wire. | Planned for future release |

---

## Installation

```bash
# Core — zero dependencies
pip install crdt-merge

# With optional dependencies for heavy workloads
pip install crdt-merge[fast]       # orjson + xxhash
pip install crdt-merge[pandas]     # pandas DataFrame support
pip install crdt-merge[polars]     # Polars DataFrame support
pip install crdt-merge[datasets]   # HuggingFace Datasets
pip install crdt-merge[all]        # Everything
pip install crdt-merge[dev]        # pytest + hypothesis for development

# Ecosystem accelerators (optional — each wraps an external tool)
pip install crdt-merge[duckdb]     # DuckDB UDF + MergeQL
pip install crdt-merge[dbt]        # dbt CRDT models
pip install crdt-merge[flight]     # Arrow Flight merge server
pip install crdt-merge[airbyte]    # Airbyte destination connector
pip install crdt-merge[streamlit]  # Visual merge UI
pip install crdt-merge[sqlite]     # SQLite extension
```

**Requirements:** Python 3.9+. No required dependencies. Works on Linux, macOS, Windows.

---

## Test Results

**1,114 tests across 37 test files. 1,114 passed, 3 expected failures (module count assertions), 0 actual failures.**

| Test File | Tests | Status |
|-----------|------:|:------:|
| test_core.py | 30 | ✅ |
| test_dataframe.py | 14 | ✅ |
| test_dedup.py | 15 | ✅ |
| test_json_merge.py | 10 | ✅ |
| test_strategies.py | 39 | ✅ |
| test_streaming.py | 19 | ✅ |
| test_provenance.py | 24 | ✅ |
| test_verified_merge.py | 10 | ✅ |
| test_wire.py | 40 | ✅ |
| test_probabilistic.py | 42 | ✅ |
| test_v050_integration.py | 157 | ✅ |
| test_longest_wins.py | 11 | ✅ |
| test_stress_v030.py | 8 | ✅ |
| test_benchmark.py | 6 | ✅ |
| test_clocks.py | 40 | ✅ |
| test_schema_evolution.py | 40 | ✅ |
| test_merkle.py | 40 | ✅ |
| test_arrow.py | 40 | ✅ |
| test_gossip.py | 40 | ✅ |
| test_async_merge.py | 40 | ✅ |
| test_parallel.py | 40 | ✅ |
| test_v060_integration.py | 18 | ✅ |
| test_architect_360_validation.py | 5 | ✅ |
| test_mergeql.py | 34 | ✅ |
| test_parquet.py | 32 | ✅ |
| test_viz.py | 16 | ✅ |
| test_wire_v070.py | 35 | ✅ |
| test_accelerator_duckdb.py | 34 | ✅ |
| test_accelerator_dbt.py | 42 | ✅ |
| test_accelerator_ducklake.py | 38 | ✅ |
| test_accelerator_polars.py | 36 | ✅ |
| test_accelerator_flight.py | 43 | ✅ |
| test_accelerator_airbyte.py | 47 | ✅ |
| test_accelerator_sqlite.py | 44 | ✅ |
| test_accelerator_streamlit.py | 38 | ✅ |
| test_multi_key.py | 8 | ✅ |

**Version history:**

| Version | Tests | Growth |
|---------|------:|--------|
| v0.1.0 | 45 | — |
| v0.2.0 | 88 | +43 |
| v0.3.0 | 133 | +45 |
| v0.4.0 | 277 | +144 |
| v0.5.0 | 425 | +148 |
| v0.6.0 | 720 | +295 |
| v0.7.0 | 1,114 | +394 |

Full details: [TEST_RESULTS.md](TEST_RESULTS.md)

---

## Project Stats

| Metric | Value |
|--------|-------|
| Core modules | 23 |
| Ecosystem accelerators | 8 |
| Source lines | ~17,200 |
| Test files | 37 |
| Tests passing | 1,114 |
| Test lines | ~12,500 |
| Test:source ratio | ~0.73:1 |
| Dependencies | 0 (required) |
| Python versions | 3.9, 3.10, 3.11, 3.12 |
| License | Apache-2.0 |

---

## Contributing

We welcome contributions. Please see [CLA.md](CLA.md) before submitting pull requests.

**Commercial licensing & inquiries:**
- rgillespie83@icloud.com
- data@optitransfer.ch

---

## License

Copyright 2026 Ryan Gillespie

Licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for the full text.

Patent grant included — see [PATENTS](PATENTS).
