Metadata-Version: 2.4
Name: rapidiff
Version: 0.3.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Requires-Dist: typing-extensions
Requires-Dist: pytest ; extra == 'dev'
Requires-Dist: pytest-benchmark ; extra == 'dev'
Requires-Dist: hyperfine ; extra == 'dev'
Requires-Dist: hypothesis ; extra == 'dev'
Requires-Dist: mypy ; extra == 'dev'
Requires-Dist: ruff ; extra == 'dev'
Requires-Dist: pre-commit ; extra == 'dev'
Provides-Extra: dev
License-File: LICENSE
Summary: Python wrapper for the similar Rust diffing library
Keywords: diff,text,difflib,myers,patience
Author-email: Vladimir Gurevich <imvladikon@gmail.com>
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Bug Tracker, https://github.com/imvladikon/rapidiff/issues
Project-URL: Documentation, https://github.com/imvladikon/rapidiff#readme
Project-URL: Homepage, https://github.com/imvladikon/rapidiff
Project-URL: Repository, https://github.com/imvladikon/rapidiff

# rapidiff

A Python diff library backed by Rust. It provides a `difflib.SequenceMatcher`-style API with faster line, character, and word comparisons for common Python workloads.

## Features

- **Multiple algorithms**: Myers, Patience, LCS, and Google diff-match-patch
- **Fast Python API**: public Python calls benchmark faster than `difflib` on the included correctness-gated scenarios
- **Multiple tokenization modes**: lines, words, characters, graphemes, unicode words
- **Structured spans**: `get_diff()` returns old/new spans with source-text positions and original text slices
- **Difflib compatibility checks**: opcodes, ratios, unified/context diffs, and sequence setters are covered by regression tests
- **Rust 2024 extension**: built with current PyO3 and maturin

## Installation

### From PyPI

```bash
pip install rapidiff
```

### From Source

Requirements:
- Python 3.10+
- Rust 1.95+ recommended
- maturin

```bash
git clone https://github.com/imvladikon/rapidiff
cd rapidiff

python -m venv venv
./venv/bin/python -m pip install "maturin[patchelf]" pytest hypothesis

VIRTUAL_ENV="$PWD/venv" ./venv/bin/python -m maturin develop --release

# Or build a wheel.
./venv/bin/python -m maturin build --release
```

## Quick Start

### Basic Usage

```python
from rapidiff import SequenceMatcher

sm = SequenceMatcher(
    a="Hello World\nLine 2", 
    b="Hallo Welt\nLine 2", 
    algorithm="myers", 
    mode="chars"
)

print(f"Similarity: {sm.ratio():.2%}")

for opcode in sm.get_opcodes():
    print(opcode)
```

### Algorithm Comparison

```python
from rapidiff import SequenceMatcher

old_text = "The quick brown fox jumps over the lazy dog"
new_text = "The fast brown fox leaps over the lazy cat"

algorithms = ['myers', 'patience', 'lcs', 'google']

for algorithm in algorithms:
    sm = SequenceMatcher(a=old_text, b=new_text, algorithm=algorithm, mode='words')
    print(f"{algorithm:8s}: {sm.ratio():.3f}")
```

### Different Tokenization Modes

```python
from rapidiff import SequenceMatcher

text1 = "Hello world! 🌍"
text2 = "Hello earth! 🌎"

modes = ['chars', 'words', 'unicode_words', 'graphemes']

for mode in modes:
    sm = SequenceMatcher(a=text1, b=text2, algorithm='myers', mode=mode)
    print(f"{mode:12s}: {sm.ratio():.3f}")
```

### Large Text Performance

```python
from rapidiff import SequenceMatcher
import time

# Load large texts (e.g., from Project Gutenberg)
with open('war_and_peace.txt', 'r') as f:
    text1 = f.read()
with open('war_and_peace_modified.txt', 'r') as f:
    text2 = f.read()

start = time.time()
sm = SequenceMatcher(a=text1, b=text2, algorithm='myers', mode='lines')
ratio = sm.ratio()
elapsed = time.time() - start

print(f"Processed {len(text1)} chars in {elapsed:.2f}s")
print(f"Similarity: {ratio:.1%}")
```

## Advanced Usage

### Working with Different Text Types

```python
from rapidiff import SequenceMatcher

# Code diffing
code1 = "def hello():\n    print('world')"
code2 = "def hello():\n    print('universe')"
sm = SequenceMatcher(a=code1, b=code2, algorithm='patience', mode='lines')

# Document diffing  
doc1 = "The cat sat on the mat."
doc2 = "A cat was sitting on the rug."
sm = SequenceMatcher(a=doc1, b=doc2, algorithm='google', mode='words')

# Character-level analysis
text1 = "café"
text2 = "coffee" 
sm = SequenceMatcher(a=text1, b=text2, algorithm='myers', mode='graphemes')
```

### API Compatibility

```python
from rapidiff import SequenceMatcher, unified_diff

sm = SequenceMatcher(None, "", "", algorithm="myers", mode="chars")

sm.set_seq1("hello")
sm.set_seq2("world")
sm.set_seqs("hello", "hello")

print(sm.ratio())
print(sm.quick_ratio())
print(sm.real_quick_ratio())

diff_results = sm.get_diff()
for result in diff_results:
    print(f"Old spans: {len(result.old_spans)}")
    print(f"New spans: {len(result.new_spans)}")

old_lines = ["one\n", "two\n"]
new_lines = ["one\n", "2\n"]
print("".join(unified_diff(old_lines, new_lines)))
```

## 🧪 Algorithms & Modes

### Available Algorithms

| Algorithm | Backend | Rust crate | Best For |
|-----------|---------|------------|----------|
| `myers` | Myers diff through `similar::TextDiff` / `capture_diff_slices` | [`similar`](https://crates.io/crates/similar) ([docs.rs](https://docs.rs/similar/latest/similar/)) | Default general-purpose text diffing |
| `patience` | Patience diff through `similar::TextDiff` / `capture_diff_slices` | [`similar`](https://crates.io/crates/similar) ([docs.rs](https://docs.rs/similar/latest/similar/)) | Code or documents with many unique stable tokens |
| `lcs` | Longest Common Subsequence through `similar::TextDiff` / `capture_diff_slices` | [`similar`](https://crates.io/crates/similar) ([docs.rs](https://docs.rs/similar/latest/similar/)) | Simple sequence similarity and compatibility checks |
| `google`, `google_efficient` | diff-match-patch efficient implementation | [`diff-match-patch-rs`](https://crates.io/crates/diff-match-patch-rs) ([docs.rs](https://docs.rs/diff-match-patch-rs/latest/diff_match_patch_rs/)) | Web/content-style text with flexible cleanup heuristics |

The Python extension layer is built with [`PyO3`](https://crates.io/crates/pyo3) and [`maturin`](https://github.com/PyO3/maturin). Public Python calls use the Rust extension module; the benchmark scripts report timings through that Python API rather than timing Rust functions directly.

For `google`, non-ASCII inputs fall back to the `similar` Myers backend internally so Unicode spans and opcodes remain valid Python strings. `rapidiff` also contains an internal `wagner_fisher` implementation for reference tests and diagnostics. It is intentionally not documented as a public algorithm because it uses dynamic programming and is not intended for large user inputs.

### Available Modes

| Mode | Description | Use Case |
|------|-------------|----------|
| `lines` | Split by newlines | File/document comparison |
| `words` | Split by whitespace | Text content analysis |
| `chars` | Character by character | Detailed text analysis |
| `graphemes` | Unicode grapheme clusters | International text |
| `unicode_words` | Whitespace word mode kept for API compatibility | Multi-language content |

## Testing

Run the comprehensive test suite:

```bash
VIRTUAL_ENV="$PWD/venv" ./venv/bin/python -m maturin develop --release
./venv/bin/python -m pytest tests -q
cargo clippy -- -D warnings
```

Current QA includes:
- **Core functionality**: all algorithms and modes
- **Difflib compatibility**: ratios, opcodes, grouped opcodes, matching blocks, unified diff, context diff
- **Span invariants**: `get_diff()` spans must match source-text slices by their positions
- **Tokenization edge cases**: whitespace, tabs, Unicode, emojis, line endings
- **Large text stress**: Project Gutenberg-style text with tail insertions
- **Edge cases**: Corruption scenarios, paraphrasing, empty sequences
- **Property-based testing**: Hypothesis-powered ratio and span invariants
- **Examples validation**: example code runs without errors

## Performance

Run the Python-level benchmark:

```bash
./venv/bin/python scripts/benchmark_rapidiff.py
```

The benchmark first verifies that `rapidiff` and `difflib.SequenceMatcher(...)` produce matching ratios and opcodes for each scenario. It then reports median timings and IQR. Latest local run on CPython 3.12.3:

| Scenario | Action | rapidiff | difflib | Speedup |
|----------|--------|----------|---------|---------|
| lines-2k-sparse | ratio | 1.425ms | 29.281ms | 20.55x |
| lines-2k-sparse | opcodes | 1.510ms | 28.906ms | 19.14x |
| lines-2k-sparse | ratio+opcodes | 1.524ms | 29.110ms | 19.10x |
| chars-3k-unique | ratio | 0.746ms | 13.890ms | 18.62x |
| chars-3k-unique | opcodes | 0.765ms | 14.018ms | 18.32x |
| chars-3k-unique | ratio+opcodes | 0.998ms | 13.890ms | 13.92x |
| words-4k-unique | ratio | 6.443ms | 37.737ms | 5.86x |
| words-4k-unique | opcodes | 6.520ms | 37.609ms | 5.77x |
| words-4k-unique | ratio+opcodes | 6.626ms | 37.591ms | 5.67x |

The benchmark measures the public Python API, including matcher creation. `ratio+opcodes` benefits from cached comparison data inside `SequenceMatcher`.

For scaling checks, run:

```bash
./venv/bin/python scripts/benchmark_scaling.py
```

This writes `docs/performance_scaling.md` and `docs/performance_scaling.svg`. The plot uses median Python-level `ratio+opcodes` time on the X axis and sequence length on the Y axis; length is line count for `lines`, character count for `chars`, and word count for `words`. Color groups each mode, while line style and marker shape distinguish `rapidiff` from builtin `difflib`. The timing script correctness-checks every scenario against `difflib` before measuring.

![rapidiff vs difflib scaling](docs/performance_scaling.svg)

Latest local scaling run:

| Mode | Length | rapidiff median | difflib median | Speedup |
|------|--------|-----------------|----------------|---------|
| lines | 250 | 0.203ms | 2.148ms | 10.56x |
| lines | 500 | 0.377ms | 4.635ms | 12.28x |
| lines | 1,000 | 0.640ms | 9.385ms | 14.68x |
| lines | 2,000 | 1.834ms | 18.858ms | 10.28x |
| lines | 4,000 | 3.079ms | 38.686ms | 12.56x |
| chars | 500 | 0.199ms | 2.231ms | 11.20x |
| chars | 1,000 | 0.380ms | 4.342ms | 11.42x |
| chars | 2,000 | 0.754ms | 9.016ms | 11.95x |
| chars | 4,000 | 1.499ms | 18.307ms | 12.21x |
| chars | 8,000 | 3.217ms | 37.182ms | 11.56x |
| words | 500 | 0.851ms | 4.163ms | 4.89x |
| words | 1,000 | 1.563ms | 9.242ms | 5.91x |
| words | 2,000 | 2.994ms | 18.000ms | 6.01x |
| words | 4,000 | 5.957ms | 37.064ms | 6.22x |
| words | 8,000 | 12.121ms | 76.059ms | 6.27x |

## Fixed Issues

This version fixes several critical issues found in span/offset calculations:

- **Operation detection**: correctly identifies replace/insert/delete operations
- **Similarity ratios**: stays within the 0.0-1.0 range
- **Span calculations**: span text is extracted from the original source text by position
- **Whitespace handling**: word spans preserve original tabs and repeated spaces
- **Unicode handling**: supports international text, emoji, and grapheme mode
- **Large text handling**: stress-tested on a War and Peace-sized text with tail-only changes

## API Reference

### SequenceMatcher Class

```python
class SequenceMatcher:
    def __init__(
        self,
        isjunk=None,
        a: str = "",
        b: str = "",
        autojunk: bool = True,
        algorithm: str = "myers",
        mode: str = "lines",
    ):
        """Initialize sequence matcher."""
        
    def ratio(self) -> float:
        """Return similarity ratio between 0.0 and 1.0."""
        
    def quick_ratio(self) -> float:
        """Return upper bound estimate of ratio."""
        
    def get_opcodes(self) -> list[tuple[str, int, int, int, int]]:
        """Return list of diff operations."""
        
    def get_diff(self) -> list[DiffResult]:
        """Return structured diff results with spans."""
        
    def set_seq1(self, a: str) -> None:
        """Set first sequence."""
        
    def set_seq2(self, b: str) -> None:
        """Set second sequence."""
        
    def set_seqs(self, a: str, b: str) -> None:
        """Set both sequences."""
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Run tests (`python -m pytest tests/ -v`)
4. Commit changes (`git commit -m 'Add amazing feature'`)
5. Push to branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request

## Project Status

- **Version**: 0.3.0
- **Status**: Release candidate
- **Test Coverage**: 85 tests passing, 6 benchmark/hyperfine tests skipped when unavailable
- **Installation**: Local wheel and editable maturin install verified
- **Performance**: Benchmarked against Python's `difflib`
- **Documentation**: README, benchmark script, examples, and usage guides
- **Compatibility**: Python 3.10+ on Linux, macOS, Windows

