Metadata-Version: 2.4
Name: diffino-cli
Version: 0.4.1
Summary: Declarative data diff engine for tables, powered by Polars. Output Excel, HTML, or Typst PDF.
License-Expression: MIT
Project-URL: Repository, https://codeberg.org/songwupei/diffino
Keywords: diff,excel,csv,comparison,polars,openpyxl
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Office/Business :: Financial :: Spreadsheet
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: polars>=0.20.0
Requires-Dist: openpyxl>=3.1.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: typer>=0.9.0
Requires-Dist: duckdb>=0.10.0
Requires-Dist: python-docx>=1.1.0
Dynamic: license-file

# diffino

Declarative data diff engine for tables and documents, powered by Polars. Compare Excel, CSV, Parquet, DuckDB, or DOCX files and generate detailed reports with character-level inline diffs.

Output formats: Excel, HTML, Typst PDF, DOCX track-changes, Changelog.

Supports cumulative changelog generation across multiple versions.

## Installation

```bash
pip install diffino-cli
```

## Quick Start

1. Prepare a config file:

```yaml
sources:
  left:
    type: excel
    path: data/v0.2.5.xlsx
  right:
    type: excel
    path: data/v0.2.6.xlsx
    version: "0.2.6"

compare:
  - left_sheet: Sheet1
    key_columns:
      - ID
    ignore_columns:
      - Notes

output:
  project: 我的项目
  formats:
    - excel
    - changelog
  changelog:
    split: true
  report_dir: ./diffs
```

2. Run diffino:

```bash
diffino run config.yaml
```

3. Or generate changelog standalone from saved reports:

```bash
diffino changelog generate --input-dir ./diffs --releases releases.yaml --split
```

## Features

- **Multi-format sources**: Excel, CSV, Parquet, DuckDB, DOCX
- **Key-based or fingerprint matching**: Compare by composite keys or full-row hashes
- **Column preprocessing**: Decimal rounding, text normalization, case sensitivity control
- **Character-level inline diff**: Red strikethrough for deleted text, green bold for inserted text
- **DOCX paragraph diff**: Diff DOCX body text paragraph-by-paragraph with inline diffs
- **Six output formats**:
  - **Excel**: Side-by-side old/new rows, yellow-highlighted changed cells with rich-text inline diffs; `final`, `side_by_side`, `track` styles
  - **HTML**: Self-contained report with `<del>`/`<ins>` tags and JS filtering
  - **Typst**: Typst PDF with cover page, colored tables, character-level inline diffs
  - **DOCX track-changes**: Native Word revision tracking (`<w:ins>`/`<w:del>`) — in-place for DOCX sources, generated table for others
  - **Changelog**: Cumulative changelog Typst files (`_summary.typ` + `_detail.typ`) — auto-generated with each run, includes the current report plus all previously saved reports in the same directory
- **Typst cover page**: Configurable project name — `{{PROJECT}}对比报告`
- **DiffReport persistence**: Auto-save JSON reports (`{old}__{new}.json`) for changelog accumulation
- **Version auto-detection**: Parses `name-vX.Y.Z.ext` patterns, with manual override in YAML
- **Changelog generation**: `diffino changelog generate` — version summary table + detailed per-version diffs, with `--split` to separate summary/detail into two files
- **Release date config**: `releases.yaml` maps versions to release dates for changelog display
- **Parallel processing**: ThreadPoolExecutor with configurable `max_workers` for multi-sheet comparisons

## CLI Commands

### `diffino run`

```bash
diffino run config.yaml               # Run comparison
```

Add `changelog` to `formats` to generate cumulative changelog Typst files with each run:

```yaml
output:
  formats:
    - excel
    - changelog        # Auto-generates changelog_summary.typ + changelog_detail.typ
  changelog:
    path: changelog.typ     # default
    split: true             # default: true
    summary_keep: 3         # default: 3
    max_summary_items: 3    # default: 3
    releases: releases.yaml # release dates config
```

When `changelog` is in `formats`, `save_report` is implied — the DiffReport JSON is always saved.

### `diffino validate`

```bash
diffino validate config.yaml          # Validate config only
```

### `diffino changelog generate`

```bash
diffino changelog generate                # Generate changelog.typ
  --input-dir ./diffs                     #   Directory of diff JSON files
  --output changelog.typ                  #   Output Typst file
  --releases releases.yaml                #   Release dates config
  --summary-keep 3                        #   Versions shown in summary (default: 3)
  --max-summary-items 3                   #   Max entries per version (default: 3)
  --split                                 #   Split into _summary.typ + _detail.typ
```

## Source Types

| Type | Key config fields |
|---|---|
| `excel` | `path` |
| `csv` | `path` |
| `parquet` | `path` |
| `duckdb` | `database`, `query` |
| `docx` | `path` |

### DOCX mode

When `sources.*.type` is `docx`, sheets are matched by table caption (exact, fuzzy, or 1-based index). Use `content: paragraphs` in a compare unit to diff document body text instead of tables:

```yaml
compare:
  # Table diff by caption
  - left_sheet: 表1-客户列表
    key_columns:
      - 客户ID
  # Paragraph diff
  - content: paragraphs
```

## Configuration Reference

See `config.example.yaml` for a complete example.

| Section | Field | Description |
|---|---|---|
| `sources.left/right` | `type` | Source type: `excel`, `csv`, `parquet`, `duckdb`, `docx` |
| `sources.left/right` | `version` | Manual version override (auto-parsed from filename) |
| `compare[]` | `left_sheet` / `right_sheet` | Sheet names (or DOCX table captions); `right_sheet` defaults to `left_sheet` |
| `compare[]` | `content` | Set to `paragraphs` for DOCX body text diff |
| `compare[]` | `key_columns` | Column names used for row matching |
| `compare[]` | `fingerprint` | Use full-row hash instead of key columns |
| `compare[]` | `ignore_columns` | Columns to exclude from comparison |
| `compare[]` | `column_rules` | Preprocessing rules (`decimal`, `text`) |
| `compare[]` | `label` | Human-readable label for this comparison unit |
| `output` | `project` | Project name for Typst cover (default: `数据`) |
| `output` | `title` | Report title (default: `更新说明`) |
| `output` | `formats` | List of: `excel`, `html`, `typst`, `docx_track`, `changelog` |
| `output` | `save_report` | Persist DiffReport as JSON for changelog |
| `output` | `report_dir` | Directory for saved reports (default: `./diffs`) |
| `output` | `max_workers` | Thread pool size (default: 4) |
| `output` | `release_date` | Override release date (ISO format) |
| `output` | `releases` | Inline releases config (alternative to file) |
| `output.excel` | `path` | Output Excel path |
| `output.excel` | `style` | `track`, `final`, or `side_by_side` |
| `output.html` | `path` | Output HTML path |
| `output.typst` | `path` | Output Typst path |
| `output.typst` | `template` | Custom Typst template path |
| `output.docx_track` | `path` | Output DOCX path |
| `output.changelog` | `path` | Output Typst path (default: `changelog.typ`) |
| `output.changelog` | `split` | Split into `_summary.typ` + `_detail.typ` (default: `true`) |
| `output.changelog` | `summary_keep` | Versions in summary table (default: 3) |
| `output.changelog` | `max_summary_items` | Max entries per version (default: 3) |
| `output.changelog` | `releases` | Path to releases config (default: `releases.yaml`) |

### Column preprocessing rules

```yaml
column_rules:
  - column: 金额
    type: decimal
    precision: 2
  - column: 名称
    type: text
    normalize_whitespace: true
    case_sensitive: false
```

### Releases config (`releases.yaml`)

Two formats are supported:

**With project name** (recommended):

```yaml
name: 穿透监管规则明细表
releases:
  - version: "v1.0"
    date: 2026-02-13
  - version: "v1.1"
    date: 2026-05-19
```

**Legacy flat list**:

```yaml
releases:
  - version: "0.2.5"
    date: 2026-05-19
  - version: "0.2.6"
    date: 2026-05-20
```

Releases config can also be specified inline under `output.releases` in the main config.

## Changelog

### Via `diffino run` (recommended)

Add `changelog` to `output.formats`. DiffReport JSONs are auto-saved to `output.report_dir` (default `./diffs`) as `{old_version}__{new_version}.json`. After each run, all JSON reports in the directory are loaded and cumulative `changelog_summary.typ` + `changelog_detail.typ` are generated.

```yaml
output:
  formats:
    - changelog
  changelog:
    split: true
```

### Via `diffino changelog generate` (standalone)

```bash
diffino changelog generate --input-dir ./diffs --split
```

### Changelog output

- **Summary** (`_summary.typ`): Version table listing each version, date, and notable change counts
- **Detail** (`_detail.typ`): Per-version sections showing every added/deleted/modified row with old/new value inline diffs

Without `--split` (or `split: false` in config), the output is a single `changelog.typ`.

## License

MIT
