Metadata-Version: 2.4
Name: phani-data-recon
Version: 1.0.7
Summary: Phani's Generic Data Reconciliation and Deduplication Utility
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: duckdb>=0.10
Requires-Dist: pandas>=2.0
Requires-Dist: openpyxl>=3.1
Requires-Dist: jinja2>=3.1
Requires-Dist: pyyaml>=6.0
Requires-Dist: jsonschema>=4.0
Requires-Dist: ruamel.yaml>=0.17
Requires-Dist: rapidfuzz>=3.0
Requires-Dist: great-expectations>=0.18
Requires-Dist: rich>=13.0

# SAP ↔ Salesforce Data Reconciliation Utility

Reconcile SAP and Salesforce master data at bulk scale (300K–400K records).
Produces a 10-tab Excel workbook and an HTML dashboard with KPIs, field-level diffs,
fuzzy match candidates, and a prioritised action plan.

Supports multiple entity pairs through config, including:
- Accounts
- Orders
- Order Items
- Any custom SAP ↔ SF dataset with mapped keys and fields

## Quick Start

### Installation

```bash
# Install from PyPI
py -m pip install phani-data-recon

# Upgrade to the latest version
py -m pip install --upgrade phani-data-recon

# Verify installed version
py -m pip show phani-data-recon
```

### Run the Application

**Using the CLI command:**

```bash
# Verify the CLI is available
reconcile-accounts --help

# Run with explicit SAP and Salesforce input files
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv
```

**Using py module (alternative if CLI is not on PATH):**

```bash
py -m phani_data_recon.cli --sap input/sap_accounts.csv --sf input/sf_accounts.csv
```

### Configuration & Input/Output

**Configure input and output folders via config file:**

Edit `config/rules.yaml` to set default paths:

```yaml
input:
  sap:
    directory: "input"
    file_name: "sap_accounts.csv"
  sf:
    directory: "input"
    file_name: "sf_accounts.csv"

output:
  formats: ["excel", "html"]
  report:
    directory: "output"
    file_name: "reconciliation_report"
```

Then run with config:

```bash
reconcile-accounts --config config/rules.yaml
py -m phani_data_recon.cli --config config/rules.yaml
```

**Override output directory at runtime:**

```bash
reconcile-accounts --config config/rules.yaml --output-dir output/custom_run
py -m phani_data_recon.cli --config config/rules.yaml --output-dir output/custom_run
```

### Validation & Advanced Options

**Validate headers and config only (dry-run):**

```bash
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv --dry-run
py -m phani_data_recon.cli --sap input/sap_accounts.csv --sf input/sf_accounts.csv --dry-run
```

**Generate only HTML output:**

```bash
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv --formats html
py -m phani_data_recon.cli --sap input/sap_accounts.csv --sf input/sf_accounts.csv --formats html
```

**Skip fuzzy matching (faster for large files):**

```bash
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv --no-fuzzy
py -m phani_data_recon.cli --sap input/sap_accounts.csv --sf input/sf_accounts.csv --no-fuzzy
```

**Enable verbose logging:**

```bash
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv --verbose
py -m phani_data_recon.cli --sap input/sap_accounts.csv --sf input/sf_accounts.csv --verbose
```

### Standalone Dedup Command (Separate from Reconciliation)

Use this command when you only want deduplication and do not want to run the full reconciliation flow. It works for Accounts, Orders, Order Items, and any other entity pair configured in YAML.

You can choose dedup logic per run:
- `primary`: dedup by primary key column
- `fuzzy`: dedup by similarity on one or more fields passed as CLI params

How fields are resolved for each mode:
- `--dedup-mode primary`
	uses `--sap-primary-key` / `--sf-primary-key` when passed; otherwise uses `join.primary.sap_col` / `join.primary.sf_col` from config.
- `--dedup-mode fuzzy`
	uses `--sap-fuzzy-field` / `--sf-fuzzy-field` when passed; otherwise uses defaults from config:
	`fuzzy_match.fields[].sap_col` for SAP and `sf_fuzzy_dedup.fields[].col` for SF.
- If fuzzy mode is selected and no fuzzy fields resolve for the selected entity, the run fails fast with a clear error.

```bash
# Accounts (default rules.yaml)
dedup-records --system sap --config config/rules.yaml --output-dir output/dedup
dedup-records --system sf  --config config/rules.yaml --output-dir output/dedup

# Orders
dedup-records --system sap --config config/rules.orders.example.yaml --output-dir output/dedup
dedup-records --system sf  --config config/rules.orders.example.yaml --output-dir output/dedup

# Order Items
dedup-records --system sap --config config/rules.order_items.example.yaml --output-dir output/dedup
dedup-records --system sf  --config config/rules.order_items.example.yaml --output-dir output/dedup

# Deduplicate only one side
dedup-records --system sap --sap input/sap_orders.csv --config config/rules.orders.example.yaml --output-dir output/dedup
dedup-records --system sf  --sf  input/sf_order_items.csv --config config/rules.order_items.example.yaml --output-dir output/dedup

# Fuzzy dedup with explicit entity fields (SF)
dedup-records --system sf --dedup-mode fuzzy --sf-fuzzy-field Name --sf-fuzzy-field WC_Email__c --fuzzy-min-score 85 --config config/rules.yaml --output-dir output/dedup

# Fuzzy dedup with explicit entity fields (SAP)
dedup-records --system sap --dedup-mode fuzzy --sap-fuzzy-field name1 --sap-fuzzy-field smtp_addr --fuzzy-min-score 85 --config config/rules.orders.example.yaml --output-dir output/dedup

# Fuzzy dedup with SAP fields passed from CLI
dedup-records --system sap --dedup-mode fuzzy --sap-fuzzy-field Order_Number --sap-fuzzy-field Customer_Name --fuzzy-min-score 85 --fuzzy-match-mode weighted --config config/rules.orders.example.yaml --output-dir output/dedup

# Fuzzy dedup with SF fields passed from CLI
dedup-records --system sf --dedup-mode fuzzy --sf-fuzzy-field OrderNumber --sf-fuzzy-field AccountName --fuzzy-min-score 85 --fuzzy-match-mode weighted --config config/rules.orders.example.yaml --output-dir output/dedup

# Module form (if console script is not on PATH)
py -m phani_data_recon.dedup_cli --system sf --sf input/sf_accounts.csv --config config/rules.yaml --output-dir output/dedup
```

Output files are generated as:
- `output/dedup/<entity>_sap_deduped_<run_id>.csv`
- `output/dedup/<entity>_sap_duplicates_<run_id>.csv`
- `output/dedup/<entity>_sf_deduped_<run_id>.csv`
- `output/dedup/<entity>_sf_duplicates_<run_id>.csv`

Where `<entity>` comes from `entities.pair` in config (slug format), for example:
- `accounts_sap_deduped_<run_id>.csv`
- `orders_sf_duplicates_<run_id>.csv`

### Entity Command Cookbook

Use this section as a direct command reference for the two primary commands:

- Reconciliation command: `reconcile-accounts`
- Dedup-only command: `dedup-records`

#### Account Reconciliation + Dedup

```bash
# Reconciliation (Accounts)
reconcile-accounts --config config/rules.yaml

# Reconciliation with join-key overrides from CLI (also updates rules.yaml)
reconcile-accounts --config config/rules.yaml --sap-primary-key kunnr --sf-primary-key BP_PowerCerv_Account_Id__c --sap-fallback-key SAP_Unique_ID --sf-fallback-key WC_SAP_Identification__c --enable-fallback-key

# Dedup only (Accounts)
dedup-records --system sap --config config/rules.yaml --output-dir output/dedup/accounts
dedup-records --system sf  --config config/rules.yaml --output-dir output/dedup/accounts

# Dedup (Accounts) using fuzzy mode with explicit SF fields
dedup-records --system sf --dedup-mode fuzzy --sf-fuzzy-field Name --sf-fuzzy-field WC_Email__c --fuzzy-min-score 85 --fuzzy-match-mode any --config config/rules.yaml --output-dir output/dedup/accounts
```

#### Order Reconciliation + Dedup

```bash
# Reconciliation (Orders)
reconcile-accounts --config config/rules.orders.example.yaml

# Dedup only (Orders)
dedup-records --system sap --config config/rules.orders.example.yaml --output-dir output/dedup/orders
dedup-records --system sf  --config config/rules.orders.example.yaml --output-dir output/dedup/orders

# Dedup (Orders) using primary-key override
dedup-records --system sap --dedup-mode primary --sap-primary-key SAP_Order_Id --config config/rules.orders.example.yaml --output-dir output/dedup/orders
dedup-records --system sf  --dedup-mode primary --sf-primary-key External_Order_Id__c --config config/rules.orders.example.yaml --output-dir output/dedup/orders

# Dedup (Orders) using fuzzy fields passed per entity
dedup-records --system sap --dedup-mode fuzzy --sap-fuzzy-field Order_Number --sap-fuzzy-field Customer_Name --fuzzy-min-score 85 --config config/rules.orders.example.yaml --output-dir output/dedup/orders
dedup-records --system sf  --dedup-mode fuzzy --sf-fuzzy-field OrderNumber --sf-fuzzy-field AccountName --fuzzy-min-score 85 --config config/rules.orders.example.yaml --output-dir output/dedup/orders
```

#### Order Item Reconciliation + Dedup

```bash
# Reconciliation (Order Items)
reconcile-accounts --config config/rules.order_items.example.yaml

# Dedup only (Order Items)
dedup-records --system sap --config config/rules.order_items.example.yaml --output-dir output/dedup/order_items
dedup-records --system sf  --config config/rules.order_items.example.yaml --output-dir output/dedup/order_items

# Dedup (Order Items) using fuzzy fields passed per entity
dedup-records --system sap --dedup-mode fuzzy --sap-fuzzy-field Material --sap-fuzzy-field Item_Description --fuzzy-min-score 85 --config config/rules.order_items.example.yaml --output-dir output/dedup/order_items
dedup-records --system sf  --dedup-mode fuzzy --sf-fuzzy-field ProductCode --sf-fuzzy-field Description --fuzzy-min-score 85 --config config/rules.order_items.example.yaml --output-dir output/dedup/order_items
```

#### Module Form (No PATH dependency)

```bash
# Reconciliation command module form
py -m phani_data_recon.cli --config config/rules.orders.example.yaml

# Dedup command module form
py -m phani_data_recon.dedup_cli --system sf --config config/rules.order_items.example.yaml --output-dir output/dedup/order_items
```

Platform-specific path examples:

Windows:

```powershell
reconcile-accounts --sap .\input\sap_accounts.csv --sf .\input\sf_accounts.csv
py -m phani_data_recon.cli --config .\config\rules.yaml --dry-run
```

```bash
# macOS
reconcile-accounts --sap ./input/sap_accounts.csv --sf ./input/sf_accounts.csv
python3 -m phani_data_recon.cli --config ./config/rules.yaml --dry-run
```

If Windows `cmd` does not recognize `reconcile-accounts`, add your Python Scripts directory to PATH and reopen `cmd`:

```cmd
setx PATH "%PATH%;C:\Users\SeshaphaniBysani\AppData\Local\Python\pythoncore-3.14-64\Scripts"
```

Then verify:

```cmd
where reconcile-accounts
reconcile-accounts --help
```

For local development in this repository, editable install still works:

```bash
py -m pip install -e .
```

## Package Usage

```bash
# Explicit input files
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv

# Config-driven execution
reconcile-accounts --config config/rules.yaml

# Override only the output directory
reconcile-accounts --config config/rules.yaml --output-dir output/run_2026_05_11

# Generate only HTML output
reconcile-accounts --sap input/sap_accounts.csv --sf input/sf_accounts.csv --formats html
```

If the console script is not available on your PATH, use:

```bash
py -m phani_data_recon.cli --dry-run
```

Production verification on Windows `cmd`:

```cmd
py -m pip show phani-data-recon
where reconcile-accounts
where dedup-records
py -m phani_data_recon.cli --sap input/sap_accounts.csv --sf input/sf_accounts.csv --dry-run
```

Expected state:
- Installed version should be `1.0.4` or later.
- If `where reconcile-accounts` is empty but module execution works, only PATH needs to be fixed.

### Python API

Run reconciliation from another Python application:

```python
from phani_data_recon.api import run_reconciliation

exit_code = run_reconciliation(
    sap="input/sap_accounts.csv",
    sf="input/sf_accounts.csv",
    config="config/rules.yaml",
    output_dir="output/api_run",
    formats=["excel", "html"],
    dry_run=False,
    no_fuzzy=False,
    verbose=True,
)

print(exit_code)
```

The API mirrors the CLI behavior and returns a process-style exit code.

## Options

```
--sap         Path to SAP accounts CSV (optional if config input.sap is set)
--sf          Path to Salesforce accounts CSV (optional if config input.sf is set)
--config      Path to rules YAML (default: ./config/rules.yaml, then packaged default)
--output-dir  Output directory (default: from config)
--formats     excel html (default: both)
--sap-primary-key    Override SAP primary join key (join.primary.sap_col)
--sf-primary-key     Override SF primary join key (join.primary.sf_col)
--sap-fallback-key   Override SAP fallback key (join.fallback.sap_col)
--sf-fallback-key    Override SF fallback key (join.fallback.sf_col)
--enable-fallback-key   Enable fallback key matching
--disable-fallback-key  Disable fallback key matching
--dry-run     Validate config + headers only; no report written
--no-fuzzy    Skip fuzzy matching (faster for large files)
--verbose     Verbose logging
```

Path resolution precedence:
- If `--sap` / `--sf` are passed, CLI values are used.
- If not passed, values are resolved from `config/rules.yaml` under `input.sap` and `input.sf`.
- If join-key override args are passed (`--sap-primary-key`, `--sf-primary-key`, fallback key args), CLI values are used at runtime.
- When join-key override args are passed, `rules.yaml` is also updated with those keys.
- If `--config` is not passed, the CLI tries local `config/rules.yaml` first and then the packaged default config.
- If `--output-dir` is passed, it overrides `output.report.directory`.
- If neither CLI nor config provides paths, the run exits with an input path error.

Example key override run:

```bash
reconcile-accounts --config config/rules.yaml --sap-primary-key kunnr --sf-primary-key BP_PowerCerv_Account_Id__c --sap-fallback-key SAP_Unique_ID --sf-fallback-key WC_SAP_Identification__c --enable-fallback-key
```

Standalone dedup command options:

```bash
dedup-records --help

# Common options
--system sap|sf
--dedup-mode primary|fuzzy
--sap-primary-key <column>
--sf-primary-key <column>
--sap-fuzzy-field <column>   # repeat for multiple fields
--sf-fuzzy-field <column>    # repeat for multiple fields
--fuzzy-min-score <0-100>
--fuzzy-match-mode weighted|any
```

Dedup mode tips:
- Use `primary` for deterministic duplicate detection by business key.
- Use `fuzzy` when entity keys are unreliable and duplicate detection should be based on descriptive fields.
- For `fuzzy`, pass fields that are meaningful for that entity (for example: Account Name/Email for Accounts, Order Number/Account Name for Orders, Product/Description for Order Items).

## Configuration

Edit `config/rules.yaml` to change:
- Default input files via `input.sap` and `input.sf` (`directory` + `file_name`)
- Join key columns (SAP ↔ SF linking fields)
- Fallback-key matching toggle via `join.fallback.enabled` (default: `false` = primary-key-only matching)
- Field comparison rules, severity levels, and normalize modes
- Deduplication strategy (`keep_first` / `keep_last` / `flag_all`)
- Fuzzy match threshold and fields
- Output formats and directory
- Output report location/name via `output.report.directory` + `output.report.file_name`

### Generic Entity Configuration

This utility is generic. To reconcile other object types (for example SAP Orders ↔ Salesforce Orders or SAP Order Items ↔ Salesforce Order Items), update:

- `entities` labels (display text in reports)
- `input.sap` and `input.sf` file paths
- `join.primary.sap_col` and `join.primary.sf_col`
- `field_mappings` for the entity-specific columns
- `fuzzy_match.fields` for the entity-specific columns

Example entity labels:

```yaml
entities:
	pair:      "Orders"
	sap:       "SAP Orders"
	sf:        "Salesforce Orders"
	sap_short: "SAP"
	sf_short:  "SF"
```

Ready-to-use example configs are included:

- `config/rules.orders.example.yaml`
- `config/rules.order_items.example.yaml`

Run examples:

```bash
# SAP Orders vs Salesforce Orders
reconcile-accounts --config config/rules.orders.example.yaml

# SAP Order Items vs Salesforce Order Items
reconcile-accounts --config config/rules.order_items.example.yaml

# Module form
py -m phani_data_recon.cli --config config/rules.orders.example.yaml
```

When using the package outside this repository, pass your own config file with `--config` if you do not want to rely on the packaged defaults.

### Output Report Configuration

Use the `output.report` block in `config/rules.yaml` to control where reports are written and what base filename is used.

```yaml
output:
	formats: ["excel", "html"]
	report:
		directory: "output/month_end"
		file_name: "customer_reconciliation"
```

This writes reports under `output/month_end/` using `customer_reconciliation` as the base name, for example:
- `output/month_end/customer_reconciliation_<run_id>.html`
- `output/month_end/customer_reconciliation_<run_id>.xlsx`

Rules:
- `--output-dir` overrides `output.report.directory`
- `output.report.file_name` sets the report filename prefix
- `output.formats` selects Excel, HTML, or both

Example commands:

```bash
# Use output settings from config
reconcile-accounts --config config/rules.yaml

# Override only the output directory at runtime
reconcile-accounts --config config/rules.yaml --output-dir output/ad_hoc_run
```

### Config Reference (Input + Join)

```yaml
input:
	sap:
		directory: "input"
		file_name: "sap_accounts.csv"
	sf:
		directory: "input"
		file_name: "sf_accounts.csv"

join:
	primary:
		sap_col: "SAP_Unique_ID"
		sf_col:  "BP_PowerCerv_Account_Id__c"
	fallback:
		enabled: false
		sap_col: "SAP_Unique_ID"
		sf_col:  "WC_SAP_Identification__c"

output:
	formats: ["excel", "html"]
	report:
		directory: "output"
		file_name: "reconciliation_report"
```

Notes:
- Set `join.fallback.enabled: false` for strict primary-key-only matching (default).
- Set `join.fallback.enabled: true` only when you explicitly want fallback-key matching.

## Report Tabs

| Tab | Content |
|-----|---------|
| Summary | KPI counts, match rate, exception rate |
| Exact_Matches | Records found in both systems |
| Field_Mismatches | Field-level diffs (CRITICAL / HIGH / INFO) |
| SAP_Only | SAP records missing from Salesforce |
| SF_Only | Salesforce records missing from SAP |
| SAP_Duplicates | Duplicate SAP rows before dedup |
| SF_Duplicates | Duplicate SF rows before dedup |
| Fuzzy_Match_Candidates | Likely-same records not linked by ID |
| Data_Quality_Issues | Null IDs, bad formats, validation failures |
| Action_Plan | P1–P4 prioritised remediation table |

## Run Tests

```bash
py -m pip install pytest
py -m pytest tests/ -v
```

## Distribution (Business Rollout)

```bash
# Build wheel + source distribution
py -m pip install build
py -m build

# Install locally from wheel
py -m pip install dist/phani_data_recon-1.0.4-py3-none-any.whl
```

If `reconcile-accounts` is not on PATH, run:

```bash
py -m phani_data_recon.cli --dry-run
```

Published package:

```bash
py -m pip install --upgrade phani-data-recon
```

Legacy script usage inside this repository still works:

```bash
python run_reconciliation.py --dry-run
```

## CI Publishing (GitHub Actions)

This repository includes [`.github/workflows/publish-pypi.yml`](.github/workflows/publish-pypi.yml) to publish new releases to PyPI without storing a PyPI API token in GitHub.

One-time PyPI setup:
- In PyPI, open the `phani-data-recon` project settings.
- Add a Trusted Publisher for this GitHub repository.
- Set the workflow name to `publish-pypi.yml`.
- Set the environment name to `pypi`.

Release flow:
- Bump the version in `pyproject.toml`.
- Create a GitHub release or run the workflow manually from the Actions tab.
- The workflow builds `dist/` artifacts and publishes them with PyPI trusted publishing.

Notes:
- This workflow uses GitHub OIDC via `id-token: write`, so no `TWINE_PASSWORD` secret is required in GitHub.
- Keep local `twine` usage only for manual emergency releases.

## Project Structure

```
reconciliation_project/
├── input/           ← Place source CSVs here
├── config/          ← rules.yaml + schema
├── src/             ← All Python modules
├── templates/       ← Jinja2 HTML template
├── tests/           ← pytest test suite
├── output/          ← Reports generated here
└── run_reconciliation.py
```
