Metadata-Version: 2.4
Name: xmldev
Version: 0.1.0
Summary: Validate, repair, and normalize XML against a JSON ground-truth schema.
Author: xmldev contributors
License: MIT
Keywords: xml,validation,repair,schema,lint
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Text Processing :: Markup :: XML
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: lxml>=5.0
Requires-Dist: click>=8.1
Requires-Dist: jsonschema>=4.21
Requires-Dist: rapidfuzz>=3.6
Requires-Dist: openai>=1.14
Requires-Dist: prometheus-client>=0.20
Requires-Dist: python-dateutil>=2.9
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: pytest-mock>=3.12; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Dynamic: requires-python

# xmldev

A Python library and CLI for validating, repairing, and normalizing XML against
a developer-defined JSON ground-truth schema. Because apparently the world still
runs on XML, and apparently no one agrees on what that XML should look like.


## Why This Exists

You have an XML feed. Your partner sends you XML. Your internal service produces XML.
An LLM burps out XML. And somehow, none of it quite matches what the schema says it
should look like.

The element is named `<FullName>` in prod, `<fullname>` in staging, and `<full-name>`
in QA, and your downstream processor accepts exactly none of them. Someone typed
`twenty five` into an integer field. Half the closing tags are missing — recovered
by lxml but now the tree looks like abstract art. Your integration test has been
broken for three sprints and nobody knows why.

xmldev is the library that fixes all of that. You tell it what the XML is supposed
to look like (in a JSON file you write once), and it tells you what is wrong with
what you got. Then it fixes it. Then it tells you exactly what it changed and why.
No magic. No surprises. An LLM can optionally be involved, but only if you explicitly
ask for it and actually configure it first.

This project exists because "just validate the XML" is the oldest lie in enterprise
software, and someone had to write the thing that actually does it.


## Installation

```bash
pip install xmldev
```

Or from source:

```bash
git clone https://github.com/yourname/xmldev.git
cd xmldev
pip install -e ".[dev]"
```

**Python 3.10+ required.** Dependencies: `lxml`, `click`, `jsonschema`,
`rapidfuzz`, `openai`, `prometheus-client`, `python-dateutil`.

The LLM dependency (`openai`) is used only when you configure and enable LLM
fallback. If you never touch the config file, no LLM calls are made. Ever.


## Quick Start

### Python API

```python
from xmldev import Xmldev

xd = Xmldev()
schema = xd.load_schema_from_file("schema.person.json")
result = xd.validate_and_fix(open("broken.xml", "rb").read(), schema)

if result["ok"]:
    print(result["fixed_xml"])
else:
    print("Could not fully fix. Diagnostics:")
    for d in result["diagnostics"]:
        print(" ", d)
```

`result` is always a dict with:

| Key | Type | Description |
|-----|------|-------------|
| `ok` | bool | True if the output passes schema validation |
| `fixed_xml` | str or None | The repaired XML string |
| `patches` | list[dict] | Every change made, with confidence and source |
| `diagnostics` | list[str] | Human-readable description of unfixable problems |
| `provenance` | dict | Which repair layer actually fixed it |
| `original` | str | The input as received, unmodified |

### CLI

```bash
# Validate and repair a file
xmldev validate --schema schema.person.json --input broken.xml --output fixed.xml --audit audit.json

# Repair with auto-apply
xmldev repair --schema schema.person.json --input broken.xml --output fixed.xml --auto-apply

# Lint an entire directory
xmldev lint --schema schema.person.json --input ./xml_feeds/ --report report.json

# Start an HTTP server
xmldev serve --port 8080
```

Exit codes: `0` = pass, `1` = validation failed, `2` = fixed but has low-confidence
patches that need review, `3` = fatal error (bad schema, I/O failure).


## Defining Your Schema

The ground-truth schema is a JSON file you write once. It describes the structure
of the XML you expect to receive. Here is the minimal schema for a person record:

```json
{
  "root": {
    "name": "person",
    "attrs": {
      "id": { "name": "id", "type": "int", "required": true }
    },
    "children": [
      {
        "spec": {
          "name": "name",
          "text_type": "string",
          "aliases": ["fullname", "full-name", "full_name"]
        },
        "min_occurs": 1,
        "order_index": 0
      },
      {
        "spec": {
          "name": "age",
          "text_type": "int",
          "default": 0
        },
        "min_occurs": 0,
        "order_index": 1
      }
    ],
    "order_enforced": true
  }
}
```

Given this schema and the following broken XML:

```xml
<person id="12">
  <fullname>Bob</fullname>
  <age>twenty five</age>
</person>
```

xmldev will:

1. Rename `<fullname>` to `<name>` (alias match, confidence 1.0, auto-applied).
2. Coerce `"twenty five"` to `25` (words-to-number parser, confidence 0.6, flagged for review).
3. Return `ok: true` and two patches in the audit.

Nothing was sent to any API. Nothing was guessed. The alias was declared. The type
was declared. The fix was deterministic.


## Schema Reference (Short Form)

Full meta-schema lives in `xmldev/schema.py`. The fields that matter most:

### Element fields

| Field | Type | Description |
|-------|------|-------------|
| `name` | string | Required. The expected tag name. |
| `text_type` | string | One of `string`, `int`, `float`, `date`, `bool`, `enum`, `regex`. Default `string`. |
| `enum_values` | list[str] | Required if `text_type` is `enum`. |
| `pattern` | string | Regex pattern for `text_type: regex`. |
| `aliases` | list[str] | Tag names that should silently rename to this element. |
| `default` | any | Value to insert when this element is missing and `min_occurs >= 1`. |
| `attrs` | dict | Map of attribute name to `AttrSpec`. |
| `children` | list | Array of `ChildSpec` objects. |
| `order_enforced` | bool | If true, children must appear in `order_index` order. |
| `conditional_rules` | list | Expression-based rules evaluated at validation time. |
| `custom_validator` | string | `"module.path:callable"` for custom validation logic. |

### ChildSpec fields

| Field | Type | Description |
|-------|------|-------------|
| `spec` | ElementSpec | The element definition (nested). |
| `min_occurs` | int | Minimum occurrences. 0 = optional. Default 0. |
| `max_occurs` | int or `"unbounded"` | Maximum occurrences. Default 1. |
| `order_index` | int | Position when `order_enforced` is true. |

### Global config

```json
{
  "root": { ... },
  "global": {
    "allow_unknown": "move_to_extension",
    "extension_element_name": "extensions",
    "fuzzy": {
      "name_threshold": 0.85,
      "permissive_threshold": 0.70,
      "max_renames_per_doc": 10
    }
  }
}
```

`allow_unknown` controls what happens to elements not in the schema:
- `keep` — leave them alone.
- `drop` — delete them, generate a patch.
- `move_to_extension` (default) — move them into an `<extensions>` wrapper element.


## LLM Fallback

The LLM fallback is opt-in. It will not activate unless you create a config file
and explicitly set `XMDEV_ALLOW_LLM=true` in it. There is no default behavior that
sends data anywhere.

Create `xmldev.config.env` (see `examples/xmldev.config.env.example`):

```
LLM_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=sk-your-key-here
LLM_MODEL_ID=gpt-4o-mini
LLM_TEMPERATURE=0.0
LLM_MAX_TOKENS=2048
XMDEV_ALLOW_LLM=true
LLM_REDACT=true
```

Then pass it at construction time:

```python
xd = Xmldev(config_path="xmldev.config.env")
result = xd.validate_and_fix(xml, schema, allow_llm=True)
```

Or via CLI:

```bash
xmldev repair --schema schema.json --input broken.xml --output fixed.xml \
  --allow-llm --config xmldev.config.env
```

PII redaction is enabled by default when LLM is used. Tags and attributes matching
a configurable list are replaced with `[REDACTED]` before the prompt is sent.
Set `LLM_REDACT=false` at your own discretion and your own legal risk.

The LLM is called after deterministic and heuristic passes have both failed. It gets
the broken XML, the relevant schema snippet, and the list of violations. If the LLM
output passes schema validation, it is accepted. If not, one retry is attempted with
a stricter prompt. If that also fails, the LLM result is discarded and diagnostics
are returned. Total LLM calls per document: maximum 2.


---


# Developer Reference

This section is for people who are integrating xmldev into a pipeline, embedding it
into a service, extending it with custom validators, or just trying to understand
why it did what it did to your XML. It is long. That is intentional.


## Architecture Overview

The library is a pipeline of sequential, independently testable stages. Each stage
produces an output that the next stage consumes. No stage reaches back to modify
a previous stage's output.

```
Input XML (str or bytes)
        |
        v
  [1] tolerant_parse        -- xmldev/parser.py
        |
        v
  [2] canonicalize           -- xmldev/parser.py
        |
        v
  [3] alias normalization   -- xmldev/repair.py  DeterministicRepairer
        |
        v
  [4] Validator              -- xmldev/validator.py
        |
  violations list
        |
        v
  [5] DeterministicRepairer  -- xmldev/repair.py   (up to 3 passes)
        |
        v
  [6] re-validate
        |
        +-- ok --> return result
        |
        v
  [7] HeuristicRepairer (if aggressive=True)
        |
        v
  [8] LLMClient (if configured and allow_llm=True)
        |
        v
  return result dict
```

Each box is a separate module. The orchestrator is `xmldev/__init__.py`
(`Xmldev.validate_and_fix`). The orchestrator holds no schema parsing, no
repair logic, and no fuzzy matching. It only calls the right thing in the
right order.


## Module Reference

### xmldev/schema.py

**What it does:** Loads and validates the user-provided ground-truth JSON schema
and converts it into a Python AST made of dataclasses.

**Key classes:**

- `SchemaLoader.load(source)` — accepts a JSON string, bytes, or `pathlib.Path`.
  Validates against an internal meta-schema using `jsonschema`. Raises
  `SchemaLoadError` (code `XM01`) on any problem.

- `Schema` — top-level container: `root: ElementSpec`, `global_config: GlobalConfig`.

- `ElementSpec` — represents one XML element:
  `name`, `text_type`, `enum_values`, `pattern`, `attrs`, `children`,
  `order_enforced`, `aliases`, `default`, `conditional_rules`, `custom_validator`.

- `ChildSpec` — wraps an `ElementSpec` with cardinality info:
  `min_occurs`, `max_occurs`, `order_index`, `required_if`.

- `AttrSpec` — attribute definition: `name`, `type`, `required`, `default`,
  `enum_values`, `pattern`, `aliases`.

- `GlobalConfig` — global behavior: `allow_unknown`, `extension_element_name`,
  `name_threshold`, `permissive_threshold`, `max_renames_per_doc`.

**Important implementation note:** The `GlobalConfig.name_threshold` field defaults
to `0.85` (the schema default), not `None`. The orchestrator detects whether the user
explicitly overrode this value by comparing against the `GlobalConfig()` default. If
not explicitly set, the fuzzy profile's threshold is used. If you change `GlobalConfig`
defaults, update `Xmldev.validate_and_fix` accordingly.

---

### xmldev/parser.py

**What it does:** Turns a string or bytes into an lxml `_Element` tree.

**Key functions:**

- `tolerant_parse(xml)` — returns a `ParseResult(root, recovered, parse_errors)`.
  First tries a strict parse. On failure, retries with `lxml.etree.XMLParser(recover=True)`.
  If recovery also fails or returns an empty tree, raises `ParseError` (code `XM02`).

- `canonicalize(root)` — modifies the tree in-place. Strips insignificant whitespace
  from text and tail nodes. Does NOT remove intentional whitespace in text content —
  only leading/trailing whitespace is stripped. Returns the root for chaining.

- `xpath_path(elem, doc_root)` — generates an XPath-style path string like
  `/person[1]/name[1]`. Used in patch records and violation messages. The path is
  position-indexed to be unambiguous even when siblings share a tag name.

- `_local_name(tag)` — strips namespace prefix from `{uri}localname` format.
  Used throughout the codebase wherever a tag name needs to be compared to a
  schema name.

**Parser recovery behavior:** lxml's recovery mode does NOT guarantee a valid tree.
It makes a best effort. The `recovered` flag in `ParseResult` indicates that recovery
was used. The `HeuristicRepairer` looks at this flag to decide whether additional
structural reconstruction is needed.

---

### xmldev/validator.py

**What it does:** Walks the parsed XML tree and produces a list of `Violation` objects.
Does NOT modify the tree.

**Key classes:**

- `Violation` — a dataclass with `code`, `path`, `element` (lxml element reference),
  `message`, and `extra` dict.

- `Validator(schema).validate(root)` — entry point. Returns `list[Violation]`.

**Violation codes:**

| Code | Meaning |
|------|---------|
| `UNKNOWN_TAG` | Element not in schema and not an alias of any schema element |
| `MISSING_REQUIRED` | min_occurs > 0 and element not present (alias-aware count) |
| `TOO_MANY` | more occurrences than max_occurs |
| `UNKNOWN_ATTR` | Attribute not in schema |
| `MISSING_REQUIRED_ATTR` | Required attribute absent and not an alias of a present attr |
| `TEXT_TYPE_ERROR` | Leaf text content fails type validation |
| `ATTR_TYPE_ERROR` | Attribute value fails type validation |
| `WRONG_ORDER` | Children appear in wrong order when order_enforced is true |
| `CONDITIONAL_RULE_FAILED` | A conditional_rule expression evaluated to False |
| `CUSTOM_VALIDATOR_FAILED` | Custom validator returned an error string |
| `CUSTOM_VALIDATOR_ERROR` | Custom validator raised an exception |

**Alias-aware cardinality counting:** When a child element appears under an alias name
(e.g., `<fullname>` for a field defined as `name` with alias `fullname`), the validator
counts it toward the canonical name's cardinality. This means `<fullname>` satisfies
`min_occurs: 1` for `name`. The actual renormalization (tag rename) happens in the
repair pre-pass, not in the validator.

**Order checking:** Only fires a `WRONG_ORDER` violation when `order_enforced: true`
is set on the parent element. The check filters the expected order list to only include
names actually present in the document, then compares order with what was found.

**Custom validators:** Any string of the form `"module.path:function_name"` in the
`custom_validator` field on an `ElementSpec` will be dynamically imported at validation
time. The function receives the `lxml._Element` and should return `None` or `True`
on success, or a non-empty string error message on failure.

---

### xmldev/fuzzy.py

**What it does:** Fuzzy string matching for tag and attribute name normalization.

**Key classes:**

- `FuzzyMatcher(profile, max_renames, name_threshold, permissive_threshold)`.

**Profiles:**

| Profile | name_threshold | permissive_threshold |
|---------|---------------|---------------------|
| strict | 0.95 | 0.80 |
| balanced (default) | 0.85 | 0.70 |
| permissive | 0.75 | 0.60 |

`name_threshold` controls tag/attribute name fuzzy matching.
`permissive_threshold` controls enum value fuzzy matching.

**Match priority** (highest to lowest):

1. Alias list match (score = 1.0, `via_alias=True`).
2. Exact match after normalization (lowercase, remove punctuation, collapse separators,
   apply synonym mappings — see `_SYNONYM_MAP` for built-in synonyms like `fullname -> name`).
3. Fuzzy score via `rapidfuzz.fuzz.ratio` >= `name_threshold`.

**Rename cap:** `max_renames` (default 10 from schema global config) is a hard cap on
total renames per document. Once hit, `renames_exhausted` returns True and all further
`match()` calls return None. Call `reset()` at the start of each document.

**Schema threshold vs profile threshold:** The schema's `global.fuzzy.name_threshold`
overrides the profile only when explicitly set by the user. If the schema does not
include a `global.fuzzy` block (or uses the default value), the profile's threshold
is used. This is handled in `Xmldev.validate_and_fix` by comparing against `GlobalConfig()`
defaults before passing to `FuzzyMatcher`.

---

### xmldev/repair.py

**What it does:** Applies rule-based fixes to an XML tree. Does NOT validate.
Does NOT call fuzzy directly — it receives a `FuzzyMatcher` instance.

**Key classes:**

- `DeterministicRepairer(schema, fuzzy, auto_apply_threshold=0.9)`

- `HeuristicRepairer(schema, auto_apply_threshold=0.9)`

**DeterministicRepairer.repair(root, violations):**

1. Deep-copies the root. All mutations are on the copy.
2. Alias pre-pass (`_normalize_alias_tags`): walks the copy tree and renames any
   element whose tag is in an alias list to the canonical name. This runs before
   re-validation so the alias count fix in the validator is not needed for repair.
3. Re-validates the copy to get fresh violations with element references pointing
   into the copy (not the original).
4. Dispatches each violation to the appropriate fixer method.
5. Returns `RepairResult(root=copy, patches=list[Patch], success=bool)`.

**Why re-validate inside repair():** The caller passes violations generated against
the original tree. After deep-copy, those element references are invalid. Re-validating
against the copy ensures fixer methods receive element objects that are actually in
the tree being modified.

**Why alias normalization is also done in validate_and_fix():** Because the validator
treats aliases as valid. A document containing only `<fullname>` (a declared alias of
`<name>`) passes validation. No violations are generated, so `repair()` is never called.
The alias pre-pass in `validate_and_fix` runs unconditionally on the live tree to ensure
aliased tags are canonicalized regardless of whether there are other violations.

**Fixer dispatch table:**

| Violation code | Fixer method | Notes |
|---------------|-------------|-------|
| `UNKNOWN_TAG` | `_fix_unknown_tag` | Fuzzy rename first; if no match, apply allow_unknown policy |
| `MISSING_REQUIRED` | `_fix_missing_required` | Inserts element with default value; skips if no default |
| `MISSING_REQUIRED_ATTR` | `_fix_missing_attr` | Inserts attribute with default; skips if no default |
| `TEXT_TYPE_ERROR` | `_fix_text_type` | Type coercion including words-to-number |
| `ATTR_TYPE_ERROR` | `_fix_attr_type` | Attribute value coercion |
| `WRONG_ORDER` | `_fix_order` | Removes and re-inserts children in schema order |
| `UNKNOWN_ATTR` | (inline) | Drops unknown attributes; confidence 0.95 |

**Confidence scores and auto-apply:** Patches with `confidence >= auto_apply_threshold`
(default 0.9) are marked `auto_applied=True`. The CLI uses this to decide the exit code:
if any patch is not auto-applied, exit code 2 is returned to signal human review is
needed.

**HeuristicRepairer.repair(root, doc_root):** Currently implements duplicate sibling
merging (collapses multiple occurrences of an element that has `max_occurs=1` by moving
children of excess occurrences into the first). Patches are marked `auto_applied=False`.
Intended as the extensible hook for DP-based reconstruction in future passes.

**Words-to-number parsing:** `_words_to_int(text)` handles English integer words like
`"twenty five"` -> `25`. Confidence for word-to-number coercions is `0.6` (below
auto-apply threshold, always requires review). Decimal words and ordinals are not
supported.

---

### xmldev/audit.py

**What it does:** Defines the `Patch` dataclass and the `AuditWriter` utility.

**Patch fields:**

```python
@dataclass
class Patch:
    type: str          # rename | insert | delete | coerce | reorder | move
    path: str          # XPath-style: /person[1]/name[1]
    original: str
    replacement: str
    rule_id: str
    confidence: float
    source: str        # deterministic | heuristic | llm
    auto_applied: bool
    notes: str
    id: str            # UUIDv4, auto-generated
    timestamp: str     # ISO 8601 UTC, auto-generated
```

**AuditWriter.write(patches, path):** Writes a JSON file with:
- `summary`: `total_patches`, `auto_applied`, `pending_review`, `by_type`, `by_source`.
- `patches`: list of `Patch.to_dict()` for every patch.

**AuditWriter.to_dict(patches):** Returns the same structure as a Python dict
without writing to disk. Used by `cli.py` for the `--audit` flag.

---

### xmldev/llm.py

**What it does:** Manages the LLM fallback. Contains config loading, PII redaction,
prompt construction, and the API call.

**LLMConfig.load(path):**
- Returns `None` if the file does not exist, or if `XMDEV_ALLOW_LLM != true`.
- Parses `KEY=VALUE` lines. Strips quotes. Supports inline comments via `#`.
- Required fields: `LLM_BASE_URL`, `LLM_API_KEY`, `LLM_MODEL_ID`.

**PIIRedactor(tags, attrs):** Walks the XML string with a regex substitution strategy.
Replaces text content of configured tags and values of configured attributes with
`[REDACTED]`. Default sensitive tags: `ssn`, `password`, `credit_card`, `secret`,
`token`, `api_key`. Override via `LLM_REDACT_TAGS=tag1,tag2` in config.

**LLMClient.repair(schema, xml_str, diagnostics=None):**
1. Optionally redacts PII from `xml_str`.
2. Calls `build_prompt(schema, xml_str, diagnostics)`.
3. Sends to `openai.OpenAI(base_url=..., api_key=...).chat.completions.create(...)`.
4. Validates response: must be parseable XML and the root tag must match `schema.root.name`.
5. If invalid, retries once.
6. Raises `LLMError` (code `XM05`) on persistent failure or if response contains
   `<xmldev_error>`.

**build_prompt(schema, xml_str, diagnostics):** Returns the prompt string. The schema
is serialized to a compact but human-readable JSON snippet. The LLM is instructed to
return only XML and nothing else. The explicit instruction to use `<xmldev_error>` for
cases where it cannot repair is included.

---

### xmldev/cli.py

**What it does:** Click-based CLI. Four commands: `validate`, `repair`, `lint`, `serve`.

**validate:** Runs the full `validate_and_fix` pipeline. Optionally writes fixed XML
and an audit JSON. Exit codes 0/1/2/3 as defined above.

**repair:** Same as validate but focuses on writing the fixed XML. Flags:
- `--auto-apply`: applies all patches regardless of confidence.
- `--auto-apply-threshold`: custom confidence cutoff.
- `--aggressive`: enables heuristic repair pass.

**lint:** Accepts a file or a directory. For directories, recursively finds `*.xml`.
Runs the pipeline on each file and reports pass/fail per file. Writes a JSON report.

**serve:** Starts a minimal `http.server.HTTPServer` on `POST /repair`. Accepts JSON
body `{"schema": {...}, "xml": "..."}`. Returns the `validate_and_fix` result as JSON.
Prometheus metrics exposed on `port+1` if `prometheus_client` is importable.

**Verbose mode (`-v`):** Sets `logging.getLogger("xmldev")` to DEBUG level. Logs
each patch type, confidence, and path. Useful for pipeline debugging.

---

## Writing Custom Validators

To attach domain-specific validation logic to any element, set `custom_validator`
in its schema `ElementSpec`:

```json
{
  "name": "email",
  "text_type": "string",
  "custom_validator": "mypackage.validators:validate_email"
}
```

The referenced function must be importable when xmldev runs. It receives the
`lxml._Element` and should return:

- `None` or `True` — validation passed.
- A non-empty string — validation failed; the string becomes the violation message.

```python
# mypackage/validators.py
import re

def validate_email(elem):
    text = (elem.text or "").strip()
    if not re.fullmatch(r"[^@]+@[^@]+\.[^@]+", text):
        return f"Invalid email address: '{text}'"
    return None
```

Custom validator exceptions are caught and surfaced as `CUSTOM_VALIDATOR_ERROR`
violations rather than propagated. This ensures one bad validator does not crash
the entire pipeline.

---

## Extending the Repair Pipeline

The repair pipeline is designed to be extended. The cleanest extension points are:

**1. Adding a new violation code and fixer:**

Add a new `elif v.code == "MY_CODE":` branch in
`DeterministicRepairer._fix_violations()`. Add the corresponding violation generation
in `Validator._check_children()` or `Validator._validate_element()`. Add a test.

**2. Adding a new heuristic:**

Subclass `HeuristicRepairer` or add a method to the existing class. Call it from
`HeuristicRepairer.repair()`. Mark patches with `source="heuristic"` and set
appropriate confidence values.

**3. Replacing the LLM client:**

`LLMClient` uses `openai.OpenAI` but only for the `chat.completions.create` call.
Any OpenAI-compatible endpoint works (Ollama, LMStudio, Mistral, Azure OpenAI, etc.)
by setting `LLM_BASE_URL` in the config. To replace the client entirely, subclass
`LLMClient` and override `repair()`.

**4. Custom fuzzy synonyms:**

Edit `_SYNONYM_MAP` in `xmldev/fuzzy.py`. The map applies during normalization,
before similarity scoring. Keys and values are normalized strings.

---

## Error Codes

| Code | When | Class |
|------|------|-------|
| XM01 | Invalid or unreadable ground-truth schema | `SchemaLoadError` |
| XM02 | XML is so broken lxml cannot recover any tree | `ParseError` |
| XM03 | I/O error reading input or schema file | (OS-level, surfaced in CLI) |
| XM04 | LLM requested but not configured | (diagnostic string) |
| XM05 | LLM returned invalid or unrepairable XML | `LLMError` |
| XM06 | Validation still fails after all repair passes | (diagnostic strings in result) |

Exceptions carry `.code`, `.message`, `.details`, `.suggested_action` attributes.

---

## Running Tests

```bash
pytest tests/ -v
```

Coverage report:

```bash
pytest tests/ --cov=xmldev --cov-report=term-missing
```

The test suite is entirely self-contained. No external services, no network calls.
LLM tests use mocked `openai.OpenAI` clients. All tests run in under 5 seconds on
a normal machine.

Test files:

| File | Coverage area |
|------|--------------|
| `tests/test_schema.py` | Schema loading, meta-schema validation, AST fields |
| `tests/test_parser.py` | Tolerant parse, recovery, canonicalization, XPath paths |
| `tests/test_repair.py` | Alias rename, fuzzy rename, type coercion, reorder, move/drop |
| `tests/test_audit.py` | Patch dataclass, AuditWriter, summary counts |
| `tests/test_llm.py` | Config loading, PII redaction, mocked LLM calls |
| `tests/test_cli.py` | Click test runner, exit codes, file outputs |

---

## Linting

```bash
ruff check xmldev/ tests/
```

The project uses `ruff` for linting. Configuration is in `pyproject.toml`.

---

## License

MIT. Go fix some XML.
