Metadata-Version: 2.4
Name: xmlcst
Version: 0.1.0
Summary: Full-fidelity XML parser with lossless round-trip editing
Project-URL: Repository, https://github.com/rcook/xmlcst
Project-URL: Issues, https://github.com/rcook/xmlcst/issues
Author: Richard Cook
License-Expression: MIT
License-File: LICENSE
Keywords: cst,editing,parser,round-trip,xml
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing :: Markup :: XML
Classifier: Typing :: Typed
Requires-Python: >=3.12
Provides-Extra: dev
Requires-Dist: basedpyright; extra == 'dev'
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Description-Content-Type: text/markdown

# xmlcst

Full-fidelity XML concrete syntax tree for Python -- parse, edit, and serialize with zero formatting loss.

## Why xmlcst?

Existing Python XML libraries (ElementTree, lxml, minidom) parse XML into a semantic tree that discards lexical details: whitespace, comment placement, attribute quote styles, entity reference forms, and more. When you serialize back, the output differs from the input even if you changed nothing.

`xmlcst` takes a different approach. It treats XML as **source text first** and semantic structure second, producing a concrete syntax tree (CST) that retains every byte of the original document. When you edit a single attribute, only that attribute changes in the output -- surrounding formatting, comments, and whitespace remain untouched.

This makes `xmlcst` ideal for programmatic editing of XML configuration files (Maven POMs, `.csproj` files, Spring configs, Android manifests) where changes must produce minimal, reviewable diffs.

### How xmlcst compares

| Feature | ElementTree | lxml | minidom | xmlcst |
|---|---|---|---|---|
| Attribute order | Partial | Partial | Partial | Preserved |
| Quote style (`'` vs `"`) | No | No | No | Preserved |
| Whitespace / indentation | No | No | No | Preserved |
| Comments | No | Yes | Yes | Preserved |
| Entity reference form | No | No | No | Preserved |
| CDATA vs escaped text | No | Yes | Yes | Preserved |
| Empty-element syntax (`<x/>` vs `<x />`) | No | No | No | Preserved |
| Byte-identical round-trip | No | No | No | Yes |

The closest conceptual analogue is [ruamel.yaml](https://yaml.readthedocs.io/) -- a round-trip-capable YAML library -- applied to XML.

## Installation

```
pip install xmlcst
```

Requires Python 3.12+. Pure Python -- no compiled dependencies. Ships [PEP 561](https://peps.python.org/pep-0561/) type annotations for full mypy / pyright support.

## Quick Start

### Parse and round-trip

```python
import xmlcst

source = '<project xmlns="http://maven.apache.org/POM/4.0.0">\n  <version>1.0</version>\n</project>'

doc = xmlcst.parse(source)
assert doc.to_string() == source  # byte-identical round-trip
```

### Edit an attribute (minimal diff)

```python
doc = xmlcst.parse('<root version="1.0" author="alice"/>')
doc.root.attributes["version"] = "2.0"
print(doc.to_string())
# <root version="2.0" author="alice"/>
# Only the value changed -- quotes, whitespace, other attributes untouched
```

### Navigate the tree

```python
doc = xmlcst.parse("""\
<project>
  <dependencies>
    <dependency>
      <groupId>junit</groupId>
    </dependency>
  </dependencies>
</project>""")

deps = doc.root.find("dependencies")
dep = deps.find("dependency")
group = dep.find("groupId")
print(group.children[0].content)  # "junit" (a Text node)

# Or search recursively
dep2 = doc.root.find_recursive("dependency")
all_deps = doc.root.findall_recursive("dependency")
```

### Add and remove elements

```python
doc = xmlcst.parse("<root>\n  <a/>\n  <b/>\n</root>")
doc.root.append(xmlcst.Element("c"))
print(doc.to_string())
# <root>
#   <a/>
#   <b/>
#   <c/>
# </root>
```

### Access formatting metadata

```python
doc = xmlcst.parse('<root  id = "1"  name=\'foo\'/>')
attr = doc.root.attributes["id"]
print(attr.raw_value)          # "1"
print(attr.quote)              # '"'
print(attr.leading_whitespace) # "  "
print(attr.eq_whitespace)      # (" ", " ")
```

### Work with entity references

```python
doc = xmlcst.parse("<root>a &amp; b</root>")
text = doc.root.children[0]
print(text.content)            # "a &amp; b"  (raw, as in the source)
print(text.decoded_content())  # "a & b"      (entities resolved)

text.set_content("x < y")     # auto-escapes
print(text.content)            # "x &lt; y"
```

## Sample Application

The [`samples/bump_pom_version/`](samples/bump_pom_version/) directory contains a complete example: a Maven POM version bumper that reads a `pom.xml`, increments the patch version, and writes the file back. Only the version string changes -- all comments, whitespace, attribute quoting, and other formatting are preserved exactly.

```bash
python samples/bump_pom_version/bump_pom_version.py
# 1.2.3 -> 1.2.4
```

The script accepts an optional path argument to operate on any POM file:

```bash
python samples/bump_pom_version/bump_pom_version.py /path/to/your/pom.xml
```

## API Overview

### Parsing

| Function | Input | Returns |
|---|---|---|
| `xmlcst.parse(text)` | `str` | `Document` |
| `xmlcst.parse_bytes(data)` | `bytes` | `Document` |
| `xmlcst.parse_file(path)` | `str \| Path` | `Document` |

All parse functions raise `xmlcst.ParseError` on malformed input. The error includes `message`, `line`, `column`, and `offset` attributes.

### Node Types

| Type | Description |
|---|---|
| `Document` | Root container; holds all top-level nodes |
| `Element` | An XML element with tag, attributes, and children |
| `Attribute` | Name-value pair with formatting metadata (quote style, whitespace) |
| `AttributeList` | Ordered collection with dict-like access by name |
| `Text` | Character data (entity references preserved in raw form) |
| `Whitespace` | Whitespace-only character data between markup |
| `Comment` | `<!-- ... -->` |
| `ProcessingInstruction` | `<?target data?>` |
| `CData` | `<![CDATA[...]]>` |
| `Doctype` | `<!DOCTYPE ...>` (preserved verbatim) |
| `XmlDeclaration` | `<?xml version="1.0" ...?>` |

### Serialization

| Method | Description |
|---|---|
| `doc.to_string()` | Exact round-trip serialization (default) |
| `doc.to_string(mode="normalized")` | Pretty-printed with consistent formatting |
| `doc.to_bytes()` | UTF-8 encoded; BOM preserved if present in input |
| `doc.write(path)` | Write to file |

## Design

`xmlcst` uses a dual-layer architecture:

1. **Token stream** (Layer 1) -- a lossless sequence of tokens covering every byte of the input. The fundamental invariant: `"".join(t.text for t in tokens) == source`.
2. **Tree API** (Layer 2) -- mutable nodes backed by the token stream. Each node tracks a token span and a dirty flag.

Unmodified nodes serialize by replaying their original tokens (byte-identical). Modified nodes rebuild from their current properties. This guarantees that edits produce the smallest possible diff.

See [SPEC.md](SPEC.md) for the full specification.

## Limitations (v1)

- UTF-8 encoding only
- XML 1.0 well-formed documents only (no error recovery)
- No DTD validation or schema support
- No XPath query engine
- No streaming / SAX-style parsing
- Pure Python (no compiled acceleration)

See the [future roadmap](SPEC.md#15-future-roadmap) in the specification for planned enhancements.

## License

MIT
