Metadata-Version: 2.4
Name: emu-xml-parser
Version: 0.1.1
Summary: Parser for XML generated by Axiell EMu
License-File: LICENSE
Keywords: natural-history,collections,api
Author: Daniel Markbreiter
Author-email: dmarkbreiter@nhm.org
Requires-Python: >=3.10
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: cerberus (>=1.3.8,<2.0.0)
Description-Content-Type: text/markdown

**Emu XML Parser**

- **Purpose:** Parse XML files produced by Axiell EMu into Python-native records (lists of dicts) using schema information embedded in the XML processing instruction. The parser preserves nested tables/tuples and fills missing fields with sensible defaults.

**Quick Install**

- Using pip (recommended for end users):

```bash
# create and activate a venv (recommended)
python3 -m venv .venv
source .venv/bin/activate

# install the package in editable mode for development
pip install -e .

# install test runner
pip install pytest
```

- Other managers: `conda` or `poetry` also work — create/activate an env then install with `pip install -e .`.

**Basic Usage (Python)**

- Import and parse an EMu XML file. The public function is `parse` exposed at the package root.

```python
from emu_xml_parser import parse

rows = parse("/path/to/emu_export.xml")
```

- If you want date fields parsed into Python `date` objects:

```python
rows = parse("/path/to/emu_export.xml", parse_dates=True)
```

## Single-Column Tables

EMu tables defined with only **one field** are automatically flattened to **lists of strings** instead of lists of dicts. This makes the data easier to work with.

**Example Schema:**

```xml
<?schema
table ecatalogue
	table common_name
		text short ComName
	end
	table element
		text long IPAnatomy
	end
end
?>
```

**XML Data:**

```xml
<tuple>
	<table name="common_name">
		<tuple>
			<atom name="ComName">Indian Bush Lark</atom>
		</tuple>
		<tuple>
			<atom name="ComName">Rufous-tailed Lark</atom>
		</tuple>
	</table>
	<table name="element">
		<tuple>
			<atom name="IPAnatomy">shell(s)</atom>
		</tuple>
	</table>
</tuple>
```

**Python Output:**

```python
{
	"common_name": ["Indian Bush Lark", "Rufous-tailed Lark"],  # List of strings
	"element": ["shell(s)"]  # Not [{"IPAnatomy": "shell(s)"}]
}
```

**Contrast with Multi-Column Tables:**

Multi-field tables remain as **lists of dicts**:

```xml
<?schema
table ecatalogue
	table SitSiteRef_tab
		text long locality
		integer locality_irn
	end
end
?>
```

**Python Output:**

```python
{
	"SitSiteRef_tab": [
		{"locality": "San Pedro", "locality_irn": 368989},
		{"locality": "Los Angeles", "locality_irn": 363879}
	]
}
```

**Why This Matters:**

- **Simplicity:** Access values with `row["common_name"][0]` instead of `row["common_name"][0]["ComName"]`
- **Common pattern:** Many EMu exports have single-field reference tables (taxonomy names, elements, etc.)
- **Backwards compatible:** Multi-field tables work as expected

**Minimal XML Example**
Input (EMu XML contains a `<?schema ... ?>` processing instruction):

```xml
<?xml version="1.0"?>
<?schema
table ecatalogue
	date date_emu_record_modified
	date date_emu_record_inserted
	integer irn
	text short emu_guid
	text short department
	text short catalogue_number
	table SitSiteRef_tab
		text long locality
		integer locality_irn
	end
	tuple SpeTaxonRef
		text short taxon_irn
	table common_name
      text short ComName
    end
	end

end
?>
<root>
	<tuple>
		<atom name="date_emu_record_modified">2023-05-18</atom>
		<atom name="date_emu_record_inserted">2012-10-30</atom>
		<atom name="irn">368521</atom>
		<atom name="emu_guid">8767ccff-...</atom>
		<atom name="department">Ornithology</atom>
		<atom name="catalogue_number">89334</atom>
		<tuple name="SpeTaxonRef">
			<atom name="taxon_irn">24960</atom>
		</tuple>
		<table name="common_name">
			<tuple>
				<atom name="ComName">Indian Bush Lark</atom>
			</tuple>
    	</table>
	</tuple>
</root>
```

Expected Python output (approx):

```py
[
	{
		"date_emu_record_modified": "2023-05-18",
		"date_emu_record_inserted": "2012-10-30",
		"irn": 368521,
		"emu_guid": "8767ccff-...",
		"department": "Ornithology",
		"catalogue_number": "89334",
		"SpeTaxonRef": [{"taxon_irn": 24960}],
		"SitSiteRef_tab": [
			{
				"locality": None,
				"locality_irn": None
			}
		],
		"common_name": ["Indian Bush Lark"]
	}
]
```

Notes:

- Atom fields become strings by default. When `parse_dates=True`, date-like fields are converted to Python `date` objects.
- Multi-field tables (tables/tuples with multiple field definitions) are represented as lists of dicts. Single-field tables become lists of strings.
- Missing fields are filled with empty strings or empty lists per the schema.

**Testing**

- Run the test suite (after installing dev/test deps):

```bash
pytest -q
```

If you used a virtual environment, ensure it's activated before running `pytest`.

**Working with Real / Large Fixtures**

- Keep small, anonymized fixtures under `tests/fixtures` and reference them in tests.
- For large or private datasets, do not commit originals; point tests to a folder via `TEST_EMU_XML_DIR` and skip if unset.

**Extending / Customizing**

- Conversion helpers live in `emu_xml_parser.converter` (e.g. date parsing/serialization) and validation/enforcement lives in `emu_xml_parser.validator`.
- If you need different conversion rules, you can adapt `convert_value` or wrap the parser in a small class that injects custom converters.

**Files of Interest**

- `src/emu_xml_parser/core.py`: entry point `parse()` for the package
- `src/emu_xml_parser/extractor.py`: reads the `<?schema ... ?>` processing instruction
- `src/emu_xml_parser/schema.py`: schema text → structured schema
- `src/emu_xml_parser/tuple_parser.py`: recursive XML → dict conversion
- `src/emu_xml_parser/converter.py`: value conversion utilities
- `src/emu_xml_parser/validator.py`: schema enforcement and normalization

**License & Contributing**

- Add your preferred license and contribution guidelines to the repository root.

