Metadata-Version: 2.4
Name: locationformatter
Version: 0.1.2
Summary: A dual-head NER-based parser for location strings
Project-URL: Homepage, https://github.com/semantic-ai/decide-location-formatter
Project-URL: Issues, https://github.com/semantic-ai/decide-location-formatter/issues
Author-email: Author <stefaan.vercoutere@sirus.be>
License: MIT
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Requires-Dist: pytorch-crf>=0.7.2
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.35
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# decide-location-formatter

A Python package for parsing and structuring ocation strings into their individual address components. Built around a dual-head NER model ([`svercoutere/abb-dual-location-component-ner`](https://huggingface.co/svercoutere/abb-dual-location-component-ner)) fine-tuned on top of **XLM-RoBERTa base**.

## How it works

Raw location strings like `"Scaldisstraat 23-25, 2000 Antwerpen"` or `"Cafe den Draak, Lovegemlaan 7, 9000 Gent"` are common in  municipal decision text but are inconsistently formatted and often contain multiple distinct locations in a single string.

The pipeline has three steps:

1. **Text cleaning** — normalises whitespace, unicode, and newlines.
2. **Dual-head NER inference** — the model runs two independent CRF-decoded classification heads over every token simultaneously:
   - **Component head** — tags each token as one of 12 address component types (street, city, postcode, …).
   - **Location head** — groups tokens that belong to the same physical location into `B-LOCATION` / `I-LOCATION` spans, allowing multi-location strings to be split.
3. **Post-processing** — component spans are nested inside their parent location spans, housenumber ranges/sequences (e.g. `23-25`, `7 en 9`) are expanded into individual entries, and bus numbers are split into a separate field.

### Architecture

| Component | Detail |
|---|---|
| Base encoder | `xlm-roberta-base` (12 layers, 768 hidden) |
| Component head | Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 25) + CRF |
| Location head | Linear(768 → 256) · GELU · Dropout(0.1) · Linear(256 → 3) + CRF |
| Tokenisation | Word-level regex tokeniser; sub-word alignments via fast tokenizer `word_ids()` |
| Max input length | 256 sub-word tokens |

### Entity types (component head)

| Label | Description |
|---|---|
| `STREET` | Street name (no house number) |
| `ROAD` | Road or route name |
| `HOUSENUMBER` | House/building number(s), ranges or sequences |
| `POSTCODE` | Postal or ZIP code |
| `CITY` | City or municipality name |
| `PROVINCE` | Province or region name |
| `BUILDING` | Named building, site or facility |
| `INTERSECTION` | Crossing or intersection of roads |
| `PARCEL` | Land parcel, section or lot number |
| `DISTRICT` | District, neighbourhood or borough |
| `GRAVE_LOCATION` | Plot/row/number within a cemetery |
| `DOMAIN_ZONE_AREA` | Domain, zone or area name |

## Evaluation

Evaluated on a held-out 10 % split of ~10 000 Belgian municipal decision location strings.

| Metric | Score |
|---|---|
| **Combined F1** | **0.9435** |
| Component F1 | 0.9295 |
| Location F1 | 0.9576 |

## Installation

### From source (recommended during development)

```bash
git clone https://github.com/semantic-ai/decide-location-formatter.git
cd decide-location-formatter
pip install -e .
```

### Dependencies only

```bash
pip install torch>=2.0 transformers>=4.35 pytorch-crf>=0.7.2
```

> The model weights (~1 GB) are downloaded automatically from the Hugging Face Hub on first use.

## Usage

### Quick start

```python
from locationformatter import LocationFormatter

lf = LocationFormatter()   # loads model once; reuse for many calls

result = lf.parse("Scaldisstraat 23-25, 2000 Antwerpen")
print(result)
```

```json
{
  "original": "Scaldisstraat 23-25, 2000 Antwerpen",
  "locations": [
    {
      "location": "Scaldisstraat 23-25, 2000 Antwerpen",
      "street": "Scaldisstraat",
      "housenumber": "23",
      "housenumber_type": "single",
      "postcode": "2000",
      "city": "Antwerpen"
    },
    {
      "location": "Scaldisstraat 23-25, 2000 Antwerpen",
      "street": "Scaldisstraat",
      "housenumber": "25",
      "housenumber_type": "single",
      "postcode": "2000",
      "city": "Antwerpen"
    }
  ]
}
```

### Multi-location strings

Strings that contain several distinct locations are automatically split:

```python
result = lf.parse("Lovegemlaan 7, 9000 Gent en Dorpstraat 12, 9240 Zele")
for loc in result["locations"]:
    print(loc)
```

### Raw prediction (no housenumber expansion)

`predict()` returns spans straight from the model without expanding ranges or splitting bus numbers:

```python
raw = lf.predict("Heikeesstraat 2-4, 9240 Zele")
# raw["locations"][0]["housenumber"] == "2-4"
# raw["locations"][0]["housenumber_type"] == "range"
```

### One-shot helper

For a single call without keeping the model in memory:

```python
from locationformatter import parse_location

result = parse_location("Grote Markt 1, 2000 Antwerpen")
```

> **Note:** `parse_location` reloads the model on every call. Use `LocationFormatter` for repeated parsing.

### Custom model or device

```python
lf = LocationFormatter(
    repo="your-org/your-model",   # any compatible HF Hub repo
    device="cuda",                # "cpu" or "cuda"; auto-detected when omitted
)
```

## API reference

### `LocationFormatter`

```python
class LocationFormatter:
    def __init__(self, repo: str = "svercoutere/abb-dual-location-component-ner",
                 device: str | None = None): ...

    def parse(self, text: str) -> dict: ...
    # Full pipeline: clean → NER → expand housenumbers.
    # Returns {"original": str, "locations": list[dict]}

    def predict(self, text: str) -> dict: ...
    # NER only, no housenumber expansion.
    # Returns {"original": str, "locations": list[dict]}
```

### Helper functions

```python
from locationformatter import clean_string, clean_house_number, extract_house_and_bus_number

clean_string("  Grote   Markt\n1  ")
# → "Grote Markt 1"

clean_house_number("3 t.e.m. 7")
# → ["3", "4", "5", "6", "7"]

clean_house_number("10-14")
# → ["10", "11", "12", "13", "14"]

extract_house_and_bus_number("5 bus 3")
# → {"housenumber": "5", "bus": "3"}
```

### Output schema

Each entry in the `locations` list is a flat dict. Only fields detected by the model are included.

| Field | Type | Description |
|---|---|---|
| `location` | `str` | The substring corresponding to this location |
| `street` | `str` | Street name |
| `road` | `str` | Road/route name |
| `housenumber` | `str` | Individual house number (after expansion) |
| `housenumber_type` | `str` | `"single"`, `"range"`, or `"sequence"` |
| `bus` | `str` | Bus/apartment number (when present) |
| `postcode` | `str` | Postal code |
| `city` | `str` | City or municipality |
| `province` | `str` | Province |
| `building` | `str` | Named building or facility |
| `intersection` | `str` | Road intersection |
| `parcel` | `str` | Land parcel identifier |
| `district` | `str` | District or neighbourhood |
| `grave_location` | `str` | Cemetery plot/row/number |
| `domain_zone_area` | `str` | Zone or area name |

## Development

### Running tests

```bash
pytest tests/
```

The unit tests for the helper functions (`clean_string`, `clean_house_number`, `extract_house_and_bus_number`) do not require the model to be loaded and run offline.
