Metadata-Version: 2.4
Name: onflow-location-platform
Version: 1.0.1
Summary: OnFlow Location Platform for parsing, converting, and standardizing Vietnamese administrative units
Home-page: https://github.com/N-H-Logistics/onflow-location-platform
Author: OnFlow
Author-email: opensource@onflow.vn
Project-URL: Homepage, https://github.com/N-H-Logistics/onflow-location-platform
Project-URL: Repository, https://github.com/N-H-Logistics/onflow-location-platform
Project-URL: Issues, https://github.com/N-H-Logistics/onflow-location-platform/issues
Keywords: address parser,administrative units,vietnam,geodata,sdk
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Vietnamese
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Text Processing
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: geopy
Requires-Dist: pandas
Requires-Dist: shapely
Requires-Dist: tqdm
Requires-Dist: unidecode
Provides-Extra: scripts
Requires-Dist: beautifulsoup4; extra == "scripts"
Requires-Dist: numpy; extra == "scripts"
Requires-Dist: requests; extra == "scripts"
Requires-Dist: seleniumbase; extra == "scripts"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# OnFlow Location Platform

OnFlow Location Platform is a Python package and data workspace for working with Vietnamese administrative units across two address systems:

- `LEGACY`: the historical 63-province structure
- `FROM_2025`: the post-reform 34-province structure

The repository contains:

- a runtime package for parsing and converting Vietnamese addresses
- packaged lookup assets under [`src/data`](src/data)
- data preparation inputs under [`data`](data)
- collection, processing, generation, and validation scripts under [`scripts`](scripts)

## Highlights

- Parse free-text Vietnamese administrative addresses into structured `AdminUnit` objects
- Convert legacy addresses to the 2025 administrative structure
- Standardize province, district, ward, and address columns in pandas DataFrames
- Query packaged lookup data from the bundled SQLite database
- Maintain a reproducible workflow from raw inputs to generated runtime assets

## Package Naming

- Repository / distribution name: `onflow-location-platform`
- Python import path: `onflow_location_platform`
- Runtime data directory: [`src/data`](src/data)

The public import path is `onflow_location_platform`.

## Repository Layout

```text
.
├── data/
│   ├── alias_keywords/
│   ├── raw/
│   ├── interim/
│   └── processed/
├── scripts/
│   ├── collecting_data/
│   ├── processing_data/
│   ├── generating_module_data/
│   └── testing_package/
├── src/
│   ├── converter/
│   ├── data/
│   ├── database/
│   ├── pandas/
│   └── parser/
└── setup.py
```

## How It Works

The repository operates as a small data platform plus a runtime package:

```text
external sources
    -> data/raw
    -> data/interim
    -> data/processed
    -> scripts/generating_module_data
    -> src/data
    -> src parser / converter / pandas / database APIs
```

In practice, the workflow is:

1. Collect source files from public endpoints or manual downloads into [`data/raw`](data/raw).
2. Clean, map, and enrich those inputs into intermediate and processed datasets under [`data/interim`](data/interim) and [`data/processed`](data/processed).
3. Generate compact runtime assets for the package in [`src/data`](src/data), including parser dictionaries, conversion mappings, and the bundled SQLite database.
4. Use the public API from [`src`](src) to parse addresses, convert legacy addresses, standardize DataFrame columns, or query lookup data.

This means the runtime package does not depend on the workspace CSV files at execution time. It depends on the generated assets already bundled in [`src/data`](src/data).

## Installation

### Local editable install

```bash
python -m venv envs
source envs/bin/activate
pip install -e .
```

### Local editable install with script dependencies

```bash
python -m venv envs
source envs/bin/activate
pip install -e '.[scripts]'
```

## Quick Start

### Parse a 2025-format address

```python
from onflow_location_platform import parse_2025_address

unit = parse_2025_address("Tân Sơn Hòa, Hồ Chí Minh")
print(unit.format_address())
```

### Parse a legacy-format address

```python
from onflow_location_platform import parse_legacy_address

unit = parse_legacy_address(
    "Đường 15, Long Bình, Quận 9, Hồ Chí Minh",
    level=3,
)

print(unit.short_province, unit.short_district, unit.short_ward)
```

### Convert a legacy address to the 2025 structure

```python
from onflow_location_platform import convert_legacy_address_to_2025

unit = convert_legacy_address_to_2025("59 Nguyễn Sỹ Sách, Phường 15, Tân Bình, Hồ Chí Minh")
print(unit.format_address())
```

### Standardize administrative unit columns in pandas

```python
import pandas as pd
from onflow_location_platform.pandas import standardize_admin_unit_columns

df = pd.DataFrame(
    [
        {"province": "ha noi", "ward": "hong ha"},
        {"province": "hà nội", "ward": "ba đình"},
    ]
)

result = standardize_admin_unit_columns(
    df,
    province="province",
    ward="ward",
)

print(result)
```

### Query the bundled SQLite lookup data

```python
from onflow_location_platform.database import get_data, query

print(get_data(fields=["province", "ward"], table="admin_units", limit=5))
print(query("SELECT province, ward FROM admin_units LIMIT 5"))
```

## Public API

### `parse_2025_address(address, keep_street=True, level=2)`

Parse a Vietnamese address in the 2025 34-province structure into an `AdminUnit`.

- `keep_street=True` keeps street text when enough address segments are available
- `level=1` parses province
- `level=2` parses ward and province

### `parse_legacy_address(address, keep_street=True, level=3)`

Parse a Vietnamese address in the legacy 63-province structure into an `AdminUnit`.

- `keep_street=True` keeps street text when enough address segments are available
- `level=1` parses province
- `level=2` parses district and province
- `level=3` parses ward, district, and province

### `convert_legacy_address_to_2025(address)`

Convert a legacy-format address into a normalized `AdminUnit` in the 2025 structure.

### `standardize_admin_unit_columns(...)`

Standardize province, district, and ward columns in a pandas DataFrame.

### `convert_address_column(...)`

Convert a full address column and optionally attach old/new administrative attributes.

### `get_data(...)` and `query(sql)`

Read lookup data from the bundled SQLite database in [`src/data/dataset.db`](src/data/dataset.db).

## Data Directories

### Runtime assets

These files are bundled with the Python package and used at runtime:

- [`src/data/parser_legacy.json`](src/data/parser_legacy.json)
- [`src/data/parser_from_2025.json`](src/data/parser_from_2025.json)
- [`src/data/converter_2025.json`](src/data/converter_2025.json)
- [`src/data/dataset.db`](src/data/dataset.db)

### Workspace data

The [`data`](data) directory is used for data preparation and notebook workflows:

- [`data/alias_keywords`](data/alias_keywords): curated alias inputs used when generating parser assets
- [`data/raw`](data/raw): collected source files
- [`data/interim`](data/interim): intermediate transformation outputs
- [`data/processed`](data/processed): processed datasets for analysis and validation

The repository currently ignores `data/raw`, `data/interim`, and `data/processed`, so those folders act as workspace outputs rather than committed package contents.

## Scripts

Operational scripts and notebooks are grouped by purpose:

- [`scripts/collecting_data`](scripts/collecting_data): external data collection and scraping
- [`scripts/processing_data`](scripts/processing_data): mapping, cleaning, enrichment, and dataset building
- [`scripts/generating_module_data`](scripts/generating_module_data): generation of packaged parser and converter assets
- [`scripts/testing_package`](scripts/testing_package): smoke tests and manual validation

Example smoke test:

```bash
PYTHONPATH=. envs/bin/python scripts/testing_package/manual_parse_smoke_test.py
```

Example benchmark:

```bash
envs/bin/python scripts/testing_package/benchmark_public_api.py
```

Example collection script:

```bash
PYTHONPATH=. envs/bin/python scripts/collecting_data/scrape_sapnhap_bando_provinces_and_wards.py --date 2026-03-31 --verbose
```

Notes:

- Some scripts require the optional dependencies from `.[scripts]`
- Collection scripts require network access
- Several workflows are notebook-driven rather than packaged as command-line tools

## Benchmark

The repository includes a reproducible micro-benchmark for the public SDK API:

```bash
envs/bin/python scripts/testing_package/benchmark_public_api.py
```

Sample results from a local run on `2026-03-31` using Python `3.11.13` on `macOS-15.5-arm64`:

| API | Sample Input | Iterations / run | Best ms/op | Mean ms/op | Ops/s |
| --- | --- | ---: | ---: | ---: | ---: |
| `parse_2025_address` | `"Tân Sơn Hòa, Hồ Chí Minh"` | 10,000 | 0.1181 | 0.1199 | 8469.8 |
| `parse_legacy_address` | `"Long Bình, Quận 9, Hồ Chí Minh"` (`level=3`) | 10,000 | 0.1003 | 0.1039 | 9965.8 |
| `convert_legacy_address_to_2025` | `"Phường 15, Tân Bình, Hồ Chí Minh"` | 5,000 | 0.2364 | 0.2402 | 4230.0 |

These numbers are indicative and will vary by machine, Python version, and whether the tested input path requires external geocoding.

## Development Notes

- Python requirement: `>=3.7`
- Runtime dependencies: `geopy`, `pandas`, `shapely`, `tqdm`, `unidecode`
- Optional script dependencies: `beautifulsoup4`, `numpy`, `requests`, `seleniumbase`

## License

`setup.py` declares the project as MIT-licensed. A standalone `LICENSE` file is not present in the current workspace snapshot.
