Metadata-Version: 2.4
Name: onflow-location-platform
Version: 1.0.7
Summary: OnFlow Location Platform for parsing, converting, and standardizing Vietnamese administrative units
Home-page: https://github.com/N-H-Logistics/onflow-location-platform
Author: OnFlow
Author-email: opensource@onflow.vn
Project-URL: Homepage, https://github.com/N-H-Logistics/onflow-location-platform
Project-URL: Repository, https://github.com/N-H-Logistics/onflow-location-platform
Project-URL: Issues, https://github.com/N-H-Logistics/onflow-location-platform/issues
Keywords: address parser,administrative units,vietnam,geodata,sdk
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Vietnamese
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Text Processing
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: geopy
Requires-Dist: pandas
Requires-Dist: shapely
Requires-Dist: tqdm
Requires-Dist: unidecode
Provides-Extra: scripts
Requires-Dist: beautifulsoup4; extra == "scripts"
Requires-Dist: numpy; extra == "scripts"
Requires-Dist: requests; extra == "scripts"
Requires-Dist: seleniumbase; extra == "scripts"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# OnFlow Location Platform

[Tiếng Việt Version](./README_VN.md)


OnFlow Location Platform is a Python package and data workspace for working with Vietnamese administrative units across two address systems:

- `LEGACY`: the historical 63-province structure
- `FROM_2025`: the post-reform 34-province structure

The repository contains:

- a runtime package for parsing and converting Vietnamese addresses
- packaged lookup assets under [`src/onflow_location_platform/data`](src/onflow_location_platform/data)
- data preparation inputs under [`data`](data)
- collection, processing, generation, and validation scripts under [`scripts`](scripts)

## Highlights

- Parse free-text Vietnamese administrative addresses into structured `AdminUnit` objects
- Convert legacy addresses to the 2025 administrative structure
- Standardize province, district, ward, and address columns in pandas DataFrames
- Query packaged lookup data from the bundled SQLite database
- Maintain a reproducible workflow from raw inputs to generated runtime assets

## Unified Interface: `OnFlowLocation`

The `OnFlowLocation` class provides a unified, property-based API to all platform operations.

```python
from onflow_location_platform import OnFlowLocation

# Initialize with address or codes
location_info = OnFlowLocation(
    address="842 Nguyễn Kiệm, Hạnh Thông, hồ chí minh",
    provide_code=91,
    district_code=913,
    ward_code=31066
)

# Parse or convert using properties
print(location_info.convert_address_new_to_old.format_address())
print(location_info.convert_address_old_to_new_by_code.format_address())
```

## Package Naming

- Repository / distribution name: `onflow-location-platform`
- Python import path: `onflow_location_platform`
- Runtime data directory: [`src/onflow_location_platform/data`](src/onflow_location_platform/data)

The public import path is `onflow_location_platform`.

## Repository Layout

```text
.
├── data/
│   ├── alias_keywords/
│   ├── raw/
│   ├── interim/
│   └── processed/
├── scripts/
│   ├── collecting_data/
│   ├── processing_data/
│   ├── generating_module_data/
│   └── testing_package/
├── src/
│   ├── onflow_location_platform/
│   │   ├── converter/
│   │   ├── data/
│   │   ├── database/
│   │   ├── pandas/
│   │   └── parser/
└── setup.py
```

## How It Works

The repository operates as a small data platform plus a runtime package:

```text
external sources
    -> data/raw
    -> data/interim
    -> data/processed
    -> scripts/generating_module_data
    -> src/onflow_location_platform/data
    -> src/onflow_location_platform parser / converter / pandas / database APIs
```

In practice, the workflow is:

1. Collect source files from public endpoints or manual downloads into [`data/raw`](data/raw).
2. Clean, map, and enrich those inputs into intermediate and processed datasets under [`data/interim`](data/interim) and [`data/processed`](data/processed).
3. Generate compact runtime assets for the package in [`src/onflow_location_platform/data`](src/onflow_location_platform/data), including parser dictionaries, conversion mappings, and the bundled SQLite database.
4. Use the public API from [`src/onflow_location_platform`](src/onflow_location_platform) to parse addresses, convert legacy addresses, standardize DataFrame columns, or query lookup data.

This means the runtime package does not depend on the workspace CSV files at execution time. It depends on the generated assets already bundled in [`src/onflow_location_platform/data`](src/onflow_location_platform/data).

## Installation

### Local editable install

```bash
python -m venv envs
source envs/bin/activate
pip install -e .
```

### Local editable install with script dependencies

```bash
python -m venv envs
source envs/bin/activate
pip install -e '.[scripts]'
```

## Quick Start

### Parse a 2025-format address

```python
from onflow_location_platform import parse_address_new

unit = parse_address_new("Tân Sơn Hòa, Hồ Chí Minh")
print(unit.format_address())
```

### Parse a legacy-format address

```python
from onflow_location_platform import parse_address_old

unit = parse_address_old(
    "Đường 15, Long Bình, Quận 9, Hồ Chí Minh",
    level=3,
)

print(unit.short_province, unit.short_district, unit.short_ward)
```

### Convert a legacy address to the 2025 structure

```python
from onflow_location_platform import convert_address_old_to_new

unit = convert_address_old_to_new("59 Nguyễn Sỹ Sách, Phường 15, Tân Bình, Hồ Chí Minh")
print(unit.format_address())
```

### Convert a 2025 address back to legacy (`first`)

```python
from onflow_location_platform import convert_address_new_to_old

unit = convert_address_new_to_old(
    "Phường Hạnh Thông, Hồ Chí Minh",
    multi_match="first",
)
print(unit.format_address())
# Phường 1, Quận Gò Vấp, Thành Phố Hồ Chí Minh
```

### Convert a 2025 address back to legacy (`all`)

```python
from onflow_location_platform import convert_address_new_to_old

units = convert_address_new_to_old(
    "Phường Hạnh Thông, Hồ Chí Minh",
    multi_match="all",
)
for unit in units:
    if unit:
        print(unit.format_address())
# Phường 1, Quận Gò Vấp Thành Phố Hồ Chí Minh
# Phường 3, Quận Gò Vấp Thành Phố Hồ Chí Minh
```

### Convert a 2025 address back to legacy (`geo`)

```python
from onflow_location_platform import convert_address_new_to_old

unit = convert_address_new_to_old(
    "842 Nguyễn Kiệm, Phường Hạnh Thông, Hồ Chí Minh",
    multi_match="geo",
)
print(unit.format_address())
# 842 Nguyễn Kiệm, Phường 3, Quận Gò Vấp Thành Phố Hồ Chí Minh
```

### Convert legacy codes to the 2025 structure

Use this when you already have the three legacy numeric primary-key codes
(province, district, ward) stored in a database, instead of a free-text address.

```python
from onflow_location_platform import convert_address_old_to_new_by_code

result = convert_address_old_to_new_by_code(
    provide_code=38,
    district_code=399,
    ward_code=16003
)
print(result.format_address())
# Xã Hoằng Thanh, Tỉnh Thanh Hóa
```

To convert individual legacy administrative codes, use the specific functions:

```python
from onflow_location_platform import (
    convert_province_old_to_new_by_code,
    convert_district_old_to_new_by_code,
    convert_ward_old_to_new_by_code
)

# Convert a legacy province code
province = convert_province_old_to_new_by_code(38)
print(province.format_address())
# Tỉnh Thanh Hóa

# Convert a legacy district code (returns the new mapped province)
district = convert_district_old_to_new_by_code(399)
print(district.format_address())
# Tỉnh Thanh Hóa

# Convert a legacy ward code (returns the new mapped ward and province)
ward = convert_ward_old_to_new_by_code(16003)
print(ward.format_address())
# Xã Hoằng Thanh, Tỉnh Thanh Hóa
```

### Get 2025 administrative info directly using new codes

```python
from onflow_location_platform import (
    get_new_admin_unit_by_new_code,
    get_new_province_by_new_code,
    get_new_ward_by_new_code
)

# Get 2025 province info from a new province code
province = get_new_province_by_new_code("01")
print(province.province) # Thành phố Hà Nội

# Get 2025 ward info from a new ward code
ward = get_new_ward_by_new_code("00097")
print(ward.format_address()) # Phường Hồng Hà, Thành phố Hà Nội

# Get full information from both new province and ward codes
full_unit = get_new_admin_unit_by_new_code(province_code="01", ward_code="00097")
```

### Standardize administrative unit columns in pandas

```python
import pandas as pd
from onflow_location_platform.pandas import standardize_admin_unit_columns

df = pd.DataFrame(
    [
        {"province": "ha noi", "ward": "hong ha"},
        {"province": "hà nội", "ward": "ba đình"},
    ]
)

result = standardize_admin_unit_columns(
    df,
    province="province",
    ward="ward",
)

print(result)
```

### Query the bundled SQLite lookup data

```python
from onflow_location_platform.database import get_data, query

print(get_data(fields=["province", "ward"], table="admin_units", limit=5))
print(query("SELECT province, ward FROM admin_units LIMIT 5"))
```

## Public API

### `parse_address_new(address, keep_street=True, level=2)`

Parse a Vietnamese address in the 2025 34-province structure into an `AdminUnit`.

- `keep_street=True` keeps street text when enough address segments are available
- `level=1` parses province
- `level=2` parses ward and province

### `parse_address_old(address, keep_street=True, level=3)`

Parse a Vietnamese address in the legacy 63-province structure into an `AdminUnit`.

- `keep_street=True` keeps street text when enough address segments are available
- `level=1` parses province
- `level=2` parses district and province
- `level=3` parses ward, district, and province

### `convert_address_old_to_new(address)`

Convert a legacy-format address into a normalized `AdminUnit` in the 2025 structure.

### `convert_address_old_to_new_by_code(provide_code, district_code, ward_code, address=None)`

Convert legacy administrative codes to the 2025 structure without any text parsing.
Accepts the three legacy numeric primary-key codes stored in a database.

- `provide_code` — legacy province numeric code (e.g. `79`)
- `district_code` — legacy district numeric code (e.g. `760`)
- `ward_code` — legacy ward numeric code (e.g. `26737`)
- `address` — optional raw address string attached to the returned `AdminUnit`
- both `int` and `str` inputs are accepted; leading zeros are added automatically
- raises `KeyError` if any code is not found, or `ValueError` if the codes are inconsistent (district does not belong to province, ward does not belong to district)

### `convert_province_old_to_new_by_code(pk_id)`

Resolve a single legacy province code to the 2025 province.
Returns an `AdminUnit` with only province-level fields populated.

### `convert_district_old_to_new_by_code(pk_id)`

Resolve a single legacy district code to the 2025 province it now belongs to.
Districts do not exist in the 2025 format; only the parent province is returned.

### `convert_ward_old_to_new_by_code(pk_id)`

Resolve a single legacy ward code to the 2025 ward and province.

### `get_new_admin_unit_by_new_code(province_code, ward_code, address=None)`

Get a 2025 `AdminUnit` directly from a 2025 province code and ward code.

### `get_new_province_by_new_code(province_code)`

Get a 2025 `AdminUnit` directly from a 2025 province code.

### `get_new_ward_by_new_code(ward_code, address=None)`

Get a 2025 `AdminUnit` directly from a 2025 ward code. Returns the ward and its corresponding province.

### `convert_address_new_to_old(address, multi_match="first")`

Convert a 2025-format address into legacy administrative units.

- `multi_match="first"` returns one `AdminUnit` (first candidate)
- `multi_match="all"` returns `list[AdminUnit]` (all candidates)
- `multi_match="geo"` returns one `AdminUnit` selected by nearest geocoded point, then falls back to `first` if needed
- input must explicitly contain a 2025 province keyword; otherwise an empty result is returned

### `standardize_admin_unit_columns(...)`

Standardize province, district, and ward columns in a pandas DataFrame.

### `convert_address_column(...)`

Convert a full address column and optionally attach old/new administrative attributes.

### `get_data(...)` and `query(sql)`

Read lookup data from the bundled SQLite database in [`src/onflow_location_platform/data/dataset.db`](src/onflow_location_platform/data/dataset.db).

## Data Directories

### Runtime assets

These files are bundled with the Python package and used at runtime:

- [`src/onflow_location_platform/data/parser_legacy.json`](src/onflow_location_platform/data/parser_legacy.json)
- [`src/onflow_location_platform/data/parser_from_2025.json`](src/onflow_location_platform/data/parser_from_2025.json)
- [`src/onflow_location_platform/data/converter_2025.json`](src/onflow_location_platform/data/converter_2025.json)
- [`src/onflow_location_platform/data/dataset.db`](src/onflow_location_platform/data/dataset.db)

### Workspace data

The [`data`](data) directory is used for data preparation and notebook workflows:

- [`data/alias_keywords`](data/alias_keywords): curated alias inputs used when generating parser assets
- [`data/raw`](data/raw): collected source files
- [`data/interim`](data/interim): intermediate transformation outputs
- [`data/processed`](data/processed): processed datasets for analysis and validation

The repository currently ignores `data/raw`, `data/interim`, and `data/processed`, so those folders act as workspace outputs rather than committed package contents.

## Scripts

Operational scripts and notebooks are grouped by purpose:

- [`scripts/collecting_data`](scripts/collecting_data): external data collection and scraping
- [`scripts/processing_data`](scripts/processing_data): mapping, cleaning, enrichment, and dataset building
- [`scripts/generating_module_data`](scripts/generating_module_data): generation of packaged parser and converter assets
- [`scripts/testing_package`](scripts/testing_package): smoke tests and manual validation

Example smoke test:

```bash
PYTHONPATH=. envs/bin/python scripts/testing_package/manual_parse_smoke_test.py
```

Example benchmark:

```bash
envs/bin/python scripts/testing_package/benchmark_public_api.py
```

Example collection script:

```bash
PYTHONPATH=. envs/bin/python scripts/collecting_data/scrape_sapnhap_bando_provinces_and_wards.py --date 2026-03-31 --verbose
```

Notes:

- Some scripts require the optional dependencies from `.[scripts]`
- Collection scripts require network access
- Several workflows are notebook-driven rather than packaged as command-line tools

## Benchmark

The repository includes a reproducible micro-benchmark for the public SDK API:

```bash
envs/bin/python scripts/testing_package/benchmark_public_api.py
```

Sample results from a local run on `2026-03-31` using Python `3.11.13` on `macOS-15.5-arm64`:

| API | Sample Input | Iterations / run | Best ms/op | Mean ms/op | Ops/s |
| --- | --- | ---: | ---: | ---: | ---: |
| `parse_address_new` | `"Tân Sơn Hòa, Hồ Chí Minh"` | 10,000 | 0.1181 | 0.1199 | 8469.8 |
| `parse_address_old` | `"Long Bình, Quận 9, Hồ Chí Minh"` (`level=3`) | 10,000 | 0.1003 | 0.1039 | 9965.8 |
| `convert_address_old_to_new` | `"Phường 15, Tân Bình, Hồ Chí Minh"` | 5,000 | 0.2364 | 0.2402 | 4230.0 |

These numbers are indicative and will vary by machine, Python version, and whether the tested input path requires external geocoding.

## Development Notes

- Python requirement: `>=3.7`
- Runtime dependencies: `geopy`, `pandas`, `shapely`, `tqdm`, `unidecode`
- Optional script dependencies: `beautifulsoup4`, `numpy`, `requests`, `seleniumbase`

## License

`setup.py` declares the project as MIT-licensed. A standalone `LICENSE` file is not present in the current workspace snapshot.
