Metadata-Version: 2.4
Name: humanmint
Version: 2.0.0
Summary: Clean, functional data processing for human-centric applications. Normalize and standardize names, emails, phones, departments, and job titles with a single unified API.
Author: Ricardo Nunes
License-Expression: MIT
Project-URL: Homepage, https://github.com/RicardoNunes2000/HumanMint
Project-URL: Documentation, https://github.com/RicardoNunes2000/HumanMint#documentation
Project-URL: API Reference, https://github.com/RicardoNunes2000/HumanMint/blob/main/README.md#api-reference
Project-URL: Use Cases, https://github.com/RicardoNunes2000/HumanMint/tree/main/docs/use_cases
Project-URL: Fields Guide, https://github.com/RicardoNunes2000/HumanMint/blob/main/docs/FIELDS.md
Project-URL: Bug Tracker, https://github.com/RicardoNunes2000/HumanMint/issues
Project-URL: Source Code, https://github.com/RicardoNunes2000/HumanMint
Project-URL: Changelog, https://github.com/RicardoNunes2000/HumanMint/blob/main/CHANGELOG.md
Keywords: data-processing,normalization,names,emails,phones,departments,titles,civic-data
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Classifier: Topic :: Office/Business
Classifier: Topic :: Text Processing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: email-validator>=2.0.0
Requires-Dist: phonenumbers>=8.13.0
Requires-Dist: nameparser>=1.1.0
Requires-Dist: nicknames>=0.0.2
Requires-Dist: rapidfuzz>=3.6
Requires-Dist: orjson>=3.10
Requires-Dist: importlib_resources>=5.0; python_version < "3.9"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: flake8>=6.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: faker>=20.0; extra == "dev"
Provides-Extra: address
Requires-Dist: usaddress>=0.6.0; extra == "address"
Provides-Extra: pandas
Requires-Dist: pandas>=1.5; extra == "pandas"
Dynamic: license-file

# HumanMint v2

HumanMint cleans and normalizes messy contact data with one line of code. It standardizes names, emails, phones, addresses, departments, titles, and organizations using curated public-sector mappings you won’t find anywhere else.

```python
from humanmint import mint

result = mint(
    name="Dr. John Q. Smith, PhD",
    email="JOHN.SMITH@CITY.GOV",
    phone="(202) 555-0173 ext 456",
    department="001 - Public Works Dept",
    title="Chief of Police",
    address="123 N. Main St Apt 4B, Madison, WI 53703",
    organization="City of Madison Police Department",
)

result.name_standardized          # "John Q Smith"
result.email_standardized         # "john.smith@city.gov"
result.phone_pretty               # "+1 202-555-0173"
result.department_canonical       # "Public Works"
result.title_canonical            # "police chief"
result.address_canonical          # "123 N. Main St Apt 4B Madison WI 53703 US"

# Split multi-person names when needed
results = mint(name="John and Jane Smith", split_multi=True)
# returns [MintResult(John Smith), MintResult(Jane Smith)]
```

## Why HumanMint
- Real-world chaos: titles inside names, departments with numbers/phone extensions, strange-casing emails, smashed-together addresses.
- Unique data: 23K+ department variants → 64 categories; 73K+ titles with curated canonicals + BLS; context-aware (dept-informed) title mapping not available off-the-shelf.
- Safe defaults: length guards, optional aggressive cleaning, semantic conflict checks, bulk dedupe, and optional multi-person name splitting.

### Department & Title mapping you can’t get elsewhere
Curated public-sector mappings that solve the “impossible to Google” parts of contact normalization.
```
"City Administration"    -> "Administration"       [administration]
"Finance Department"     -> "Finance"              [finance]
"Public Works"           -> "Public Works"         [infrastructure]
"Police Department"      -> "Police"               [public safety]
```
Titles get similar treatment across 73K standardized forms with optional department context to boost accuracy.

### All fields in one library
Names, emails, phones, addresses, departments, titles, organizations—one pipeline. Most libraries cover one field; HumanMint returns the whole record with canonicalization, categorization, and confidence.

### Fast
Typical workloads run sub-millisecond per record with multithreading and built-in dedupe.

### AI extraction (optional)
Install the ML extra (`pip install gliner2`) and pass `text=` with `use_gliner=True` to extract from unstructured text, then normalize. Structured fields you pass always win. You can also pass a `GlinerConfig` (`gliner_cfg`) to control schema, threshold, and GPU usage.
GLiNER extraction is experimental and may be inaccurate; prefer structured inputs when available.

Example (signature block → canonicalized):
```
text = """
John A. Miller
Deputy Director of Public Works
City of Springfield, Missouri
305 E McDaniel St, Springfield, MO 65806
Phone: (417) 864-1234
Email: jmiller@springfieldmo.gov
"""

result = mint(text=text, use_gliner=True)

# Result:
# MintResult(
#   name: John A Miller
#   email: jmiller@springfieldmo.gov
#   phone: +1 417-864-1234
#   department: Public Works
#   title:
#     raw: Deputy Director
#     normalized: Deputy Director
#     canonical: deputy director
#   address: None
#   organization: Springfield Missouri
# )
```
You can also batch texts: `mint(texts=[...], use_gliner=True)` returns a list of `MintResult` objects.

Advanced GLiNER configuration:
```python
from humanmint.gliner import GlinerConfig

cfg = GlinerConfig(
    threshold=0.85,    # optional confidence threshold
    use_gpu=True,      # move model to GPU if available
    schema=None,       # custom schema dict if desired
    extractor=None,    # reuse a preloaded GLiNER2 instance
)

result = mint(text=text, use_gliner=True, gliner_cfg=cfg)
```

## What’s new in v2 (vs v1)
- Clear, canonical property names: `name_standardized`, `email_standardized`, `phone_standardized`, `title_canonical`, `department_canonical` (legacy aliases removed).
- Explainable comparisons: `compare(..., explain=True)` shows component scores/penalties.
- Multi-person name splitting: `split_multi=True` handles “John and Jane Smith”.
- Name enrichment: detects nicknames and generational suffixes without polluting the main name fields.
- Optional GLiNER extraction for unstructured text via `use_gliner=True` and `GlinerConfig`; multi-person GLiNER input raises a clear error.
- Structured-field pipeline remains deterministic and fast; GLiNER is opt-in and experimental.

## Installation
```bash
pip install humanmint
```

## Quickstart
```python
from humanmint import mint, compare, bulk

r1 = mint(name="Jane Doe", email="jane.doe@city.gov", department="Public Works", title="Engineer")
r2 = mint(name="J. Doe",  email="JANE.DOE@CITY.GOV", department="PW Dept",       title="Public Works Engineer")

score = compare(r1, r2)  # similarity 0–100
# Or with explanation:
score, why = compare(r1, r2, explain=True)
print("\n".join(why))

records = [
    {"name": "Alice", "email": "alice@example.com"},
    {"name": "Bob",   "email": "bob@example.com"},
]
results = bulk(records, workers=4)
```

## Access Patterns
- Dict access: `result.title["canonical"]`, `result.department["canonical"]`, `result.department["category"]`
- Properties (preferred): `name_standardized`, `title_canonical`, `department_canonical`, `email_standardized`, `phone_standardized`, `address_canonical`, `organization_canonical`
- Full dicts: `result.title`, `result.department`, `result.email`, etc.

## Recommended Properties

**Names**
- `name_standardized`, `name_first`, `name_last`, `name_middle`, `name_suffix`, `name_suffix_type`, `name_gender`, `name_nickname`

**Emails**
- `email_standardized`, `email_domain`, `email_is_valid`, `email_is_generic_inbox`, `email_is_free_provider`

**Phones**
- `phone_standardized`, `phone_e164`, `phone_pretty`, `phone_extension`, `phone_is_valid`, `phone_type`

**Departments**
- `department_canonical`, `department_category`, `department_normalized`, `department_override`

**Titles**
- `title_canonical`, `title_raw`, `title_normalized`, `title_is_valid`, `title_confidence`, `title_seniority`

**Addresses**
- `address_canonical`, `address_raw`, `address_street`, `address_unit`, `address_city`, `address_state`, `address_zip`, `address_country`

**Organizations**
- `organization_raw`, `organization_normalized`, `organization_canonical`, `organization_confidence`

Use `result.get("email.is_valid")` or other dot paths to fetch nested dict values.

## Comparing Records
```python
from humanmint import compare
score = compare(r1, r2)  # 0–100
# >85 likely duplicate, >70 similar, <50 different
```

## Batch & Export
```python
from humanmint import bulk, export_json, export_csv, export_parquet, export_sql

results = bulk(records, workers=4, progress=True)
export_json(results, "out.json")
export_csv(results, "out.csv", flatten=True)
```

## CLI
```bash
humanmint clean input.csv output.csv --name-col name --email-col email --phone-col phone --dept-col department --title-col title
```

## Performance (benchmark)
| Dataset | Time | Per Record | Throughput |
|---------|------|-----------|------------|
| 1,000   | 561 ms | 0.56 ms | 1,783 rec/sec |
| 10,000  | 3.1 s  | 0.31 ms | 3,178 rec/sec |
| 50,000  | 14.0 s | 0.28 ms | 3,576 rec/sec |

## Notes
- US-focused address parsing; `usaddress` is used when available, otherwise heuristics.
- Optional deps (pandas, pyarrow, sqlalchemy, rich, tqdm) enhance exports and progress bars.
- Department and title datasets are curated and updated regularly for best accuracy.
