Metadata-Version: 2.4
Name: affiliation-regex-parser
Version: 0.1.0
Summary: Regex-based parser for structuring academic author affiliation strings.
Author-email: Ehsan Bitaraf <ehsan.bitaraf@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/EhsanBitaraf/affiliation-regex-parser
Project-URL: Repository, https://github.com/EhsanBitaraf/affiliation-regex-parser
Project-URL: Issues, https://github.com/EhsanBitaraf/affiliation-regex-parser/issues
Keywords: affiliation,parser,regex,nlp,metadata,academia
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: flashtext>=2.7
Provides-Extra: excel
Requires-Dist: pandas>=2.0; extra == "excel"
Requires-Dist: openpyxl>=3.1; extra == "excel"
Dynamic: license-file


# Affiliation Regex Parser

A lightweight, regex-based Python library for parsing academic author affiliation strings into structured fields (departments, institutes, universities, organizations, cities, countries, emails, postcodes, and unknown segments).

Examples and usage patterns are documented in `cookbook.ipynb`.

## Install

Core install:

```bash
pip install affiliation-regex-parser
````

Optional Excel support (for `worldcities.xlsx`-based city→country inference):

```bash
pip install affiliation-regex-parser[excel]
```

## Features

* Regex-first, deterministic parsing (no ML dependencies).
* Pluggable architecture via providers (e.g., custom city lists, custom inference).
* Optional city→country inference using a world cities dataset.
* Designed for batch parsing (reuse one parser instance).

## Data attribution (worldcities.xlsx)

This project can use a `worldcities.xlsx` dataset sourced from SimpleMaps World Cities (free version), licensed under **Creative Commons Attribution 4.0 (CC BY 4.0)**.

Source:

```text
https://simplemaps.com/data/world-cities
```

If you redistribute the dataset with this package, ensure attribution is preserved per CC BY 4.0.

## Documentation

* See `cookbook.ipynb` for practical examples and common configurations.
* See the docstrings in `AffiliationRegexParser` and provider classes for API details.

## License

MIT License. See `LICENSE`.

## Project status

This package is under active development; the public API aims to remain stable, but outputs may improve over time as patterns and fixtures evolve.


