Metadata-Version: 2.4
Name: lamisema
Version: 1.1.0
Summary: Structured Information Extraction for Nepali and multilingual PDFs
Project-URL: Homepage, https://lamisema.readthedocs.io
Project-URL: Repository, https://github.com/sanjiblamichhane/lamisema
Project-URL: Issues, https://github.com/sanjiblamichhane/lamisema/issues
Project-URL: Changelog, https://github.com/sanjiblamichhane/lamisema/blob/master/CHANGELOG.md
Project-URL: Documentation, https://lamisema.readthedocs.io
Author-email: Sanjib Lamichhane <lamich2ane@gmail.com>
License: MIT License
        
        Copyright (c) 2026 Sanjib Lamichhane
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: bikram-sambat,devanagari,nepal,nepali,nlp,ocr,pdf
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Nepali
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: fastapi>=0.115.0
Requires-Dist: pdfplumber>=0.11.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: python-multipart>=0.0.9
Requires-Dist: uvicorn[standard]>=0.30.0
Provides-Extra: all
Requires-Dist: boto3>=1.34.0; extra == 'all'
Requires-Dist: easyocr>=1.7.0; extra == 'all'
Requires-Dist: opencv-python-headless; extra == 'all'
Provides-Extra: dev
Requires-Dist: httpx>=0.27.0; extra == 'dev'
Requires-Dist: mkdocs-material>=9.4.0; extra == 'dev'
Requires-Dist: mkdocs>=1.5.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Provides-Extra: easyocr
Requires-Dist: easyocr>=1.7.0; extra == 'easyocr'
Requires-Dist: opencv-python-headless; extra == 'easyocr'
Provides-Extra: s3
Requires-Dist: boto3>=1.34.0; extra == 's3'
Provides-Extra: tesseract
Description-Content-Type: text/markdown

# LamiSema

**Structured information extraction for Nepali PDFs.**

[![PyPI](https://img.shields.io/pypi/v/lamisema.svg)](https://pypi.org/project/lamisema/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/)
[![CI](https://github.com/sanjiblamichhane/lamisema/actions/workflows/ci.yml/badge.svg)](https://github.com/sanjiblamichhane/lamisema/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

---

## The problem

Most Nepali PDFs silently return wrong output when you use standard tools on them. There are three types, each needing a different approach:

| PDF type | Example source | Standard tool result |
|---|---|---|
| **Unicode-native** | Modern government portals | ✅ Works fine |
| **Legacy-encoded** | Pre-2010 docs using Preeti/Kantipur font | ❌ Returns garbage (`g]kfn` instead of `नेपाल`) |
| **Scanned** | Physical forms, old records | ❌ Returns empty string |

LamiSema detects the type first and automatically routes to the right strategy.

---

## How it works

```mermaid
flowchart LR
    PDF[PDF Input] --> P[Pre-flight\nDetect encoding type]
    P -->|unicode_native| T[Text layer\npdfplumber]
    P -->|legacy_encoded| O[OCR\nTesseract nep+eng]
    P -->|scanned| O
    T --> N[NER + Date\nNormalization]
    O --> N
    N --> J[Structured JSON\nwith confidence scores]
```

---

## Install

```bash
pip install lamisema
```

**System dependency** — Tesseract with the Nepali language pack:

```bash
# macOS
brew install tesseract tesseract-lang

# Ubuntu / Debian
sudo apt-get install tesseract-ocr tesseract-ocr-nep
```

---

## Python usage

```python
from lamisema import LamiSema

pipeline = LamiSema()

with open("report.pdf", "rb") as f:
    result = pipeline.extract(f.read(), filename="report.pdf")

print(result.encoding_type)        # "legacy_encoded"
print(result.overall_confidence)   # 0.74
print(result.pages[0].entities)    # [Entity(type="DATE_BS", text="२०८१ साल असार १५", ...)]
```

---

## REST API

Start the server:

```bash
lamisema serve
# → http://localhost:9001/docs
```

```bash
# 1. Upload
curl -X POST http://localhost:9001/upload -F "file=@report.pdf"
# → { "doc_id": "DOC-A1B2C3D4" }

# 2. Detect encoding (fast, no extraction)
curl http://localhost:9001/preflight/DOC-A1B2C3D4

# 3. Extract everything
curl -X POST http://localhost:9001/extract/DOC-A1B2C3D4

# 4. Get result
curl http://localhost:9001/result/DOC-A1B2C3D4
```

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/` | Health check |
| `POST` | `/upload` | Upload a PDF, get `doc_id` |
| `GET` | `/preflight/{doc_id}` | Encoding type + font analysis |
| `POST` | `/extract/{doc_id}` | Full extraction + NER |
| `GET` | `/result/{doc_id}` | Retrieve completed result |
| `POST` | `/normalize-dates` | Normalize BS dates in raw text |

---

## Try the demo app

A full-stack demo (Next.js frontend + API + MinIO) is in [`application-demo/`](application-demo/).

```bash
cd application-demo
cp .env.example .env
docker compose -f docker-compose.local.yaml up --build
# → http://localhost:3000
```

---

## Docs

Full documentation at **[lamisema.readthedocs.io](https://lamisema.readthedocs.io)**

- [Installation](docs/installation.md)
- [Python API](docs/python-api.md)
- [REST API Reference](docs/api-reference.md)
- [Encoding types explained](docs/encoding-types.md)
- [Contributing](docs/contributing.md)

---

## License

MIT — see [LICENSE](LICENSE)
