Metadata-Version: 2.4
Name: srilanka-epi
Version: 2.0.5
Summary: Live download and parsing of Sri Lanka Weekly Epidemiological Reports (WER) — dengue surveillance data by RDHS district
Author-email: "R.B.H.G. Chathura Kavindu Bandara Weerakoon" <chathurakavindu15@gmail.com>
Maintainer-email: "R.B.H.G. Chathura Kavindu Bandara Weerakoon" <chathurakavindu15@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/chathurakavinduweerakoon/srilanka-epi
Project-URL: Repository, https://github.com/chathurakavinduweerakoon/srilanka-epi
Project-URL: Bug Tracker, https://github.com/chathurakavinduweerakoon/srilanka-epi/issues
Project-URL: Documentation, https://github.com/chathurakavinduweerakoon/srilanka-epi#readme
Project-URL: Changelog, https://github.com/chathurakavinduweerakoon/srilanka-epi/blob/main/CHANGELOG.md
Keywords: dengue,epidemiology,sri-lanka,public-health,WER,disease-surveillance,data-extraction,pdf-parsing,open-data,ministry-of-health
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Healthcare Industry
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31.0
Requires-Dist: pdfplumber>=0.10.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: beautifulsoup4>=4.12.0
Provides-Extra: ocr
Requires-Dist: pdf2image>=1.16.0; extra == "ocr"
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: responses>=0.23.0; extra == "dev"
Requires-Dist: twine>=4.0; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Provides-Extra: all
Requires-Dist: srilanka-epi[dev,ocr]; extra == "all"
Dynamic: license-file

# srilanka-epi

[![PyPI version](https://badge.fury.io/py/srilanka-epi.svg)](https://badge.fury.io/py/srilanka-epi)
[![Python](https://img.shields.io/pypi/pyversions/srilanka-epi)](https://pypi.org/project/srilanka-epi/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
[![CI](https://github.com/chathurakavinduweerakoon/srilanka-epi/actions/workflows/ci.yml/badge.svg)](https://github.com/chathurakavinduweerakoon/srilanka-epi/actions/workflows/ci.yml)

**Production-grade Python library to live-download and parse Weekly Epidemiological Reports (WER) published by the Epidemiology Unit, Ministry of Health & Indigenous Medicine, Sri Lanka.**

> Built as part of **DengueSense LK** — Final Year Development Project, ICBT Campus / Cardiff Metropolitan University (2026).

---

## Features

| Feature | Details |
|---|---|
| 🔴 **Live data** | Auto-scrapes epid.gov.lk — no URL registration needed |
| 📄 **PDF parsing** | pdfplumber word-coordinate parser |
| 🔍 **OCR fallback** | pdf2image + Tesseract for corrupted-font PDFs |
| 🗺️ **Province mapping** | All 9 provinces → 26 RDHS districts |
| 🧹 **Auto PDF cleanup** | PDFs deleted after parsing by default — only the CSV stays |
| 🐍 **Typed** | Full PEP 484 type hints + `py.typed` marker |
| ✅ **Tested** | pytest suite with 40+ unit tests |
| 📦 **PyPI-ready** | `pip install srilanka-epi` |

---

## Installation

```bash
pip install srilanka-epi
```

**With OCR fallback** (for corrupted-font PDFs):

```bash
pip install "srilanka-epi[ocr]"
```

> **OCR also requires:**
> - [Tesseract-OCR binary](https://github.com/UB-Mannheim/tesseract/wiki) (Windows)
> - [Poppler](https://github.com/oschwartz10612/poppler-windows) (Windows) — for pdf2image

---

## Quick Start

### 🔴 Live Data — Get the Latest WER

```python
import srilanka_epi

# One-shot: scrape epid.gov.lk, download PDF, parse dengue data
result = srilanka_epi.get_latest_wer()

print(result["metadata"])
# {'vol': 53, 'no': 18, 'week_no': 18, 'year': 2026, 'week_range': '29th April – 5th May 2026'}

print(result["dengue"])
#           rdhs  week_cases  cumulative_cases  week_no  year  vol  wer_no
# 0      Colombo         142              4521       18  2026   53      18
# 1      Gampaha          89              3102       18  2026   53      18
# ...
```

### 🔴 Live Data — Get a Specific Issue

```python
result = srilanka_epi.get_wer(volume=53, number=17)
df = result["dengue"]
print(df.head())
```

### 🔴 Live Data — Full-Year Time Series

```python
# Download all 2026 issues and build a time series
df = srilanka_epi.get_dengue_timeseries(volume=53)

# Province breakdown
df = srilanka_epi.add_province(df)
print(df.groupby("province")["week_cases"].sum().sort_values(ascending=False))

# Export
df.to_csv("dengue_2026.csv", index=False)
```

### 📋 List All Available Issues

```python
issues = srilanka_epi.list_available_wers()
for (vol, no), url in issues.items():
    print(f"Vol {vol} No {no:02d}: {url}")
```

---

## Parsing a Local PDF

```python
from srilanka_epi import parse_wer_pdf

# From file path
result = parse_wer_pdf("WER_Vol53_No18.pdf")

# From bytes
with open("WER.pdf", "rb") as f:
    result = parse_wer_pdf(f.read(), volume=53, number=18)

print(result["metadata"])  # {'vol': 53, 'no': 18, 'week_no': 18, 'year': 2026, ...}
print(result["dengue"])    # pd.DataFrame
```

### Hybrid Parser (pdfplumber + OCR fallback)

Some WER PDFs have corrupted font encoding. Use `extract_dengue_data()` for automatic OCR fallback:

```python
from srilanka_epi import extract_dengue_data

result = extract_dengue_data("WER_Vol53_No18.pdf")
print(result["method"])  # "pdfplumber" or "ocr"
print(result["dengue"])
```

---

## Full API Reference

### Live Data Functions

| Function | Description |
|---|---|
| `get_latest_wer(...)` | Scrape, download, and parse the most recent WER |
| `get_wer(volume, number, ...)` | Download and parse a specific WER issue |
| `get_dengue_timeseries(volume, ...)` | Full-year dengue time series |
| `list_available_wers(refresh=False)` | List all issues on epid.gov.lk |

### PDF Parsing

| Function | Description |
|---|---|
| `parse_wer_pdf(source, ...)` | Fast pdfplumber parser |
| `extract_dengue_data(source, ...)` | Hybrid parser with OCR fallback |

### Dengue Helpers

| Function | Description |
|---|---|
| `add_province(df)` | Add province column to dengue DataFrame |
| `weekly_national_total(df)` | Aggregate to national weekly totals |
| `top_districts(df, week_no, year, n)` | Top N districts by weekly cases |

### Download Utilities

| Function | Description |
|---|---|
| `download_wer_pdf(volume, number, ...)` | Download one PDF (live URL) |
| `download_range(volume, start_no, end_no, ...)` | Batch download and parse range of issues |
| `build_dengue_timeseries(results, district)` | Combine parsed results |
| `to_csv(df, path)` | Export to CSV |

### Scraper Primitives

| Function | Description |
|---|---|
| `scrape_wer_index(url, ...)` | Fetch + parse the WER index page |
| `fetch_wer_index(url, ...)` | Download raw HTML of WER index |
| `parse_wer_links(html)` | Extract PDF URLs from HTML |
| `get_latest_wer_url(...)` | Return `((vol, no), url)` for the latest issue |

---

## District & Province Constants

```python
import srilanka_epi

# All 26 RDHS districts (+ SRILANKA national total)
print(srilanka_epi.DISTRICTS)

# Province → district mapping
print(srilanka_epi.PROVINCE_MAP)          # {district: province}
print(srilanka_epi.WESTERN_PROVINCE)     # ['Colombo', 'Gampaha', 'Kalutara']
print(srilanka_epi.NORTHERN_PROVINCE)    # ['Jaffna', 'Kilinochchi', 'Mannar', ...]
```

---

## Logging

The library uses Python's standard `logging` module. To see debug output:

```python
import logging
logging.basicConfig(level=logging.INFO)

import srilanka_epi
result = srilanka_epi.get_latest_wer()
```

---

## Data Source

All data is sourced from the **Epidemiology Unit, Ministry of Health & Indigenous Medicine, Sri Lanka**.

- Website: https://www.epid.gov.lk
- WER Index: https://www.epid.gov.lk/weekly-epidemiological-report/weekly-epidemiological-report

**Please acknowledge the Epidemiology Unit as the primary data source in any publications.**

---

## Contributing

Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for setup instructions, code style guide, and the PR process.

---

## Citation

```bibtex
@software{weerakoon2026srilankaepi,
  author    = {Weerakoon, R.B.H.G. Chathura Kavindu Bandara},
  title     = {srilanka-epi: Python library for live Sri Lanka Weekly Epidemiological Report data},
  year      = {2026},
  version   = {2.0.0},
  url       = {https://github.com/chathurakavinduweerakoon/srilanka-epi},
  note      = {ICBT Campus / Cardiff Metropolitan University}
}
```

---

## License

MIT License — see [LICENSE](LICENSE) for full text.
