Metadata-Version: 2.4
Name: jp-idwr-db
Version: 2026.5.13
Summary: Japanese IDWR infectious disease database and analytics toolkit built on Polars.
Project-URL: Homepage, https://github.com/AlFontal/jp-idwr-db
Project-URL: Repository, https://github.com/AlFontal/jp-idwr-db
Project-URL: Bug Tracker, https://github.com/AlFontal/jp-idwr-db/issues
Author: jp-idwr-db contributors
License: GPL-3.0-or-later
License-File: LICENSE
Keywords: epidemiology,infectious-disease,japan,polars,surveillance
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Requires-Python: >=3.10
Requires-Dist: fastexcel>=0.10
Requires-Dist: httpx>=0.27
Requires-Dist: openpyxl>=3.1
Requires-Dist: platformdirs>=4.2
Requires-Dist: polars>=0.20
Requires-Dist: pyarrow>=14.0
Provides-Extra: dev
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pre-commit>=3.7; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: excel
Requires-Dist: fastexcel>=0.10; extra == 'excel'
Description-Content-Type: text/markdown

# jp-idwr-db
[![PyPI version](https://img.shields.io/pypi/v/jp-idwr-db)](https://pypi.org/project/jp-idwr-db/)
[![Python versions](https://img.shields.io/pypi/pyversions/jp-idwr-db)](https://pypi.org/project/jp-idwr-db/)
[![CI](https://img.shields.io/github/actions/workflow/status/AlFontal/jp-idwr-db/ci.yml?branch=main&label=CI)](https://github.com/AlFontal/jp-idwr-db/actions/workflows/ci.yml)
[![License: GPL-3.0-or-later](https://img.shields.io/badge/License-GPL--3.0--or--later-blue.svg)](https://github.com/AlFontal/jp-idwr-db/blob/main/LICENSE)

`jp-idwr-db` publishes Japan’s infectious disease surveillance data (NIID/JIHS IDWR) as a
versioned, language-agnostic data product: Parquet tables plus a machine-readable
`manifest.json` (and an optional DuckDB file with views).

The Python package adds a convenient API and local caching on top of those release assets.
Internally, data wrangling is Polars-first for speed and consistent transforms.

The goal is to skip the usual work of chasing week-by-week files across changing archives and formats, so you can get straight to building time series and doing epidemiology instead of spending hours on data munging.

The package provides an easier interface to the data, but you can also query the Parquet files directly with any tool that supports them (DuckDB, Arrow, Spark, etc.) using the `manifest.json` for file locations and schema. Direct-access examples are included below.

## Python Install

```bash
pip install jp-idwr-db
```

## Quick Start

To fetch the full unified dataset with a single call:

```python
import jp_idwr_db as jp
import polars as pl

df = (
    jp.load("unified", version="latest")
    .select(["date", "prefecture", "category", "disease", "count", "source"])
)
print(df)
```

```text
shape: (5_370_477, 6)
┌────────────┬────────────┬──────────┬─────────────────────────────┬───────┬────────────────────┐
│ date       ┆ prefecture ┆ category ┆ disease                     ┆ count ┆ source             │
│ ---        ┆ ---        ┆ ---      ┆ ---                         ┆ ---   ┆ ---                │
│ date       ┆ str        ┆ str      ┆ str                         ┆ f64   ┆ str                │
╞════════════╪════════════╪══════════╪═════════════════════════════╪═══════╪════════════════════╡
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ AIDS                        ┆ 0.0   ┆ Confirmed cases    │
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ Acute poliomyelitis         ┆ 0.0   ┆ Confirmed cases    │
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ Acute viral hepatitis       ┆ 4.0   ┆ Confirmed cases    │
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ Amebiasis                   ┆ 0.0   ┆ Confirmed cases    │
│ 1999-04-11 ┆ Aichi      ┆ total    ┆ Anthrax                     ┆ 0.0   ┆ Confirmed cases    │
│ …          ┆ …          ┆ …        ┆ …                           ┆ …     ┆ …                  │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ Viral hepatitis(excluding   ┆ 0.0   ┆ All-case reporting │
│            ┆            ┆          ┆ hepa…                       ┆       ┆                    │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ West Nile fever             ┆ 0.0   ┆ All-case reporting │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ Western equine encephalitis ┆ 0.0   ┆ All-case reporting │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ Yellow fever                ┆ 0.0   ┆ All-case reporting │
│ 2026-02-09 ┆ Yamanashi  ┆ total    ┆ Zika virus infection        ┆ 0.0   ┆ All-case reporting │
└────────────┴────────────┴──────────┴─────────────────────────────┴───────┴────────────────────┘
```

You can also filter at the source with `jp.get_data(...)`:

```python

# Fetch only tuberculosis data for 2024 in Tokyo, Osaka, and Hokkaido
tb = (
    jp.get_data(
        disease="Tuberculosis",
        year=2024,
        prefecture=["Tokyo", "Osaka", "Hokkaido"],
        version="latest")
    .select(["date", "prefecture", "disease", "count", "source"])
)
print(tb)
```

```text
shape: (156, 5)
┌────────────┬────────────┬──────────────┬───────┬────────────────────┐
│ date       ┆ prefecture ┆ disease      ┆ count ┆ source             │
│ ---        ┆ ---        ┆ ---          ┆ ---   ┆ ---                │
│ date       ┆ str        ┆ str          ┆ f64   ┆ str                │
╞════════════╪════════════╪══════════════╪═══════╪════════════════════╡
│ 2024-01-01 ┆ Hokkaido   ┆ Tuberculosis ┆ 2.0   ┆ All-case reporting │
│ 2024-01-01 ┆ Osaka      ┆ Tuberculosis ┆ 3.0   ┆ All-case reporting │
│ 2024-01-01 ┆ Tokyo      ┆ Tuberculosis ┆ 15.0  ┆ All-case reporting │
│ 2024-01-08 ┆ Hokkaido   ┆ Tuberculosis ┆ 4.0   ┆ All-case reporting │
│ 2024-01-08 ┆ Osaka      ┆ Tuberculosis ┆ 17.0  ┆ All-case reporting │
│ …          ┆ …          ┆ …            ┆ …     ┆ …                  │
│ 2024-12-16 ┆ Osaka      ┆ Tuberculosis ┆ 17.0  ┆ All-case reporting │
│ 2024-12-16 ┆ Tokyo      ┆ Tuberculosis ┆ 41.0  ┆ All-case reporting │
│ 2024-12-23 ┆ Hokkaido   ┆ Tuberculosis ┆ 5.0   ┆ All-case reporting │
│ 2024-12-23 ┆ Osaka      ┆ Tuberculosis ┆ 16.0  ┆ All-case reporting │
│ 2024-12-23 ┆ Tokyo      ┆ Tuberculosis ┆ 53.0  ┆ All-case reporting │
└────────────┴────────────┴──────────────┴───────┴────────────────────┘
```

```python

# Sentinel-only diseases from recent years in Tokyo prefecture
sentinel_df = (
    jp.get_data(
        source="sentinel",
        prefecture="Tokyo",
        year=(2024, 2026),
        version="latest")
    .select(["date", "prefecture", "disease", "count", "per_sentinel"])
)
print(sentinel_df)
```

```text
shape: (2_052, 5)
┌────────────┬────────────┬─────────────────────────────────┬─────────┬──────────────┐
│ date       ┆ prefecture ┆ disease                         ┆ count   ┆ per_sentinel │
│ ---        ┆ ---        ┆ ---                             ┆ ---     ┆ ---          │
│ date       ┆ str        ┆ str                             ┆ f64     ┆ f64          │
╞════════════╪════════════╪═════════════════════════════════╪═════════╪══════════════╡
│ 2024-01-07 ┆ Tokyo      ┆ Acute hemorrhagic conjunctivit… ┆ null    ┆ null         │
│ 2024-01-07 ┆ Tokyo      ┆ Aseptic meningitis              ┆ null    ┆ null         │
│ 2024-01-07 ┆ Tokyo      ┆ Bacterial meningitis            ┆ null    ┆ null         │
│ 2024-01-07 ┆ Tokyo      ┆ COVID-19                        ┆ 1365.0  ┆ 3.38         │
│ 2024-01-07 ┆ Tokyo      ┆ Chickenpox                      ┆ 31.0    ┆ 0.12         │
│ …          ┆ …          ┆ …                               ┆ …       ┆ …            │
│ 2026-01-25 ┆ Tokyo      ┆ Influenza(excld. avian influen… ┆ 13082.0 ┆ 34.07        │
│ 2026-01-25 ┆ Tokyo      ┆ Mumps                           ┆ 30.0    ┆ 0.12         │
│ 2026-01-25 ┆ Tokyo      ┆ Mycoplasma pneumonia            ┆ 32.0    ┆ 1.28         │
│ 2026-01-25 ┆ Tokyo      ┆ Pharyngoconjunctival fever      ┆ 115.0   ┆ 0.47         │
│ 2026-01-25 ┆ Tokyo      ┆ Respiratory syncytial virus in… ┆ 242.0   ┆ 1.0          │
└────────────┴────────────┴─────────────────────────────────┴─────────┴──────────────┘
```

<details>
<summary><strong>Data Download Model</strong></summary>

- Package wheels do not ship the large parquet tables.
- On first call to `jp.load(..., version="latest")` (or `jp.get_data(..., version="latest")`), the package downloads parquet assets listed in the latest published release `manifest.json`.
- By default, the package uses the packaged data version that matches the installed wheel. Use `version="latest"` when you want the freshest published snapshot.
- Cache path defaults to:
  - macOS: `~/Library/Caches/jp_idwr_db/data/<version>/`
  - Linux: `~/.cache/jp_idwr_db/data/<version>/`
  - Windows: `%LOCALAPPDATA%\\jp_idwr_db\\Cache\\data\\<version>\\`

Prefetch explicitly:

```bash
python -m jp_idwr_db data download
python -m jp_idwr_db data download --version latest --force
```

Environment overrides:

- `JPINFECT_DATA_VERSION`: choose a specific release tag or `latest` (example: `latest`)
- `JPINFECT_DATA_BASE_URL`: override asset host base URL
- `JPINFECT_CACHE_DIR`: override local cache root
</details>

## Language-independent data access

Release data assets are published as:

- `manifest.json`
- one or more `.parquet` tables (including `unified.parquet`)
- optional `jp_idwr_db.duckdb` (views over the parquet files)

Manifest schema reference: [`docs/manifest.schema.json`](./docs/manifest.schema.json).

Fetch the manifest:

```bash
curl -L "https://github.com/AlFontal/jp-idwr-db/releases/latest/download/manifest.json"
```

Query with DuckDB CLI (when `jp_idwr_db.duckdb` and parquet files are in the same directory):

```bash
duckdb jp_idwr_db.duckdb -c "SELECT year, week, COUNT(*) AS rows FROM unified GROUP BY 1,2 ORDER BY 1 DESC, 2 DESC LIMIT 5;"
```

### Download assets for any language

```bash
BASE="https://github.com/AlFontal/jp-idwr-db/releases/latest/download"

mkdir -p jp-idwr-assets
cd jp-idwr-assets
curl -L -O "${BASE}/manifest.json"
curl -L -O "${BASE}/unified.parquet"
curl -L -O "${BASE}/jp_idwr_db.duckdb"
```

### R example (DuckDB, local)

This example opens the local `jp_idwr_db.duckdb` artifact (downloaded with the parquet files)
and queries the `unified` view. Run it from the directory where `jp_idwr_db.duckdb`
and the parquet files are located:

```r
con <- DBI::dbConnect(duckdb::duckdb(), "jp_idwr_db.duckdb", read_only = TRUE)

tb <- DBI::dbGetQuery(
  con,
  "SELECT date, prefecture, disease, count, source
   FROM unified
   WHERE year = 2024 AND disease = 'Tuberculosis'
   ORDER BY date, prefecture
   LIMIT 20"
)

print(tb)
DBI::dbDisconnect(con, shutdown = TRUE)
```

```text
        date prefecture      disease count             source
1 2024-01-01      Aichi Tuberculosis     5 All-case reporting
2 2024-01-01      Akita Tuberculosis     1 All-case reporting
3 2024-01-01     Aomori Tuberculosis     0 All-case reporting
4 2024-01-01      Chiba Tuberculosis     7 All-case reporting
5 2024-01-01      Ehime Tuberculosis     1 All-case reporting
6 2024-01-01      Fukui Tuberculosis     1 All-case reporting
...
```

### R example (Arrow, remote)

You can also query the parquet files directly from the GitHub Release URL without downloading first:

```r
library(magrittr)

url <- "https://github.com/AlFontal/jp-idwr-db/releases/latest/download/unified.parquet"

tb <- arrow::read_parquet(url) %>%
  dplyr::filter(year == 2024, disease == "Tuberculosis") %>%
  dplyr::select(date, prefecture, disease, count, source) %>%
  dplyr::arrange(date, prefecture)

print(as.data.frame(tb))
```

```text
        date prefecture      disease count             source
1 2024-01-01      Aichi Tuberculosis     5 All-case reporting
2 2024-01-01      Akita Tuberculosis     1 All-case reporting
3 2024-01-01     Aomori Tuberculosis     0 All-case reporting
4 2024-01-01      Chiba Tuberculosis     7 All-case reporting
5 2024-01-01      Ehime Tuberculosis     1 All-case reporting
6 2024-01-01      Fukui Tuberculosis     1 All-case reporting
...
```

## Main API

Top-level API exported by `jp_idwr_db`:

- `load(name)`
- `get_data(...)`
- `list_diseases(source="all")`
- `list_prefectures()`
- `get_latest_week()`
- `prefecture_map()`
- `attach_prefecture_id(df, prefecture_col="prefecture", id_col="prefecture_id")`
- `merge(...)`, `pivot(...)`
- `configure(...)`, `get_config()`


## Datasets

Use `jp.load(...)` with:

- `"sex"`: historical sex-disaggregated surveillance
- `"place"`: historical place-category surveillance
- `"bullet"`: modern all-case weekly reports (rapid zensu)
- `"sentinel"`: sentinel reports (teitenrui; 2012+ in release data assets)
- `"unified"`: deduplicated combined dataset (sex-total + modern bullet/sentinel, recommended)

Note: teitenrui CSVs report year-to-date cumulative counts. `jp-idwr-db` converts these to
weekly incidence (`count_t - count_{t-1}` within year/prefecture/disease; first week kept as-is).

Detailed schema and coverage are documented in [DATASETS.md](./docs/DATASETS.md).

## Raw Download and Parsing

Raw file workflows are available in `jp_idwr_db.io`:

- `jp_idwr_db.io.download(...)`
- `jp_idwr_db.io.download_recent(...)`
- `jp_idwr_db.io.read(...)`

These are useful for refreshing local raw weekly files or debugging parser behavior.

## Data Wrangling Examples

See [EXAMPLES.md](./docs/EXAMPLES.md) for data wrangling recipes (grouping, trends, regional slices, source-aware filtering).

Disease-by-disease temporal coverage is documented in [DISEASES.md](./docs/DISEASES.md).

## Data Source

NIID/JIHS infectious disease surveillance publications:

- Historical annual archive files (`Syu_01_1`, `Syu_02_1`)
- Rapid weekly CSV reports (`zensuXX.csv`, `teitenruiXX.csv`)

## Development

```bash
uv sync --all-extras --dev
uv run ruff check .
uv run mypy src
uv run pytest

# Build release data assets (manifest + duckdb + parquet metadata)
  uv run --with duckdb --with jsonschema jp-idwr-db-build-assets \
  --data-dir data/parquet \
  --release-tag vYYYY.M.D \
  --base-url https://github.com/AlFontal/jp-idwr-db/releases/download/vYYYY.M.D \
  --schema-path docs/manifest.schema.json
```

## Security and Integrity

- Release assets include a `manifest.json` with SHA256 checksums and file sizes.
- `ensure_data()` verifies each downloaded parquet checksum and size before marking cache complete.
- For PyPI publishing, prefer Trusted Publishing (OIDC) over long-lived API tokens.

## License

GPL-3.0-or-later. See [LICENSE](./LICENSE).
