Metadata-Version: 2.4
Name: insurance-spatial-conformal
Version: 0.1.0
Summary: Spatially weighted conformal prediction intervals for geographically calibrated insurance pricing
Project-URL: Homepage, https://github.com/burning-cost/insurance-spatial-conformal
Project-URL: Repository, https://github.com/burning-cost/insurance-spatial-conformal
Project-URL: Documentation, https://github.com/burning-cost/insurance-spatial-conformal#readme
Author-email: Burning Cost <pricing.frontier@gmail.com>
License: MIT
Keywords: actuarial,conformal-prediction,insurance,pricing,spatial-statistics,tweedie
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Mathematics
Requires-Python: >=3.10
Requires-Dist: matplotlib>=3.7
Requires-Dist: numpy>=1.24
Requires-Dist: pgeocode>=0.4
Requires-Dist: polars>=0.20
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.10
Provides-Extra: dev
Requires-Dist: lightgbm>=4.0; extra == 'dev'
Requires-Dist: pandas>=2.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Provides-Extra: geo
Requires-Dist: geopandas>=0.14; extra == 'geo'
Requires-Dist: shapely>=2.0; extra == 'geo'
Description-Content-Type: text/markdown

# insurance-spatial-conformal

Spatially weighted conformal prediction intervals for geographically calibrated insurance pricing.

## The problem

You've built a home insurance pricing model. You evaluate it nationally — 90% coverage on the test set, right on target. Your actuary signs off. Your model goes live.

Then someone runs a postcode-level diagnostic and finds that coverage in Hackney is 73% and coverage in rural Devon is 96%. Your nationally correct model is systematically under-covering urban risks and over-covering rural ones.

This is the exchangeability problem in conformal prediction. Standard split conformal assumes that calibration scores and test scores are drawn from the same distribution — interchangeable. That's fine nationally, but it breaks geographically. A semi-detached in Hackney and a farmhouse in Devon have materially different loss distributions, and treating all calibration scores as equally relevant to both is wrong.

The fix is geographic kernel weighting: when computing the prediction interval for a test property, weight the calibration scores by proximity. Properties in Hackney get high weight from the Hackney calibration data, low weight from the Devon data. The quantile you use for the interval reflects local behaviour, not national average behaviour.

This library implements that fix for UK insurance pricing.

## What it does

- **Spatially weighted conformal prediction** using Gaussian, Epanechnikov, or uniform (nearest-neighbour) spatial kernels
- **Tweedie Pearson non-conformity scores** — variance-stabilised scores for GLM and GBM models with Tweedie/compound Poisson objectives
- **Cross-validated bandwidth selection** using spatial blocking CV with MACG objective — the bandwidth that minimises geographic coverage gaps
- **MACG diagnostic** (Mean Absolute Coverage Gap) across a spatial grid, plus per-region breakdown for FCA Consumer Duty reporting
- **UK postcode geocoding** via pgeocode with outward-code fallback

## Installation

```bash
pip install insurance-spatial-conformal
```

Optional geographic visualisation dependencies:

```bash
pip install insurance-spatial-conformal[geo]
```

## Quickstart

```python
from insurance_spatial_conformal import SpatialConformalPredictor

# Your fitted pricing model (LightGBM, XGBoost, sklearn, CatBoost — anything with predict())
# Already split your data into train / calibration / test

scp = SpatialConformalPredictor(
    model=fitted_lgbm,
    nonconformity='pearson_tweedie',
    tweedie_power=1.5,
    bandwidth_km=20.0,       # 20 km Gaussian kernel; or None to auto-select
)

# Calibrate on holdout set with coordinates
scp.calibrate(X_cal, y_cal, lat=lat_cal, lon=lon_cal)

# Predict intervals for new business
result = scp.predict_interval(X_test, lat=lat_test, lon=lon_test, alpha=0.10)

print(result.lower[:5])   # lower bounds
print(result.upper[:5])   # upper bounds
print(result.point[:5])   # point predictions from model
```

Using postcodes instead of coordinates:

```python
from insurance_spatial_conformal import PostcodeGeocoder

gc = PostcodeGeocoder()
lat_cal, lon_cal = gc.geocode(postcode_list_cal)
scp.calibrate(X_cal, y_cal, lat=lat_cal, lon=lon_cal)
```

Auto-selecting bandwidth via cross-validation:

```python
scp = SpatialConformalPredictor(model=fitted_model, bandwidth_km=None)
result = scp.calibrate(
    X_cal, y_cal, lat=lat_cal, lon=lon_cal,
    cv_candidates_km=[2, 5, 10, 20, 30, 50],
    cv_folds=5,
)
print(f"CV-selected bandwidth: {result.bandwidth_km} km")
```

## Coverage diagnostics

```python
from insurance_spatial_conformal import SpatialCoverageReport

report = SpatialCoverageReport(scp)
result = report.evaluate(X_val, y_val, lat=lat_val, lon=lon_val, alpha=0.10)

print(report.summary())
# === Spatial Coverage Report ===
#   Validation set: 5,000 observations
#   Target coverage (1-alpha): 90.0%
#   Marginal coverage: 0.901
#   Coverage gap: -0.0010
#   MACG (312 grid cells): 0.0187
#   Bandwidth: 20.0 km
#   Kernel: gaussian

# Coverage map — green = on target, red = under/over covered
fig = report.coverage_map(resolution=20)
fig.savefig("coverage_by_postcode.png", dpi=150)

# FCA Consumer Duty table — coverage by segment
table = report.fca_consumer_duty_table(region_labels=county_labels)
print(table.filter(pl.col("flag") == "REVIEW"))
```

## Non-conformity score choice

The score determines the shape of the prediction interval. For insurance pricing:

| Score | Use when | Interval shape |
|-------|----------|----------------|
| `pearson_tweedie` | Tweedie GLM/GBM (default) | Width scales as yhat^(p/2) |
| `pearson` | Poisson frequency model | Width scales as sqrt(yhat) |
| `scaled_absolute` | Two-model approach with spread model | Width scales with difficulty |
| `absolute` | Baseline only | Fixed-width regardless of risk level |

```python
# Tweedie power 1.5 = compound Poisson-Gamma (typical burning cost)
scp = SpatialConformalPredictor(
    model=model, nonconformity='pearson_tweedie', tweedie_power=1.5
)

# Two-model approach: spread model predicts |y - yhat|
spread_model = LGBMRegressor().fit(X_cal, np.abs(y_cal - yhat_cal))
scp = SpatialConformalPredictor(
    model=model, nonconformity='scaled_absolute', spread_model=spread_model
)
```

## API reference

### SpatialConformalPredictor

```python
SpatialConformalPredictor(
    model,                        # fitted sklearn-compatible model
    nonconformity='pearson_tweedie',
    tweedie_power=1.5,
    spatial_kernel='gaussian',    # 'gaussian' | 'epanechnikov' | 'uniform'
    bandwidth_km=None,            # None = CV-select; float = fixed
    spread_model=None,            # required for 'scaled_absolute'
    n_eff_min=30,                 # warn if effective N < this threshold
)

.calibrate(X_cal, y_cal, lat=..., lon=..., postcodes=..., exposure=...)
    → CalibrationResult

.predict_interval(X_test, lat=..., lon=..., postcodes=..., alpha=0.10)
    → IntervalResult  (.lower, .upper, .point, .n_effective, .bandwidth_km)

.spatial_coverage_report(X_val, y_val, lat=..., lon=...)
    → SpatialCoverageReport
```

### BandwidthSelector

```python
BandwidthSelector(
    candidates_km=[2, 5, 10, 15, 20, 30, 50],
    cv=5,
    n_eff_min=30,
    metric='macg',
    grid_resolution=10,
)

.select(scores, lat, lon, alpha=0.10) → BandwidthCVResult
```

### SpatialCoverageReport

```python
SpatialCoverageReport(predictor)

.evaluate(X_val, y_val, lat=..., lon=..., alpha=0.10, grid_resolution=20)
    → CoverageResult  (.marginal_coverage, .macg, .n_grid_cells)

.coverage_map(resolution=20) → matplotlib Figure
.fca_consumer_duty_table(region_labels=...) → polars DataFrame
.macg_by_region(region_labels) → polars DataFrame
.summary() → str
```

## Design decisions

**Haversine distance, not Euclidean.** At 55°N (central Scotland), a degree of longitude is ~64 km but a degree of latitude is ~111 km. Euclidean distance on decimal degrees would produce elliptical kernels skewed north-south by ~42%. All distance calculations use haversine.

**Bandwidth parameterisation as km, not eta.** The Hjort et al. paper uses eta = bandwidth^2 internally. We expose the parameter in kilometres because that's what a pricing actuary can reason about — "20 km bandwidth" is meaningful, "eta = 400,000 m²" is not.

**Tibshirani (2019) augmentation.** The finite-sample coverage guarantee requires augmenting the calibration distribution with a point at +∞ with weight proportional to 1/(n+1). This ensures the marginal guarantee holds exactly at 1−α, not just approximately.

**Spatial blocking CV, not random folds.** Random CV folds allow geographically proximate calibration and validation points into the same split, which leaks spatial information and makes the CV loss overly optimistic. K-means on coordinates gives spatially contiguous folds.

**Kish effective N warning.** In rural areas with sparse data, a narrow bandwidth might have effective N < 30 at some test points. The predictor warns rather than erroring — the interval is still produced, but flagged. In practice, the CV bandwidth selector includes a floor on effective N.

**Polars output for DataFrames.** Diagnostics and the FCA table return Polars DataFrames rather than pandas. Polars is faster for the typical operations (group-by, filter, sort) and has cleaner null semantics. Call `.to_pandas()` if your downstream tools need pandas.

## References

Hjort, N. L., Jullum, M., & Loland, A. (2025). Uncertainty quantification in automated valuation models with spatially weighted conformal prediction. *International Journal of Data Science and Analytics* (Springer). doi:10.1007/s41060-025-00862-4. arXiv:2312.06531.

Tibshirani, R. J., Barber, R. F., Candes, E. J., & Ramdas, A. (2019). Conformal prediction under covariate shift. *NeurIPS 2019*.

Manna, S. et al. (2025). Distribution-free prediction sets for Tweedie regression. arXiv:2507.06921.

Kish, L. (1965). *Survey Sampling*. Wiley.

Roberts, D. R. et al. (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. *Ecography*, 40(8), 913-929.

## Licence

MIT
