Metadata-Version: 2.4
Name: insurance-composite
Version: 0.1.0
Summary: Composite (spliced) severity regression with covariate-dependent thresholds for insurance pricing
Author-email: Burning Cost <pricing.frontier@gmail.com>
License: MIT
Keywords: EVT,actuarial,composite,insurance,regression,severity,spliced
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Mathematics
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.10
Provides-Extra: dev
Requires-Dist: numpy>=1.24; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: scikit-learn>=1.3; extra == 'dev'
Requires-Dist: scipy>=1.10; extra == 'dev'
Provides-Extra: formulas
Requires-Dist: patsy>=0.5; extra == 'formulas'
Provides-Extra: plotting
Requires-Dist: matplotlib>=3.7; extra == 'plotting'
Description-Content-Type: text/markdown

# insurance-composite

Composite (spliced) severity regression with covariate-dependent thresholds.

## The problem

Standard severity GLMs fit one distribution across the whole claim range. This is wrong for most lines of business.

Motor bodily injury claims: 90% are soft tissue injuries under £20k. The other 10% are catastrophic injuries that cost £200k–£5m and follow a completely different distribution. Fitting a single Gamma misrepresents both ends.

The right approach is to split at a **threshold**: a light body distribution below, a heavy tail distribution above. This is called a spliced or composite model. It directly prices the large-loss loading you were previously fudging with a manual factor.

The unsolved problem in Python (until now) is **regression**. Claim severity depends on covariates — vehicle type, driver age, occupation class. When your model has individual-specific covariates, the natural splice point should vary by policyholder too. A young driver with a high-powered vehicle doesn't belong at the same threshold as a standard commuter.

This library implements composite severity regression where covariates drive the tail scale parameter, which automatically makes the threshold covariate-dependent via mode-matching.

## What it does

Three composite models for V1:

| Model | Body | Tail | Threshold |
|---|---|---|---|
| `LognormalBurrComposite` | Lognormal | Burr XII | Mode-matching (data-driven, covariate-dependent) |
| `LognormalGPDComposite` | Lognormal | GPD | Fixed or profile likelihood |
| `GammaGPDComposite` | Gamma | GPD | Fixed or profile likelihood |

Plus `CompositeSeverityRegressor`, a scikit-learn-compatible wrapper that takes a feature matrix and makes per-policyholder severity predictions.

## Why mode-matching with Burr XII, not GPD

GPD is the canonical EVT tail distribution. But mode-matching requires the tail distribution to have a positive finite mode. GPD with xi >= 0 has mode 0 — which covers all insurance heavy-loss scenarios (xi typically 0.1–0.5 for UK lines).

Burr XII has mode = `beta * [(alpha-1)/(delta*alpha+1)]^{1/delta}` for alpha > 1. This is tractable and positive, making it the natural choice for mode-matching composite models. The covariate-dependent threshold then falls out automatically as each policyholder's Burr scale beta varies with their covariates.

If you try to combine GPD with mode-matching, the library raises `ValueError` with a clear explanation rather than silently failing.

## Installation

```bash
pip install insurance-composite
```

With plotting support:
```bash
pip install insurance-composite[plotting]
```

## Quick start

```python
import numpy as np
from insurance_composite import LognormalBurrComposite, CompositeSeverityRegressor

# Fit without covariates (mode-matching finds threshold automatically)
model = LognormalBurrComposite(threshold_method="mode_matching")
model.fit(claim_amounts)

print(f"Threshold: £{model.threshold_:,.0f}")
print(f"Body weight: {model.pi_:.2%} of claims below threshold")
print(model.summary(claim_amounts))

# Value at Risk and TVaR
print(f"99th percentile: £{model.var(0.99):,.0f}")
print(f"TVaR(99%): £{model.tvar(0.99):,.0f}")

# ILF at standard motor BI limits
for lim in [250_000, 500_000, 1_000_000]:
    print(f"ILF({lim:,}): {model.ilf(lim, basic_limit=250_000):.4f}")

model.plot_fit(claim_amounts)
```

```python
# With covariates (covariate-dependent threshold)
reg = CompositeSeverityRegressor(
    composite=LognormalBurrComposite(threshold_method="mode_matching"),
    feature_cols=["vehicle_age", "driver_age", "region"],
)
reg.fit(X_train, y_train)

# Each policyholder gets their own threshold
thresholds = reg.predict_thresholds(X_test)
print(f"Threshold range: £{thresholds.min():,.0f} – £{thresholds.max():,.0f}")

# ILF schedule per policyholder
ilf = reg.compute_ilf(
    X_test,
    limits=[50_000, 100_000, 250_000, 500_000, 1_000_000],
    basic_limit=250_000,
)
```

```python
# GPD tail with fixed threshold (when you have a natural attachment point)
from insurance_composite import LognormalGPDComposite

model = LognormalGPDComposite(threshold=100_000.0, threshold_method="fixed")
model.fit(claims_above_deductible)
```

```python
# Profile likelihood threshold (data-driven, no mode-matching constraint)
from insurance_composite import GammaGPDComposite

model = GammaGPDComposite(threshold_method="profile_likelihood")
model.fit(claim_amounts)
print(f"Selected threshold: £{model.threshold_:,.0f}")
```

## Diagnostics

```python
from insurance_composite.diagnostics import (
    quantile_residuals,
    density_overlay_plot,
    qq_plot,
    mean_excess_plot,
)

# Randomized quantile residuals (Dunn & Smyth 1996)
# Should be N(0,1) under correct specification
resid = quantile_residuals(model, claim_amounts)

# Visual checks
density_overlay_plot(model, claim_amounts)   # fitted density on histogram
qq_plot(model, claim_amounts)               # model vs empirical quantiles
mean_excess_plot(claim_amounts)             # threshold guidance
```

## UK insurance applications

**Motor BI**: Body captures typical whiplash/minor injury claims (£500–£20k). Tail captures serious injury (£50k–£5m). GPD tail index xi ≈ 0.2–0.4 for UK motor BI.

**Employers' Liability**: Occupation class enters the tail scale. Different industries have wildly different tail behaviour (asbestos claims vs. office injuries). The composite threshold shifts with occupation.

**PI/D&O**: Deductible is the natural threshold — set `threshold_method='fixed'` at the deductible level. GPD above the deductible is standard actuarial practice.

**Reinsurance pricing**: XL layer (d, d+L] pricing uses `model.limited_expected_value(d+L) - model.limited_expected_value(d)` directly from the fitted composite.

## Comparison to R packages

| Capability | ReIns | evmix | insurance-composite |
|---|---|---|---|
| Composite body + tail | ME-Pareto | Various-GPD | LN/Gamma + Burr/GPD |
| Regression covariates | No | No | Yes |
| Covariate-dependent threshold | No | No | Yes (mode-matching) |
| Mode-matching threshold | No | No | Yes (Burr XII) |
| Profile likelihood threshold | No | Yes | Yes |
| scikit-learn API | No | No | Yes |
| Python | No | No | Yes |

The R packages handle univariate fitting well. This library's differentiator is regression with covariate-dependent thresholds — none of the R packages do that.

## Methodology

Based on:
- Liu, Li, Shi (2024) — GBII composite regression with varying threshold. *Insurance: Mathematics and Economics*.
- Fung et al. (2022) — Lognormal-T composite regression. arXiv:2208.01262.
- Reynkens et al. (2017) — ME-Pareto splicing with censoring. *IME* 77:65-77.

The mode-matching approach sets the threshold equal to the tail distribution's mode. This guarantees C1 continuity at the splice point: both density and its derivative are continuous. In practice it also stabilises estimation because the threshold is determined by the shape parameters, not estimated separately.

## V2 roadmap

- Mixed Erlang body (EM algorithm, dense nonparametric body)
- Soft splicing (Fung-Jeong-Tzougas 2024)
- Censored data support (Reynkens 2017)
- Full GBII distribution family

## License

MIT
