Metadata-Version: 2.4
Name: insurance-severity
Version: 0.2.0
Summary: Insurance severity modelling: composite (spliced) models and Distributional Refinement Networks
Project-URL: Homepage, https://github.com/burning-cost/insurance-severity
Project-URL: Repository, https://github.com/burning-cost/insurance-severity
Project-URL: Bug Tracker, https://github.com/burning-cost/insurance-severity/issues
Project-URL: Documentation, https://github.com/burning-cost/insurance-severity#readme
Author-email: Burning Cost <pricing.frontier@gmail.com>
License: MIT
License-File: LICENSE
Keywords: EVT,actuarial,composite,distributional-forecasting,drn,gamma,glm,histogram,insurance,neural-network,regression,severity,solvency-ii,spliced
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business :: Financial
Classifier: Topic :: Scientific/Engineering :: Mathematics
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0.0
Requires-Dist: polars>=1.0
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.10
Provides-Extra: all
Requires-Dist: catboost>=1.2.0; extra == 'all'
Requires-Dist: matplotlib>=3.7; extra == 'all'
Requires-Dist: patsy>=0.5; extra == 'all'
Requires-Dist: statsmodels>=0.14.5; extra == 'all'
Requires-Dist: torch>=2.0.0; extra == 'all'
Provides-Extra: catboost
Requires-Dist: catboost>=1.2.0; extra == 'catboost'
Provides-Extra: dev
Requires-Dist: matplotlib>=3.7; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: statsmodels>=0.14.5; extra == 'dev'
Provides-Extra: drn
Requires-Dist: torch>=2.0.0; extra == 'drn'
Provides-Extra: formulas
Requires-Dist: patsy>=0.5; extra == 'formulas'
Provides-Extra: glm
Requires-Dist: statsmodels>=0.14.5; extra == 'glm'
Provides-Extra: plotting
Requires-Dist: matplotlib>=3.7; extra == 'plotting'
Description-Content-Type: text/markdown

# insurance-severity

[![PyPI](https://img.shields.io/pypi/v/insurance-severity)](https://pypi.org/project/insurance-severity/)
[![Python](https://img.shields.io/pypi/pyversions/insurance-severity)](https://pypi.org/project/insurance-severity/)
[![Tests](https://img.shields.io/badge/tests-passing-brightgreen)]()
[![License](https://img.shields.io/badge/license-BSD--3-blue)]()

Comprehensive severity modelling for UK insurance pricing. Two complementary approaches in one package.

## The problem

Claim severity distributions don't behave like textbook Gamma distributions. You have a body of attritional losses and a heavy tail of large losses, and these two populations have different drivers. Standard GLMs smooth over this structure. This package gives you two principled ways to deal with it.

## Quick Start

```bash
pip install insurance-severity
```

```python
import numpy as np
from insurance_severity import LognormalBurrComposite

rng = np.random.default_rng(42)

# Synthetic severity: lognormal attritional body + heavy Pareto-like tail
attritional = rng.lognormal(mean=7.5, sigma=1.0, size=850)    # ~85% of claims
large_loss  = rng.pareto(a=2.5, size=150) * 40_000 + 8_000    # ~15% large losses
claims = np.concatenate([attritional, large_loss])
rng.shuffle(claims)

# Fit the composite model — profile likelihood selects the threshold automatically
model = LognormalBurrComposite(threshold_method="mode_matching")
model.fit(claims)

print(f"Threshold:       £{model.threshold_:,.0f}")
# body_params_ = [mu, sigma] for the lognormal; tail_params_ = [alpha, delta, beta] for Burr XII
mu, sigma = model.body_params_
alpha, delta, beta = model.tail_params_
print(f"Body lognormal:  mu={mu:.3f}, sigma={sigma:.3f}  (log-scale)")
print(f"Tail Burr XII:   alpha={alpha:.3f}, delta={delta:.3f}, beta={beta:,.0f}")
print(f"Body weight pi:  {model.pi_:.3f}  ({model.pi_*100:.1f}% of claims are attritional)")

# ILF: expected loss in layer (250k xs 250k) relative to basic limit
ilf = model.ilf(limit=500_000, basic_limit=250_000)
print(f"ILF at £500k limit / £250k basic: {ilf:.4f}")

# Tail Value at Risk at 99.5th percentile (Solvency II capital proxy)
tvar = model.tvar(alpha=0.995)
print(f"TVaR 99.5%: £{tvar:,.0f}")
```

## What's in the package

### `insurance_severity.composite` — spliced severity models

Composite (spliced) models divide the claim distribution at a threshold into a body distribution (moderate claims) and a tail distribution (large losses). Each component is fitted separately and joined at the threshold.

What this package adds that R doesn't have: **covariate-dependent thresholds**. For a motor book, the threshold between attritional and large loss isn't the same for a HGV fleet and a private motor policy. With mode-matching regression, the threshold varies by policyholder.

Supported combinations:
- Lognormal body + Burr XII tail (mode-matching supported)
- Lognormal body + GPD tail
- Gamma body + GPD tail

Features:
- Fixed threshold, profile likelihood, and mode-matching threshold methods
- Covariate-dependent tail scale regression (`CompositeSeverityRegressor`)
- ILF computation per policyholder
- TVaR (Tail Value at Risk)
- Quantile residuals, mean excess plots, Q-Q plots

```python
import numpy as np
from insurance_severity import LognormalBurrComposite, CompositeSeverityRegressor

rng = np.random.default_rng(42)
n = 1000

# Synthetic severity data: lognormal attritional body + Pareto-like large losses
attritional = rng.lognormal(mean=7.5, sigma=1.2, size=int(n * 0.85))  # ~£1,800 median
large_losses = rng.pareto(a=2.5, size=int(n * 0.15)) * 50_000 + 10_000
claim_amounts = np.concatenate([attritional, large_losses])
rng.shuffle(claim_amounts)

# Rating factors for regression example
vehicle_age = rng.integers(0, 15, n).astype(float)
driver_age = rng.integers(18, 75, n).astype(float)
ncd_years = rng.integers(0, 5, n).astype(float)
X = np.column_stack([vehicle_age, driver_age, ncd_years])
n_train = int(0.8 * n)
X_train, X_test = X[:n_train], X[n_train:]
y_train = claim_amounts[:n_train]

# No covariates -- mode-matching threshold
model = LognormalBurrComposite(threshold_method="mode_matching")
model.fit(claim_amounts)
print(model.threshold_)
print(model.ilf(limit=500_000, basic_limit=250_000))

# With covariates -- threshold varies by policyholder
reg = CompositeSeverityRegressor(
    composite=LognormalBurrComposite(threshold_method="mode_matching"),
)
reg.fit(X_train, y_train)
thresholds = reg.predict_thresholds(X_test)   # per-policyholder thresholds
ilf_matrix = reg.compute_ilf(X_test, limits=[100_000, 250_000, 500_000, 1_000_000])
```

### `insurance_severity.drn` — Distributional Refinement Network

The DRN (Avanzi, Dong, Laub, Wong 2024, arXiv:2406.00998) starts from a frozen GLM or GBM baseline and refines it into a full predictive distribution using a neural network. The network outputs bin-probability adjustments to a histogram representation of the distribution, not the mean.

The practical payoff: you keep the actuarial calibration of your existing GLM pricing model, and the neural network fills in the distributional shape that the GLM can't capture — skewness, heteroskedastic dispersion, tail behaviour by segment.

**Note:** the DRN requires PyTorch. Install it before using this subpackage:

```bash
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install insurance-severity[glm]
```

```python
import numpy as np
from insurance_severity import GLMBaseline, DRN
import statsmodels.formula.api as smf
import statsmodels.api as sm
import pandas as pd

rng = np.random.default_rng(42)
n = 1000

# Synthetic severity data with covariates
vehicle_age = rng.integers(0, 15, n).astype(float)
driver_age = rng.integers(18, 75, n).astype(float)
ncd_years = rng.integers(0, 5, n).astype(float)
region = rng.choice(["London", "SE", "NW", "Midlands", "Scotland"], n)

# Lognormal claim amounts with some large losses
log_mu = 7.5 + 0.03 * vehicle_age - 0.005 * driver_age - 0.05 * ncd_years
claim_amounts = rng.lognormal(mean=log_mu, sigma=1.1 + 0.02 * vehicle_age)

n_train = int(0.8 * n)
df = pd.DataFrame({
    "claims": claim_amounts,
    "age": driver_age,
    "vehicle_age": vehicle_age,
    "region": region,
})
df_train = df.iloc[:n_train]
df_test = df.iloc[n_train:]
X_train = np.column_stack([vehicle_age[:n_train], driver_age[:n_train], ncd_years[:n_train]])
X_test = np.column_stack([vehicle_age[n_train:], driver_age[n_train:], ncd_years[n_train:]])
y_train = claim_amounts[:n_train]
y_test = claim_amounts[n_train:]

# Fit your existing GLM
glm = smf.glm(
    "claims ~ age + C(region) + vehicle_age",
    data=df_train,
    family=sm.families.Gamma(sm.families.links.Log()),
).fit()

# Wrap it as a baseline
baseline = GLMBaseline(glm)

# Refine with DRN (scr_aware=True enables SCR-aware bin cutpoints at 99.5th percentile)
drn = DRN(baseline, hidden_size=64, max_epochs=300, scr_aware=True)
drn.fit(X_train, y_train, verbose=True)

# Full predictive distribution per policyholder
dist = drn.predict_distribution(X_test)
print(dist.mean())           # (n,) expected claim
print(dist.quantile(0.995))  # 99.5th percentile -- Solvency II SCR
print(dist.crps(y_test))     # CRPS scoring
```

## Installation

```bash
pip install insurance-severity
```

With GLM support (statsmodels):

```bash
pip install insurance-severity[glm]
```

## Subpackages

Access subpackages directly if you only need one approach:

```python
from insurance_severity.composite import LognormalBurrComposite
from insurance_severity.drn import DRN
```

---


## Databricks Notebook

A ready-to-run Databricks notebook benchmarking this library against standard approaches is available in [burning-cost-examples](https://github.com/burning-cost/burning-cost-examples/blob/main/notebooks/insurance_severity_demo.py).

## Performance

Benchmarked against a **single Lognormal** on 2,500 train / 500 test synthetic claims with a known heavy-tailed DGP — Lognormal body (85%) below the splice point, Pareto tail (alpha=2.0, 15%) above. Post-Phase-98 fix numbers (composite predict() now returns the full mixture, pi fitted from data). Full script: `benchmarks/benchmark.py`.

DGP true quantiles: 95th=12,087 | 99th=24,701 | 99.5th=33,435.

| Metric | Single Lognormal | LognormalGPDComposite |
|--------|-----------------|----------------------|
| 95th percentile | 13,178 (err 1,091) | 11,586 (err 501) |
| 99th percentile | 27,702 (err 3,001) | 22,551 (err 2,150) |
| 99.5th percentile | 36,360 (err 2,925) | 29,465 (err 3,970) |
| Total tail error (sum of 3) | 7,017 | 6,621 |
| Test log-likelihood | -4,613.8 | -4,613.6 |
| Fit time | <1s | 2–5s (profile-likelihood grid) |

Overall tail quantile improvement: **5.6%** reduction in absolute error across the three quantile levels. The composite model fits a GPD tail above the profile-likelihood threshold (fitted ~5,022) with body weight pi=0.73.

The improvement is modest but directionally correct. On 3,000 observations with moderate tail heaviness (Pareto alpha=2.0), the profile-likelihood threshold selection has limited data in the tail — the Pareto 99.5th estimate overshoots (error 3,970 vs lognormal 2,925) because the GPD is trying to fit from only ~375 tail observations. At larger sample sizes (20k+) the tail fit stabilises and the composite advantage grows.

**When to use:** XL reinsurance pricing (where the expected loss in a layer depends entirely on tail behaviour), ILF computation at high policy limits, and large loss loading in ground-up pricing where the true severity distribution is Pareto-like. Concrete situations: motor bodily injury, liability lines, property CAT perils.

**When NOT to use:** When claims are capped (loss-limited data). Also when the portfolio has fewer than a few thousand claims — the profile-likelihood threshold selection is unstable with sparse data and the composite model may overfit the tail. For homogeneous attritional loss books where a Gamma fits well (small commercial property), the added complexity is not warranted.


## Source repos

- `insurance-composite` — archived, merged into this package
- `insurance-drn` — archived, merged into this package

## Related Libraries

| Library | What it does |
|---------|-------------|
| [insurance-frequency-severity](https://github.com/burning-cost/insurance-frequency-severity) | Joint frequency-severity models — combines this library's severity component with frequency in a Sarmanov copula |
| [insurance-quantile](https://github.com/burning-cost/insurance-quantile) | Quantile GBM for tail risk — non-parametric alternative when parametric severity assumptions are not tenable |
| [insurance-dynamics](https://github.com/burning-cost/insurance-dynamics) | Loss development models — severity projections are a key input to dynamic reserve models |
