Metadata-Version: 2.4
Name: insurance-eqrn
Version: 0.1.0
Summary: Extreme Quantile Regression Neural Networks for insurance pricing — covariate-dependent GPD tail modelling
Author-email: Burning Cost <pricing.frontier@gmail.com>
License: MIT
Keywords: GPD,actuarial,extreme-value-theory,insurance,neural-networks,pytorch,quantile-regression,reinsurance
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business :: Financial
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: lightgbm>=4.0
Requires-Dist: matplotlib>=3.7
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.10
Requires-Dist: torch>=2.0
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Description-Content-Type: text/markdown

# insurance-eqrn

Extreme Quantile Regression Neural Networks for insurance pricing.

## The problem

Your EVT model gives you the 1-in-200 claim for the portfolio. EQRN gives you the 1-in-200 claim for the Kensington flat vs the Somerset farmhouse. That difference is your reinsurance margin.

The standard approach to extreme severity modelling — fit a GPD to all claims above a threshold, read off the 99.5th percentile — pools everything together. It gives you one shape parameter and one scale parameter for the whole book. If your TPBI claims have a heavier tail for younger injured parties and lighter for older ones, the pooled model averages those tails away. Your per-segment VaR is wrong and your XL pricing is wrong.

The solution is covariate-dependent GPD parameters: xi(x) and sigma(x) as functions of risk characteristics, not pooled scalars. This is what EQRN does.

EQRN (Pasche & Engelke 2024, *Annals of Applied Statistics*) is the first method to estimate covariate-dependent GPD parameters using a neural network. This library is the first Python implementation.

## What this library provides

- `EQRNModel` — two-step fitting: LightGBM intermediate quantile + GPD neural network
- `EQRNDiagnostics` — QQ plot, threshold stability, calibration, xi scatter
- Out-of-fold intermediate quantile estimation (prevents leakage into GPD step)
- Orthogonal GPD reparameterisation for stable gradient training
- `predict_quantile` — conditional VaR at any extreme level (0.99, 0.995, ...)
- `predict_tvar` — conditional TVaR / expected shortfall
- `predict_exceedance_prob` — P(claim > threshold | risk profile)
- `predict_xl_layer` — expected loss in per-risk XL layer (attachment, limit)

## Install

```bash
pip install insurance-eqrn
```

PyTorch is required. For CPU-only:

```bash
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install insurance-eqrn
```

## Quickstart

```python
import numpy as np
from insurance_eqrn import EQRNModel, EQRNDiagnostics

# X: covariate matrix (e.g. risk characteristics)
# y: claim severity values (above basic threshold)
model = EQRNModel(
    tau_0=0.85,             # intermediate quantile level
    hidden_sizes=(32, 16, 8),
    n_epochs=300,
    shape_fixed=False,      # covariate-dependent xi
    seed=42,
)
model.fit(X_train, y_train, X_val=X_val, y_val=y_val)

# Per-segment 99.5th percentile severity
var_995 = model.predict_quantile(X_test, q=0.995)

# TVaR for reinsurance pricing
tvar_99 = model.predict_tvar(X_test, q=0.99)

# XL layer: £500k xs £500k
xl_loss = model.predict_xl_layer(X_test, attachment=500_000, limit=500_000)

# Fitted GPD parameters per observation
params = model.predict_params(X_test)
# DataFrame with columns: xi, sigma, nu, threshold
```

## The two-step method

**Step 1: Intermediate quantile (LightGBM, out-of-fold)**

Fits a quantile regression at level tau_0 (default 0.8) using K-fold cross-validation. Out-of-fold predictions are mandatory here. If you use in-sample predictions, the GPD network in Step 2 sees artificially clean thresholds and learns the wrong exceedance set.

**Step 2: GPD neural network on exceedances**

Identifies observations above their predicted threshold (~20% of training data at tau_0=0.8). Trains a feedforward network mapping (X, Q_hat(tau_0)) → (nu(x), xi(x)) using the orthogonal GPD deviance loss.

The orthogonal parameterisation (nu = sigma * (xi + 1)) makes the Fisher information matrix diagonal, which stabilises Adam training substantially compared to the direct (sigma, xi) parameterisation.

**Prediction**

For a new observation x at target level tau > tau_0:

```
Q_x(tau) = Q_hat_x(tau_0) + sigma(x)/xi(x) * [((1-tau_0)/(1-tau))^xi(x) - 1]
```

At xi ≈ 0 (exponential limit), this is `Q_hat + sigma * log((1-tau_0)/(1-tau))`.

## Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| tau_0 | 0.8 | Intermediate quantile level. Increase for smaller datasets |
| hidden_sizes | (32, 16, 8) | Network hidden layer widths |
| n_epochs | 500 | Maximum training epochs |
| patience | 50 | Early stopping patience |
| shape_fixed | False | If True, xi is a scalar. Start here before fitting full model |
| l2_pen | 1e-4 | L2 weight decay |
| shape_penalty | 0 | Penalty on variance of xi(x) — smooths the shape surface |
| p_drop | 0 | Dropout probability. Try 0.1–0.2 for small datasets |
| n_folds | 5 | K-fold folds for OOF intermediate quantile |
| seed | None | Random seed |

## Diagnostics

```python
from insurance_eqrn import EQRNDiagnostics

diag = EQRNDiagnostics(model)

# GPD QQ plot — should track the diagonal if the tail model is correct
diag.qq_plot(X_test, y_test)

# Predicted vs empirical coverage at each quantile level
diag.calibration_plot(X_test, y_test, levels=[0.9, 0.95, 0.99, 0.995])

# Mean residual life plot — linearity onset shows where GPD approximation holds
diag.mean_residual_life_plot(y_train)

# Threshold stability — fit shape_fixed models at each tau_0, look for plateau
diag.threshold_stability_plot(X_train, y_train)

# Summary table: predicted vs empirical exceedance rates
diag.summary_table(X_test, y_test)
```

## Insurance applications

**Motor TPBI (Third-Party Bodily Injury)**

Young injured parties have longer annuity streams and heavier tails. EQRN lets you model xi(x) as a function of injured party age, claim type, solicitor involvement. Output: P(claim > £500k | risk profile) per policy.

**Property large loss**

Commercial property fire severity varies by construction class, sum insured, sprinkler status. EQRN provides 1-in-200 loss conditional on risk characteristics — input to CAT reinsurance models.

**Per-risk XL pricing**

```python
# Price layer: £1M xs £500k, conditional on risk
xl = model.predict_xl_layer(X_test, attachment=500_000, limit=1_000_000)
```

**Solvency II SCR**

EQRN provides per-segment 99.5th percentile severity, which is the correct input for simulation-based SCR calculations on heterogeneous portfolios. Segment-level conditional VaR is more conservative than pooled EVT for high-risk segments and more accurate for low-risk segments.

## When not to use EQRN

- **Frequency modelling**: EQRN models severity above a threshold. Frequency is a separate model.
- **Attritional claims**: Claims below tau_0 are not modelled by EQRN.
- **Small books** (n_exceedances < 200): Set shape_fixed=True as a minimum. Below ~100 exceedances, fall back to marginal EVT.
- **No covariates**: Use insurance-evt directly.

## Reference

Pasche, O.C. & Engelke, S. (2024). "Neural networks for extreme quantile regression with an application to forecasting of flood risk." *Annals of Applied Statistics*, 18(4), 2818–2839. DOI:10.1214/24-AOAS1907.

R reference implementation: [opasche/EQRN](https://github.com/opasche/EQRN) (CRAN, March 2025).
