Metadata-Version: 2.4
Name: insurance-multilevel
Version: 0.1.5
Summary: Two-stage CatBoost + REML random effects model for insurance pricing with high-cardinality group factors
Project-URL: Homepage, https://github.com/burning-cost/insurance-multilevel
Project-URL: Repository, https://github.com/burning-cost/insurance-multilevel
Project-URL: Issues, https://github.com/burning-cost/insurance-multilevel/issues
Project-URL: Documentation, https://github.com/burning-cost/insurance-multilevel#readme
Author-email: Burning Cost <pricing.frontier@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: Buhlmann-Straub,REML,UK,actuarial,broker,catboost,credibility,hierarchical,insurance,multilevel,personal lines,pricing,random effects,scheme
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business :: Financial
Classifier: Topic :: Scientific/Engineering :: Mathematics
Requires-Python: >=3.10
Requires-Dist: catboost>=1.2
Requires-Dist: numpy>=1.25
Requires-Dist: polars>=1.0
Requires-Dist: pyarrow>=12.0
Requires-Dist: scipy>=1.10
Provides-Extra: dev
Requires-Dist: pyarrow>=12.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Description-Content-Type: text/markdown

# insurance-multilevel
[![Tests](https://github.com/burning-cost/insurance-multilevel/actions/workflows/tests.yml/badge.svg)](https://github.com/burning-cost/insurance-multilevel/actions/workflows/tests.yml)

Two-stage CatBoost + REML random effects for insurance pricing with high-cardinality group factors.

**Blog post:** [Your Group Factors Are Not All Worth Modelling](https://burning-cost.github.io/2027/03/15/multilevel-group-factors/)

## The Problem

UK personal lines pricing teams face a specific structural problem: portfolios are distributed through hundreds of brokers, schemes, and affinity partners. These groups differ systematically — broker A has an older, lower-risk customer base; broker B has young drivers; scheme C operates in flood-prone postcodes. But you cannot capture this by throwing broker IDs into a GBM.

The reasons one-hot encoding fails at scale:
- 500 brokers means 500 extra features, most with sparse data
- A GBM with 300 trees will overfit to the largest brokers and ignore the rest
- New brokers at prediction time have no training data

What you actually need is **shrinkage**: for a new or low-volume broker, trust the book-wide average. For a high-volume broker with years of data, trust their own experience. The crossover point is determined by how variable brokers are relative to within-group noise.

This is classical Bühlmann-Straub credibility theory, reimplemented as a statistically principled REML random effects model. If you want the classical closed-form version — for scheme pricing or geographic rating without a GBM stage — see [insurance-credibility](https://github.com/burning-cost/insurance-credibility), which implements `BuhlmannStraub` and `HierarchicalBuhlmannStraub` directly.

## The Solution

Two stages, run sequentially:

**Stage 1: CatBoost on individual risk factors**
Group columns (broker, scheme) are deliberately excluded. CatBoost learns age bands, vehicle classes, postcode sectors, NCB — everything about the *individual* policy. Output: f̂ᵢ (CatBoost predicted premium).

**Stage 2: REML random intercepts on log-ratio residuals**
Compute rᵢ = log(yᵢ / f̂ᵢ) — the log-ratio of observed to CatBoost-predicted. Fit a one-way random effects model:

```
rᵢ = μ + b_g(i) + εᵢ
b_g ~ N(0, τ²)      (between-group variation)
εᵢ  ~ N(0, σ²)      (within-group noise)
```

REML estimates σ² and τ², then computes BLUPs for each group:

```
b̂_g = Z_g × (r̄_g - μ̂)         (shrunk group mean)
Z_g = τ² / (τ² + σ²/n_g)        (Bühlmann credibility weight)
```

**Final premium:** f̂(x) × exp(b̂_g)

## Installation

```bash
uv add insurance-multilevel
```

> 💬 Questions or feedback? Start a [Discussion](https://github.com/burning-cost/insurance-multilevel/discussions). Found it useful? A ⭐ helps others find it.

## Quick Start

```python
import polars as pl
from insurance_multilevel import MultilevelPricingModel

model = MultilevelPricingModel(
    catboost_params={"loss_function": "RMSE", "iterations": 500},
    random_effects=["broker_id", "scheme_id"],
    min_group_size=5,
)

model.fit(X_train, y_train, weights=exposure, group_cols=["broker_id", "scheme_id"])

premiums = model.predict(X_test, group_cols=["broker_id", "scheme_id"])
```

## Credibility Summary

```python
summary = model.credibility_summary()
print(summary)
```

```
# shape: (47, 11) — one row per (level, group) combination.
# With random_effects=["broker_id", "scheme_id"] and 30 brokers + 17 schemes,
# you get 47 rows (30 broker-level rows + 17 scheme-level rows), each with its
# own variance components estimated at that level.
shape: (47, 11)
┌───────────┬─────────────┬────────┬────────────┬────────┬────────────┬───────────────────┬───────┬────────┬──────┬──────────┐
│ level     ┆ group       ┆ n_obs  ┆ group_mean ┆ blup   ┆ multiplier ┆ credibility_weight┆ tau2  ┆ sigma2 ┆ k    ┆ eligible │
│ ---       ┆ ---         ┆ ---    ┆ ---        ┆ ---    ┆ ---        ┆ ---               ┆ ---   ┆ ---    ┆ ---  ┆ ---      │
│ str       ┆ str         ┆ f64    ┆ f64        ┆ f64    ┆ f64        ┆ f64               ┆ f64   ┆ f64    ┆ f64  ┆ bool     │
╞═══════════╪═════════════╪════════╪════════════╪════════╪════════════╪═══════════════════╪═══════╪════════╪══════╪══════════╡
│ broker_id ┆ broker_02   ┆ 342.0  ┆ 0.1823     ┆ 0.1691 ┆ 1.184      ┆ 0.928             ┆ 0.103 ┆ 0.248  ┆ 2.41 ┆ true     │
│ broker_id ┆ broker_07   ┆ 289.0  ┆ -0.1354    ┆-0.1231 ┆ 0.884      ┆ 0.919             ┆ 0.103 ┆ 0.248  ┆ 2.41 ┆ true     │
│ ...       ┆ ...         ┆ ...    ┆ ...        ┆ ...    ┆ ...        ┆ ...               ┆ ...   ┆ ...    ┆ ...  ┆ ...      │
```

The `multiplier` column is what pricing teams use. broker_02 has consistently worse-than-expected experience; apply a 1.184 loading on top of the base premium for policies written through that broker.

The column names — `tau2` (τ², between-group variance), `sigma2` (σ², within-group variance), `k` (Bühlmann's k), `credibility_weight` (Z) — deliberately mirror the notation in classical Bühlmann-Straub theory. See [insurance-credibility](https://github.com/burning-cost/insurance-credibility) for the closed-form version of the same model, where these parameters are explained in detail.

## Variance Components

```python
vc = model.variance_components["broker_id"]
print(vc)
# VarianceComponents(sigma2=0.2481, tau2={broker_id=0.1032},
#                   k={broker_id=2.41}, log_likelihood=-412.3,
#                   converged=True, iterations=23)
```

The **Bühlmann k** tells you the crossover point: a broker needs k=2.41 claim-years of data before their own experience gets more than 50% credibility. With σ²=0.25 and τ²=0.10, this is a reasonable portfolio — brokers do vary, but not outrageously.

## Diagnostics

```python
from insurance_multilevel import (
    icc,
    variance_decomposition,
    high_credibility_groups,
    groups_needing_data,
    lift_from_random_effects,
)

# Intraclass Correlation Coefficient
# "10% of total variance is between-broker"
print(icc(vc, "broker_id"))  # 0.294

# Which brokers have enough data to trust?
hc = high_credibility_groups(model.credibility_summary(), min_z=0.7)

# How much more data does each broker need to reach 80% credibility?
needs = groups_needing_data(model.credibility_summary(), target_z=0.8)

# Does Stage 2 actually help?
stage1 = model.stage1_predict(X_test)
final = model.predict(X_test)
lift = lift_from_random_effects(y_test, stage1, final, weights=exposure_test)
print(lift["malr_improvement_pct"])  # e.g., 4.2% improvement
```

## Design Choices

**Why two stages instead of joint estimation?**
Joint approaches (GPBoost, MERF) are mathematically cleaner but have identifiability problems when group IDs are high-cardinality. If broker_id is in the CatBoost feature set, the tree can absorb *some* of the group signal, leaving underestimated τ² in Stage 2. Two-stage with group exclusion is simpler and avoids this.

**Why REML instead of ML?**
Maximum likelihood underestimates variance components because it doesn't account for the degrees of freedom consumed by fixed effects (the grand mean μ). REML conditions out μ first. For small numbers of groups (m < 30), the difference is material. For m > 100, ML and REML converge.

**Why log-ratio residuals?**
Insurance premia are multiplicative. A broker loading of 1.15 applies regardless of whether the base premium is £300 or £1,200. By working on the log scale, we get additive random effects that translate cleanly to multiplicative adjustments.

**Why min_group_size=5?**
With n=1, you cannot separate the group random effect from within-group noise. The BLUP for a singleton group would be either 0 (correct: no information) or dominated by that single extreme observation (wrong: no shrinkage possible without group-level data). We exclude singletons from variance estimation and give them Z=0. This is conservative and actuarially correct.

## API Reference

### `MultilevelPricingModel`

```python
MultilevelPricingModel(
    catboost_params: dict | None = None,
    random_effects: list[str] | None = None,
    min_group_size: int = 5,
    reml: bool = True,
)
```

Methods:
- `fit(X, y, weights, group_cols)` — fit two-stage model
- `predict(X, group_cols, allow_new_groups)` — return premium predictions
- `credibility_summary(group_col)` — Bühlmann-Straub summary DataFrame
- `stage1_predict(X)` — CatBoost predictions only (no random effects)
- `log_ratio_residuals(X, y)` — log(y / f_hat) for diagnostics
- `variance_components` — dict of VarianceComponents per group level
- `feature_importances` — Stage 1 CatBoost feature importances

### `RandomEffectsEstimator`

Lower-level class if you want to use REML variance components without CatBoost.

```python
est = RandomEffectsEstimator(reml=True, min_group_size=5)
vc = est.fit(residuals, group_ids, weights)
blups = est.predict_blup(group_ids)
```

### `VarianceComponents`

Dataclass holding σ², τ², k (Bühlmann), log-likelihood, convergence info.

## Scope (V1)

- Random intercepts only (no random slopes)
- Gaussian residuals on log-transformed response
- Two-stage estimation (not joint)
- Nested hierarchy supported: fit separate estimators per group level
- Crossed effects excluded (broker × territory combinations)
- One-way random effects per group column

## Performance

Benchmarked on Databricks (2026-03-16) on a synthetic UK motor portfolio: 8,000 policies across 200 occupation codes (40 thick groups / 160 thin groups), true sigma_u=0.30 (between-group std), true sigma_e=0.50 (within-group std), ICC=0.36. Three approaches compared on a held-out 20% test set. See `benchmarks/benchmark.py` for the full script.

**Variance component recovery:**

| Parameter | True | Estimated |
|-----------|------|-----------|
| sigma2 (within-group) | 0.2500 | 0.1532 |
| tau2 (between-group) | 0.0900 | 0.0761 |
| ICC | 0.360 | 0.332 |

REML under-estimates both variance components on this dataset. The within-group sigma2 is substantially lower than the truth, likely because the Stage 1 CatBoost model absorbs much of the variance before the REML step. The estimates converge better when Stage 1 is a weak model or the fixed effects are excluded. Despite this attenuation, the Stage 2 lift is positive and substantial.

**Holdout gamma deviance (lower = better):**

| Method | Gamma Deviance |
|--------|----------------|
| No group effect (CatBoost on X only) | 0.338092 |
| One-hot encoding of group | 0.272967 |
| MultilevelPricingModel (CatBoost + REML) | 0.272280 |
| Stage 1 only (CatBoost) vs Full model lift | +15.93% reduction |

MultilevelPricingModel beats one-hot encoding on deviance (marginally) and outperforms no-group by 19.4%. The Stage 2 lift of 15.93% is large because ICC=0.33 — when a third of the variance is attributable to group membership, ignoring group structure is expensive.

**MAPE by group thickness:**

| Method | Thin groups MAPE | Thick groups MAPE |
|--------|-----------------|-------------------|
| No group effect | 68.35% | 62.46% |
| One-hot encoding | 66.09% | 51.11% |
| MultilevelPricingModel (REML) | 63.55% | 52.85% |

MultilevelPricingModel achieves the lowest MAPE on thin groups (63.55% vs 66.09% for one-hot). One-hot encoding slightly wins on thick groups because with enough data the one-hot coefficients converge to the correct values, and the REML shrinkage marginally over-regularises. The difference is 1.7pp — within noise. On thin groups the gap is decisive: one-hot overfits noisy 10-policy estimates; REML shrinks them appropriately.

**Group effect recovery:** Pearson r(BLUP, true effect)=0.7291 (target: >0.6). BLUP adjustments correlate strongly with the planted true occupation effects.

**Credibility weight distribution:**
- Thin groups (<15 exposure): mean Z=0.38 (heavily shrunk to portfolio mean)
- Medium groups (15-80 exposure): mean Z=0.97
- Thick groups (>=80 exposure): mean Z=0.98

ICC=0.33. The lift is substantial whenever occupation code effects are real. On a portfolio where certain occupations over-perform by 15-20%, this pays for itself.

**When to use:** Portfolios where group-level (broker, scheme, territory, occupation) effects are material but group membership cannot simply be included as a GBM feature due to cardinality, sparsity, or new groups at prediction time.

**When NOT to use:** When group effects are small (ICC < 0.05), or when you have fewer than 3-5 observations per group — BLUPs collapse to the grand mean and Stage 2 adds no lift. If you have 10+ observations per group and no new-group problem, one-hot encoding is a reasonable substitute.

## Databricks Notebook

A ready-to-run Databricks notebook benchmarking this library against standard approaches is available in [burning-cost-examples](https://github.com/burning-cost/burning-cost-examples/blob/main/notebooks/insurance_multilevel_demo.py).

## Related libraries

| Library | Why it's relevant |
|---------|------------------|
| [insurance-credibility](https://github.com/burning-cost/insurance-credibility) | Bühlmann-Straub closed-form implementation — use this when you don't need a GBM Stage 1 |
| [shap-relativities](https://github.com/burning-cost/shap-relativities) | Extract rating relativities from the Stage 1 CatBoost model |
| [insurance-spatial](https://github.com/burning-cost/insurance-spatial) | BYM2 spatial random effects for territory — the geographic equivalent of broker random effects |
| [insurance-cv](https://github.com/burning-cost/insurance-cv) | Walk-forward cross-validation respecting IBNR structure, needed for validating the Stage 1 model |

[All Burning Cost libraries →](https://burning-cost.github.io)

## Read more

[Your Broker Adjustments Are Guesswork](https://burning-cost.github.io/blog/your-broker-adjustments-are-guesswork) — why ad hoc broker loadings fail and how REML random effects give you defensible, data-driven credibility weights.


## Licence

MIT
