Metadata-Version: 2.4
Name: insurance-cure
Version: 0.1.0
Summary: Mixture cure models for insurance non-claimer scoring: covariate-aware logistic incidence, parametric and semiparametric latency, EM estimation
Author-email: Burning Cost <pricing.frontier@gmail.com>
License: MIT
Keywords: Cox,EM algorithm,UK insurance,Weibull,actuarial,claims frequency,cure fraction,insurance,mixture cure model,motor insurance,non-claimer,survival analysis
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business :: Financial
Classifier: Topic :: Scientific/Engineering :: Mathematics
Requires-Python: >=3.10
Requires-Dist: joblib>=1.2
Requires-Dist: lifelines>=0.27
Requires-Dist: numpy>=1.23
Requires-Dist: pandas>=1.5
Requires-Dist: scikit-learn>=1.1
Requires-Dist: scipy>=1.10
Provides-Extra: dev
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Description-Content-Type: text/markdown

# insurance-cure

Mixture cure models for insurance non-claimer scoring.

## The problem

Frequency GLMs treat all zero-claim policyholders the same. They do not distinguish between:

1. **Structural non-claimers** — policyholders who would never claim regardless of how long you observed them. A 60-year-old with 9 years NCB driving 5,000 miles a year.
2. **Lucky susceptibles** — policyholders who are genuinely at risk but happened not to claim this year.

These two groups behave differently over multi-year retention horizons. The structural immune cohort will never generate claim cost regardless of tenure. The low-hazard susceptible will eventually claim.

A Poisson GLM cannot tell them apart. A mixture cure model (MCM) can.

## What this library does

`insurance-cure` fits covariate-aware MCMs with a logistic incidence sub-model (who is susceptible?) and a parametric or semiparametric latency sub-model (when do susceptibles claim?). The primary output is a per-policyholder susceptibility score.

The population survival function:

```
S_pop(t | x, z) = pi(z) * S_u(t | x) + [1 - pi(z)]
```

- `pi(z)` = P(susceptible), logistic regression incidence sub-model
- `S_u(t | x)` = survival for susceptibles, Weibull/log-normal/Cox latency
- `[1 - pi(z)]` = cure fraction: P(never experiences event)

Estimation via EM algorithm (Peng & Dear 2000; Sy & Taylor 2000). Multiple restarts to handle multimodality. Bootstrap standard errors available.

**No other pip-installable Python package provides covariate-aware MCM with actuarial output.** R has smcure, flexsurvcure, cuRe. Python has nothing. This fills that gap.

## Installation

```bash
pip install insurance-cure
```

Dependencies: numpy, scipy, pandas, scikit-learn, lifelines, joblib.

## Quick start

```python
import pandas as pd
from insurance_cure import WeibullMixtureCure
from insurance_cure.diagnostics import sufficient_followup_test, CureScorecard
from insurance_cure.simulate import simulate_motor_panel

# Generate synthetic motor panel with known cure fraction 40%
df = simulate_motor_panel(n_policies=3000, cure_fraction=0.40, seed=42)

# ALWAYS check sufficient follow-up before fitting
qn = sufficient_followup_test(df["tenure_months"], df["claimed"])
print(qn.summary())

# Fit Weibull MCM
model = WeibullMixtureCure(
    incidence_formula="ncb_years + age + vehicle_age",
    latency_formula="ncb_years + age",
    n_em_starts=5,
)
model.fit(df, duration_col="tenure_months", event_col="claimed")
print(model.result_.summary())

# Outputs
cure_scores = model.predict_cure_fraction(df)      # P(immune) per policy
suscept = model.predict_susceptibility(df)          # 1 - cure_fraction
pop_surv = model.predict_population_survival(df, times=[12, 24, 36, 60])

# Validate with scorecard
scorecard = CureScorecard(model, bins=10).fit(df, duration_col="tenure_months", event_col="claimed")
print(scorecard.summary())
```

## Models

### WeibullMixtureCure (recommended)

Weibull AFT latency. Clean parametric extrapolation. Best default choice.

```python
from insurance_cure import WeibullMixtureCure

model = WeibullMixtureCure(
    incidence_formula="ncb_years + age + vehicle_age",
    latency_formula="ncb_years + age",
    n_em_starts=5,        # EM restarts — use >=5 for production
    bootstrap_se=True,    # Bootstrap SEs — slow but rigorous
    n_bootstrap=200,
    n_jobs=-1,
)
model.fit(df, duration_col="tenure_months", event_col="claimed")
```

### LogNormalMixtureCure

Log-normal latency. Better when the conditional hazard peaks then falls — sometimes fits pet or travel data better than Weibull.

```python
from insurance_cure import LogNormalMixtureCure

model = LogNormalMixtureCure(
    incidence_formula="pet_age + breed_risk + indoor",
    latency_formula="pet_age + breed_risk",
)
model.fit(df)
```

### CoxMixtureCure

Semiparametric Cox PH latency. Nonparametric baseline hazard — most flexible. Cannot extrapolate beyond the observation window. Use for exploration, not production pricing projection.

```python
from insurance_cure import CoxMixtureCure

model = CoxMixtureCure(
    incidence_formula="ncb_years + age",
    latency_formula="ncb_years",
)
model.fit(df)
```

### PromotionTimeCure

Non-mixture (promotion time) cure model. Population-level proportional hazards structure. Include as comparison model. The cure fraction emerges from the asymptote; there is no explicit incidence sub-model.

```python
from insurance_cure import PromotionTimeCure

model = PromotionTimeCure(formula="ncb_years + age + vehicle_age")
model.fit(df)
```

## Diagnostics

### Sufficient follow-up test

The Maller-Zhou Qn test is mandatory. If the observation window is too short, many censored policyholders are simply susceptibles who have not yet claimed, not structural non-claimers. The cure fraction estimate will be upwardly biased.

```python
from insurance_cure.diagnostics import sufficient_followup_test

result = sufficient_followup_test(df["tenure_months"], df["claimed"])
print(result.summary())
# Maller-Zhou Sufficient Follow-Up Test
# ========================================
#   Qn statistic      : 3.2194
#   p-value           : 0.0006
#   ...
#   Conclusion: Sufficient follow-up: evidence for a genuine cure fraction.
```

### Cure scorecard

```python
from insurance_cure.diagnostics import CureScorecard

scorecard = CureScorecard(model, bins=10).fit(df)
print(scorecard.summary())
# Decile 1 (lowest cure) should have highest event rates.
# Decile 10 (highest cure) should have lowest event rates.
```

## Insurance applications

**UK motor:** First at-fault claim in policy tenure. Event = first claim, time axis = tenure in months. Incidence covariates: NCB years, driver age, vehicle age, occupation. A policyholder with 9 years NCB is a plausible structural non-claimer; a first-year policyholder is not.

**Pet insurance:** First claim by condition type. Breed, age, indoor/outdoor status drive susceptibility. Indoor cats in early life have very high cure fractions for accidental injury.

**Travel insurance:** Single-trip non-claimers. Destination, duration, age, trip type (business vs leisure) drive susceptibility.

**Where MCM does NOT apply:** Buildings (flood, subsidence). Return periods exceed practical follow-up windows. The Qn test will reject sufficient follow-up. Use flood zone categories as structural zero covariates in a standard GLM instead.

## Synthetic data

```python
from insurance_cure.simulate import simulate_motor_panel, simulate_pet_panel

# Motor panel: multi-year structure with NCB, age, vehicle age
df = simulate_motor_panel(
    n_policies=5000,
    n_years=5,
    cure_fraction=0.40,
    weibull_shape=1.2,
    weibull_scale=36.0,    # months to first claim for susceptibles
    censoring_rate=0.15,   # annual lapse rate
    seed=42,
)

# Pet panel: cross-sectional
df_pet = simulate_pet_panel(n_policies=2000, cure_fraction=0.35, seed=42)
```

The true latent immune status is included as `is_immune` for validation. This column is not available in real data.

## EM algorithm details

The EM algorithm decouples into two standard sub-problems at each iteration:

**E-step:** For censored observation i:
```
w_i = pi(z_i) * S_u(t_i|x_i) / [pi(z_i) * S_u(t_i|x_i) + (1 - pi(z_i))]
```
For observed events: w_i = 1 (certainly susceptible).

**M-step:**
1. Logistic regression for gamma using w_i as soft labels
2. Weighted Weibull/log-normal MLE for latency parameters, using w_i as case weights

The w_i weights are interpretable posterior susceptibility probabilities. This transparency is a key advantage over direct MLE of the full log-likelihood, which converges less reliably and provides no intermediate interpretation.

## Design choices

**EM over direct MLE.** Direct MLE of the full MCM log-likelihood suffers from negative-definite Hessian problems near the boundaries (cure fraction near 0 or 1). EM converges monotonically. The M-step delegates to proven scipy/sklearn solvers for each sub-problem separately. This is the approach taken by smcure in R.

**Separate incidence and latency formulae.** Following smcure's `cureform` / `formula` convention. In practice, all covariates typically enter the incidence sub-model; only timing-relevant covariates enter the latency.

**Multiple restarts.** The MCM log-likelihood is multimodal, especially when the cure fraction is near 0 or 1. Five restarts (mix of smart and random initialisations) is a practical default. Increase for production models.

**Bootstrap SEs.** EM does not directly yield standard errors. The Louis (1982) observed information matrix requires second derivatives of the complete-data log-likelihood — numerically involved. Bootstrap (B=200) is the smcure default and is implemented here via joblib parallel.

## References

- Farewell (1982), Biometrics 38:1041-1046 — canonical covariate MCM
- Maller & Zhou (1996), Survival Analysis with Long-Term Survivors, Wiley — identifiability, Qn test
- Peng & Dear (2000), Biometrics 56:237-243 — EM algorithm, semiparametric
- Sy & Taylor (2000), Biometrics 56:227-236 — EM algorithm, Cox latency
- Tsodikov (1998), JRSS-B 60:195-207 — promotion time / non-mixture model

---

*Burning Cost — actuarial Python for UK pricing teams.*
