2026-04-27 : Sphinx-ready docstring audit
- Audited every module's docstrings and reworked them for clean Sphinx /
  Napoleon rendering. No behavior changes; pure documentation.
- Added module-level docstrings to feature.py, gamdist.py,
  categorical_feature.py, linear_feature.py, spline_feature.py, and
  proximal_operators.py. Expanded gamdist/__init__.py's module docstring.
  Tightened the existing module docstrings on multi_task_features.py and
  multi_task_gamdist.py to use Sphinx cross-references.
- Reformatted GAM.__init__'s References section to use Sphinx footnote /
  citation syntax (``.. [label] description``) so [GAMADMM]_, [GAMr]_,
  [ADMM]_, etc. become real cross-references instead of indented prose.
  Promoted the bulk of __init__'s docstring to a class-level docstring on
  GAM, where Sphinx and Napoleon prefer it.
- Wrapped scattered Unicode-math snippets across categorical_feature,
  linear_feature, spline_feature, multi_task_features, gamdist (deviance,
  binomial overdispersion, _optimize, etc.) and proximal_operators in
  ``:math:`` roles or ``.. math::`` blocks; embedded ASCII tables
  (binomial covariate-class examples) are now reST grid tables.
- Expanded docstrings on the abstract _Feature interface methods so
  Sphinx-rendered subclasses inherit useful descriptions, and added
  full docstrings to every public-facing proximal-operator function in
  gamdist/proximal_operators.py (previously most had no docstring).
- Normalized NumPy-style parameter blocks to drop the leading-space
  indentation that confused Napoleon; standard-form ``name : type``
  with consistent alignment throughout.

2026-04-27 : Multi-task GAM (joint fit across K outcomes with convex coupling)
- Added MultiTaskGAM and a new feature module gamdist/multi_task_features.py
  for fitting K supervised-learning tasks jointly under a shared feature set
  with optional convex cross-task regularization. First slice in the design
  RFC at issue #39: per-task (family, link), per-task n_k (different lengths
  allowed), and one coupling penalty -- group-lasso across tasks on a linear
  feature -- which zeros the feature simultaneously across all K tasks once
  lambda is large enough. Coupling lives entirely inside the feature; the
  orchestrator never sees a penalty coefficient (CLAUDE.md seam principle).
- gamdist/multi_task_gamdist.py: MultiTaskGAM(families, links=None,
  dispersions=None, name=None) holds K parallel ADMM-state lists (f_bar,
  z_bar, u, offset, residual histories). add_feature(name, type='linear',
  ...) is restricted to linear features in this slice; categorical/spline
  can be added by analogy. fit([X_1,...,X_K], [y_1,...,y_K], weights=None,
  smoothing=1.0, max_its=100) runs K-coupled ADMM: per iteration, each
  multi-task feature's optimize_multi receives the K fpumz vectors and
  returns K contributions; the proximal step is K independent calls of the
  existing (family_k, link_k) prox via _optimize_task. Convergence requires
  every task's primal/dual residual to drop below its own tolerance;
  prim_res / dual_res accumulate the max across tasks for the trace.
- gamdist/multi_task_features.py: _MultiTaskFeature ABC adds initialize_multi,
  optimize_multi, predict_task, compute_dual_tol_task, num_params_task, and
  dof_task; the single-task ABC methods (initialize / optimize / predict)
  raise NotImplementedError so the orchestrator distinguishes the two paths
  unambiguously. _MultiTaskLinearFeature owns one slope per task plus shared
  data caches; without coupling the K subproblems decouple and reduce to K
  independent 1-D least-squares solves matching _LinearFeature task-by-task.
  regularization={"group_lasso_across_tasks": {"coef": lambda}} adds
  lambda * sqrt(sum_k m_k^2 * xtx_k), the K-task generalization of the
  single-task linear group-lasso. Closed-form prox in the change of
  variables eta_k = m_k * sqrt(xtx_k): the penalty becomes lambda * ||eta||_2
  and the per-task quadratic stays diagonal, so the standard L2
  group-lasso shrinkage drops out: eta = max(0, 1 - lambda/(rho * ||b||_2)) * b.
- predict accepts either a single DataFrame (same predictors for every task)
  or a length-K list (per-task predictors) and returns a length-K list of
  mean responses; deviance returns a length-K list and reuses the existing
  per-family deviance math, parameterized by task index.
- Out of scope for this slice (raise NotImplementedError on call): persistence
  (save_flag / load_from_file), summary / plot / aic / aicc / gcv / ubre /
  confidence_intervals, robust covariance, quantile / huber families,
  covariate classes, the estimate_overdispersion path, and per-feature
  tasks=[subset] routing. Other coupling penalties (nuclear / trace norm,
  network lasso on a task graph, hierarchical pooling) are also deferred.
- Tests in tests/test_multi_task_gam.py cover: feature-level validation
  (zero tasks, unsupported regularization keys, missing / negative coef),
  smoothing scaling, huge-lambda zeroing all K slopes simultaneously,
  lambda=0 decoupling matches per-task OLS closed form, MultiTaskGAM
  construction (empty / unknown / quantile families, mismatched links
  length, unsupported (family, link) pair, canonical-link defaults,
  per-task dispersions), and end-to-end fits: unequal n_k recovery, no-
  coupling parity with two independently-fit GAMs, group-lasso uniformly
  zeroing a noise feature in both tasks, mixed normal+identity /
  binomial+logit fit and predict, predict accepting per-task DataFrames,
  deviance(X, y) parity with the iterate-based form, predict-before-fit
  guard, and NotImplementedError on summary / aic / aicc / gcv / ubre /
  confidence_intervals plus the categorical feature-type rejection.
  Closes the first slice of #39; follow-up issues to file: other coupling
  penalties (#39 trace-norm / network-on-tasks / hierarchical pooling),
  multi-task categorical / spline features, persistence, summary / plot,
  and tasks=[subset] routing for mixed task-local / shared features.

2026-04-26 : Adaptive (weighted) lasso wrapper
- Added `fit_adaptive_lasso(gam, X, y, *, gamma=1.0, eps=1e-6, **fit_kwargs)`
  as a top-level helper exported from the `gamdist` package. It runs the
  standard two-stage adaptive lasso of Zou (2006): fit `gam` once with
  the user-configured L1 penalty to get pilot coefficients, rewrite each
  L1-regularized feature's per-coefficient weight to
  `base / (|pilot - prior| + eps) ** gamma`, then refit. Both stages are
  ordinary weighted-L1 fits at fixed weights, so each subproblem stays
  convex (CLAUDE.md North Star). Closes #31.
- _LinearFeature: gained `_apply_adaptive_l1(gamma, eps)` that rewrites
  the scalar `_coef1` from the pilot slope `_m`. The next `initialize()`
  call rescales it by the model's `smoothing` into `_lambda1`.
- _CategoricalFeature: gained `_apply_adaptive_l1(gamma, eps)` that
  rewrites `_coef1` as a per-category dict, naturally exercising the
  existing dict-coef support in `_compute_lambda`. When the user
  configured a prior, the adaptive deviation is `|p_c - prior_c|` and
  only categories named in the prior dict are rewritten; categories
  excluded by an explicit dict-form base coefficient stay at zero.
- _CategoricalFeature: made `initialize()` idempotent with respect to
  the prior dict-to-vector conversion (the second stage of adaptive
  lasso would otherwise call `.items()` on the previously-converted
  ndarray and crash). The user-supplied prior keys are now persisted in
  a separate `_prior_keys` set so `_compute_lambda` can apply uniform
  base-coef penalties only to the originally-named categories on
  re-initialize. Older pickles default `_prior_keys` to the categories
  with non-zero `_prior` values for backward compatibility.
- Tests in `tests/test_adaptive_lasso.py` cover: parameter validation
  (gamma <= 0, eps <= 0, no L1 features), the per-feature
  `_apply_adaptive_l1` behavior (linear scalar rewrite, no-op when no
  L1 / zero base coef, categorical dict rewrite, dict-base-coef
  category exclusion, prior-deviation logic and prior-key filtering),
  end-to-end recovery on a sparse linear-features fixture (adaptive
  shrinks noise harder than plain L1 and is closer to the true signal),
  noise features driven essentially to zero with gamma=1, and a
  five-level categorical with two active levels recovering sparsity
  better than plain L1.

2026-04-26 : Convex shape constraints on feature coefficients
- Added a `constraints={...}` keyword to `GAM.add_feature` and to each
  feature class. Supported keys: `sign` (`"nonnegative"` /
  `"nonpositive"`), `lower` / `upper` (floats), `monotonic`
  (`"increasing"` / `"decreasing"`), `convex`, and `concave`. All are
  linear inequality constraints, so every per-feature subproblem stays
  convex (CLAUDE.md North Star).
- _LinearFeature: only sign / lower / upper apply (a single slope has no
  meaningful monotonicity / curvature). The closed-form 1-D optimum is
  projected onto `[lower, upper]` -- a no-op when the unconstrained
  minimum is already feasible. No cvxpy on this path; the new step is a
  one-line clip after the existing soft-threshold / Huber math.
- _CategoricalFeature: drops linear-inequality terms into the existing
  cvxpy program. `lower` / `upper` clip every `q_c`. `monotonic` /
  `convex` / `concave` require an `order` list of category labels and
  emit consecutive-difference / second-difference constraints along
  that order. The implicit zero-sum constraint is preserved, so one-
  sided sign constraints collapse the entire vector to zero -- they
  exist for symmetry and for the regulated "no negative effects"
  interpretation. Categories appearing only in the order list are
  added to the feature's known category set so the optimize step can
  reference their indices.
- _SplineFeature: shape constraints land on the spline evaluated at the
  knots `xi`. `initialize` now also stores the basis matrix `N_xi` of
  shape (K, K) so the optimize step can express
  `f_xi = N_xi @ theta` cheaply. `lower` / `upper` clip the knot
  values; `monotonic` / `convex` / `concave` use `cvx.diff` to encode
  consecutive / second differences along the sorted knots. Constraint
  enforcement at the knots is a near-global proxy for a smooth cubic
  spline. `optimize()` routes through `_optimize_cvx` whenever
  constraints (or either group lasso) is active; the closed-form
  Cholesky path is preserved when none are active. The constrained
  QP can hit CLARABEL's numerical-tolerance failure on dense knot
  grids with many active inequality constraints, so `_optimize_cvx`
  catches `cvx.SolverError` and retries with SCS once before
  surfacing the failure.
- Save / load round-trips for all three feature types updated; older
  pickles default `has_constraints` to False with empty bounds /
  ordering. `_SplineFeature._load` rebuilds `N_xi` from the persisted
  knots when loading older pickles that don't store it.
- gamdist.add_feature gained docstring text describing the new kwarg.
- Tests in `tests/test_constraints.py` cover: API validation paths
  (unknown keys, sign / lower-upper exclusivity, sign on linear,
  monotonic on linear, missing order on categorical, duplicates /
  too-short order, mutually-exclusive convex / concave, inverted
  bounds), unconstrained-violates-constraint demonstrations for each
  constraint type, save/load round-trips for all three feature types,
  and end-to-end GAM fits (linear sign clipping, categorical
  monotonic-increasing across three levels, spline monotonic on a
  triangle-wave dose-response signal).

2026-04-26 : Huber penalty regularizer for linear and categorical features
- Added regularization={"huber": {"coef": lambda, "delta": delta}} to
  _LinearFeature and _CategoricalFeature. The penalty
  lambda * h_delta(coef) is the standard Huber loss applied to each
  per-feature coefficient: 0.5*c^2 for |c| <= delta, delta*|c| -
  0.5*delta^2 outside. Convex, smooth at the origin, with bounded
  per-coefficient gradient -- ridge-like for small magnitudes,
  L1-like for large ones. Closes #27.
- _LinearFeature: closed-form 1-D piecewise solve. Inside the band
  the penalty acts like ridge with strength lambda; outside it acts
  like L1 with strength lambda*delta. Combines additively with l2
  in closed form (denom_quad = xtx + (2*lambda2 + lambda)/rho inside
  the band, denom_lin = xtx + 2*lambda2/rho outside; the linear-
  region offset is lambda*delta/rho). Combinations with l1,
  group_lasso, or group_lasso_inf are rejected at __init__ with a
  clear error -- those penalize the same |m| and stack non-trivially
  with the Huber linear-region threshold; if combinations are needed
  in the future they should be implemented explicitly rather than
  silently doing the wrong thing. No cvxpy on the linear path.
- _CategoricalFeature: penalty term
  (1/rho) * sum_c lambda_c * cvx.huber(q_c, delta) is added to the
  cvxpy objective. cvxpy's huber atom is 2x the standard 0.5-leading
  Huber loss; the (2/rho) scaling factor used elsewhere in the
  categorical objective absorbs the 0.5, leaving (1/rho) on the
  cvxpy term. coef accepts a scalar (uniform across categories) or
  a per-category dict (matching the l2 pattern); delta is a single
  positive scalar. Combinations with all existing categorical
  penalties just work via cvxpy.
- Save/load round-trips updated for both feature types; older
  pickles default has_huber=False with zero coefficient.
- gamdist.add_feature docstring mentions the new key. Tests in
  tests/test_{linear,categorical}_huber.py cover: missing-coef /
  missing-delta / non-positive-delta error paths (and negative coef
  on linear), smoothing scaling for both scalar and dict coef,
  coef=0 matches unpenalized, large-coef zeros the contribution,
  the quadratic-region and linear-region closed-form math (linear),
  large-delta collapse to ridge with strength lambda/2 (both), the
  combination with l2 closed form (linear), the cvxpy branch's
  shrinkage geometry vs. ridge with the same lambda (categorical),
  rejection of combinations with l1/group_lasso variants on linear,
  save/load round-trip, and an end-to-end GAM fit demonstrating
  Huber's bounded-influence behavior on a single high-leverage point
  (linear) or a single extreme category (categorical).

2026-04-26 : L_inf-norm group lasso variant
- Added regularization={"group_lasso_inf": {"coef": lambda}} as a
  drop-in companion to the existing L2 group lasso on _LinearFeature,
  _CategoricalFeature, and _SplineFeature. The penalty is
  lambda * ||f_j||_inf, where ||.||_inf is the largest absolute
  per-row contribution in f_j. Convex by construction; produces a
  different shrinkage geometry from the L2 variant -- coefficient
  clipping rather than uniform contraction. Closes #45.
- _LinearFeature: f_j = m * (x - mean(x)), so ||f_j||_inf =
  |m| * max|x - mean(x)|. The penalty reduces to a 1-D
  soft-threshold on the slope with threshold
  lambda_inf * max|x - mean(x)| / rho. Stacks additively with the
  existing L1 and L2-group-lasso thresholds and combines with
  ridge as an elastic-net closed form (no cvxpy on the linear path).
  initialize() now also stores _x_inf = max|x - mean(x)|; older
  pickles recompute it on load.
- _CategoricalFeature: penalty term
  (2/rho) * lambda_inf * ||A q||_inf is added to the cvxpy program
  alongside any other regularizers. ||A q||_inf = max_{c with obs}
  |q_c|; categories that never appear in training data are masked
  out (via ccs > 0) so unused-but-network-connected levels can't
  inflate the penalty.
- _SplineFeature: penalty term gamma_inf * cvx.norm_inf(N theta)
  is added in _optimize_cvx alongside any L2 group-lasso term.
  optimize() routes through the cvxpy path whenever either
  group-lasso variant is active; the closed-form Cholesky path is
  preserved when both are off.
- Save/load round-trips updated for all three feature types; older
  pickles default has_group_lasso_inf=False with zero coefficient.
- gamdist.add_feature docstring mentions the new key. Tests in
  tests/test_{linear,categorical,spline}_group_lasso_inf.py cover:
  no-coef rejection, smoothing scaling, coef=0 matches unpenalized,
  large-coef zeros the contribution, threshold/closed-form for the
  linear path, the categorical clipping geometry vs. L2's uniform
  contraction, additive combination with the L2 variant, save/load
  round-trip, and end-to-end "drops the noise feature" within a
  full GAM fit for each feature type.

2026-04-26 : Quasi-likelihood families (quasi_poisson, quasi_binomial)
- Added two new family aliases: family="quasi_poisson" routes to the
  Poisson log-link prox, and family="quasi_binomial" routes to the
  binomial logistic-link prox. Point estimates therefore coincide with
  the corresponding full-likelihood family on identical data; only the
  dispersion parameter (and any inferential quantity that scales with
  it) differs.
- _Feature dispatch (prox in _optimize, _glm_irls_terms,
  _deviance_from_mu, _pearson_residuals, _unit_deviance_for_residuals,
  _anscombe_residuals, deviance-residual sign branch) now uses a new
  _base_family attribute that strips the "quasi_" prefix. The user-
  facing _family is preserved for save/load and dispersion routing.
  All existing single-family code paths see no behavior change since
  _base_family == _family for non-quasi families.
- GAM.dispersion() gained two cases. For quasi_poisson it always
  returns the Pearson chi-square / (n - p) estimate (or the user-
  supplied dispersion= value), reusing the existing
  _poisson_overdispersion path. For quasi_binomial it calls a new
  _binomial_pearson_dispersion that handles both Bernoulli (no
  covariate classes) and binomial-counts data; the formula matches
  the Pearson branch of _binomial_overdispersion but does not require
  the replication structure that estimator assumes.
- CANONICAL_LINKS, FAMILIES, and the Family Literal grew the two new
  entries; SUPPORTED_FAMILY_LINK_PAIRS picks them up automatically
  through CANONICAL_LINKS.items(). Non-canonical (quasi_*, link)
  combinations are rejected at construction time with the same error
  used for the existing convex-only allow-list.
- Tests in tests/test_quasi_families.py cover: canonical-link
  defaults, _base_family alias, point-estimate equivalence with the
  full-likelihood cousin on overdispersed (NB and beta-binomial)
  data, dispersion estimate above 1 on overdispersed data and
  exactly 1 on the standard Poisson/binomial path, the Pearson
  closed-form match, Bernoulli quasi-binomial fit, user-supplied
  dispersion respected, non-canonical link rejection, sandwich SEs
  agree with the Poisson sandwich at the same fitted mu (since phi
  drops out), and a save/load round-trip preserving _family /
  _base_family / dispersion. Closes #23.

2026-04-26 : L1 regularization for linear features
- _LinearFeature now accepts regularization={"l1": {"coef": lambda}}.
  The previous option-B rejection (added in #20) is removed; the
  matching error message in __init__ is replaced with the standard
  "no coefficient" check used by the other regularizers.
- The penalty lambda * |m| reduces the per-feature primal step to a
  closed-form 1-D soft-threshold on b = x_centered^T y with threshold
  lambda1 / rho. With ridge the elastic-net closed form drops out:
      m = soft(b, lambda1/rho) / (xtx + 2*lambda2/rho).
  When group_lasso is also active, its threshold (lambda_gl *
  sqrt(xtx) / rho) adds to the L1 threshold -- both penalize |m| at
  the origin and combine additively. No cvxpy on the linear path;
  the existing ridge-only fast path is unchanged when L1 is off.
- Smoothing scaling, save/load, and per-feature state plumbing for
  L1 (_has_l1, _coef1, _lambda1) were already in place from the
  pre-rejection era and are reused as-is.
- Tests in tests/test_linear_l1.py cover: missing-coef error,
  smoothing scaling, lambda=0 parity with the unpenalized solve,
  huge-lambda zeroing the slope, intermediate-lambda shrinkage,
  the soft-threshold closed form, the elastic-net closed form when
  combined with l2, the additive-threshold closed form when combined
  with group_lasso, save/load round-trip, and an end-to-end GAM test
  where the noise feature shrinks to <10% of the signal feature's
  norm. The rejection assertion in tests/test_linear_feature.py is
  flipped to assert the missing-coef path instead. Closes #53.

2026-04-26 : Reject non-convex (family, link) combinations up front
- GAM.__init__ now validates (family, link) against an explicit
  SUPPORTED_FAMILY_LINK_PAIRS allow-list (the canonical pairs from
  CANONICAL_LINKS) and raises ValueError naming the unsupported pair
  and the canonical link to use. Previously, non-canonical combinations
  routed through a silent scipy.optimize.minimize_scalar fallback in
  proximal_operators.py with no convexity guarantee, contradicting the
  convexity-only design (see CLAUDE.md).
- Removed the fallback prox functions _prox_normal, _prox_binomial,
  _prox_poisson, _prox_gamma, and _prox_inv_gaussian from
  proximal_operators.py. Each was the non-canonical-link sibling of the
  corresponding canonical-link solver and existed only to invoke
  minimize_scalar per observation, an interface that hides non-convex
  problems behind best-effort scalar minimization. The canonical-link
  solvers (_prox_normal_identity, _prox_binomial_logit,
  _prox_poisson_log, _prox_gamma_reciprocal, and
  _prox_inv_gaussian_reciprocal_squared) and the quantile / huber
  identity-link solvers are unchanged. The minimize_scalar fallback
  inside _prox_binomial_logit_scalar / _prox_poisson_log_scalar (after
  Newton fails to converge on the convex canonical objective) is
  preserved -- it operates on a pair that's in the allow-list.
- GAM._optimize now dispatches by family alone, since the link is
  forced to its canonical value by construction-time validation. The
  catch-all "Family X and Link Y not (yet) supported." branch is kept
  as a defensive backstop but is unreachable from public API.
- GAM._load also re-validates the (family, link) pair so older pickles
  saved with a non-canonical link fail loudly with the same error
  message users get at construction.
- Tests:
    * tests/test_proximal_non_canonical.py is removed; its canonical-
      link array-form / weighted-call coverage is folded into
      tests/test_proximal_operators.py
      (test_prox_poisson_log_array_with_weights,
       test_prox_gamma_reciprocal_with_weights,
       test_prox_inv_gaussian_reciprocal_squared_with_weights,
       and the eta-space robustness / reference-minimization tests).
    * tests/test_internals.py::test_link_function_round_trip drops the
      probit and complementary_log_log cases; only canonical pairs
      remain.
    * tests/test_gam_api.py adds test_unsupported_family_link_raises
      (parametrized across the previously silent fallback combinations:
       normal+log, normal+logistic, binomial+probit, binomial+cloglog,
       poisson+identity, gamma+log, gamma+identity, inv_gaussian+log),
      a message-content assertion that names the canonical link, and a
      check that family="exponential" + non-reciprocal link is rejected
      after the gamma collapse.
- Migration: callers who relied on a non-canonical link must switch to
  the canonical link for that family. Per CLAUDE.md, adding a new
  (family, link) entry to SUPPORTED_FAMILY_LINK_PAIRS requires both a
  convexity proof and a dedicated solver in proximal_operators.py.
  Closes #19.

2026-04-26 : Sandwich (Huber-White) covariance estimator
- GAM.robust_covariance(cov_type="HC0") returns the heteroscedasticity-
  consistent sandwich estimator
      V = (X' W X)^{-1} (X' diag(s_i^2) X) (X' W X)^{-1}
  for the parameter vector defined by GAM._design_matrix(). Per
  observation w_i = (dmu/deta)_i^2 / V(mu_i) is the IRLS weight and
  s_i = (y_i - mu_i) (dmu/deta)_i / V(mu_i) is the eta-scale score;
  mu is read off the fitted model. The dispersion drops out, so the
  estimator is identical for known and estimated dispersion. cov_type=
  "HC1" applies the n / (n - p) small-sample correction.
- _design_matrix(X=None) builds an intercept + raw-linear + treatment-
  contrast (drop lex-first level) categorical design that matches
  statsmodels / patsy parametrization, so robust_covariance lines up
  term-by-term with statsmodels.GLM(...).fit(cov_type="HC0").cov_params().
- Supported families: normal, binomial, poisson, gamma, inverse_gaussian.
  Spline features, binomial covariate classes, observation weights, and
  the quantile / huber families raise NotImplementedError up front --
  follow-on issues will extend coverage. Penalized fits return the
  naive sandwich at the penalized point estimate; a bread that includes
  the penalty Hessian is future work and is also flagged in the
  docstring.
- Foundational dependency for issue #35 (quasi-likelihood families):
  with sandwich SEs in place, quasi_normal / quasi_binomial /
  quasi_poisson can ship without inference being a separate workstream.
- Tests in tests/test_robust_covariance.py: (a) inject statsmodels'
  converged mu into gamdist and compare cov_params to numerical
  precision (rtol=1e-10) -- pins the sandwich formula independent of
  the optimizer, since gamdist's ADMM stops at eps_abs=eps_rel=1e-3
  while statsmodels' IRLS goes to machine precision; (b) end-to-end
  fit-and-compare with looser tolerances (rtol=5e-2) on the same three
  families; (c) HC0 -> HC1 scaling identity; (d) design-matrix layout
  invariants; (e) NotImplementedError gates for splines, the quantile
  family, unsupported cov_type, and unfitted-model access. statsmodels
  is gated via pytest.importorskip so the test module is silently
  skipped in environments without it (statsmodels is not added to the
  package's runtime or dev dependencies). Closes #34.

2026-04-26 : Huber (M-estimator) family for robust continuous regression
- GAM now accepts family="huber" with a required positive `delta`
  parameter in the units of `y`. Only link="identity" is supported
  (the M-estimator framing is convex only at the identity link);
  other links raise up front. Residuals with |y - mu| <= delta are
  penalized as 0.5 * r^2; larger residuals are penalized linearly,
  capping their per-observation influence. This is the standard
  bounded-influence robust regression loss.
- Closed-form prox in gamdist/proximal_operators.py:
  _prox_huber_identity is a clipped weighted-mean shrinkage,
      x* = v + w*delta/mu               if y - v >  delta*(mu+w)/mu
      x* = v - w*delta/mu               if y - v < -delta*(mu+w)/mu
      x* = (w*y + mu*v) / (w + mu)      otherwise.
  The inner branch matches the normal-identity prox; the outer
  branches saturate at v +/- w*delta/mu, which is what gives Huber
  its bounded influence. No ADMM-loop changes; per-feature primal
  steps see only fpumz and rho exactly as before.
- The training offset (which anchors the model's average prediction,
  since every feature is mean-zero) is set to np.median(y) for the
  huber family. The median is the L1 limit of the Huber M-estimator
  of location and a robust anchor regardless of delta.
- _deviance_from_mu reports 2 * sum(L_delta(y - mu)) for the huber
  family (the M-estimator analogue of -2 * log-likelihood used by the
  convergence trace and AIC/BIC summaries); dispersion() is fixed at
  1.0. huber is added to FAMILIES_WITH_KNOWN_DISPERSIONS so AIC/BIC
  don't add a phantom DOF for an estimated dispersion.
- delta is persisted in save/load; older pickles default delta to None.
- Tests in tests/test_proximal_operators.py cover the prox vs.
  scipy.optimize.minimize_scalar across delta values that span the
  inner / outer regimes, the quadratic-region collapse to the
  normal-identity prox at large delta, the outer-region corner
  saturation, the weighted variant, and continuity at the threshold.
  tests/test_huber.py adds an end-to-end test that Huber recovers the
  underlying slope on outlier-contaminated data more accurately than
  OLS, that delta -> infinity collapses to OLS on clean data, the
  delta-required / non-positive-delta / non-identity-link API
  validations, the deviance = 2 * sum(huber_loss) identity, and a
  save/load round-trip. Closes #21.

2026-04-26 : Quantile (pinball-loss) family for conditional quantile regression
- GAM now accepts family="quantile" with a required tau parameter in
  (0, 1). Only link="identity" is supported (the M-estimator framing
  is convex only at the identity link); other links raise up front.
  tau=0.5 fits the conditional median; tau in (0, 1) fits an arbitrary
  conditional quantile.
- Pinball loss rho_tau(r) = max(tau*r, (tau-1)*r) plugs into the ADMM
  loop as a new entry in the (family, link) prox dispatch:
  _prox_quantile_identity in gamdist/proximal_operators.py is the
  closed-form shifted-soft-threshold
      x* = v + w*tau/mu        if y > v + w*tau/mu
      x* = v - w*(1-tau)/mu    if y < v - w*(1-tau)/mu
      x* = y                   otherwise
  i.e. the same shape as the L1 prox with asymmetric thresholds.
  No ADMM-loop changes; per-feature primal steps see only fpumz and
  rho exactly as before.
- The training offset (which anchors the model's average prediction,
  since every feature is mean-zero) is set to np.quantile(y, tau)
  for the quantile family instead of mean(y). Without this the
  intercept is locked to the wrong target for tau != 0.5 and ADMM
  cannot reach the correct fit.
- _deviance_from_mu treats 2 * sum(rho_tau(y - mu)) as the family's
  "deviance" (the M-estimator analogue of -2 * log-likelihood used
  by the convergence trace and AIC/BIC summaries); dispersion() is
  fixed at 1.0. quantile is added to FAMILIES_WITH_KNOWN_DISPERSIONS
  so AIC/BIC don't add a phantom DOF for an estimated dispersion.
- tau is persisted in save/load; older pickles default tau to None.
- Tests in tests/test_proximal_operators.py cover the prox vs.
  scipy.optimize.minimize_scalar on the same scalar objective across
  tau values, the shifted-soft-threshold thresholds, the weighted
  variant, and the median collapse to the symmetric L1 prox.
  tests/test_quantile.py adds an end-to-end recovery test on
  heteroskedastic data y = x + (1+x)*N(0,1) for tau in {0.1, 0.5, 0.9}
  (true conditional quantile is linear in x), an empirical-coverage
  check on a held-out sample for tau in {0.2, 0.8}, API validation
  (missing / out-of-range tau, non-identity link), the deviance =
  2 * sum(pinball) identity, and a save/load round-trip. Closes #22.

2026-04-26 : Network ridge regularizer for categorical features
- _CategoricalFeature now accepts
      regularization={"network_ridge": {"coef": lambda, "edges": edges}}
  alongside the existing l1/l2/network_lasso/group_lasso options.
- Penalty is the smooth-shrinkage sibling of network_lasso:
      lambda * sum_{(i,j) in E} w_ij * (q_i - q_j)^2  =  lambda * q^T L q
  where L is the (weighted) graph Laplacian of the edges. The penalty
  is convex quadratic, so it lands as (2*lambda/rho) * L added to the
  AtA Hessian inside the categorical cvxpy program -- no new cvxpy
  variables or constraints. Lambda is scaled by the global `smoothing`
  argument (matching the other categorical penalties).
- Edges DataFrame uses the same node1 / node2 / weight schema accepted
  by network_lasso. Categories appearing only in the edge list (not in
  the training data) are still added to the feature's categories so
  they get coefficients shrunk toward their neighbors.
- Persisted across save/load. Older pickles default has_network_ridge
  to False on load.
- Tests in tests/test_categorical_network_ridge.py cover: missing-coef
  / missing-edges error paths, smoothing scaling, Laplacian structure
  (symmetric, row-sums-zero, PSD), lambda=0 parity with the
  unpenalized solve, empty edges DataFrame parity with the
  unpenalized solve, huge-lambda collapsing all coefficients to the
  zero-sum-constrained constant, intermediate-lambda shrinkage of
  neighbor differences, save/load round-trip, an end-to-end test
  where network ridge beats plain L2 on a noisily under-sampled
  region in a smooth-along-the-chain spatial signal, and a
  combined-with-L2 sanity check. Closes #36.

2026-04-26 : Group lasso for linear and spline features
- _LinearFeature and _SplineFeature now accept
      regularization={"group_lasso": {"coef": lambda}}
  matching the categorical-feature pattern added 2026-04-25.
- Linear: penalty is lambda * ||f_j||_2 = lambda * |m| * sqrt(xtx),
  which reduces to a 1-D soft-threshold on the slope. Implemented
  closed-form inside _LinearFeature.optimize alongside the existing
  ridge solve; no cvxpy dependency added on this path.
- Spline: penalty is lambda * ||N theta||_2. Group lasso is non-
  smooth at the origin and breaks the Cholesky-based fast path, so
  _SplineFeature.optimize routes through cvxpy (CLARABEL) when the
  regularizer is active and keeps the existing closed-form path
  otherwise. _SplineFeature.__init__ gains a `regularization` kwarg
  threaded through GAM.add_feature.
- Save / load plumbing parallels the categorical implementation;
  older pickles continue to load (has_group_lasso defaults to False).
- New tests in tests/test_linear_group_lasso.py and
  tests/test_spline_group_lasso.py mirror the categorical
  group-lasso suite: smoothing scaling, soft-threshold closed form,
  zero-coef matches unpenalized, large-coef zeros the contribution,
  save/load round-trip, and an end-to-end GAM test where the noise
  feature shrinks heavily relative to the signal feature. Closes #44.

2026-04-26 : Adopt ruff format
- Wire `ruff format` in as the project's formatter. `[tool.ruff]`
  line-length drops from 100 to 88 (ruff format / black default) and
  a `[tool.ruff.format]` block is added. CI (`.github/workflows/
  ci.yml`) gains a "Format check with ruff" step alongside the
  existing ruff lint / mypy / pytest steps; PRs that introduce
  unformatted code now fail. CLAUDE.md's Install / run and Tests /
  CI sections mention the formatter. Closes #55. One-shot reformat
  of the existing tree (12 files, ~300/-189) lands in the same
  commit so CI passes from the moment the check is enabled.

2026-04-26 : Reformulate inv_gaussian + reciprocal_squared prox in eta space
- _prox_inv_gaussian_reciprocal_squared_scalar now iterates Newton in
  eta = z^2, the natural convex coordinate for this prox. The Hessian
  in eta is 0.25*w*eta^(-3/2) + mu, strictly positive for w, mu, eta
  > 0, so Newton converges monotonically from any positive starting
  point. The damped-Newton-with-backtracking and minimize_scalar
  fallback added in #42 are gone — the only safeguard now is halving
  the Newton step when it would push eta non-positive. Function
  returns eta directly instead of z*z. The eta-space reference test
  added in #42 (test_prox_inv_gaussian_reciprocal_squared_matches_eta_minimization)
  continues to pass without modification, as does the robustness grid.
  Closes #52.

2026-04-26 : Reject l1 regularization on linear features
- _LinearFeature.__init__ now raises ValueError if the user passes
  regularization={"l1": ...}. The L1 coefficient was previously stored
  but silently dropped in optimize() — users got a plain ridge / no-
  penalty fit and no warning. The error message points at the
  categorical-feature L1 path for users who need L1 today, and at
  issue #53 for the follow-up that will implement L1 properly on
  linear features. Closes #20.

2026-04-26 : Damped Newton fallback for inv_gaussian + reciprocal_squared prox
- _prox_inv_gaussian_reciprocal_squared_scalar previously raised
  ValueError when undamped Newton failed to converge in 100 iterations,
  inconsistent with the binomial / poisson canonical-link solvers which
  fall back to scipy.optimize.minimize_scalar. The prox is convex in
  eta = z^2 but not in z, so the Newton step in z is not always a
  descent direction. Now uses damped Newton with backtracking on the
  objective (matching _prox_binomial_logit_scalar) and falls through to
  minimize_scalar if Newton can't make progress. Closes #42.

2026-04-26 : Rename network_lasso edge columns country1/country2 to node1/node2
- _CategoricalFeature now expects the edges DataFrame for
  regularization={"network_lasso": {"edges": ...}} to use the
  domain-neutral column names "node1" and "node2" (with the existing
  optional "weight"). The previous names hardcoded a country-pairs
  use case into a generic feature class. Clean rename, no backwards-
  compat path for the old names — callers with country-keyed edges
  need a one-line update. README example and tests updated.
  Closes #37.

2026-04-26 : Persist network_lasso edges DataFrame across save/load
- _CategoricalFeature._save now stores the original edges DataFrame
  alongside the derived difference matrix D, and _load restores it.
  Older pickles only stored D and silently dropped self._edges; loaded
  models now keep the adjacency for inspection or re-fitting. Older
  pickles without the new key continue to load (D-only path), matching
  the same older-pickle handling used for has_group_lasso. Closes #43.

2026-04-26 : Validate weights and covariate_class_sizes lengths in fit()
- GAM.fit() now raises ValueError immediately if `weights` or
  `covariate_class_sizes`, when provided, do not match len(y). Previously
  the mismatch propagated as a confusing broadcasting failure deep
  inside deviance computation. Closes #40.

2026-04-26 : Replace assert with raise in proximal-operator validation
- _prox_normal, _prox_binomial, _prox_poisson, _prox_gamma, and
  _prox_inv_gaussian previously used `assert inv_link is not None` to
  validate the required inv_link argument. Under `python -O` assertions
  are stripped, which would let a None inv_link propagate as a less-
  clear TypeError deeper in the call. Now raises ValueError with a
  clear message in all five locations. Closes #41.

2026-04-25 : Group lasso for categorical features
- _CategoricalFeature now accepts
      regularization={"group_lasso": {"coef": lambda}}
  alongside the existing l1/l2/network_lasso options.
- The penalty added to the cvxpy objective is
      lambda * ||A q||_2  =  lambda * sqrt(q^T diag(ccs) q)
  i.e. the L2 norm of the per-feature contribution vector. Driving
  this to zero zeros out the entire feature, giving categorical-
  variable selection: either every level participates in the model
  or none does. Lambda is scaled by the global `smoothing` argument
  (matching the other categorical penalties).
- Persisted across save/load. Older pickles default has_group_lasso
  to False on load.
- Tests cover: lambda=0 parity with the unpenalized solve, huge
  lambda zeroing the parameter vector, intermediate-lambda shrinkage,
  smoothing scaling, save/load round-trip, missing-coef error path,
  and an end-to-end variable-selection test (a noise feature shrinks
  to <10% of the signal feature's norm).
- Linear / spline group lasso and the l_inf variant remain on the
  to-do list for follow-up.

2026-04-25 : Parallel feature optimization
- GAM.fit() now accepts an n_jobs argument. With n_jobs > 1 (or
  n_jobs=-1 for os.cpu_count()), the per-feature primal step in each
  ADMM iteration is dispatched through a ThreadPoolExecutor created
  once at the start of fit() and shut down on return. NumPy / SciPy /
  cvxpy release the GIL during their numeric kernels, so threading
  produces real speedup -- expect roughly 2-4x on models with several
  non-trivial features (splines, categoricals via cvxpy). Default
  remains n_jobs=1 (serial) since pure linear-only models are usually
  faster serial.
- Refactor: extract the ADMM main loop and per-feature dispatch from
  fit() into _admm_loop and _optimize_features helper methods. This
  is what lets the pool lifecycle be expressed as a small try/finally
  in fit() without re-indenting ~70 lines of loop body.
- Serial and parallel runs accumulate f_new in self._features
  insertion order, so on identical inputs they produce bit-identical
  results (test_fit_n_jobs_2_matches_serial verifies).

2026-04-25 : Residuals vs predictor plots
- Add GAM.plot_residuals_vs_predictor(predictor, kind=..., name=...).
  Scatter of residuals against an arbitrary 1-D array of length
  num_obs. Useful for diagnosing unmodeled non-linearities (a
  predictor already in the model) or assessing whether to add a new
  predictor (one not currently in the model). Categorical
  (string/object) predictors get an automatic categorical x-axis.

2026-04-25 : Fix _gamma_dispersion convergence
- Replace the damped-Newton solver in _gamma_dispersion (fixed step
  beta=0.1, max_its=100, tol=1e-6) with scipy.optimize.brentq on a
  wide bracket. The likelihood equation
      2 * num_obs * (log nu - psi(nu)) - dof / nu = dev
  is monotonic in nu and has a unique positive root for any inputs
  with dev > 0 and 2 * num_obs > dof, so a bracket-based root
  finder converges reliably. The previous Newton iteration ran out
  of iterations on perfectly reasonable inputs (e.g. dof=3, dev=10,
  num_obs=50), causing GAM(family="gamma").dispersion() to spuriously
  raise ValueError after fitting.
- The function now raises ValueError only when the equation has no
  positive root (e.g. dev <= 0).

2026-04-25 : Anscombe residuals
- Add kind="anscombe" to GAM.residuals() (and GAM.plot_residuals()).
  Implements the family-specific variance-stabilizing transformation
  A(t) = integral V(s)^(-1/3) ds (McCullagh & Nelder 1989 Sec 2.4.1):
    * normal:           (y - mu) / sqrt(phi)  (== Pearson)
    * binomial(m):      sqrt(m) * (A(y/m) - A(mu)) / [mu(1-mu)]^(1/6)
                        / sqrt(phi), with A from the regularized
                        incomplete beta with parameters (2/3, 2/3)
    * poisson:          (3/2) * (y^(2/3) - mu^(2/3))
                        / mu^(1/6) / sqrt(phi)
    * gamma:            3 * (y^(1/3) - mu^(1/3)) / mu^(1/3) / sqrt(phi)
    * inverse_gaussian: (log y - log mu) / sqrt(mu * phi)

2026-04-25 : Cleanup follow-ups
- Categorical predict() now emits a UserWarning naming any categories
  not seen during fit (effect 0 fallback is unchanged).
- Harden the Newton iterations in _prox_binomial_logit_scalar and
  _prox_poisson_log_scalar: switch to scipy.special.expit /
  np.logaddexp for stability, add a backtracking line search, and
  fall back to scipy.optimize.minimize_scalar instead of raising
  ValueError when max_its is exhausted.
- Add GAM.residuals(kind="response"|"pearson"|"deviance") and
  GAM.plot_residuals(): a 1x2 figure with residuals-vs-fitted on the
  left and a normal Q-Q plot on the right.
- Implement Poisson overdispersion: GAM(family="poisson",
  estimate_overdispersion=True) now estimates phi via the Pearson
  chi-square / dof formula. Without the flag the dispersion is still
  fixed at 1.0.

2026-04-25 : BIC and pseudo-R^2
- Add GAM.bic() = deviance/dispersion + log(n) * p, matching the
  parameter-counting convention used by aic()/aicc().
- Add GAM.null_deviance() (deviance of the intercept-only model) and
  GAM.r_squared() (deviance-based pseudo-R^2; classical R^2 for the
  normal+identity case, NaN when null deviance is zero).
- Refactor GAM.deviance() onto a private _deviance_from_mu(y, mu, m, w)
  helper so null_deviance() reuses the per-family math.
- summary() now prints BIC and R^2.
- Remove "AICc, BIC, R-squared estimate" from the to-do list at the
  top of gamdist.py.

2026-04-25 : AICc
- Implement GAM.aicc() using the Hurvich-Tsai small-sample correction
  AICc = AIC + 2p(p+1)/(n-p-1), matching the convention used by aic()
  for counting parameters under estimated dispersion. Returns +inf
  when n <= p + 1 (overparameterized). summary() now prints AICc.

2026-04-25 : Doc cleanup
- Refresh CLAUDE.md to describe the Python 3.11+ codebase: drop the
  Python-2 framing, the obsolete pip/test.py instructions, and the
  references to the removed multiprocessing path and the long-gone
  Cython cdef remnants. Add a short Tests/CI section.
- Remove the dead "# self._rho = mv['rho']" line in GAM._load(); fit()
  resets _rho on every call, so persisting it had no effect.
- Drop the stale "AICc" line from GAM.summary()'s docstring; summary()
  has never printed AICc, and aicc() raises NotImplementedError.

2026-04-25 : Modernization (v0.2.0)
- Drop Python 2 support; require Python 3.11 or newer.
- Switch packaging from setup.py to pyproject.toml with the hatchling
  backend; manage development environments with uv.
- Add PEP 484 type hints across the public GAM API and the feature
  classes; promote _Feature to abc.ABC with abstractmethod stubs.
- Add a pytest test suite (96 tests, 84% line coverage) covering the
  proximal operators, feature classes, GAM API, save/load round-trip,
  link round-trip, and end-to-end regression varieties.
- Bug fixes:
    * deviance(X, y, covariate_class_sizes=ccs) used m=1 when ccs was
      provided (inverted condition); now uses ccs as intended.
    * Define SplineError so the existing raise sites in spline_feature
      no longer NameError when triggered.
    * confidence_intervals() and aicc() now raise NotImplementedError
      instead of silently returning None.
    * Pickle files are now opened in binary mode ('wb'/'rb') so save
      and load work on Python 3.
    * Iteration order over features is deterministic (sorted) wherever
      it affects covariate-class indexing.
    * Logistic link uses scipy.special.expit / logit to avoid exp()
      overflow on extreme linear predictors.
    * Binomial / Poisson / gamma deviance handle boundary mu and y
      values without spurious RuntimeWarnings.
- Default cvxpy solver for categorical features is CLARABEL (ECOS is
  no longer bundled with cvxpy >= 1.4).
- Removed: the unreachable multiprocessing branch in fit() and the
  associated _feature_wrapper helper. Parallel feature optimization
  remains a documented to-do item.
- Pickle compatibility: model files saved with v0.1 (Python 2) cannot
  be loaded by v0.2; refit and re-save.

2017-12-30 : Overdispersion
- Implemented over- (or under-) dispersion for binomial families.

2017-12-29 : Save/Load
- Added ability to save models after fitting has completed and load
  them for subsequent predictions.

2017-12-28 : Covariate classes and spline fix
- Adding ability to use covariate classes with categorical variables
  and binomial response.
- Fixing spline normalization to have mean zero over the training
  data.
