BOCD-GMM: Gaussian Mixture Model

Overview

The BOCD-GMM model is a particle-based, non-parametric approach for detecting changepoints in complex data distributions. It uses Gaussian Mixture Models (GMM) to handle multimodal data and provides robustness against outliers.

When to Use BOCD-GMM

Best suited for:

  • Multimodal data with multiple modes
  • Heavy-tailed distributions
  • Outlier-prone data streams
  • Non-Gaussian distributions
  • Complex, heterogeneous data

Advantages:

  • Handles multimodal and non-Gaussian data
  • Robust to outliers through mixture components
  • Flexible distribution modeling
  • Better performance on realistic data

Limitations:

  • Computationally more expensive than NIG
  • Many hyperparameters to tune
  • Requires more data for stable estimates
  • Slower execution than BOCD-NIG

Parameters

Initialization

from pybocd import BOCDGMM

model = BOCDGMM(
    # Component parameters
    alpha_0=2.0,
    beta_0=2.0,
    
    # Mean parameters
    m_0=0.0,
    kappa_0=1.0,
    
    # Precision parameters
    alpha_p_0=2.0,
    beta_p_0=2.0,
    
    # Mixture weight parameters
    mu_p_0=0.0,
    sigma_p_sq_0=1.0,
    
    # Jitter (smoothing) parameters
    jitter_mu=0.01,
    jitter_sigma_sq=0.01,
    jitter_tau_sq=0.01,
    jitter_pi=0.01,
    
    # Inference parameters
    l=200.0,           # Expected run length
    m=20,              # Number of mixture components
    n=200,             # Number of particles
    init_particle_n=50 # Initial number of particles
)

Parameter Descriptions

Parameter Description
alpha_0, beta_0 Prior parameters for component weighting
m_0, kappa_0 Prior mean and precision for mixture component means
alpha_p_0, beta_p_0 Prior shape/rate for component precisions
mu_p_0, sigma_p_sq_0 Parameters for precision prior distribution
jitter_* Smoothing parameters for particle updates
l Expected run length between changepoints
m Maximum number of mixture components
n Number of particles for sequential Monte Carlo
init_particle_n Initial particle count before resampling

Usage Example

import numpy as np
from pybocd import BOCDGMM

# Generate synthetic multimodal data
np.random.seed(42)
data = np.concatenate([
    np.random.normal(-2, 0.5, 100),    # Mode 1: mean=-2
    np.random.normal(2, 0.5, 100),     # Mode 2: mean=2
    np.random.normal(0, 0.5, 100),     # Mode 1 returns
    np.random.normal(-2, 0.5, 100)
])

# Add some outliers
outlier_indices = np.random.choice(len(data), 10, replace=False)
data[outlier_indices] += np.random.normal(0, 3, 10)

# Initialize GMM-based model
model = BOCDGMM(
    alpha_0=2.0, beta_0=2.0,
    m_0=0.0, kappa_0=1.0,
    alpha_p_0=2.0, beta_p_0=2.0,
    mu_p_0=0.0, sigma_p_sq_0=1.0,
    jitter_mu=0.01, jitter_sigma_sq=0.01,
    jitter_tau_sq=0.01, jitter_pi=0.01,
    l=200.0, m=20, n=200, init_particle_n=50
)

# Process data
for t, x in enumerate(data):
    model.add_data(x)
    
    if t % 50 == 0:
        print(f"Time {t}: Run length = {model.run_length:.1f}")

Accessing Results

# MAP estimate of run length
run_length = model.run_length

# Full posterior distribution
dist = model.run_length_dist

# Mixture component information
# (availability depends on implementation)

Tuning the GMM Model

Number of Particles (n)

More particles = more accurate but slower:

model = BOCDGMM(..., n=100)   # Fast, less accurate
model = BOCDGMM(..., n=500)   # Balanced
model = BOCDGMM(..., n=1000)  # Slow, more accurate

Number of Components (m)

Controls mixture complexity:

model = BOCDGMM(..., m=5)     # Few components, simple patterns
model = BOCDGMM(..., m=20)    # Moderate complexity
model = BOCDGMM(..., m=50)    # High complexity, more flexible

Jitter Parameters

Smoothing for particle diversity:

# Less smoothing (sharper updates)
model = BOCDGMM(..., jitter_mu=0.001, jitter_sigma_sq=0.001)

# More smoothing (smoother updates)
model = BOCDGMM(..., jitter_mu=0.1, jitter_sigma_sq=0.1)

Prior Settings

Weak priors:

BOCDGMM(alpha_0=1.0, beta_0=1.0, kappa_0=0.1, ...)

Strong priors:

BOCDGMM(alpha_0=10.0, beta_0=10.0, kappa_0=10.0, ...)

Performance Considerations

  • Memory: Grows linearly with number of particles and components
  • Speed: Slower than BOCD-NIG by 5-50x depending on parameters
  • Accuracy: Better on non-Gaussian, multimodal data
  • Stability: More stable with larger particle counts

Comparison with BOCD-NIG

Aspect BOCD-NIG BOCD-GMM
Data Type Univariate, normal Multimodal, complex
Speed Very fast Slower
Robustness Low to outliers High
Hyperparameters Few (4-5) Many (10+)
Computational Cost O(1) O(n × m)
Best For Simple streams Complex distributions

References

For theoretical details, see the original BOCD paper and advanced particle filtering literature.