Metadata-Version: 2.4
Name: c3-charm
Version: 0.1.1
Summary: Python SDK for the CHARM time-series foundation model — embeddings, forecasting, and a downstream-task toolkit.
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: time-series,embeddings,forecasting,foundation-model,anomaly-detection
Author: C3 AI
Author-email: opensource@c3.ai
Requires-Python: >=3.10
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Provides-Extra: toolkit
Requires-Dist: datasetsforecast (>=0.1.0) ; extra == "toolkit"
Requires-Dist: dill (>=0.3.7) ; extra == "toolkit"
Requires-Dist: gin-config (>=0.5.0) ; extra == "toolkit"
Requires-Dist: httpx[http2] (>=0.25.0)
Requires-Dist: lightgbm (>=4.0.0) ; extra == "toolkit"
Requires-Dist: matplotlib (>=3.7.0) ; extra == "toolkit"
Requires-Dist: minisom (>=2.3.1) ; extra == "toolkit"
Requires-Dist: numpy (>=1.24.0)
Requires-Dist: optuna (>=3.0.0) ; extra == "toolkit"
Requires-Dist: pandas (>=2.0.0) ; extra == "toolkit"
Requires-Dist: pyarrow (>=14.0.0) ; extra == "toolkit"
Requires-Dist: python-dotenv (>=1.0.0)
Requires-Dist: requests (>=2.31.0)
Requires-Dist: scienceplots (>=2.0.0) ; extra == "toolkit"
Requires-Dist: scikit-learn (>=1.3.0) ; extra == "toolkit"
Requires-Dist: seaborn (>=0.13.0) ; extra == "toolkit"
Requires-Dist: tensordict (>=0.1.0) ; extra == "toolkit"
Requires-Dist: torch (>=2.0.0)
Requires-Dist: tqdm (>=4.60.0)
Project-URL: Documentation, https://github.com/c3ai/c3-charm#readme
Project-URL: Homepage, https://c3.ai
Project-URL: Issues, https://github.com/c3ai/c3-charm/issues
Project-URL: Repository, https://github.com/c3ai/c3-charm
Description-Content-Type: text/markdown

# c3-charm

[![PyPI version](https://img.shields.io/pypi/v/c3-charm.svg)](https://pypi.org/project/c3-charm/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python](https://img.shields.io/pypi/pyversions/c3-charm.svg)](https://pypi.org/project/c3-charm/)

A Python SDK for the CHARM time-series foundation model. Provides **embeddings** (multivariate time series → vectors), **forecast/backcast** (quantile predictions), and a **toolkit** for downstream tasks (anomaly detection, retrieval, classification, reconstruction, forecasting).

## What is CHARM?

CHARM (CHannel Aware Representation Model) is a foundation model for **multivariate time series**. It ingests windows of (T, C) data — T timesteps, C channels — and produces dense embeddings that capture temporal patterns and cross-channel relationships. Channel names (descriptions) are part of the input, making the model channel-aware.

**No scaling required** — the model handles normalization internally. Send raw data directly.

---

## Installation

```bash
pip install c3-charm            # core SDK only (embeddings + forecast)
pip install c3-charm[toolkit]   # includes PyTorch models, datasets, trainers
```

Or from source:

```bash
git clone https://github.com/c3ai/c3-charm.git
cd c3-charm
poetry install                    # core SDK only
poetry install --with toolkit     # include toolkit dependencies
```

---

## Core SDK

### Client initialization

```python
from charm import CharmClient

client = CharmClient(
    base_url="http://your-server:8080",
    api_key="your-api-key",      # or set CHARM_API_KEY env var
    timeout=300,
    max_retries=3,
)
```

### Embeddings — `client.embeddings.create()`

Converts time series windows into dense vectors.

```python
response = client.embeddings.create(
    descriptions=[["sensor_A", "sensor_B"]],  # (N, C) channel names
    ts_array=[[[1.0, 2.0], [1.1, 2.1], ...]],  # (N, T, C) values
    batch_size=32,
    return_tensors="np",       # "list", "np", or "torch"
    aggregate=True,            # True → (N, D); False → (N, T_, C, D)
    progress=True,
)
embeddings = response.embeds  # shape (N, D) when aggregate=True
```

**`aggregate` parameter:**
- `True` (default): Returns flattened embeddings `(N, D)` — one vector per series. Best for retrieval, classification, clustering.
- `False`: Returns per-patch, per-channel embeddings `(N, T_, C, D)` where `T_ = T / patch_size`. Best for fine-grained tasks or custom heads.

**Async** (faster for large datasets):

```python
response = await client.embeddings.async_create(
    descriptions=descriptions,
    ts_array=ts_array,
    max_B_per_request=32,
    concurrency_per_call=8,
    return_tensors="np",
    aggregate=True,
)
```

### Forecast / Backcast — `client.prediction.create()`

Zero-shot quantile predictions — no training required.

```python
response = client.prediction.create(
    descriptions=[["sensor_A", "sensor_B"]],
    ts_array=[[[1.0, 2.0], [1.1, 2.1], ...]],
    target_len=10,       # positive = forecast, negative = backcast
    return_tensors="np",
)
forecast = response.denormalized_predictions  # (N, 10, C, Q) — Q quantiles
median = response.median                      # (N, 10, C) — point forecast
```

**Backcast** (reconstruct past values):

```python
response = client.prediction.create(
    descriptions=descriptions,
    ts_array=ts_array,
    target_len=-8,  # reconstruct last 8 steps
    return_tensors="np",
)
```

### Input constraints

| Constraint | Limit |
|---|---|
| Timesteps per series | 1 ≤ T < 1500 |
| Channels per series | C < 1500 |
| Per-request size | N × C × T ≤ 500,000 |
| Batch consistency | All series in a request must share the same T and C |
| Minimum for good embeddings | T ≥ 32 (model patch size) |

The SDK handles client-side batching automatically when you set `batch_size` (sync) or `max_B_per_request` (async).

### Output shapes

| Method | Output field | Shape |
|---|---|---|
| `embeddings.create(aggregate=True)` | `response.embeds` | (N, D) |
| `embeddings.create(aggregate=False)` | `response.embeds` | (N, T_, C, D) |
| `prediction.create(target_len > 0)` | `response.denormalized_predictions` | (N, target_len, C, Q) |
| `prediction.create(target_len < 0)` | `response.denormalized_predictions` | (N, abs(target_len), C, Q) |
| `prediction.create(...)` | `response.median` | (N, abs(target_len), C) |

### Channel descriptions

Descriptions are **required** and affect embedding quality. They tell the model what each channel represents.

**Good descriptions** — use meaningful, consistent names:
```python
descriptions = [["engine_temperature", "oil_pressure", "rpm"]]
```

**Acceptable** — short but informative:
```python
descriptions = [["temp", "pressure", "speed"]]
```

**Avoid** — generic or positional names reduce model effectiveness:
```python
descriptions = [["col_0", "col_1", "col_2"]]  # works but suboptimal
```

When working with pandas DataFrames, use column names directly:
```python
descriptions = [df.columns.tolist()] * N
```

### Scaling

**No pre-processing needed.** CHARM normalizes internally. Send raw data as-is. Do not apply StandardScaler, MinMaxScaler, or log transforms before calling the API.

### Error handling

```python
from charm import CharmError, AuthenticationError, InvalidRequestError, RateLimitError

try:
    response = client.embeddings.create(...)
except AuthenticationError:
    # bad API key
except InvalidRequestError as e:
    # shape violations, empty input
except RateLimitError:
    # back off and retry
except CharmError as e:
    # catch-all for other SDK errors
```

---

## Toolkit — Downstream Tasks

The toolkit (`pip install c3-charm[toolkit]`) provides PyTorch models, dataset utilities, and training infrastructure for fine-tuning on top of CHARM embeddings.

### Retrieval — `charm_toolkit.retrieval`

Find similar time series by embedding similarity.

```python
from charm_toolkit.retrieval import (
    l2_normalize,
    cosine_similarity_matrix,
    knn_search,
    retrieval_metrics,
)

# Embed your data
response = client.embeddings.create(
    descriptions=descriptions,
    ts_array=windows_list,
    return_tensors="np",
)
embeddings = response.embeds  # (N, D)

# Similarity search
sim = cosine_similarity_matrix(embeddings, embeddings)

# kNN search
indices, scores = knn_search(query_emb, corpus_emb, k=5)

# Evaluation metrics
metrics = retrieval_metrics(
    query_emb=query_emb,
    corpus_emb=corpus_emb,
    query_labels=query_labels,
    corpus_labels=corpus_labels,
    k_values=[1, 3, 5, 10],
    exclude_self=True,
    query_ids=query_dataset_names,
    corpus_ids=corpus_dataset_names,
)
# Returns: precision@k, ndcg@k, hit_rate@k
```

### Anomaly Detection — `charm_toolkit.anomaly_detection`

Detect anomalies via kNN distance scoring on windowed CHARM embeddings.

```python
from charm_toolkit.anomaly_detection import (
    sliding_window_embeddings,
    knn_anomaly_scores,
    window_scores_to_pointwise,
)

# 1. Embed sliding windows
train_emb = sliding_window_embeddings(
    client, train_data, descriptions,
    window_size=128, stride=1, batch_size=64,
)
test_emb = sliding_window_embeddings(
    client, test_data, descriptions,
    window_size=128, stride=1, batch_size=64,
)

# 2. Score test windows by distance to train
window_scores = knn_anomaly_scores(
    test_emb=test_emb,
    reference_emb=train_emb,
    k=5,
    distance="cosine",    # "cosine", "l2", "l1"
    aggregation="mean",   # "mean", "max"
)

# 3. Aggregate to per-timestep scores
pointwise_scores = window_scores_to_pointwise(
    window_scores=window_scores,
    window_size=128,
    stride=1,
    total_length=len(test_data),
    method="mean",  # "mean", "max", "last", "center"
)
```

**Pointwise aggregation methods:**

Each timestep is covered by multiple overlapping windows. The `method` parameter controls how to assign a single score per timestep:

| Method | Behavior | Use case |
|--------|----------|----------|
| `"mean"` | Average of all windows covering the point | Smooth, best for offline evaluation |
| `"max"` | Max score among covering windows | Conservative, catches isolated spikes |
| `"last"` | Score of the most recently *completed* window | Online/streaming — score only updates when a window finishes processing |
| `"center"` | Score of the window centered on each point | Minimal time-shift, tightest temporal alignment |

### ReconstructionModel — anomaly detection via learned head

```python
from charm_toolkit import (
    ReconstructionModel, create_reconstruction_datasets,
    collator, TrainerClass,
)
from torch.utils.data import DataLoader
import torch.nn as nn

train_ds, val_ds, test_ds = create_reconstruction_datasets(
    raw_data,           # (T, C) numpy array or torch tensor
    descriptions=channel_names,
    window_size=256,
    stride=1,
    train_ratio=0.7,
    val_ratio=0.15,
    sequential=True,
    scale=True,
)

model = ReconstructionModel(
    embedding_client=client,
    reconstructor="linear",  # "linear", "mlp", or custom nn.Module
    hidden_dim=128,
    dropout=0.1,
)

trainer = TrainerClass(
    model=model,
    train_loader=DataLoader(train_ds, batch_size=512, collate_fn=collator),
    val_loader=DataLoader(val_ds, batch_size=512, collate_fn=collator),
    epochs=1000,
    patience=5,
    lr=1e-3,
    criterion=nn.HuberLoss(),
)
trainer.fit()
```

### ForecastingModel — embedding-based forecasting

```python
from charm_toolkit import ForecastingModel, create_forecasting_datasets, collator, TrainerClass
from torch.utils.data import DataLoader

train_ds, val_ds, test_ds = create_forecasting_datasets(
    raw_data,
    descriptions=channel_names,
    train_horizon=96,
    test_horizon=96,
    train_ratio=0.7,
    val_ratio=0.15,
    sequential=True,
    scale=True,
)

model = ForecastingModel(
    embedding_client=client,
    horizon=96,
    input_size=96,
    head="linear",
    hidden_dim=128,
    mode="last",         # "last", "avg", "none"
    per_channel=True,
    num_channels=len(channel_names),
)

trainer = TrainerClass(
    model=model,
    train_loader=DataLoader(train_ds, batch_size=512, collate_fn=collator),
    val_loader=DataLoader(val_ds, batch_size=512, collate_fn=collator),
    epochs=1000,
    patience=10,
    lr=1e-2,
)
trainer.fit()
```

### ClassificationModel — time series classification

```python
from charm_toolkit import ClassificationModel, create_classification_datasets, collator, TrainerClass
from torch.utils.data import DataLoader
import torch.nn as nn

train_ds, val_ds, test_ds = create_classification_datasets(
    raw_data,          # (N, T, C)
    labels=labels,     # list of N integer labels
    descriptions=channel_names,
    train_ratio=0.7,
    val_ratio=0.15,
)

model = ClassificationModel(
    embedding_client=client,
    num_classes=num_classes,
    hidden_dim=128,
    pooling_over_t="mean",
    pooling_over_channels="mean",
    classifier_type="mlp",
)

trainer = TrainerClass(
    model=model,
    train_loader=DataLoader(train_ds, batch_size=32, collate_fn=collator),
    val_loader=DataLoader(val_ds, batch_size=32, collate_fn=collator),
    epochs=100,
    patience=10,
    lr=1e-3,
    criterion=nn.CrossEntropyLoss(),
)
trainer.fit()
```

### Precomputing embeddings (critical for training)

Toolkit models call the API every forward pass. For training with hundreds of windows per epoch, **precompute embeddings once**:

```python
from charm_toolkit import precompute_dataset_embeddings, PrecomputedEmbeddingsDataset

# Compute once, save to disk as memmap
train_shape = precompute_dataset_embeddings(
    client=client, dataset=train_ds,
    output_path="./outputs/train_embeddings.pt", memory_batch_size=8192
)
val_shape = precompute_dataset_embeddings(
    client=client, dataset=val_ds,
    output_path="./outputs/val_embeddings.pt", memory_batch_size=8192
)

# Wrap datasets — model skips API calls when "embeds" key present
train_ds = PrecomputedEmbeddingsDataset(train_ds, "./outputs/train_embeddings.pt", train_shape)
val_ds = PrecomputedEmbeddingsDataset(val_ds, "./outputs/val_embeddings.pt", val_shape)

# Training now uses cached embeddings — orders of magnitude faster
train_loader = DataLoader(train_ds, batch_size=512, shuffle=True, collate_fn=collator)
```

### Trainer API

```python
from charm_toolkit import TrainerClass

trainer = TrainerClass(
    model=model,
    train_loader=train_loader,
    val_loader=val_loader,
    test_loader=test_loader,     # optional
    lr=1e-3,
    weight_decay=1e-4,
    epochs=1000,
    patience=5,
    min_delta=1e-4,
    max_grad_norm=5.0,
    criterion=None,              # defaults to MSELoss
)
trainer.fit()
test_loss = trainer.evaluate(test_loader)
```

### Dataset factory functions

All return `(train_dataset, val_dataset, test_dataset)`:

| Function | Input shape | Key args |
|---|---|---|
| `create_reconstruction_datasets(raw_data, ...)` | (T, C) | `window_size`, `stride`, `train_ratio`, `val_ratio` |
| `create_forecasting_datasets(raw_data, ...)` | (T, C) | `train_horizon`, `test_horizon`, `stride`, `train_ratio`, `val_ratio` |
| `create_classification_datasets(raw_data, labels, ...)` | (N, T, C) | `train_ratio`, `val_ratio` |

Reconstruction and forecasting expect a single long time series `(T, C)` split temporally. Classification expects pre-windowed `(N, T, C)`.

### collator

All DataLoaders using toolkit datasets require `collator` as the `collate_fn`:

```python
from charm_toolkit import collator
# or equivalently:
from charm_toolkit.Datasets import collator
```

---

## Embeddings as features

CHARM embeddings work as drop-in feature vectors for any sklearn model:

```python
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LogisticRegression
from charm_toolkit.retrieval import cosine_similarity_matrix

response = client.embeddings.create(
    descriptions=descriptions,
    ts_array=windows_list,
    return_tensors="np",
)
X = response.embeds  # (N, D)

# Anomaly detection with isolation forest
clf = IsolationForest(contamination=0.05)
anomaly_labels = clf.fit_predict(X)

# Similarity search
sim = cosine_similarity_matrix(X, X)

# As features for any classifier
clf = LogisticRegression().fit(X_train, y_train)
```

---

## Local Deployment

Deploy models locally from GitHub releases — no remote server needed:

```python
with CharmClient(tag="experiment-2026-03-15_10-30-00") as client:
    response = client.embeddings.create(...)
# Server shuts down automatically
```

When `tag` is provided:
1. Checks for GPU availability (falls back to CPU)
2. Clones repo at the specified tag (shallow clone)
3. Downloads model weights from the GitHub release
4. Launches the serving stack locally
5. Polls health endpoint until ready

Files cached at `~/.charm/models/<tag>/` for fast subsequent runs.

```python
CharmClient(
    tag="experiment-tag",           # required for local mode
    repo_url="https://...",         # default: c3-e/research
    cache_dir="/path/to/cache",     # default: ~/.charm/models
    port=8080,                      # 0 = auto-select
)
```

---

## Decision guide

### When to use CHARM

- Multivariate time series (multiple channels measured over time)
- Each window has at least ~32 timesteps (model patch size)
- You want a strong starting point without feature engineering

### When to use classical methods instead

- Tabular data without a time dimension — use LightGBM, XGBoost
- Very short series (< 10 timesteps)
- Single scalar features — still works but may not outperform ARIMA/ETS

### Zero-shot vs fine-tuned

| Approach | When | Effort |
|---|---|---|
| `prediction.create(target_len=H)` | Quick forecast baseline, no labeled data | None — one API call |
| Embeddings + sklearn | Moderate data, combine with other features | Minutes |
| Embeddings + kNN (retrieval/AD) | Unlabeled anomaly detection or search | Minutes |
| Toolkit model (Reconstruction/Forecasting/Classification) | Have labeled data, want best performance | Train a small head (~minutes on CPU) |

---

## Testing

```bash
pip install pytest
python -m pytest tests/
python -m pytest tests/test_utils.py -v
```

## Documentation

The full API reference and usage guide is this README — it renders on the [PyPI page](https://pypi.org/project/c3-charm/).

## License

Apache License 2.0 — see [LICENSE](LICENSE).

