Metadata-Version: 2.4
Name: bclm-pytorch
Version: 1.0.7
Summary: Load and run BCLM language models with PyTorch.
Author: BCML Labs
License: Apache-2.0
Project-URL: Homepage, https://github.com/bcml-labs/bclm-pytorch
Project-URL: Repository, https://github.com/bcml-labs/bclm-pytorch
Project-URL: Documentation, https://github.com/bcml-labs/bclm-pytorch/blob/main/DOCS.md
Project-URL: Issues, https://github.com/bcml-labs/bclm-pytorch/issues
Keywords: language-model,pytorch,transformers,inference,bclm
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.1
Requires-Dist: safetensors>=0.4
Requires-Dist: tokenmonster>=1.0
Requires-Dist: huggingface_hub>=0.20
Provides-Extra: fast
Requires-Dist: xformers>=0.0.23; extra == "fast"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.3; extra == "dev"
Dynamic: license-file

# BCLM-PyTorch Documentation

**bclm-pytorch** is the official Python library for loading and running inference with BCLM language models.

## Installation

```bash
pip install bclm-pytorch
```

Dependencies installed automatically: `torch`, `safetensors`, `tokenmonster`, `huggingface_hub`.

## Quick Start

```python
import bclm

# Loads a model from HuggingFace
model = bclm.load("bclm-1-small-preview")

# Multi-turn chat
chat = model.chat(system_prompt="You are a helpful assistant.")
print(chat.send("What is 2+2?"))
print(chat.send("Why is that the case?"))

# Streaming chat (real-time token output)
for chunk in chat.send_stream("Tell me a short story about a cat."):
    print(chunk, end="", flush=True)
print()
```

## Loading Models

`bclm.load()` is the single entry point for all model sources.

### From Hugging Face

```python
# Short form — resolves to huggingface.co/bclm/bclm-1-small-preview
model = bclm.load("bclm-1-small-preview")

# Explicit repo ID
model = bclm.load("bclm/bclm-1-small-preview")

# Any HF repo
model = bclm.load("your-org/your-model")
```

### From a Local Directory

Point to a directory containing `config.json` and `model.safetensors`:

```python
model = bclm.load("/path/to/model/directory")
```

### From a URL

Point to a directory served over HTTPS that contains `config.json` and `model.safetensors`:

```python
model = bclm.load("https://example.com/models/bclm-1-small/")
```

### Options

```python
model = bclm.load(
    "bclm-1-small-preview",
    device="cuda",          # "cpu", "cuda", "cuda:0", etc. (default: auto)
    dtype=torch.bfloat16,   # torch.float16, torch.float32 (default: bf16 on GPU, f32 on CPU)
    compile=False,          # Enable torch.compile for inference (default: False)
)
```

| Parameter | Default | Description |
|-----------|---------|-------------|
| `device`  | auto    | `"cpu"`, `"cuda"`, or a specific device string. Auto-selects CUDA if available. |
| `dtype`   | auto    | `torch.bfloat16` on CUDA, `torch.float32` on CPU. Weights ship as float16 but were trained in bfloat16. |
| `compile` | `False` | Wrap the model with `torch.compile`. Off by default to avoid warmup latency. |

## Chat Interface

The chat interface supports multi-turn conversations with automatic history management.

### Basic Usage

```python
chat = model.chat(
    system_prompt="You are a helpful assistant.",  # optional
    max_new_tokens=512,
    temperature=1.0,
    top_k=50,
    top_p=0.9,
    repetition_penalty=1.15,
    frequency_penalty=0.1,
)

# Each call appends to the conversation history
response = chat.send("Hello, who are you?")
print(response)

follow_up = chat.send("Can you elaborate?")
print(follow_up)
```

### Streaming

```python
for chunk in chat.send_stream("Write a poem about the ocean."):
    print(chunk, end="", flush=True)
print()
```

### Per-Message Overrides

Override generation parameters for a single message without changing the session defaults:

```python
response = chat.send(
    "Give me a one-word answer: is the sky blue?",
    max_new_tokens=10,
    temperature=0.1,
    repetition_penalty=1.0,   # disable for this message
    frequency_penalty=0.0,
)
```

### History Management

```python
# View conversation history
for msg in chat.messages:
    print(f"{msg['role']}: {msg['content'][:80]}")

# Clear history (keeps system prompt)
chat.clear()
```

### Interactive CLI

Launch a blocking interactive chat loop in the terminal:

```python
chat.interactive()
```

Commands inside the interactive loop:
- `/clear` — reset conversation
- `/history` — print message list
- `/help` — show commands
- `Ctrl-C` — exit

## Text Completion

For non-chat use cases (continuing a text prompt):

```python
text = model.complete("Once upon a time", max_tokens=200, temperature=0.8)
print(text)
```

### Streaming Completion

```python
for chunk in model.complete_stream("The quick brown fox", max_tokens=100):
    print(chunk, end="", flush=True)
print()
```

## Generation Parameters

All generation methods accept these parameters:

| Parameter            | Default | Description |
|----------------------|---------|-------------|
| `max_new_tokens`     | 512     | Maximum number of tokens to generate. |
| `temperature`        | 1.0     | Sampling temperature. 0 = greedy, higher = more random. |
| `top_k`              | 50      | Restrict sampling to the top-k most likely tokens. `None` to disable. |
| `top_p`              | 0.9     | Nucleus sampling threshold. `None` to disable. |
| `repetition_penalty` | 1.15    | Multiplicative penalty applied to every token already present in the context. Positive logits are divided by the penalty; negative logits are multiplied — both directions make the token less likely. 1.0 disables. Defaults are tuned for small language models, which are more prone to degenerate repetition. |
| `frequency_penalty`  | 0.1     | Additive penalty proportional to how many times each token has appeared. Logits are reduced by `frequency_penalty × count`, discouraging high-frequency tokens more strongly than low-frequency ones. 0.0 disables. |

## Model Information

```python
model = bclm.load("bclm-1-small-preview")

# Architecture name (e.g., "BCLM1Model")
print(model.architecture)

# Model config object
print(model.config)

# Parameter count
print(f"{model.num_parameters / 1e6:.1f}M parameters")

# Device and dtype
print(model.device, model.dtype)
```

## Advanced: Direct Access

For advanced use cases, you can access the underlying PyTorch modules:

```python
model = bclm.load("bclm-1-small-preview")

# Raw nn.Module (e.g., BCLM1Model)
raw = model.raw_model

# Inference wrapper (e.g., BCLM1ForGeneration)
gen = model.generator

# Tokenizer
tok = model.tokenizer
```

## Tokenizer

The current tokenizer backend is [TokenMonster](https://github.com/alasdairforsythe/tokenmonster). The tokenizer spec is embedded in each model's `config.json` (e.g., `"tokenmonster:english-32000-consistent-v1"`), so the correct tokenizer is loaded automatically.

```python
tok = model.tokenizer

ids = tok.encode("Hello world")
text = tok.decode(ids)
```

## Environment Variables

| Variable | Description |
|----------|-------------|
| `BCLM_TOKENIZER` | Override tokenizer spec (e.g., `tokenmonster:english-32000-consistent-v1`). |
| `BCLM_TOKENMONSTER_DIR` | Custom cache directory for TokenMonster vocab files. |

## Config Format

Model directories must contain a `config.json`:

```json
{
  "architecture": "BCLM1Model",
  "model": {
    "vocab_size": 32768,
    "tokenizer": "tokenmonster:english-32000-consistent-v1",
    "embed_dim": 384,
    "n_layers": 12,
    "max_seq_len": 16384,
    "dropout": 0.0,
    "attn_heads": 6,
    "attn_kv_heads": 2,
    "local_attn_layers": [1, 5, 7, 11],
    "global_attn_layers": [3, 9],
    "attn_window_size": 1024,
    "conv_kernel_size": 4,
    "osc_n_pairs": 1,
    "osc_n_real": 16,
    "osc_clamp_min_decay": 1e-05,
    "bigram_table_factor": 5
  }
}
```

The `"architecture"` field determines which model class is instantiated. Weights should be in `model.safetensors` (safetensors format).

## Error Handling

```python
import bclm

try:
    model = bclm.load("nonexistent-model")
except FileNotFoundError:
    print("Model not found")
except ValueError as e:
    print(f"Invalid model: {e}")
except ImportError as e:
    print(f"Missing dependency: {e}")
```

## Requirements

- Python ≥ 3.9
- PyTorch ≥ 2.1
- `safetensors`
- `tokenmonster`
- `huggingface_hub`

Optional:
- `xformers` — enables optimized attention kernels on CUDA
