Metadata-Version: 2.4
Name: cal-context
Version: 1.3.0
Summary: Context Assembly Layer — intelligent, cache-aware context management for LLM applications
Author: MT Solutions LLC
License: MIT
Project-URL: Homepage, https://cal-context.com
Project-URL: Repository, https://github.com/mtsolutions/context-assembly-layer
Project-URL: Issues, https://github.com/mtsolutions/context-assembly-layer/issues
Keywords: llm,context,assembly,ai,token,management,caching
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# CAL — Context Assembly Layer

Intelligent, cache-aware context management for LLM applications. CAL dynamically selects, compresses, and assembles context chunks so your AI agent gets exactly the right information for each request — nothing more, nothing less — while preserving provider cache coherence.

Works with Anthropic, OpenAI, and Google Gemini out of the box.

### The Problem

You have an AI agent with tools, documents, conversation history, and user preferences. Stuffing everything into every request wastes tokens, increases latency, and costs money. But naively reducing tokens can bust your provider's prefix cache — making the "optimized" version cost MORE than the unoptimized one. We learned this the hard way in production.

### What CAL Does

**Selector** — Scores chunks by relevance using IDF-weighted triggers, summary matching, and conversation inheritance. Two-zone architecture: stable prefix (always cached) + dynamic chunks (deterministically ordered to maximize cache hits). Includes poison trigger suppression and require-any gates to prevent noise loading.

**Chunker** — Splits large documents into coherent pieces. Compresses intelligently when a chunk is relevant but too large.

**Tool Stubs** — Three-tier lazy tool loading with conversation history awareness. Provides lightweight stubs until the model signals intent to use a specific tool. Automatically preserves full schemas for tools already used in conversation history, preventing provider validation errors.

**Cost Engine** — Provider-aware savings calculator. Knows that Anthropic has 4 input tiers, OpenAI has automatic 90% cache discounts, and Google charges for cache storage. No more wrong math.

**Telemetry** — Logs token counts, cache hit rates, chunk overlap, and cost estimates per request. Trust production data, not benchmarks.

### Quick Start

```python
pip install cal-context

from cal import Selector, Chunker

selector = Selector(chunks_dir='./my_chunks', provider='anthropic')
chunker = Chunker(max_tokens=4096)

query = 'What is the status of Project Alpha?'
selected = selector.select(query, max_chunks=5)
compressed = chunker.process(selected)

# Context is assembled with cache-stable ordering
prompt = build_prompt(system=compressed, user=query)
response = call_llm(prompt)
```

### Why Cache-Stable Ordering Matters

Every major LLM provider caches by prefix. If the first N tokens of your request match the previous request, you get cheap cache-read pricing. If you sort chunks by relevance score, the same chunk can land at different positions between requests, breaking the prefix match. In our production test, this made the "optimized" version cost more than no optimization at all.

CAL fixes this by using scores only for selection (which chunks to include) and using deterministic alphabetical ordering for position. Overlapping chunks between requests stay in identical positions, maximizing cache hits.

### Two-Zone Architecture

```python
from cal import Selector
from cal.cache_hints import get_hint_provider

selector = Selector(chunks_dir='./my_chunks')
hints = get_hint_provider('anthropic')

assembled = selector.assemble('What is Project Alpha status?')
system = hints.build_system_message(assembled['zone1'], assembled['zone2'])

# Zone 1: Mandatory chunks (identity, rules) — always cached
# Zone 2: Dynamic chunks — alphabetically ordered for prefix stability
```

| Zone | Content | Order | Cache Behavior |
|---|---|---|---|
| Zone 1 (Stable) | Mandatory chunks: identity, tools, rules | Fixed — never changes | Always cache hit (prefix match) |
| Zone 2 (Dynamic) | Selected chunks: project data, recent context | Alphabetical by chunk ID | Cache hit when overlapping selections share prefix |
| User Message | Current user query | Always last | Never cached (always unique) |

### Noise Suppression (v1.2)

Real-world indexes have "poison triggers" — common words like dates, names, or generic terms that appear in dozens of unrelated chunks. Without suppression, these cause irrelevant chunks to load on nearly every request.

CAL v1.2 adds two defenses:

**IDF Floor** — Triggers appearing in 10+ chunks automatically get zero weight. Configurable via `idf_floor_doc_freq`.

```python
# Default: triggers in 10+ chunks get zero weight
selector = Selector(chunks_dir='./chunks')

# Custom threshold
selector = Selector(chunks_dir='./chunks', idf_floor_doc_freq=15)

# Disable entirely
selector = Selector(chunks_dir='./chunks', idf_floor_doc_freq=0)
```

**Require-Any Gates** — Lock a chunk behind specific topic keywords. The chunk only loads when at least one gate keyword appears in the query.

```json
{
  "chunk_id": "brookes_agent_setup",
  "triggers": ["agent", "setup", "server", "openclaw"],
  "negative_triggers": {
    "require_any": ["brooke"],
    "penalty": [],
    "hard_exclude": []
  }
}
```

Without the gate, this chunk loads on any query mentioning "agent" or "setup". With `require_any: ["brooke"]`, it only loads when Brooke is specifically mentioned.

In production, these two features moved us from **65% to 83% average token savings** — a +18 percentage point improvement from suppressing 4 noise chunks that were loading on 78-96% of all requests.

### History-Aware Tool Stubs (v1.3)

When an LLM conversation includes `tool_use` blocks (e.g. `read({path: "..."})`) in the message history, providers like Anthropic validate those historical calls against the current request's tool schemas. If you stub a tool that was already used, the schema mismatch causes a 400 error.

CAL v1.3 adds history-aware tool stub selection:

```python
from cal.tool_stubs import ToolStubs

stubs = ToolStubs(my_tool_schemas)

# Pass conversation messages for history awareness
schemas, meta = stubs.select_tools(
    "what about the second result?",
    messages=conversation_history,  # scans for tool_use blocks
)

# meta["history_protected"] shows which tools kept full schemas
# meta["tier"] shows which tier was selected (0, 1, or 2)
```

**Three tiers:**
| Tier | When | Behavior |
|---|---|---|
| 0 (No Tools) | Short conversational message, no history tools | Strip all tools — retry if model needs one |
| 1 (Shortlist) | No specific tool detected, but might need tools | Core tools as stubs only |
| 2 (Targeted) | Specific tool intent detected via triggers | Full schemas for detected tools, stubs for rest |

History-protected tools always keep full schemas regardless of tier.

### Provider Support

| Provider | Cache Type | CAL Behavior |
|---|---|---|
| Anthropic | Prefix + cache_control hint | Emits `cache_control: ephemeral` on Zone 1 |
| OpenAI | Automatic prefix + prompt_cache_key | Sets stable `prompt_cache_key` per workspace |
| Google Gemini | Implicit (auto) or Explicit (manual) | Stable prefix for implicit; explicit cache API optional |

### Cost Engine

```python
from cal.cost import estimate_savings, google_cache_breakeven

# How much does CAL actually save?
savings = estimate_savings(
    tokens_without_cal=23000,
    tokens_with_cal=5500,
    provider='anthropic',
    model='opus',
    cache_hit_rate=0.85,
    requests_per_day=100,
)
print(savings['note'])
# "76% token reduction. Saves ~$100/month at 100 req/day (anthropic/opus, 85% cache hit rate)."

# Is Google explicit caching worth it?
breakeven = google_cache_breakeven(
    cached_tokens=2000,
    uncached_rate=3.50,
    cache_read_rate=0.35,
)
print(f"Need {breakeven['breakeven_requests_per_hour']:.0f} req/hr to break even")
```

### Telemetry

```python
from cal.telemetry import Telemetry

telemetry = Telemetry(log_path='./cal_telemetry.jsonl', provider='anthropic', model='opus')

# Log each request
telemetry.record(
    original_tokens=23000,
    optimized_tokens=5500,
    chunks_selected=['identity', 'project_alpha', 'tools'],
    cached_tokens=4800,  # from provider response headers
)

# Get aggregate stats
stats = telemetry.get_stats()
print(f"Avg reduction: {stats['avg_reduction_pct']}%")
print(f"Avg cache overlap: {stats['avg_overlap_pct']}%")
```

### Production Benchmarks

Measured on Claude Opus 4, 103 chunks indexed, 250+ production requests:

| Metric | Without CAL | With CAL (v1.1) | With CAL (v1.2) | With CAL (v1.3) |
|---|---|---|---|---|
| Tokens per request | ~23,000 | ~7,800 | ~4,100 | ~4,100 |
| Chunks per request | 103 (all) | ~20 | ~6 | ~6 |
| Avg savings | — | 65% | 83% | 83% |
| Tool schema errors | N/A | Possible on multi-turn | Possible on multi-turn | 0 (history-aware) |
| Cost per request (cached) | $0.043 | $0.015 | $0.008 | $0.008 |
| Failsafe errors | N/A | 0 | 0 | 0 |

Important: The primary value is context quality, not cost. 4K relevant tokens produce better model responses than 23K tokens with noise. Cost savings are the bonus.

### Configuration

| Variable | Default | Description |
|---|---|---|
| CAL_PROVIDER | anthropic | Provider: anthropic, openai, or google |
| CAL_CHUNKS_DIR | ./chunks | Path to your chunks directory |
| CAL_MAX_TOKENS | 100000 | Max token budget for assembled context |
| CAL_COMPRESSION_THRESHOLD | 0.8 | Compress chunks above this % of budget |
| CAL_MODEL | (provider default) | Model for compression if needed |
| CAL_TELEMETRY_ENABLED | true | Enable/disable request logging |

### Contributing

PRs welcome. Open an issue first so we can discuss the approach. One feature or fix per PR.

### License

MIT — do whatever you want with it. See LICENSE.
