Metadata-Version: 2.4
Name: semvec
Version: 0.6.3
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Framework :: FastAPI
Classifier: Typing :: Typed
Requires-Dist: numpy>=1.23
Requires-Dist: requests>=2.31
Requires-Dist: tiktoken>=0.7
Requires-Dist: fastapi>=0.115 ; extra == 'api'
Requires-Dist: uvicorn[standard]>=0.32 ; extra == 'api'
Requires-Dist: sqlalchemy>=2.0 ; extra == 'api'
Requires-Dist: prometheus-client>=0.21 ; extra == 'api'
Requires-Dist: pydantic>=2.6 ; extra == 'api'
Requires-Dist: msgspec>=0.18 ; extra == 'api'
Requires-Dist: sentence-transformers>=3.0 ; extra == 'benchmarks'
Requires-Dist: fastmcp>=2.0 ; extra == 'coding'
Requires-Dist: cryptography>=42 ; extra == 'compliance'
Requires-Dist: ruff>=0.6 ; extra == 'dev'
Requires-Dist: mypy>=1.11 ; extra == 'dev'
Requires-Dist: pre-commit>=3.7 ; extra == 'dev'
Requires-Dist: pytest>=7 ; extra == 'dev'
Requires-Dist: httpx>=0.27 ; extra == 'dev'
Requires-Dist: mkdocs>=1.6 ; extra == 'docs'
Requires-Dist: mkdocs-material>=9.5 ; extra == 'docs'
Requires-Dist: pymdown-extensions>=10.7 ; extra == 'docs'
Requires-Dist: bm25s>=0.2 ; extra == 'hybrid'
Requires-Dist: nltk>=3.8 ; extra == 'hybrid'
Requires-Dist: pyjwt>=2.9 ; extra == 'jwt'
Requires-Dist: mem0ai>=0.1 ; extra == 'mem0'
Requires-Dist: faiss-cpu>=1.7 ; extra == 'mem0'
Provides-Extra: api
Provides-Extra: benchmarks
Provides-Extra: coding
Provides-Extra: compliance
Provides-Extra: cortex
Provides-Extra: dev
Provides-Extra: docs
Provides-Extra: hybrid
Provides-Extra: jwt
Provides-Extra: mem0
License-File: LICENSE
Summary: Semvec — patent-pending persistent semantic state engine
Keywords: llm,llm-agents,agent-memory,long-term-memory,chat-memory,vector-memory,memory-layer,context-compression,context-management,semantic-state,semantic-search,persistent-memory,embeddings,rag,retrieval-augmented-generation,agents,multi-agent,mcp,model-context-protocol,claude-code,cursor,openai,anthropic,mem0-alternative,letta-alternative,langchain-memory-alternative
Home-Page: https://www.semvec.io
Author-email: Michael Neuberger <vertrieb@versino.de>
License-Expression: LicenseRef-Proprietary
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Changelog, https://semvec-docs.pages.dev/changelog/
Project-URL: Comparisons, https://semvec-docs.pages.dev/comparisons/
Project-URL: Documentation, https://semvec-docs.pages.dev
Project-URL: FAQ, https://semvec-docs.pages.dev/guides/faq/
Project-URL: Homepage, https://www.semvec.io
Project-URL: Pricing, https://www.semvec.io
Project-URL: PyPI, https://pypi.org/project/semvec/
Project-URL: Quickstart, https://semvec-docs.pages.dev/guides/quickstart/
Project-URL: REST API reference, https://semvec-docs.pages.dev/api/rest/
Project-URL: Support, https://www.semvec.io

# Semvec

[![PyPI](https://img.shields.io/pypi/v/semvec.svg?label=PyPI&color=blue)](https://pypi.org/project/semvec/)
[![Python versions](https://img.shields.io/pypi/pyversions/semvec.svg)](https://pypi.org/project/semvec/)
[![Wheel](https://img.shields.io/pypi/wheel/semvec.svg)](https://pypi.org/project/semvec/)
[![License](https://img.shields.io/badge/license-Proprietary-lightgrey.svg)](#license)
[![Docs](https://img.shields.io/badge/docs-semvec--docs.pages.dev-2ea44f.svg)](https://semvec-docs.pages.dev)
[![Patent applications pending](https://img.shields.io/badge/patents-applications%20pending-blue.svg)](#license--patents)

**Constant-cost semantic memory for LLM agents — drop-in alternative to mem0, Letta, and LangChain Memory.**

Semvec replaces unbounded conversation history with a **fixed-size 384-d semantic state vector** plus a tiered, content-aware memory. The cost of every LLM call stays constant — turn 10 and turn 10 000 carry the same input footprint — and the agent still has structured access to decisions, invariants, error patterns, and prior context across sessions.

```python
pip install semvec
```

```python
from semvec import SemvecState, SemvecConfig
from semvec.token_reduction import SemvecStateSerializer

state = SemvecState(config=SemvecConfig(dimension=768))
for text, embedding in conversation:
    state.update(embedding, text)

context = SemvecStateSerializer().serialize(state, query_text="what did we decide?")
# `context` is a 150–350-token block — paste it into any LLM system prompt.
```

### Architectural differences vs. mem0, Letta, LangChain Memory

This table compares **architectural properties**, not measured performance. The benchmarks below were run head-to-head against mem0 only.

| Property | **semvec** | mem0 | Letta (MemGPT) | LangChain Memory |
|---|---|---|---|---|
| Per-turn input footprint | **O(1)** — fixed-size state | O(retrieved records) | O(in-context blocks) | depends on class (buffer ≈ O(n); summary ≈ bounded) |
| LLM calls during ingest | **0** (deterministic EMA) | LLM fact-extraction per turn | LLM-managed page-in/out | varies (none for buffer/vector; LLM for summary classes) |
| Recall procedure | Deterministic (vector + literal cache) | LLM-extracted facts | LLM-managed swap | Deterministic retrieval (when vector-based) |
| Numeric / exact-value safety | **Verbatim cache** with `Decimal` | Embedded → lossy | Embedded → lossy | Not addressed by the framework |
| Deployment options | Proprietary; self-hosted, air-gapped, or Versino-managed hosting | OSS, self-hosted | OSS, self-hosted | OSS, self-hosted |
| Patent protection | Applications pending (U.S. 19/269,195, 19/550,466; EP 25 188 105, EP 26 160 795) | — | — | — |
| Multi-agent coordination | Built-in (Cortex) | Manual | Manual | Manual |

→ Deep-dive comparisons: [vs mem0](https://semvec-docs.pages.dev/comparisons/vs-mem0/) · [vs Letta](https://semvec-docs.pages.dev/comparisons/vs-letta/) · [vs LangChain Memory](https://semvec-docs.pages.dev/comparisons/vs-langchain-memory/)

### Benchmarks (where we measured head-to-head)

- **LOCOMO 10-conv (1986 QAs, gpt-4o, T = 0.0)** — semvec BM25-hybrid + cross-encoder rerank scores **F1 0.495** including adversarial Cat 5 (vs **0.469** for the dense-only baseline, +2.6 pp). Strongest single-category lift: **multi-hop +5.3 pp**. Rank **2 of 8** on the published LOCOMO leaderboard, beating RAG @ k=5 (0.433), claude-3-sonnet 200K (0.428), gemini-1.0-pro 32K (0.391), and gpt-3.5-turbo 16K (0.359).
- **vs. mem0 on LOCOMO** — under the mem0 LLM-as-Judge prompt (verbatim), semvec clears mem0 by **~12 pp at the aggregate level** (mem0 paper excludes Cat 5; the LOCOMO paper does not). Wall-clock is **17× shorter** (2.77 h vs. ~47 h on the same suite) — semvec issues zero LLM calls during ingest.
- **Token efficiency** on the same LOCOMO config: **~93 % fewer input tokens per turn** vs gpt-4-turbo Full-Context 128K (~1.5–2 k vs ~26 k).

Reproduce with `pip install "semvec[benchmarks,hybrid,api,mem0]"` and `benchmarks/run_locomo.py`. We have not benchmarked against Letta or LangChain Memory directly; the comparison pages above describe the architectural differences, not measured performance gaps.

---

## Table of contents

- [What you get](#what-you-get)
- [Installation](#installation)
- [Choose your use case](#choose-your-use-case)
- [Token-reduced LLM context](#token-reduced-llm-context)
- [Drop-in chat proxy](#drop-in-chat-proxy)
- [Multi-agent coordination](#multi-agent-coordination)
- [Coding-agent compaction](#coding-agent-compaction)
- [REST API server](#rest-api-server)
- [Persistence](#persistence)
- [Configuration & environment variables](#configuration--environment-variables)
- [Error handling](#error-handling)
- [Licensing](#licensing)
- [Limitations & non-goals](#limitations--non-goals)
- [FAQ](#faq)
- [Telemetry](#telemetry)
- [Support](#support)
- [License](#license)

---

## What you get

| Capability | What it solves |
|---|---|
| **Constant-size compressed context** | Per-call LLM input cost stops growing with conversation length. ~93 % fewer input tokens per turn on LOCOMO vs gpt-4-turbo full-context. |
| **Tiered memory with selective forgetting** | Three tiers (short / medium / long term) with retention scoring — frequently-accessed older memories outlive never-touched newer ones. |
| **Domain anchors + resonance triggers** | Bias retrieval toward known domains or specific keywords without re-training. Lifts precision@3 from 86 % → 91.7 % on mixed-domain workloads. |
| **Drop-in chat proxy** | Wrap any OpenAI-compatible LLM and get compressed context for free. Works with vLLM, LiteLLM, OpenRouter, Ollama out of the box. |
| **Multi-agent coordination (Cortex)** | Run several agents that share an aggregated view, vote on proposals, and exchange checksummed state vectors. |
| **Coding-agent compaction** | Persistent memory across coding sessions — design decisions, invariants, error patterns, code-pointer index, anti-resonance checks. MCP server for Claude Code & Cursor included. |
| **REST API server** | `semvec serve` exposes the full surface over FastAPI: sessions, clusters, regions, observer, network, literal cache, Prometheus metrics. |
| **Compliance pack** | Append-only event store, deterministic replay, GDPR Art. 17 forget with signed certificates, HMAC request signing, RS256 user JWTs. |
| **Bring-your-own embedder** | Anything exposing `get_embedding(text) → np.ndarray` and `get_dimension() → int` works. SentenceTransformers, OpenAI, ONNX int8 — see the [embedders guide](https://semvec-docs.pages.dev/guides/embedders/). |
| **One wheel, all platforms** | Python 3.10–3.14 via stable ABI. Pre-built wheels for Linux glibc + Alpine musl (x86_64 + aarch64), macOS (x86_64 + arm64), Windows (x86_64). |

Per-release detail in the [changelog](https://semvec-docs.pages.dev/changelog/).

---

## Installation

```bash
# Core only
pip install semvec

# With multi-agent coordination
pip install "semvec[cortex]"

# With coding-agent compaction (FastMCP server, Claude Code hooks)
pip install "semvec[coding]"

# Compliance pack (event store, retention, DSGVO forget, HMAC, RS256)
pip install "semvec[compliance]"
# When you also want the FastAPI compliance routes + middleware:
pip install "semvec[api,compliance]"

# REST API server
pip install "semvec[api]"
semvec serve --host 0.0.0.0 --port 8080

# Benchmark harness dependencies (SentenceTransformers, datasets, psutil)
pip install "semvec[benchmarks]"

# BM25-hybrid retrieval (LOCOMO +2.6 pp F1)
pip install "semvec[hybrid]"

# Optional Mem0 head-to-head baseline for benchmarks
pip install "semvec[mem0]"

# Developer tooling (ruff, mypy, pre-commit, pytest, httpx)
pip install "semvec[dev]"

# mkdocs-material for the documentation site
pip install "semvec[docs]"

# Everything the developers use
pip install "semvec[cortex,coding,api,compliance,hybrid,benchmarks,dev,docs]"
```

| Extra | Pulls in | When you need it |
|---|---|---|
| `[cortex]` | — (marker only) | multi-agent coordination is always available; the extra marks intent for future pip resolvers |
| `[coding]` | `fastmcp>=2.0` | MCP server + Claude Code hooks |
| `[compliance]` | `cryptography>=42` | Event store, retention sweeper, deletion-certificate signer, HMAC + RS256 signing. FastAPI routes need `[api]` on top. See the [Compliance guide](https://semvec-docs.pages.dev/guides/compliance/). |
| `[jwt]` | `pyjwt>=2.9` | Stand-alone licence-JWT decoding without the full `[api]` extra — handy for build pipelines or short scripts that only need to inspect a token. |
| `[api]` | `fastapi`, `uvicorn[standard]`, `sqlalchemy`, `prometheus-client`, `pydantic` | REST API server (`semvec serve`) |
| `[benchmarks]` | `sentence-transformers` | running the LOCOMO bench runners under `benchmarks/` |
| `[hybrid]` | `bm25s>=0.2`, `nltk>=3.8` | BM25-hybrid retrieval — required to reproduce the LOCOMO +2.6 pp lift |
| `[mem0]` | `mem0ai>=0.1`, `faiss-cpu>=1.7` | head-to-head Mem0 comparison |
| `[dev]` | `ruff`, `mypy`, `pre-commit`, `pytest`, `httpx` | contributing — includes the FastAPI TestClient transport |
| `[docs]` | `mkdocs>=1.6`, `mkdocs-material>=9.5`, `pymdown-extensions` | building the documentation site (`mkdocs serve`) |

### Embedder requirement

Semvec is embedder-agnostic and refuses silent hash-based fallbacks — you bring your own. Any object exposing `get_embedding(text) → np.ndarray` and `get_dimension() → int` works.

```bash
pip install sentence-transformers
```

**Choose the embedder dimension carefully — Semvec's retrieval quality is bounded by what the embedder can separate.** Measured on 80 mixed-domain notes:

| Embedder | dimension | precision@3 | usable for |
|---|---|---|---|
| `all-MiniLM-L6-v2` | 384 | **66.67 %** | English-only, tight-domain prototypes only |
| `paraphrase-multilingual-mpnet-base-v2` | 768 | **86.11 %** | German / multilingual mixed-domain (recommended) |

The 384-dim MiniLM is the easy default but on multilingual or domain-mixed text it confuses generic terms (e.g. "filter" → coffee filter vs. data filter). For German content, mixed-domain corpora, or anything where you need ≥ 80 % precision@3, use multilingual mpnet 768 d minimum.

```python
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer(
    "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
)
```

---

## Choose your use case

| You want to… | Jump to |
|---|---|
| Compress conversation history for any LLM | [Token-reduced LLM context](#token-reduced-llm-context) |
| Drop-in replacement for `openai.chat.completions` | [Drop-in chat proxy](#drop-in-chat-proxy) |
| Coordinate many agents (analyst + planner + critic …) | [Multi-agent coordination](#multi-agent-coordination) |
| Give Claude Code / Cursor persistent memory across sessions | [Coding-agent compaction](#coding-agent-compaction) |
| Run as a service, talk to it over HTTP | [REST API server](#rest-api-server) |
| Process regulated data (GDPR, audit, retention) | [Compliance pack](https://semvec-docs.pages.dev/guides/compliance/) |

---

## Token-reduced LLM context

The single most-used path: produce a compact system-prompt block from any conversation, regardless of length.

```python
from semvec import SemvecState, SemvecConfig
from semvec.token_reduction import SemvecStateSerializer

state = SemvecState(config=SemvecConfig(dimension=768))

for text, embedding in conversation:
    state.update(embedding, text)

serializer = SemvecStateSerializer()
context = serializer.serialize(state, query_text="what did we decide about auth?")

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": context},
        {"role": "user",   "content": "what did we decide about auth?"},
    ],
)
```

Compared to raw history concatenation, the compressed context **does not grow with conversation length** — input cost converges to a constant. The serializer fits prior context into a 150–350-token block sized for a system prompt.

The truncation budget is caller-controlled. Pass `SerializerConfig(max_memory_chars=N)` for any positive `N` (e.g. `10_000` to effectively disable per-memory truncation), and set `full_first=True` to keep the highest-ranked retrieved memory verbatim while the rest stay short:

```python
from semvec.token_reduction import SerializerConfig

cfg = SerializerConfig(top_k=5, max_memory_chars=200, full_first=True)
context = SemvecStateSerializer(cfg).serialize(state, query_text="...")
# Entry 1: full text. Entries 2..5: capped at 200 chars.
```

The same pattern is exposed on the REST API as `?max_text_chars=N&full_first=true` on `GET /v1/state/context`.

### Lift retrieval quality with anchors and triggers

The passive ingest above gives you retrieval that already beats sliding-window concatenation. To bias retrieval toward known domains or specific cues, register anchors and resonance triggers:

```python
from semvec import SemvecState, SemvecConfig

state = SemvecState(config=SemvecConfig(
    dimension=768,
    enable_topic_switch=True,
    auto_anchor_on_topic_switch=True,   # opt-in (default off)
))

# Anchors — bias retrieval toward your known domains.
for prototype in [
    "SAP Business One Service Layer OData REST API",
    "Python MCP Model Context Protocol Server",
    "italienische Kueche Kochen Pasta Pizza",
    "Kaffee Espresso Roesterei Brewing",
]:
    state.add_anchor(embed(prototype))

# Triggers — boost memories on a keyword OR vector match.
state.create_resonance_trigger(
    keyword="security review",
    embedding=embed("security audit threat model"),
    threshold=0.7,
)

for text, vec in conversation:
    state.update(vec, text)

# Retrieval is now anchor-biased: candidates aligned with one of
# your domain anchors win the tie-break against generic phrases.
top = state.memory.get_relevant_memories(embed("OData filter syntax"), top_k=3)
```

What each piece adds (measured on mpnet 768 d, 80 mixed German notes):

| Variant | precision@3 |
|---|---|
| passive `update()` only | 86.11 % |
| + 4 domain anchors | **91.67 % (+ 5.56 pp)** |
| + 4 resonance triggers | 86.11 % |
| anchors + triggers | 91.67 % |

Without anchors, the retrieval boost is a no-op — flipping these features on costs nothing if you do not need them. Anchors and triggers compete for the same boost slot (`max(...)`, not addition), so redundant signals do not double-count.

**Tuning rule of thumb:** keep `anchor_retrieval_boost ≥ trigger_retrieval_boost`, both in the `[0.1, 0.6]` range. Pushing either past `0.7` mostly stops moving the needle — spend your budget on better anchor prototypes or sharper trigger thresholds rather than dialling the boosts higher.

---

## Drop-in chat proxy

`SemvecChatProxy` wraps any callable LLM behind compressed context and tracks both compressed and full-history token counts per turn:

```python
from semvec.token_reduction import SemvecChatProxy, create_llm_client

llm = create_llm_client("openai")  # reads OPENAI_BASE_URL/MODEL/API_KEY from env
proxy = SemvecChatProxy(
    llm_call=llm,
    system_prompt="You are a helpful assistant.",
    embedding_service=my_embedder,
)

for question in ["summarise Q3", "compare with Q2", "biggest miss?"]:
    result = proxy.chat(question)
    print(f"turn {result.turn_number}: {result.response}")
    print(f"  compressed tokens: {result.tokens.compressed}")
    print(f"  full-history tokens: {result.tokens.full_history}")

print(proxy.get_summary())
```

Built-in clients: `OpenAIClient` (works with the OpenAI API and any compatible endpoint such as vLLM, LiteLLM, OpenRouter), `OllamaClient`. You can pass any callable `(list[ChatMessage]) -> str`.

> **Break-even is around ten turns.** The compressed prompt carries a constant ~110-token header. For very short conversations (≤ 5 turns) plain history concatenation is cheaper; from ~10 turns onward the proxy undercuts naive concatenation, and the gap widens linearly with conversation length. Measured on the LOCOMO 10-conv suite: ~93 % fewer input tokens per turn vs a stateless full-history baseline.

---

## Multi-agent coordination

Run several agents (analyst, planner, critic, …) that share an aggregated view, vote on proposals, and exchange checksummed state vectors.

```python
from semvec.cortex import SemvecAgentNetwork, AttentionAggregation

network = SemvecAgentNetwork(
    aggregation_strategy=AttentionAggregation(dimension=768),
    dimension=768,
)
network.add_local_instance("analyst")
network.add_local_instance("planner")

network.process_input("analyst", "quarterly revenue is up 23%")
network.process_input("planner", "we should redirect Q4 spend to retention")

state = network.get_network_state()
print(f"active agents: {state['active_instances']}/{state['total_instances']}")

# Pull per-agent feedback for the next turn (consensus-aware)
feedback = network.get_feedback_for_agent("analyst")
```

Aggregation strategies: `WeightedAverageAggregation`, `AttentionAggregation`. `ConsensusEngine` adds proposal voting with five levels (`SIMPLE_MAJORITY`, `QUALIFIED_MAJORITY`, `UNANIMOUS`, `WEIGHTED_VOTE`, `ADAPTIVE_THRESHOLD`); quorum is measured against the registered voter pool, not just votes-cast-so-far. `StateVectorPacket` round-trips bit-exactly via `serialize()`/`deserialize()` and `verify_integrity()` confirms byte equality.

See the [Cortex API reference](https://semvec-docs.pages.dev/api/cortex/) for the full surface, the [Cortex overview](https://semvec-docs.pages.dev/guides/cortex/) for the in-process / service / REST decision tree, and [Cortex over REST API](https://semvec-docs.pages.dev/guides/cortex-rest/) for the cluster / region / observer / network endpoints with curl + httpx examples.

---

## Coding-agent compaction

Persistent memory across coding sessions for Claude Code, Cursor, Aider — code pointers, anti-resonance error patterns, structured handoff context.

→ Full integration guides: **[Claude Code](https://semvec-docs.pages.dev/guides/claude-code/)** (MCP + automatic `SessionStart` / `PreCompact` hooks) · **[Cursor](https://semvec-docs.pages.dev/guides/cursor/)** (MCP + project rule). The high-level [Coding overview](https://semvec-docs.pages.dev/guides/coding/) lays out the three usage paths (MCP, in-process API, REST API) and when to pick which.

```python
from semvec.coding import CodingEngine

engine = CodingEngine(state_dir="~/.semvec/project-x", embedder=my_embedder)
engine.ingest_transcript("path/to/claude_code_session.jsonl")

context = engine.get_compacted_context(
    "implement password reset flow",
    invariants=["never log plaintext passwords"],
)
```

### Multi-session memory via `LiteralCache`

Below the high-level `CodingEngine`, `state.literal_cache` is a structured memory of design decisions, error patterns, invariants, and per-checkpoint test results — anything you want to survive across sessions verbatim:

```python
import semvec

state = semvec.SemvecState(semvec.SemvecConfig(dimension=768))
cache = state.literal_cache

cache.record_decision("Use mpnet 768d for German content", checkpoint=1)
cache.record_error_pattern(
    pattern="catastrophic recency bias on blocked-domain ingest",
    example="500-note 4-domain blocked sequence",
    fix="raise long_term_size and use tier weights 1.0/0.95/0.9",
    checkpoint=1,
)
cache.add_invariant("State must round-trip via to_dict/from_dict")
cache.record_test_results(
    checkpoint=1,
    passed_tests=["test_a", "test_b", "test_c"],
    failed_tests=[],
)

# Build the LLM hand-off context for the next session
ctx = cache.build_handoff_context(next_checkpoint=2)
# ### INVARIANTS — Do NOT break these:
# - State must round-trip via to_dict/from_dict
#
# ### Test Status (CP1: 100%, 3/3)
#
# ### Known Error Patterns
# - `catastrophic recency bias on blocked-domain ingest` (x1): raise long_term_size...
#
# ### Design Decisions
# - [CP1] Use mpnet 768d for German content

# Persist + restore — round-trip preserves decisions, error_patterns,
# invariants, test_history, code_structures.
blob = state.to_bytes()
restored = semvec.SemvecState.from_bytes(blob)
assert restored.literal_cache.build_handoff_context(2) == ctx
```

`build_handoff_context()` produces a Markdown block ready for the system prompt of the next session. See the [Coding API reference](https://semvec-docs.pages.dev/api/coding/) for the full surface.

### Claude Code integration (MCP + hooks)

Wire it directly into Claude Code via the bundled FastMCP server and two lifecycle hooks. The settings below give you the bare minimum; for the full walk-through (what each hook does, `CLAUDE.md` setup, troubleshooting, end-to-end example session) see the [Claude Code guide](https://semvec-docs.pages.dev/guides/claude-code/).

Add to `.claude/settings.json`:

```json
{
  "mcpServers": {
    "semvec": {
      "command": "python",
      "args": ["-m", "semvec.coding.mcp_server"],
      "env": {
        "SEMVEC_STATE_DIR": ".semvec",
        "SEMVEC_EMBED_MODEL": "all-MiniLM-L6-v2"
      }
    }
  },
  "hooks": {
    "PreCompact":  [{"command": "python -m semvec.coding.hooks.pre_compact",  "timeout": 30000}],
    "SessionStart":[{"command": "python -m semvec.coding.hooks.session_start", "timeout": 10000}]
  }
}
```

The MCP server exposes six tools — `pss_get_context`, `pss_update`, `pss_check_anti_resonance`, `pss_register_code`, `pss_record_error`, `pss_save`. FastMCP is installed automatically via the `[coding]` extra.

The same FastMCP server plugs into **Cursor** via `.cursor/mcp.json` plus a Cursor Rule that replaces Claude Code's lifecycle hooks. Full step-by-step in the [Cursor guide](https://semvec-docs.pages.dev/guides/cursor/).

For multi-tenant server-side use (literal-cache endpoints over HTTP, JWT-gated, Postgres-backed metadata), see the [REST API reference](https://semvec-docs.pages.dev/api/rest/).

---

## REST API server

```bash
pip install "semvec[api]"

# Dev mode — anonymous community-tier auth, in-memory SQLite
SEMVEC_ALLOW_ANONYMOUS=1 semvec serve --host 0.0.0.0 --port 8080

# Production — license JWT required, Postgres-backed metadata
export SEMVEC_LICENSE_KEY="eyJhbGciOiJFZERTQSI..."
export DATABASE_URL="postgresql://user:pw@host/semvec"
semvec serve --host 0.0.0.0 --port 8080
```

Talk HTTP:

```bash
# Health check (no auth)
curl http://localhost:8080/v1/health

# Single turn
curl -X POST http://localhost:8080/v1/run \
  -H "Authorization: Bearer $SEMVEC_LICENSE_KEY" \
  -H "Content-Type: application/json" \
  -d '{"session_id": "demo", "query": "what was the Q3 miss?"}'

# Retrieve compressed context
curl "http://localhost:8080/v1/state/context?session_id=demo&top_k=5" \
  -H "Authorization: Bearer $SEMVEC_LICENSE_KEY"
```

Endpoint groups: **sessions** (CRUD + run/store/context), **session-control** (resonance triggers, anchors, isolation, export/import/verify), **clusters**, **regions** (consensus-driven realignment), **global observer** (anomaly detection across regions), **network** (state transfer, user partitioning, trust-based consensus), **literal cache**, **Prometheus `/metrics`**.

Auth is via `Authorization: Bearer <jwt>` or `X-API-Key: <jwt>` — same Ed25519-signed JWT as the in-process licensing system.

See the [REST API reference](https://semvec-docs.pages.dev/api/rest/) for every endpoint and the [CLI reference](https://semvec-docs.pages.dev/api/cli/) for `semvec serve` flags.

---

## Persistence

`state.to_dict()` is a JSON-safe checkpoint with embedded SHA-256 checksum — best when the snapshot has to round-trip through systems that only speak JSON.

`state.to_bytes(compress=True)` is the compact binary equivalent (gzip-compressed JSON, magic header, SHA-256 corruption check) — best for cold-storage checkpoints. `state.to_bytes(compress=False)` is the speed-optimised variant: same byte footprint as JSON, but kept as a self-describing binary blob with corruption check — best for hot-path persistence. Both paths preserve the full state on round-trip:

- the semantic state and its rolling histories
- all three memory tiers
- domain anchors and topic-switch history
- the **complete `LiteralCache`**: entities, decisions, error patterns, invariants, test history, code structures

Restore with `SemvecState.from_bytes(blob)`; the version byte distinguishes the two `to_bytes` modes automatically.

Practical sizing on mpnet 768 d:

| Memories | JSON | `to_bytes(compress=True)` | `to_bytes(compress=False)` |
|---|---|---|---|
| 110 (small) | 18 ms / 8.8 kB / memory | 157 ms / 3.7 kB / memory | 36 ms / 8.8 kB / memory |
| 1 000 (extrapolated) | ~ 0.2 s / 9 MB | ~ 1.4 s / 3.7 MB | ~ 0.3 s / 9 MB |
| 100 000 | ~ 17 s / 1.7 GB | ~ 2.5 min / 400 MB | ~ 30 s / 1.7 GB |

Pick the variant by use case:
- **Cold-storage checkpoint** (occasional, durability matters) → `compress=True`. ~ 2.4× smaller than JSON; pay the gzip cost once.
- **Hot-path persistence** (every-turn or per-request) → `compress=False`. Same size as JSON, only ~ 1.9× slower than `json.dumps`, but kept as a self-describing binary blob with corruption check.

For very large footprints (> 100 k memories) wrap your own NPZ/Parquet around the embedding payload to save another factor.

---

## Configuration & environment variables

| Variable | Default | Used by |
|---|---|---|
| `SEMVEC_LICENSE_KEY` | — | Pro/Enterprise gates; REST API auth |
| `SEMVEC_ALLOW_ANONYMOUS` | unset | REST API: bypass auth (dev only) |
| `SEMVEC_STATE_DIR` | `.semvec` | `CodingEngine` state persistence |
| `SEMVEC_EMBED_MODEL` | `all-MiniLM-L6-v2` | MCP server / hooks default embedder (consider overriding to `paraphrase-multilingual-mpnet-base-v2` for German/multilingual) |
| `SEMVEC_EMBED_DEVICE` | `cpu` | MCP server / hooks: `cpu` or `cuda` |
| `DATABASE_URL` | `sqlite:///semvec.db` | REST API persistence (also accepts `postgresql://…`) |
| `METRICS_USER` / `METRICS_PASSWORD` | — | Basic Auth on Prometheus `/metrics` |
| `OPENAI_BASE_URL`, `OPENAI_API_KEY`, `OPENAI_MODEL` | — | `OpenAIClient` |
| `OLLAMA_BASE_URL`, `OLLAMA_MODEL` | `http://localhost:11434`, — | `OllamaClient` |

---

## Error handling

```python
import time
from semvec import RateLimitError, LicenseExpiredError, ConfigurationError

try:
    result = state.update(embedding, text)
except RateLimitError as e:
    # e.retry_after is a datetime.timedelta; e.upgrade_url is set
    time.sleep(e.retry_after.total_seconds())
    result = state.update(embedding, text)
except LicenseExpiredError as e:
    # Hard fail — re-import won't help. Renew at e.upgrade_url.
    logger.error("semvec license expired — renew at %s", e.upgrade_url)
    raise
except ConfigurationError as e:
    # Wrong dimension, missing embedder, malformed config, etc.
    raise
```

All Semvec exceptions inherit from `SemvecError`. License-related exceptions (`RateLimitError`, `LicenseExpiredError`, `LicenseError`) inherit from `LicenseError → SemvecError`.

---

## Licensing

Three tiers; Community works without a key, Pro and Enterprise require a signed Ed25519 JWT:

| Tier | Rate limit | Retrieval modes |
|---|---|---|
| **Community** (no key) | 5 QPS sustained / 50 burst | Base retrieval |
| **Pro** | 200 / 2000 QPS | Extended |
| **Enterprise** | Unthrottled | All |

JWTs have a 30-day TTL. Expiry is a hard fail — the next gated call raises `LicenseExpiredError` with the renewal URL in the message. Rate-limit exhaustion raises `RateLimitError` whose message names the tier and the retry-after delay.

The limiter is a token bucket per `SemvecState`. Both `update()` and the three `calculate_*` methods draw from the same bucket — the QPS budget is the combined operations-per-second on that state. Burst capacity gives every legitimate dev workload (conversational chat, MCP servers, smoke-tests, small `pytest` suites) plenty of headroom; sustained heavy load above the Community 5 QPS belongs in Pro. For background on the bucket plus the secondary probe-defence sliding window, see the [Licensing guide](https://semvec-docs.pages.dev/guides/licensing/).

```bash
export SEMVEC_LICENSE_KEY="eyJhbGciOiJFZERTQSI..."
```

---

## Limitations & non-goals

Honest list of what Semvec **does not** do:

- **Not a vector database.** Long-term memory is bounded; if you need recall over a million documents, run a dedicated vector store and treat Semvec as a conversational compressor on top.
- **Not a drop-in for stateless completion.** The whole point is persistent state; if you only do single-shot prompts, you do not need Semvec.
- **No silent embedder fallback.** If you do not pass an embedder, methods that need one raise a descriptive `RuntimeError`. Intentional — silent hash fallbacks gave surprising failure modes in earlier iterations.
- **License gate is a licensing feature, not a hard security boundary.** Use it to enforce subscription tiers, not to keep determined adversaries out.
- **No mobile / WASM build today.** `abi3-py310` Linux/macOS/Windows only.
- **REST API persistence is metadata-only.** Hot semantic state lives in-memory per process; only session/cluster/member/region/audit metadata is persisted. Plan accordingly for restarts.

---

## FAQ

**Is this RAG?** Not in the usual sense. RAG retrieves documents at query time. Semvec compresses *the conversation itself* into a fixed-size state. They compose well — many users run Semvec for conversational signal + a vector DB for document retrieval.

**Does the state ever grow?** No, the state vector itself is fixed-size. The associated memory tiers are bounded by configured capacities — when full, the lowest-scoring entry is evicted (not the oldest).

**Can I run it offline / air-gapped?** Yes for Community tier. Pro/Enterprise tiers verify Ed25519 JWT signatures locally — no network call to a license server at runtime. Contact `vertrieb@versino.de` for offline-issued JWTs with custom TTLs.

**How fast is it?** Per-turn `update()` is sub-millisecond on a recent x86_64 CPU at dimension 384, dominated by NumPy/Rust matrix ops, not Python overhead. The whole point of the Rust port was to keep the math out of the GIL.

**Is the source available?** Compiled wheels are public on PyPI; the Rust source is held closed. Source access for Enterprise terms — contact `vertrieb@versino.de`.

**GPU support?** Embedders run on whatever device you configure (`cuda`, `mps`, `cpu`); the Semvec core itself is CPU-only — the math is small enough that GPU offload would lose more in transfer than it gains.

---

## Telemetry

**None.** Semvec does not phone home. There is no init ping, no per-call event, no usage tracking, no machine pseudonym, no diversity sketch. License-JWT verification, inference, state updates, and retrieval all run locally — the package only contacts the network when you explicitly call something that does (the optional REST API server, the OpenAI / Ollama clients, your own embedder).

If you need install counts, [PyPI download statistics](https://pypistats.org/packages/semvec) (`pypistats overall semvec`) give you that without any client-side telemetry.

Earlier 0.x releases (≤ 0.5.2) shipped an opt-out anonymous init ping and a HyperLogLog "diversity sketch" intended to detect surrogate-cloning attempts. Both were removed in **0.5.3** — the trade-off was wrong, the GDPR Art. 6(1)(f) "patent-enforcement" basis was untenable, and the architecture matched the pattern of commercial spyware regardless of intent. If you're still on ≤ 0.5.2, upgrading to 0.5.3 removes the ping; you can also delete `~/.semvec/telemetry-salt` (it is no longer used).

---

## Support

- **Documentation**: https://semvec-docs.pages.dev
- **Pricing & licensing**: https://www.semvec.io
- **Pro / Enterprise support**: `support@versino.de` (priority response)
- **Security disclosures**: `security@versino.de` — please do not open public issues for vulnerabilities; coordinated disclosure with 48 h acknowledgement, fix-or-mitigation in 30 days for high-severity issues

## License & patents

Proprietary — all rights reserved. Commercial use requires a Pro or
Enterprise license. The full license text ships inside the wheel as
`LICENSE`; for procurement, see https://www.semvec.io.

**Patent applications pending:** U.S. non-provisional Nos. 19/269,195,
19/550,466; European EP 25 188 105, EP 26 160 795.

Per-application detail:

- `U.S. 19/269,195` — USPTO non-provisional; filed; pending.
- `U.S. 19/550,466` — USPTO non-provisional; filed; pending.
- `EP 25 188 105` — European Patent Office (filed 2025; pending).
- `EP 26 160 795` — European Patent Office (pending).

Until grant, references to "patent-protected" features describe claims
of pending applications, not enforceable exclusive rights.

Copyright © 2026 Michael Neuberger · Versino PsiOmega GmbH.

