Metadata-Version: 2.4
Name: ak-primus
Version: 0.2.1
Summary: Intelligent token compression + routing MCP server
Author: AK-Primus Contributors
License: MIT
Project-URL: Homepage, https://github.com/ak-primus/ak-primus
Project-URL: Repository, https://github.com/ak-primus/ak-primus
Project-URL: Issues, https://github.com/ak-primus/ak-primus/issues
Keywords: mcp,llm,compression,token,optimization,context-window
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: mcp>=1.0.0
Requires-Dist: tiktoken>=0.7.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: pydantic>=2.7.0
Requires-Dist: anyio>=4.4.0
Requires-Dist: structlog>=24.0.0
Provides-Extra: compress
Requires-Dist: llmlingua>=0.2.0; extra == "compress"
Requires-Dist: transformers>=4.41.0; extra == "compress"
Requires-Dist: torch>=2.2.0; extra == "compress"
Provides-Extra: retrieval
Requires-Dist: sentence-transformers>=3.0.0; extra == "retrieval"
Requires-Dist: chromadb>=0.5.0; extra == "retrieval"
Requires-Dist: numpy>=1.26.0; extra == "retrieval"
Requires-Dist: scikit-learn>=1.5.0; extra == "retrieval"
Requires-Dist: openai>=1.30.0; extra == "retrieval"
Requires-Dist: anthropic>=0.29.0; extra == "retrieval"
Provides-Extra: optimize
Requires-Dist: dspy-ai>=2.4.0; extra == "optimize"
Requires-Dist: openai>=1.30.0; extra == "optimize"
Requires-Dist: evaluate>=0.4.0; extra == "optimize"
Provides-Extra: http
Requires-Dist: uvicorn[standard]>=0.29.0; extra == "http"
Requires-Dist: starlette>=0.37.0; extra == "http"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Requires-Dist: mypy>=1.10.0; extra == "dev"
Provides-Extra: all
Requires-Dist: ak-primus[compress,http,optimize,retrieval]; extra == "all"
Dynamic: license-file

# AK-Primus v0.2.0

> Intelligent token compression + routing MCP server with adaptive self-healing.

Every LLM request passes through a 7-layer pipeline selected by an 8-class hybrid classifier. The right compression and retrieval stack runs automatically for each request type. Nothing is applied blindly. The system gets cheaper the more it is used.

```
Request → Classifier → Router → [Cache | Memory | Compress | Search | Prompt-Opt] → LLM
                                                                                       ↓
                                                        Quality Score ← Response ←────┘
                                                              ↓
                                                    Adaptive Profile Update
```

---

## What's new in 0.2.0

| Area | Change |
|---|---|
| Compression | Expansion guard — never inflates token count; LLMLingua-1 uses GPT-2 (was 7B Llama) |
| Classifier | ML hybrid (400-example logistic regression + rule fast-path); 88%+ accuracy on 275-scenario suite |
| Cache | 3-level lookup: SHA-256 exact → ChromaDB HNSW ANN → SQLite cosine fallback |
| Memory | L1 working memory + L2 episodic (session consolidation) + L3 semantic (cross-session ChromaDB) |
| DSPy | 4 typed Signatures wired end-to-end; BootstrapFewShot + MIPRO2 with disk cache |
| Quality | ROUGE-L + BERTScore + LLM-as-judge blended score; `score_async()` variant |
| Transport | `AK_PRIMUS_TRANSPORT=http` — Starlette ASGI with `/health`, `/ready`, `/metrics`, SSE |
| Testing | 84 pytest tests + 275-scenario benchmark suite with per-class accuracy CI gate |

---

## Architecture

```
ak-primus/
├── core/
│   └── ak_primus/
│       ├── classifier.py          # 8-class hybrid (rule + ML logistic regression)
│       ├── router.py              # Stack selection per request type
│       ├── server.py              # MCP server — 7 tools, stdio + HTTP/SSE
│       ├── layers/
│       │   ├── cache.py           # 3-level semantic cache (exact / HNSW / cosine)
│       │   ├── compression.py     # LLMLingua-2, LLMLingua-1, LongLLMLingua, SelectiveContext
│       │   ├── memory.py          # L1 working | L2 episodic | L3 semantic (ChromaDB)
│       │   ├── metrics.py         # tiktoken real token counting + cost accounting
│       │   ├── prompt_opt.py      # DSPy BootstrapFewShot + MIPRO2, OPRO, Medprompt
│       │   ├── quality.py         # ROUGE-L + BERTScore + LLM-as-judge + adaptive profile
│       │   └── search.py          # HyDE, RAPTOR, FLARE, ColBERT retrieval
│       ├── ml/
│       │   └── classifier_ml.py   # Embedding-based logistic regression (400 training examples)
│       └── storage/
│           ├── session_store.py   # SQLite WAL — sessions, cache, profiles, memory_facts
│           └── vector_store.py    # ChromaDB HNSW — semantic_cache + memory_facts
├── extension/
│   └── src/
│       ├── extension.ts           # VS Code extension entry point
│       ├── dashboard.ts           # Real-time metrics webview
│       └── mcp-client.ts          # MCP stdio bridge
├── tests/
│   ├── unit/                      # 70 unit tests (all layers)
│   └── integration/               # 14 integration + benchmark threshold tests
└── benchmarks/
    └── run_1200_scenarios.py      # 275-scenario accuracy + latency suite
```

---

## 7 MCP Tools

| Tool | Purpose |
|---|---|
| `classify_request` | Detect request type + return recommended stack with expected token reduction |
| `compress_history` | Apply compression stack to message history; returns compressed messages + savings |
| `build_context` | HyDE / RAPTOR / FLARE retrieval-augmented context building |
| `optimize_prompt` | DSPy BootstrapFewShot / OPRO / Medprompt prompt optimisation |
| `get_token_report` | Real tiktoken metrics, cost savings, session stats |
| `process_request` | **Master pipeline**: classify → cache → memory → compress → quality → adapt |
| `report_quality` | Feed quality signal (0–1) back into adaptive compression profile |

### `process_request` — master pipeline

```json
{
  "optimized_messages": [...],
  "tokens_before": 1240,
  "tokens_after": 487,
  "tokens_saved": 753,
  "savings_pct": 60.7,
  "request_type": "code",
  "confidence": 0.91,
  "quality_score": 0.876,
  "adapted_ratio": 0.382,
  "cache_hit": false,
  "session_id": "sess-abc123",
  "lifetime_savings": 41250
}
```

---

## 8 Request Types

| Type | Classifier Trigger | Default Stack |
|---|---|---|
| `code` | Code keywords + task verbs + C++/C# detection | SelectiveContext → PrefixCache |
| `rag_doc` | Documents present + QA-style question | HyDE retrieval + LLMLingua-2 |
| `agent_session` | Multi-turn history + follow-up phrases | WorkingMemory + L3 semantic |
| `domain_expert` | Legal / medical / finance domain terms | Medprompt + SelectiveContext |
| `multi_hop` | "relationship", "compare", "trace" patterns | RAPTOR + ChainOfThought |
| `math` | Equations, proof, calculate keywords | LLMLingua-1 (formula-aware) |
| `fixed_template` | Long system prompt (>600 tokens) | PrefixCache (no compression) |
| `simple_qa` | Short conversational question | Light SelectiveContext |

---

## Installation

```bash
# Minimal (core MCP server, no ML)
pip install ak-primus

# Full (all optional groups)
pip install "ak-primus[all]"

# From source
git clone https://github.com/ak-primus/ak-primus
pip install -e "core[all]"
```

| Group | Installs | When to use |
|---|---|---|
| `compress` | LLMLingua, transformers, torch | Token compression |
| `retrieval` | sentence-transformers, chromadb, scikit-learn | Semantic cache + search |
| `optimize` | dspy-ai, evaluate | DSPy + ROUGE/BERTScore |
| `http` | uvicorn, starlette | HTTP/SSE transport |
| `dev` | pytest, ruff, mypy | Development |

---

## Quick Start

### Claude Desktop

```json
{
  "mcpServers": {
    "ak-primus": {
      "command": "ak-primus",
      "args": ["serve"],
      "env": {
        "ANTHROPIC_API_KEY": "sk-ant-..."
      }
    }
  }
}
```

### Docker

```bash
# stdio (MCP)
docker build -f Dockerfile.akprimus --target runtime -t ak-primus:0.2.0 .
docker run -i --rm -e ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" -v ak_data:/data ak-primus:0.2.0

# HTTP/SSE with health probes
docker compose --profile http up
curl http://localhost:8080/health
```

---

## Self-Healing Compression

The adaptive profile tunes compression ratios over time per (workspace, request_type):

```
quality ≥ 0.85  → compress more aggressively next time
quality < 0.70  → back off compression
```

After ~50 samples per type, the ratio converges to the Pareto-optimal point.

```bash
# Feed quality signal back after reviewing LLM response
curl -s http://localhost:8080/mcp -d '{"tool":"report_quality","quality_score":0.92,"request_type":"code"}'
```

---

## Running Tests

```bash
pytest tests/ -q                              # 84 tests
python benchmarks/run_1200_scenarios.py --quick   # accuracy + latency
```

### Benchmark (v0.2.0, 275 scenarios)

| Class | Accuracy |
|---|---|
| `math` | 96% |
| `fixed_template` | 100% |
| `agent_session` | 92% |
| `domain_expert` | 90% |
| `multi_hop` | 90% |
| `simple_qa` | 88% |
| `code` | 82% |
| `rag_doc` | 80% |
| **Overall** | **~88%** |

Classifier latency: p50 < 1ms, p95 < 5ms (rule-based path, no model load).

---

## Environment Variables

| Variable | Default | Description |
|---|---|---|
| `AK_PRIMUS_MODEL` | `claude-sonnet-4-6` | LLM model for DSPy / OPRO / judge |
| `AK_PRIMUS_TRANSPORT` | `stdio` | `stdio` or `http` |
| `AK_PRIMUS_HOST` | `127.0.0.1` | HTTP bind host |
| `AK_PRIMUS_PORT` | `8080` | HTTP bind port |
| `AK_PRIMUS_DB` | `~/.ak_primus/store.db` | SQLite path |
| `AK_PRIMUS_DSPY_CACHE` | `~/.ak_primus/dspy_cache` | DSPy program cache |

---

## License

MIT

