Metadata-Version: 2.4
Name: nvd-claude-proxy
Version: 1.4.4
Summary: Run Claude Code (and any Anthropic SDK client) on NVIDIA NIM models via a local proxy.
Author: nvd-claude-proxy contributors
License: MIT License
        
        Copyright (c) 2026 nvd-claude-proxy contributors
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/khiwn/nvd-claude-proxy
Project-URL: Bug Tracker, https://github.com/khiwn/nvd-claude-proxy/issues
Project-URL: Changelog, https://github.com/khiwn/nvd-claude-proxy/releases
Keywords: anthropic,claude,nvidia,nim,proxy,llm,openai-compatible,claude-code,nemotron
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: Proxy Servers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Framework :: FastAPI
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.115
Requires-Dist: uvicorn[standard]>=0.32
Requires-Dist: httpx[http2]>=0.27
Requires-Dist: pydantic>=2.9
Requires-Dist: pydantic-settings>=2.6
Requires-Dist: pyyaml>=6.0
Requires-Dist: python-multipart>=0.0.12
Requires-Dist: tiktoken>=0.8
Requires-Dist: pillow>=11.0
Requires-Dist: structlog>=24.4
Requires-Dist: orjson>=3.10
Requires-Dist: typer>=0.12
Requires-Dist: rich>=13.0
Requires-Dist: json-repair>=0.30.0
Requires-Dist: json5>=0.9.28
Requires-Dist: sqlalchemy>=2.0
Requires-Dist: aiosqlite>=0.20
Provides-Extra: dev
Requires-Dist: pytest>=8.3; extra == "dev"
Requires-Dist: pytest-asyncio>=0.24; extra == "dev"
Requires-Dist: pytest-httpx>=0.33; extra == "dev"
Requires-Dist: ruff>=0.7; extra == "dev"
Requires-Dist: mypy>=1.13; extra == "dev"
Requires-Dist: respx>=0.22; extra == "dev"
Requires-Dist: jsonschema>=4.23; extra == "dev"
Provides-Extra: metrics
Requires-Dist: prometheus-client>=0.21; extra == "metrics"
Provides-Extra: pdf
Requires-Dist: pypdf>=4.0; extra == "pdf"
Provides-Extra: redis
Requires-Dist: redis[hiredis]>=5.0; extra == "redis"
Provides-Extra: full
Requires-Dist: prometheus-client>=0.21; extra == "full"
Requires-Dist: pypdf>=4.0; extra == "full"
Requires-Dist: redis[hiredis]>=5.0; extra == "full"
Dynamic: license-file

# nvd-claude-proxy

[![PyPI](https://img.shields.io/pypi/v/nvd-claude-proxy)](https://pypi.org/project/nvd-claude-proxy/)
[![Python](https://img.shields.io/pypi/pyversions/nvd-claude-proxy)](https://pypi.org/project/nvd-claude-proxy/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Code Style: Ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)

**Run Claude Code — and any Anthropic SDK client — on enterprise-grade NVIDIA NIM models.**

`nvd-claude-proxy` is a low-latency local HTTP proxy that translates between the [Anthropic Messages API](https://docs.anthropic.com/en/api/messages) and the NVIDIA NIM (OpenAI-compatible) API. The default runtime now uses the lightweight R2 path optimized for Claude Code responsiveness.

---

## 🚀 Key Features

- **Architectural Excellence**: Fully decoupled core translation logic from the transport layer.
- **Enterprise Resilience**: Built-in **Circuit Breakers** and automated failover chains to protect against upstream outages.
- **Idempotency Support**: Request deduplication and safe retries via `anthropic-idempotency-key` across Redis, SQLite, and Memory backends.
- **Scalable State**: Distributed session management via **Redis** (with SQLite and In-Memory fallbacks).
- **Official-Grade Security**: Unified `AuthMiddleware` protecting all endpoints with global API key enforcement.
- **Claude Code Optimized**: Specifically tuned for Claude Code's complex tool-calling and reasoning patterns.
- **Vision & Progressive Streaming**: Fine-grained progressive tool streaming and real-time multimodal (`image_url`) parity.
- **Modular Pipeline**: Event-driven streaming architecture for deterministic state management.

---

## 🛠 Deployment & Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `NVIDIA_API_KEY` | (Required) | Your NVIDIA NIM API key. |
| `PROXY_API_KEY` | None | Optional key to protect the proxy itself. |
| `STORAGE_ENGINE` | `sqlite` | Persistence backend: `redis`, `sqlite`, or `memory`. |
| `REDIS_URL` | None | Required if `STORAGE_ENGINE=redis` (e.g., `redis://localhost:6379`). |
| `PROXY_PORT` | `8788` | Local port for the proxy. |
| `RATE_LIMIT_RPM`| `0` | Global rate limit (requests per minute). `0` to disable. |

### Quick Start

```bash
# Install the proxy
pip install nvd-claude-proxy[full]

# Export your API key
export NVIDIA_API_KEY=nvapi-...

# Start the default low-latency runtime and launch Claude Code
ncp code
```

Then point your Claude Code at the proxy:
```bash
export ANTHROPIC_BASE_URL=http://localhost:8788
claude
```

---

## 🏗 Architecture

The proxy uses a **Chain of Responsibility** pattern for streaming events:
`MetadataProcessor -> TextProcessor -> ToolProcessor -> SafetyProcessor -> FinalizerProcessor`

This ensures that even complex interleaved reasoning and parallel tool calls are correctly reconstructed for the Anthropic SDK.

---
**Official-Grade Infrastructure for the AI Era.**


---

## Production Claude Code + NVIDIA NIM configuration

Use this proxy as the Anthropic-compatible endpoint for Claude Code:

    export NVIDIA_API_KEY=nvapi-...
    export PROXY_PORT=8788
    export MAX_REQUEST_BODY_MB=32
    export REQUEST_TIMEOUT_SECONDS=600
    export STORAGE_ENGINE=redis
    export REDIS_URL=redis://127.0.0.1:6379

    # Optional but strongly recommended for shared/devbox usage
    export PROXY_API_KEY=replace-with-a-long-random-secret

Run the proxy:

    uv run ncp run
    # or: ncp run

Point Claude Code at the proxy:

    export ANTHROPIC_BASE_URL=http://127.0.0.1:8788
    export ANTHROPIC_AUTH_TOKEN=dummy
    claude

### Recommended production notes

- Prefer `STORAGE_ENGINE=redis` for stable rate limiting, idempotency, and multi-session behavior.
- Keep `MAX_REQUEST_BODY_MB=32` to avoid pathological payloads while still supporting large Claude Code tool catalogs.
- Use the default streaming path; it emits early `message_start` and periodic `ping` events to reduce apparent latency and prevent idle timeouts.
- If tool calls appear slow or malformed upstream, start with `claude-sonnet-4-6` or `claude-haiku-4-5` mappings before moving to larger reasoning models.
- This proxy is translation-only: Claude Code executes tools locally; the proxy must preserve tool ordering, streamed JSON fragments, and Anthropic-compatible SSE grammar.


---

## R2 low-latency mode

Version 1.3.5 adds a lightweight hosted-catalog runtime inspired by the one-file
reference proxy. Use it when you care more about **fast first-token latency** and
minimal overhead than about the full production registry/session stack.

### Start R2 mode

```bash
ncp r2 --model nvidia/llama-3.3-nemotron-super-49b-v1.5
# or
nvd-claude-proxy-r2
```

Then point Claude Code at it:

```bash
M=nvidia/llama-3.3-nemotron-super-49b-v1.5
export ANTHROPIC_BASE_URL=http://127.0.0.1:8787
export ANTHROPIC_API_KEY=not-used
export ANTHROPIC_CUSTOM_MODEL_OPTION=$M
export ANTHROPIC_DEFAULT_HAIKU_MODEL=$M
export ANTHROPIC_DEFAULT_OPUS_MODEL=$M
export ANTHROPIC_DEFAULT_SONNET_MODEL=$M
export CLAUDE_CODE_SUBAGENT_MODEL=$M
claude
```

### Why use R2 mode

- eager `message_start` for lower perceived TTFT
- 15s ping heartbeat during silent reasoning phases
- simpler tool translation path
- direct NVIDIA model IDs, no alias registry required
- less overhead than the full production runtime


---

## Default runtime in 1.4.0

Starting with **1.4.0**, the default commands now use the low-latency R2 runtime:

- `ncp code` → starts the R2 runtime and launches Claude Code
- `ncp proxy` → starts the R2 runtime only
- `ncp r2` → explicit alias for the same default runtime
- `nvd-claude-proxy` → starts the R2 runtime when invoked as the package entrypoint

This change prioritizes:

- faster first-token latency
- simpler Claude Code model wiring
- lower runtime overhead
- direct NVIDIA model IDs

Use `NCP_DEFAULT_MODEL` to override the default hosted NVIDIA model used by `ncp code` and `ncp proxy`.


---

## Streaming quality and visualization

The default runtime now emphasizes Anthropic-style streaming quality:

- SSE `id:` field is emitted on every event
- early `message_start` for lower perceived TTFT
- keepalive `ping` events during silent upstream gaps
- progressive `message_delta` usage snapshots after content-block closes
- visualization side-channel events via `event: ncp_visualization`

### R2 streaming environment knobs

- `R2_PING_INTERVAL` — keepalive cadence in seconds
- `R2_TEXT_DELTA_CHARS` — max chunk size for text/thinking deltas
- `R2_STREAM_VISUALIZATION` — enable or disable visualization side-channel events
- `R2_MESSAGE_DELTA_EVERY_BLOCK` — emit progress usage snapshots after each content block stop

### Visualization endpoint

The runtime also exposes:

`GET /v1/stream/visualization`

This reports the currently active visualization behavior for dashboards or debugging tools.


### Stream dashboard

The low-latency runtime now ships with a beautiful live stream visualization UI.

Open:

`/dashboard/stream`

Features:

- glassmorphism dark UI
- live color-coded event timeline
- state graph lanes for lifecycle, content, tools, and diagnostics
- websocket-driven real-time visualization from the R2 stream side-channel
- usage progress counters and live request tracking

This UI is powered by the `ncp_visualization` side-channel and the websocket endpoint:

`/ws/stream-visualization`


### Default max tokens

The default R2 runtime now supports a built-in default output budget for upstream requests when the client does not explicitly send `max_tokens`.

Use either:

```bash
export NCP_DEFAULT_MAX_TOKENS=12000
```

or per launch:

```bash
ncp code --max-tokens 12000
```

This is especially useful for large codebase mapping tasks where Claude Code may otherwise request too much output for the selected model context window.


### Automatic fallback and context-safe retries

Before publishing, the default R2 runtime was further hardened to reduce Claude retry loops:

- automatic fallback across `NCP_FALLBACK_MODELS` when the primary model is retired, missing, rate-limited, or transiently failing
- automatic max-token reduction retry when NVIDIA returns context-length overflow style 400s
- startup diagnostics now print the dashboard and health URLs immediately

Override fallback models with:

```bash
export NCP_FALLBACK_MODELS="meta/llama-4-maverick-17b-128e-instruct,deepseek-ai/deepseek-v4-flash,qwen/qwen3-coder-480b-a35b-instruct"
```


### 1.4.1 stability upgrade

Version 1.4.1 adds:

- `NCP_DEFAULT_MAX_TOKENS` and `ncp code --max-tokens`
- `NCP_FALLBACK_MODELS` automatic model fallback
- context-safe retry when upstream rejects oversized context windows
- improved startup diagnostics for dashboard and health endpoints


---

## 1.4.2 classic R2 restore

Version **1.4.2** restores the smooth **1.3.5-style R2 hosted-catalog flow** as the primary experience, while keeping non-invasive improvements like the stream dashboard and streaming observability.

### Primary command

`ncp code` now uses the restored classic R2 path.

Recommended launch:

```bash
ncp code --model nvidia/llama-3.3-nemotron-super-49b-v1.5 --max-tokens 12000
```

### Permanent fix for context overflow

The classic R2 runtime now applies permanent budget guardrails:

- explicit client `max_tokens` is hard-clamped by `--max-tokens` / `NCP_HARD_MAX_TOKENS`
- omitted client `max_tokens` uses `NCP_DEFAULT_MAX_TOKENS`
- oversized input is rejected early with an actionable Claude-specific error
- combined input/output budget is reduced before upstream request when possible

This permanently fixes the common failure mode where Claude Code asks for too much output or continues a giant session until the provider rejects it.

---

## Official-grade Claude Code + NVIDIA NIM setup

Start the proxy:

```bash
export NVIDIA_API_KEY=nvapi-...
export PROXY_PORT=8787
export NCP_DEFAULT_MODEL=qwen/qwen3-coder-480b-a35b-instruct
export NCP_DEFAULT_MAX_TOKENS=12000
export NCP_HARD_MAX_TOKENS=16384
uv run ncp r2
```

Point Claude Code at it:

```bash
export ANTHROPIC_BASE_URL=http://127.0.0.1:8787
export ANTHROPIC_API_KEY=dummy
export ANTHROPIC_CUSTOM_MODEL_OPTION=qwen/qwen3-coder-480b-a35b-instruct
export ANTHROPIC_CUSTOM_MODEL_OPTION_NAME="Qwen3 Coder 480B via NVIDIA NIM"
export ANTHROPIC_DEFAULT_SONNET_MODEL_NAME=qwen/qwen3-coder-480b-a35b-instruct
export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1
claude
```

Refresh candidate NVIDIA model capability entries:

```bash
uv run ncp models-refresh --output /tmp/nvidia-models.generated.yaml
```

Run conformance smoke test:

```bash
ANTHROPIC_BASE_URL=http://127.0.0.1:8787 scripts/conformance-anthropic.sh
```

Max-token and thinking behavior:

- The proxy never forwards Anthropic `thinking` directly to NVIDIA.
- For models with reasoning support, the proxy maps thinking to model-specific controls.
- If `thinking.budget_tokens` is present, the proxy ensures upstream `max_tokens > thinking.budget_tokens` or returns an Anthropic-compatible 400.
- The proxy clamps output tokens to model context windows to avoid NVIDIA context-length 400s.
