Metadata-Version: 2.4
Name: llmdoctor
Version: 0.1.1
Summary: Find LLM cost leaks before your bill does. Static analysis for Anthropic and OpenAI client code.
Project-URL: Homepage, https://github.com/Shahriyar-Khan27/llm-doctor
Project-URL: Repository, https://github.com/Shahriyar-Khan27/llm-doctor
Project-URL: Issues, https://github.com/Shahriyar-Khan27/llm-doctor/issues
Project-URL: Changelog, https://github.com/Shahriyar-Khan27/llm-doctor/blob/main/CHANGELOG.md
Author: llmdoctor contributors
License: MIT
License-File: LICENSE
Keywords: ai,anthropic,claude,cost,cost-optimization,gpt,linter,llm,openai,prompt-cache,static-analysis,tokens
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: click>=8.1
Requires-Dist: rich>=13.0
Provides-Extra: dev
Requires-Dist: build>=1.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: twine>=5.0; extra == 'dev'
Description-Content-Type: text/markdown

# llmdoctor

**Find LLM cost leaks before your bill does.**

`llmdoctor doctor` is a static analyzer for Python code that calls Anthropic
or OpenAI. It catches the patterns that quietly burn money in production:

- Prompt-cache placement bugs that invalidate the cache on every call
  (the bug claude-mem itself shipped — their issue #1890)
- Missing `max_tokens` caps where output tokens cost 3–10× input
- Premium models (Opus, GPT-5) used for tiny prompts where a cheaper model
  would produce indistinguishable output
- Large static system prompts left uncached

It's an advisor, not a runtime patcher. It reads your code, prints findings
with rough cost-impact estimates, and exits.

## Install

```bash
pip install llmdoctor
# or no-install:
pipx run llmdoctor doctor .
```

## Usage

```bash
llmdoctor doctor .              # scan current directory
llmdoctor doctor src/agent.py   # scan one file
llmdoctor doctor . --json       # for CI / piping into other tools
llmdoctor doctor . --fail-on HIGH   # exit 1 if any HIGH-severity issue
```

## What it looks like

```
╭─ llmdoctor doctor ─────────────────────────────────────────────╮
│ Scanned 14 file(s) under src/                                    │
│ Found 3 issue(s)  ·  2 HIGH · 1 MEDIUM                           │
│ Estimated potential savings: ~$340/month  (rough estimate)       │
╰──────────────────────────────────────────────────────────────────╯

╭─ [HIGH] TS001 Dynamic content before cache_control invalidates the cache ─╮
│   file:  src/agent.py:42                                                  │
│   code:  {"type": "text", "text": f"User said: {user_query}"},            │
│   why:   System block at index 0 contains dynamic content but appears     │
│          BEFORE the first block with cache_control. ...                   │
│   fix:   Move static content BEFORE the cache_control marker. Move        │
│          dynamic content into the messages array.                         │
│   estimate: ~$135.00/month  (assuming: 3000-token system prompt, 100      │
│             calls/day, 30-day month, 0.1× cache-read pricing)             │
│   docs:  https://docs.anthropic.com/.../prompt-caching                    │
╰───────────────────────────────────────────────────────────────────────────╯
```

## Checks shipped in 0.1.0

| Code  | Severity | What it catches |
|-------|----------|-----------------|
| TS001 | HIGH     | Dynamic content placed before a `cache_control` marker (silently invalidates the prompt cache). |
| TS003 | MEDIUM   | Large static system prompt without `cache_control` (missed cache opportunity). |
| TS010 | HIGH     | OpenAI call with no `max_tokens` / `max_completion_tokens` (output cost unbounded). |
| TS011 | MEDIUM   | `max_tokens` set suspiciously high (likely a copy-paste default that enables runaway completions). |
| TS020 | MEDIUM   | Premium model (Opus, GPT-5, GPT-4-Turbo, GPT-4o) on a tiny static prompt where a cheaper tier would likely match quality. |

## How cost estimates are calculated

Estimates are **heuristic**, not invoice predictions. Each issue prints its
assumptions (e.g. *"100 calls/day, 30-day month, 3000-token system prompt"*).
Treat the numbers as order-of-magnitude. The tool's value is the *finding* and
the *fix*; the dollar number is the attention-grabber.

Pricing table is in `src/llmdoctor/pricing.py` — verified 2026-04-30. Submit
a PR if a model is missing or the price moves.

## What this tool deliberately does NOT do (yet)

- It does not patch your code. It reports, you fix.
- It does not run your code. Static analysis only — safe on closed-source repos.
- It does not measure live traffic. That's a different product (the SDK,
  coming next). The doctor is the first wedge.
- It does not check JavaScript / TypeScript. Python only in 0.1.0.
- It does not flag retry-storm patterns yet (planned: TS030).
- It does not detect tool-definition duplication across calls (planned: TS040).

If your codebase doesn't import `anthropic` or `openai` directly (e.g. you
use LangChain, LiteLLM, or hit the HTTP API), the doctor will produce no
findings. Adapter checks for those frameworks are a next step.

## Self-audit

Before publishing, we audited the doctor itself for the categories of
failure most likely to make a measurement tool lose credibility:
checker correctness on edge cases, input safety (BOMs, huge files,
binary content, recursion bombs), reporter safety (markup injection),
and basic security threat modelling. Five concrete bugs were caught and
fixed before 0.1.0; eight intentional false-negatives are documented
with rationale.

Full report: [`AUDIT.md`](AUDIT.md).

## Development

```bash
git clone https://github.com/Shahriyar-Khan27/llm-doctor
cd llmdoctor
pip install -e ".[dev]"
pytest
```

## License

MIT.

## Why we built this

We were scoping a broader LLM-cost optimization SDK and surveyed the
landscape: LLMLingua-family compression, GPTCache-style semantic caching,
Mem0 / Letta / claude-mem memory frameworks, and Anthropic's prompt
caching. One finding kept resurfacing as the single highest-leverage gap:
**prompt-cache placement bugs are everywhere, mostly invisible, and cost
serious money.** Even a competent OSS project like claude-mem shipped one
to production (their issue #1890). Runtime tools catch this only after
weeks of wasted spend; static analysis catches the whole class in seconds.

So before building the bigger SDK, we shipped the diagnostic. That's
`llmdoctor`.
