Prompt Optimization Flow¶
Overview¶
This document describes a sophisticated prompt optimization architecture that intercepts, analyzes, enhances, and validates user prompts before they reach Claude. The system uses a multi-stage pipeline involving local LLMs, MCP tool chains, knowledge graph integration, and cloud-based optimization to maximize prompt quality while minimizing API costs.
Key Benefits¶
- Zero Initial API Cost: All optimization happens before hitting paid Claude API endpoints
- Intelligent Complexity Routing: Simple prompts bypass optimization for speed; complex prompts get full treatment
- Knowledge Graph Integration: Automatically enriches prompts with relevant context from CortexGraph
- Multi-Model Validation: Cross-validates optimizations using multiple LLMs to ensure quality
- Flexible Architecture: Local LLMs can be swapped with cloud providers as needed
- Metadata Enrichment: Adds confidence scores, similarity metrics, and processing metadata to prompts
Architecture Components¶
1. Proxy Server¶
- Central orchestration layer
- Handles routing decisions based on complexity
- Manages communication between all components
- Tracks confidence/similarity thresholds
2. Local LLMs¶
- Primary: Prompt optimization and tagging
- Validation: Multiple instances for cross-validation
- Can be replaced with cloud providers (OpenAI, Anthropic, etc.)
3. MCP Tool Chain¶
- CortexGraph: Knowledge graph for context retrieval
- STOPPER: Process control and validation
- Custom Tools: User-defined extensions
- Gemini Optimizer: Large context window for final assembly
4. Validation Layer¶
- Semantic similarity checks
- Confidence scoring
- Iterative refinement below thresholds
Detailed Flow Description¶
Phase 1: Initial Intake¶
- User Input: User enters prompt in Claude Code interface
- Proxy Intercept: Proxy captures the prompt before it reaches Claude
- Complexity Analysis: NLP-based complexity rating determines routing strategy
Phase 2: Intelligent Routing¶
- Simple Path (Low Complexity):
- Proxy applies basic formatting rules
- Routes directly to Claude with minimal processing
-
Optimizes for speed and reduces overhead
-
Complex Path (High Complexity):
- Triggers full optimization pipeline
- Proceeds to Phase 3
Phase 3: Prompt Optimization¶
- Local LLM Processing:
- Adds semantic tags to categorize intent
- Restructures prompt for optimal Claude comprehension
- Formats according to Claude best practices
- Extracts key entities and concepts
Phase 4: Validation & Refinement¶
- Multi-Model Validation:
- Routes optimized prompt to 2-n additional local LLMs
- Each validator scores the optimization independently
- Can use semantic similarity algorithms instead of LLMs
-
Calculates confidence and similarity metrics
-
Threshold Check:
- If scores meet threshold: Proceed to Phase 5
- If scores below threshold: Return to Phase 3 for reprocessing
-
Prevents low-quality optimizations from proceeding
-
Tool Recommendation:
- Proxy receives validated prompt with metadata
- System suggests relevant MCP tools for the query
Phase 5: MCP Tool Chain Execution¶
-
CortexGraph Search:
- Searches knowledge graph for related concepts
- Retrieves relevant memories and context
- Returns similarity-scored results
-
STOPPER Validation:
- Process control checks
- Safety and constraint validation
- Prevents out-of-scope operations
-
Additional Tools:
- Routes to n other tools based on user preferences
- Each tool contributes specialized context
- Tools run in parallel for efficiency
Phase 6: Final Assembly¶
-
Gemini Optimization:
- Combines original prompt + optimizations + tool outputs
- Leverages Gemini's large context window (2M tokens)
- Uses generous free tier for cost optimization
- Assembles coherent final prompt
-
Quality Assurance:
- Compares input to assembled output
- Generates similarity score (drift detection)
- Calculates final confidence rating
- Appends metadata to prompt
Phase 7: Claude Execution¶
-
Final Prompt Delivery:
- Proxy sends optimized prompt to Claude
- First API cost incurred at this step
- Prompt includes:
- Original user intent (preserved)
- Optimization tags and structure
- Knowledge graph context
- Tool outputs and recommendations
- Confidence/similarity metadata
- Processing history
-
Normal Operation:
- Claude processes the enriched prompt
- Claude Code continues standard workflow
- User receives high-quality response
Sequence Diagram¶
```mermaid sequenceDiagram actor User participant Claude Code Interface participant Proxy participant NLP Complexity Analyzer participant Local LLM (Optimizer) participant Local LLM 2 (Validator) participant Local LLM N (Validator) participant Semantic Similarity Engine participant MCP Chain participant CortexGraph participant STOPPER participant Custom Tools participant Gemini participant Claude API
%% Phase 1: Initial Intake
User->>Claude Code Interface: Enter prompt
Claude Code Interface->>Proxy: Forward prompt
Proxy->>NLP Complexity Analyzer: Analyze complexity
NLP Complexity Analyzer-->>Proxy: Complexity rating
%% Phase 2: Routing Decision
alt Low Complexity (Simple Prompt)
Proxy->>Proxy: Apply basic rules
Proxy->>Claude API: Route directly to Claude
Note over Proxy,Claude API: Fast path for simple queries
else High Complexity (Complex Prompt)
Note over Proxy: Trigger full optimization pipeline
%% Phase 3: Optimization
Proxy->>Local LLM (Optimizer): Optimize prompt
Note over Local LLM (Optimizer): - Add semantic tags<br/>- Format for Claude<br/>- Extract entities<br/>- Restructure query
Local LLM (Optimizer)-->>Proxy: Optimized prompt v1
%% Phase 4: Validation Loop
rect rgb(240, 240, 240)
Note over Proxy,Semantic Similarity Engine: Validation & Refinement Loop
par Parallel Validation
Proxy->>Local LLM 2 (Validator): Validate optimization
Proxy->>Local LLM N (Validator): Validate optimization
Proxy->>Semantic Similarity Engine: Check semantic similarity
end
Local LLM 2 (Validator)-->>Proxy: Confidence score 2
Local LLM N (Validator)-->>Proxy: Confidence score N
Semantic Similarity Engine-->>Proxy: Similarity score
Proxy->>Proxy: Aggregate scores
alt Below Confidence/Similarity Threshold
Note over Proxy,Local LLM (Optimizer): Quality check failed
Proxy->>Local LLM (Optimizer): Reprocess with feedback
Local LLM (Optimizer)-->>Proxy: Optimized prompt v2
Note over Proxy: Loop until threshold met
else Above Threshold
Note over Proxy: Quality validated, proceed
end
end
Proxy->>Proxy: Append recommendation metadata
%% Phase 5: MCP Tool Chain
Proxy->>MCP Chain: Route validated prompt + metadata
rect rgb(230, 245, 255)
Note over MCP Chain,Custom Tools: MCP Tool Execution (Parallel)
par Tool Execution
MCP Chain->>CortexGraph: Search knowledge graph
MCP Chain->>STOPPER: Validate constraints
MCP Chain->>Custom Tools: Execute user-defined tools
end
CortexGraph-->>MCP Chain: Context + memories (similarity scored)
STOPPER-->>MCP Chain: Validation results
Custom Tools-->>MCP Chain: Tool outputs
end
%% Phase 6: Final Assembly
MCP Chain->>Gemini: Assemble final prompt
Note over Gemini: - Combine all inputs<br/>- Optimize structure<br/>- 2M token context<br/>- Free tier usage
Gemini->>Gemini: Compare input vs output
Gemini->>Gemini: Calculate similarity & confidence
Gemini-->>MCP Chain: Final prompt + metadata
MCP Chain-->>Proxy: Return final prompt
%% Phase 7: Claude Execution
Note over Proxy,Claude API: 💰 First API cost incurred here
Proxy->>Claude API: Send final optimized prompt
Note over Claude API: Prompt includes:<br/>- Original intent<br/>- Optimizations<br/>- Knowledge graph context<br/>- Tool outputs<br/>- Metadata
end
%% Normal Operation
Claude API-->>Claude Code Interface: Process request
Claude Code Interface-->>User: Return response
Note over User,Claude Code Interface: Claude Code continues as normal
```
Configuration Options¶
Complexity Thresholds¶
```python
Proxy configuration¶
Prompts with complexity > COMPLEX_PROMPT_THRESHOLD follow the complex path, otherwise the simple path is used.¶
COMPLEX_PROMPT_THRESHOLD = 0.4 ```
Validation Settings¶
```python
Validation thresholds¶
CONFIDENCE_THRESHOLD = 0.75 # Minimum confidence to proceed SIMILARITY_THRESHOLD = 0.80 # Minimum semantic similarity MAX_REFINEMENT_ITERATIONS = 3 # Prevent infinite loops ```
Model Selection¶
```python
Local LLMs (can be replaced with cloud providers)¶
OPTIMIZER_MODEL = "llama-3.1-70b" # Primary optimizer VALIDATOR_MODELS = [ # Validation ensemble "mixtral-8x7b", "qwen-2.5-72b", "deepseek-v2" ]
Example using cloud providers (alternative to local)¶
OPTIMIZER_MODEL = "openai:gpt-4"¶
VALIDATOR_MODELS = ["anthropic:claude-3-opus", "openai:gpt-4"]¶
```
MCP Tools¶
```python
Tool chain configuration¶
MCP_TOOLS = { "cortex_graph": { "enabled": True, "similarity_threshold": 0.7, "max_results": 10 }, "stopper": { "enabled": True, "strict_mode": False }, "custom": { "user_preferences": True, "context_retrieval": True } } ```
Gemini Settings¶
```python
Final assembly configuration¶
GEMINI_MODEL = "gemini-2.0-flash-exp" # Free tier, large context GEMINI_MAX_TOKENS = 2000000 # 2M token context window GEMINI_TEMPERATURE = 0.3 # Consistent assembly ```
Performance Characteristics¶
Latency Profile¶
| Stage | Estimated Time | Notes |
|---|---|---|
| Complexity Analysis | 10-50ms | Fast NLP classification |
| Simple Path (total) | 50-100ms | Minimal processing overhead |
| Optimization | 200-500ms | Local LLM inference |
| Validation | 150-300ms | Parallel execution |
| MCP Tool Chain | 100-400ms | Depends on tool complexity |
| Gemini Assembly | 300-800ms | Large context processing |
| Complex Path (total) | 1-3 seconds | Full pipeline |
Cost Analysis¶
Traditional Approach (direct to Claude): - Every prompt hits Claude API immediately - No optimization or context enrichment - Cost: $X per request from first token
Optimized Approach (this architecture): - Local LLMs: Free (self-hosted) or cheap (cloud) - Gemini: Leverages the generous free tier for final assembly - Claude API: Only hit after full optimization - Cost: $0 until Claude execution, then same $X but better results
Net Effect: - Same Claude API cost per request - Significantly better prompt quality - Higher success rate (fewer retries needed) - Lower total cost due to reduced iterations
Implementation Considerations¶
1. Local LLM Requirements¶
- GPU: RTX 4090 or better for 70B models
- RAM: 64GB+ recommended
- Alternative: Use cloud inference APIs (Groq, Together.ai, OpenRouter)
2. Proxy Server¶
- Needs to be MCP-compatible
- Should support WebSocket for streaming
- Must handle concurrent validation requests
3. Knowledge Graph Integration¶
- CortexGraph needs to be populated with relevant data
- Index must be kept up-to-date
- Consider using CortexGraph for temporal memory
4. Error Handling¶
- Fallback to simple path if optimization fails
- Timeout protection (max 5s total processing)
- Graceful degradation if tools unavailable
5. Monitoring & Observability¶
- Track optimization success rates
- Monitor confidence/similarity distributions
- Log processing times for each stage
- A/B test optimized vs non-optimized prompts
Future Enhancements¶
- Adaptive Thresholds: Learn optimal confidence/similarity thresholds per user
- Caching Layer: Cache optimizations for similar prompts
- User Feedback Loop: Incorporate user ratings to improve optimization
- Model Selection: Automatically choose best LLM based on prompt type
- Streaming Optimization: Stream partial results during processing
- Cost Tracking: Detailed cost accounting per stage
- A/B Testing Framework: Compare different optimization strategies
Security Considerations¶
- Prompt Injection: Validate all optimized prompts for injection attempts
- Data Privacy: Local LLMs keep sensitive data on-premise
- Rate Limiting: Prevent abuse of free tier services
- Access Control: Authenticate proxy requests
- Audit Trail: Log all prompt transformations
Related Documentation¶
- CortexGraph Architecture - Integration with temporal memory
- CortexGraph Documentation - Knowledge graph features
- MCP Specification - Tool protocol details
- Prompt Injection Prevention - Security best practices
Example Workflow¶
Input Prompt¶
``` "Help me write a Python function to process user data" ```
After Optimization¶
```markdown
Task: Python Function Development¶
User Intent: Create data processing function
Context (from CortexGraph): - User prefers type hints (from memory: 2025-10-15) - Uses pytest for testing (from memory: 2025-10-20) - Prefers dataclasses over dicts (from memory: 2025-10-12)
Requirements: 1. Function should process user data 2. Follow user's Python style preferences 3. Include type hints and docstrings 4. Consider testing approach
Metadata: - Confidence: 0.87 - Similarity: 0.92 - Optimization iterations: 1 - Tools used: CortexGraph, STOPPER - Processing time: 1.2s ```
Result¶
Claude receives a rich, contextualized prompt that produces higher-quality output on the first try, reducing the need for follow-up iterations.
Built with Claude Code 🤖