// HOW SCORES ARE CALCULATED
Scoring Guide
Thread-based engineering metrics across 4 dimensions
OVERALL COMPOSITE SCORE
overall = (p_thread_score + l_thread_score + b_thread_score + z_thread_score) / 4
Each dimension score is 0-10 (capped). The overall score is the simple average. All four *_thread_score values are used directly — no additional normalization.
MORE (P-threads) — Parallelism
Max: 5+Number of parallel execution paths
Score Formula
p_thread_score = max(max_concurrent_agents, peak_parallel_tools)// max of concurrent agents and parallel tool calls, capped at 1010/10: 10+ sub-agents running concurrently, OR 10+ parallel tool calls in a single message. Direct value — no log scale.
Sweep-Line Algorithm
Score = the larger of (concurrent agents) and (parallel tools in one message).
This is a direct value, NOT log-scaled. Score equals the raw count, capped at 10.
How max_concurrent_agents is computed:
1. Extract [first_seen, last_seen] time range for each sub-agent
2. Create +1 event at start, -1 event at end
3. Sort by time and sweep to find max overlap = max_concurrent_agents
How peak_parallel_tools is computed:
4. Count tool_use blocks in a single assistant message = peak_parallel_tools
e.g. Claude calls Read, Grep, Glob simultaneously → peak = 3
Examples
Simple Q&A (score: 0)
User asks a question, Claude answers with Read + Grep (sequential).
max_concurrent_agents = 0, peak_parallel_tools = 1
p_thread_score = max(0, 1) = 1Parallel file reads (score: 3)
Claude reads 3 files simultaneously in one message: Read, Read, Read.
max_concurrent_agents = 0, peak_parallel_tools = 3
p_thread_score = max(0, 3) = 3Team-based refactoring (score: 5)
Claude spawns 5 Agent sub-agents for parallel work.
Sweep-line detects 5 agents active at the same time.
max_concurrent_agents = 5, peak_parallel_tools = 2
p_thread_score = max(5, 2) = 5Metrics
max_concurrent_agentsMaximum sub-agents active simultaneously. Computed via sweep-line algorithm over agent_progress event time ranges.total_sub_agentsTotal unique sub-agents created during the session.peak_parallel_toolsMaximum number of tool_use blocks called simultaneously in a single assistant message.Score Ranges
How to Improve
- →Use Agent tool to execute independent tasks simultaneously.
- →Request parallel tool calls for independent operations like code search and file reads.
- →"Analyze these 3 files simultaneously" → Claude spawns 3 Agents.
- →Request concurrent test, build, and lint runs.
- →For complex refactoring, separate modules into sub-agents for P-thread classification.
LONGER (L-threads) — Autonomy
Max: 10Autonomous execution time without human intervention
Score Formula
l_thread_score = min(log1p(longest_stretch_minutes) * 2.0, 10)// log1p(x) = ln(1+x)10/10: ~148 minutes (≈2h 28m) of continuous autonomous work. Derived from ln(1+148) × 2 ≈ 10.0
Activity-Based Measurement
Uses log scale: log1p(x) = ln(1+x). This compresses large values so diminishing returns apply.
Why log1p instead of ln? → ln(0) = -∞ (crashes), but ln(1+0) = 0 (safe for zero input).
Key: Measures from human message to Claude's last activity (tool call).
Measures only Claude's actual working time, not until the next human message.
Example: Human(10:00) → Claude works → last tool(10:05) → [idle] → Human(12:00)
Without activity-based measurement: incorrectly 120 min. OMAS: correctly 5 min.
Segments: (1) before first human, (2) between humans, (3) after last human — max of each.
Examples
Quick Q&A (score: 0)
User asks, Claude responds immediately. No tool calls between messages.
longest_autonomous_stretch = 0 min
l_thread_score = min(log1p(0) * 2.0, 10) = 0.05-minute autonomous work (score: 3.6)
User: "Fix the login bug"
Claude works for 5 minutes: Grep → Read → Edit → Bash(test) → done.
longest_autonomous_stretch = 5 min
l_thread_score = min(log1p(5) * 2.0, 10) = min(3.58, 10) = 3.5830-minute feature implementation (score: 6.9)
User: "Implement the entire auth module with tests and docs"
Claude works for 30 min straight: 47 tool calls, no human intervention.
longest_autonomous_stretch = 30 min
l_thread_score = min(log1p(30) * 2.0, 10) = min(6.88, 10) = 6.88Score Reference Table
Metrics
longest_autonomous_stretch_minutesMax time (minutes) between human message and Claude's last activity. Activity-based measurement excludes idle time.max_tool_calls_between_humanMaximum number of tool calls between human messages.session_duration_minutesTotal session length (first to last timestamp).max_consecutive_assistant_turnsMaximum consecutive assistant messages.Score Ranges
How to Improve
- →Give clear and specific instructions at once so Claude runs autonomously longer.
- →Don't break large tasks into small pieces — deliver all requirements at once.
- →"Refactor this entire module. Write tests and create a PR too."
- →Trust Claude to make its own decisions without interrupting mid-work.
- →Write detailed project conventions in CLAUDE.md so Claude works longer without questions.
THICKER (B-threads) — Density
Max: 10+Sub-agent scale and nesting depth
Score Formula
b_thread_score = total_sub_agents * max(1, max_sub_agent_depth)// sub-agent count × nesting depth, capped at 10line_bonus = min(ai_written_lines / 50000, 1.0)// AI-written lines bonus: linear, max +1.0 at 50K linesb_norm = min(b_thread_score + line_bonus, 10.0)// final density score with line bonus applied10/10: e.g. 5 sub-agents × depth 2 = 10, or 10 flat sub-agents × depth 1 = 10. Multiplicative — depth is the multiplier. AI-written lines add up to +1.0 bonus (50K lines = full bonus).
Sub-Agent Depth Detection & AI Lines Bonus
Direct multiplication — NOT log-scaled. Raw product is capped at 10.
Depth acts as a multiplier, rewarding nested agent architectures.
depth=0: No sub-agents → score always 0
depth=1: Flat sub-agents (don't spawn their own agents)
depth=2+: Nested — sub-agents spawn sub-agents (B-thread classification)
Detection: If a subagent's JSONL file contains agent_progress events, it's nested.
Examples: 3 agents × depth 2 = 6 | 5 agents × depth 2 = 10 | 10 agents × depth 1 = 10
AI Written Lines Bonus:
Counts lines written via Write (content), Edit (new_string), MultiEdit (edits[].new_string).
Linear bonus: 0 lines → +0.0, 5K lines → +0.1, 10K → +0.2, 50K+ → +1.0 (capped).
This is per-session, not cumulative. Typical sessions earn +0.0~0.2 bonus.
Impact on overall score: max +0.25 (since 4 dimensions are averaged).
Examples
No sub-agents (score: 0)
Simple session with direct tool calls only.
total_sub_agents = 0, max_sub_agent_depth = 0
b_thread_score = 0 * max(1, 0) = 03 flat sub-agents (score: 3)
Claude spawns 3 Explore agents for research (none spawn sub-agents).
total_sub_agents = 3, max_sub_agent_depth = 1
b_thread_score = 3 * max(1, 1) = 34 nested sub-agents (score: 8)
Claude spawns a team of 4 agents. One agent spawns its own sub-agent.
total_sub_agents = 4, max_sub_agent_depth = 2 (nested)
b_thread_score = 4 * max(1, 2) = 8 (B-thread!)Metrics
tool_calls_per_minuteTool calls per minute. total_tool_calls / max(duration_minutes, 0.1)max_sub_agent_depthMaximum sub-agent nesting depth. 0=none, 1=flat, 2+=nested (B-thread).total_tool_callsTotal tool calls performed across the entire session.tokens_per_minuteTokens consumed per minute. (input_tokens + output_tokens) / duration.ai_written_linesTotal lines of code written by AI via Write, Edit, and MultiEdit tools in this session.ai_line_bonusBonus added to density score. min(ai_written_lines / 50000, 1.0). Max +1.0 at 50K lines.Score Ranges
How to Improve
- →Use Team/Agent features for complex tasks with nested sub-agents.
- →Request "organize a team for this" on large projects for B-thread classification.
- →Encourage deep execution trees where sub-agents spawn sub-agents.
- →Separate code analysis → implementation → testing → review into individual sub-agents.
- →Combine with worktree isolation for even higher density.
- →Write more code via Write/Edit tools to earn the AI-written lines bonus (up to +1.0).
FEWER (Trust) — Reduced Human Checkpoints
Max: 10Human checkpoint reduction, trust level
Score Formula
effective_human = human_messages - trivial_delegations// excludes trivial delegations (≤5 tool calls after human msg)ratio_score = min(log1p(tool_calls / effective_human) * 2.0, 10)// log1p(x) = ln(1+x), log-scaled ratioask_penalty = min(penalized_ask_ratio * 10.0, 3.0)// only AskUserQuestion OUTSIDE plan mode (max -3 pts)z_thread_score = max(ratio_score - ask_penalty, 0.0)// final = ratio score - penalty (floor 0)10/10: ~148+ tool calls per effective human message with no AskUser penalty. Trivial delegations like 'run tests' (≤5 tool calls) are excluded from human count.
Trivial Delegation Filter & Penalty System
Three-part formula: filter trivial delegations → base ratio (log scale) → minus penalty.
Uses log1p(x) = ln(1+x) for the ratio — same diminishing returns as L-thread.
Step 1: Trivial Delegation Filter (NEW)
If a human message is followed by ≤ 5 tool calls before the next human message,
it is classified as a 'trivial delegation' (e.g. 'run tests', 'build it').
These are NOT genuine checkpoints — they're simple convenience requests.
effective_human_count = human_messages - trivial_delegations (min 1).
Example: 3 human messages → tool counts per segment: [2, 40, 3]
→ Segments with ≤ 5 tools: 2 (trivial). Segment with 40: real work.
→ effective_human = 3 - 2 = 1. Only the 40-tool segment counts.
Step 2: Penalty targets ONLY AskUserQuestion outside Plan Mode:
penalized_ask_ratio = penalized_ask_count / total_tool_calls
ask_penalty = min(penalized_ask_ratio × 10, 3.0) → maximum -3 points
AskUserQuestion is classified into two contexts:
1. Inside Plan Mode (EnterPlanMode ~ ExitPlanMode):
→ No penalty! Clarifying requirements during planning is good practice.
2. Outside Plan Mode (during implementation):
→ Penalty applied. Asking users mid-implementation signals uncertainty.
Example: Plan Mode 3 questions + implementation 1 question
→ plan_mode_ask_user_count = 3 (no penalty)
→ penalized_ask_user_count = 1 (only this penalized)
Examples
Frequent back-and-forth (score: 1.4)
10 tool calls, 10 human messages (user asks after every step).
ratio = 10 / 10 = 1.0, no AskUser penalty.
z_thread_score = min(log1p(1.0) * 2.0, 10) - 0 = 1.39Autonomous implementation (score: 6.1)
User: "Build the entire API module"
Claude makes 40 tool calls with only 2 human messages.
ratio = 40 / 2 = 20.0, no AskUser outside plan mode.
z_thread_score = min(log1p(20) * 2.0, 10) = min(6.09, 10) = 6.09With AskUser penalty (score: 4.2)
50 tool calls, 2 human messages. But 3 AskUserQuestion outside plan mode.
ratio = 50 / 2 = 25.0 → ratio_score = min(log1p(25) * 2.0, 10) = 6.52
penalty: penalized_ask_ratio = 3/50 = 0.06 → ask_penalty = min(0.6, 3.0) = 0.6
z_thread_score = max(6.52 - 0.6, 0) = 5.92Plan Mode exception (no penalty)
Claude enters Plan Mode, asks 3 clarifying questions, exits, then executes.
100 tool calls, 1 human message. 3 AskUser in plan mode + 0 outside.
ratio = 100 / 1 = 100 → ratio_score = 9.23, penalty = 0
z_thread_score = 9.23 (plan mode questions NOT penalized!)Trivial delegation filter (score boost)
3 human messages, but 2 are trivial ('run tests' → 1 tool, 'build' → 2 tools).
Only 1 message triggered real work (40 tool calls). Total = 43 tools.
Without filter: ratio = 43/3 = 14.3 → score = 5.47
With filter: effective_human = 3 - 2 = 1, ratio = 43/1 = 43 → score = 7.56
z_thread_score = 7.56 (trivial delegations excluded from human count!)Score Reference Table (ratio_score before penalty)
Metrics
tool_calls_per_human_messageTool calls per effective human message. total_tool_calls / max(effective_human_count, 1). Excludes trivial delegations.assistant_per_human_ratioAssistant to human message ratio. assistant_count / max(human_count, 1)ask_user_countTotal AskUserQuestion invocations (including plan mode).plan_mode_ask_user_countAskUserQuestion count inside Plan Mode (between EnterPlanMode ~ ExitPlanMode). No penalty.penalized_ask_user_countAskUserQuestion count outside Plan Mode. Only these are penalized.autonomous_tool_call_pctPercentage of tool calls excluding penalized AskUser. (1 - penalized/total) * 100trivial_delegation_countHuman messages classified as trivial delegation (≤5 tool calls in following segment). Excluded from trust ratio.effective_human_countHuman messages actually used in trust ratio. = human_messages - trivial_delegations (min 1).Score Ranges
How to Improve
- →Give clear instructions once so Claude handles everything autonomously.
- →Write coding conventions, preferred patterns, and project structure in CLAUDE.md.
- →Asking questions in Plan Mode is fine — no penalty!
- →To reduce implementation questions, clarify requirements during the planning phase.
- →Pre-approve permissions (auto-accept) so execution continues without interruption.
- →Aim for Z-thread: automate entire feature implementation with a single command.
Thread Type Classification
Sessions are classified into one type using the following priority order (Z is highest):
human_messages <= 1 AND tool_calls >= 10Zero-touch: Minimal human input, maximum autonomous work. Most evolved form.
max_sub_agent_depth >= 2Big: Sub-agents spawning sub-agents — nested execution.
autonomous_stretch > 30min AND tool_calls > 50Long: 30+ minutes of autonomous execution without human intervention.
sub_agent_prompt_similarity > 70% (Jaccard)Fusion: Similar tasks distributed to multiple agents (Map-Reduce pattern).
max_concurrent_agents > 1Parallel: 2+ sub-agents running concurrently.
human_messages >= 3 AND each_gap_tool_calls >= 3Chained: Human-AI conversation repeated in a chain pattern.
None of the above conditions metDefault conversational session. Short Q&A or simple tasks.
Improvement Roadmap
Evolve in order: Base → C → P → L → B → Z. Key strategies for each step:
Continue conversations for 3+ turns, progressively building work. Ensure 3+ tool calls per turn.
Request 2+ independent tasks simultaneously. Explicitly say 'use Agent tool for parallel processing'.
Describe requirements in detail and reduce mid-work intervention. A thorough CLAUDE.md enables 30+ min autonomous runs.
Use 'organize a team' or 'work in a worktree' to encourage deep sub-agent trees.
Ultimate goal: implement an entire feature with a single command. Auto-approve permissions + detailed project docs + clear single instruction.
Fair Comparison System
A system that filters and weights sessions for fair comparison. Prevents short test sessions or automation scripts from skewing overall scores.
Minimum Qualifying Thresholds
All criteria below must be met to be included in comparisons.
Weighted Scoring
Longer and more complex sessions receive proportionally higher weight.
weight(session) = log1p(total_tool_calls) * log1p(session_duration_minutes)
weighted_score = Σ(score_i * weight_i) / Σ(weight_i)
Consistency Score (0~10)
Measures consistency based on standard deviation of overall scores from the last 20 sessions.
consistency = max(0, min(10, 10 - std_dev * 3.33))
std_dev = 0 → 10.0 (perfect consistency) | std_dev ≥ 3 → ~0.0
Composite Rank Score
Final comparison rank score combining weighted score (80%) and consistency score (20%).
Data Source
Claude Code JSONL session logs: ~/.claude/projects/<hash>/<session>.jsonl
Sub-agent logs: <session-dir>/subagents/agent-<id>.jsonl
omas scan to scan all sessions → omas export to generate JSON