// HOW SCORES ARE CALCULATED

Scoring Guide

Thread-based engineering metrics across 4 dimensions

OVERALL COMPOSITE SCORE

overall = (p_thread_score + l_thread_score + b_thread_score + z_thread_score) / 4

Each dimension score is 0-10 (capped). The overall score is the simple average. All four *_thread_score values are used directly — no additional normalization.

MORE (P-threads) — Parallelism

Max: 5+

Number of parallel execution paths

Score Formula

p_thread_score = max(max_concurrent_agents, peak_parallel_tools)// max of concurrent agents and parallel tool calls, capped at 10

10/10: 10+ sub-agents running concurrently, OR 10+ parallel tool calls in a single message. Direct value — no log scale.

Sweep-Line Algorithm

Score = the larger of (concurrent agents) and (parallel tools in one message).

This is a direct value, NOT log-scaled. Score equals the raw count, capped at 10.

How max_concurrent_agents is computed:

1. Extract [first_seen, last_seen] time range for each sub-agent

2. Create +1 event at start, -1 event at end

3. Sort by time and sweep to find max overlap = max_concurrent_agents

How peak_parallel_tools is computed:

4. Count tool_use blocks in a single assistant message = peak_parallel_tools

e.g. Claude calls Read, Grep, Glob simultaneously → peak = 3

Examples

Simple Q&A (score: 0)

User asks a question, Claude answers with Read + Grep (sequential).

max_concurrent_agents = 0, peak_parallel_tools = 1

p_thread_score = max(0, 1) = 1

Parallel file reads (score: 3)

Claude reads 3 files simultaneously in one message: Read, Read, Read.

max_concurrent_agents = 0, peak_parallel_tools = 3

p_thread_score = max(0, 3) = 3

Team-based refactoring (score: 5)

Claude spawns 5 Agent sub-agents for parallel work.

Sweep-line detects 5 agents active at the same time.

max_concurrent_agents = 5, peak_parallel_tools = 2

p_thread_score = max(5, 2) = 5

Metrics

max_concurrent_agentsMaximum sub-agents active simultaneously. Computed via sweep-line algorithm over agent_progress event time ranges.
total_sub_agentsTotal unique sub-agents created during the session.
peak_parallel_toolsMaximum number of tool_use blocks called simultaneously in a single assistant message.

Score Ranges

0
No sub-agents, sequential execution
1
Sub-agents present but not concurrent
2~3
Moderate parallelism
4~5+
High parallelism (P-thread)

How to Improve

  • Use Agent tool to execute independent tasks simultaneously.
  • Request parallel tool calls for independent operations like code search and file reads.
  • "Analyze these 3 files simultaneously" → Claude spawns 3 Agents.
  • Request concurrent test, build, and lint runs.
  • For complex refactoring, separate modules into sub-agents for P-thread classification.

LONGER (L-threads) — Autonomy

Max: 10

Autonomous execution time without human intervention

Score Formula

l_thread_score = min(log1p(longest_stretch_minutes) * 2.0, 10)// log1p(x) = ln(1+x)

10/10: ~148 minutes (≈2h 28m) of continuous autonomous work. Derived from ln(1+148) × 2 ≈ 10.0

Activity-Based Measurement

Uses log scale: log1p(x) = ln(1+x). This compresses large values so diminishing returns apply.

Why log1p instead of ln? → ln(0) = -∞ (crashes), but ln(1+0) = 0 (safe for zero input).

Key: Measures from human message to Claude's last activity (tool call).

Measures only Claude's actual working time, not until the next human message.

Example: Human(10:00) → Claude works → last tool(10:05) → [idle] → Human(12:00)

Without activity-based measurement: incorrectly 120 min. OMAS: correctly 5 min.

Segments: (1) before first human, (2) between humans, (3) after last human — max of each.

Examples

Quick Q&A (score: 0)

User asks, Claude responds immediately. No tool calls between messages.

longest_autonomous_stretch = 0 min

l_thread_score = min(log1p(0) * 2.0, 10) = 0.0

5-minute autonomous work (score: 3.6)

User: "Fix the login bug"

Claude works for 5 minutes: Grep → Read → Edit → Bash(test) → done.

longest_autonomous_stretch = 5 min

l_thread_score = min(log1p(5) * 2.0, 10) = min(3.58, 10) = 3.58

30-minute feature implementation (score: 6.9)

User: "Implement the entire auth module with tests and docs"

Claude works for 30 min straight: 47 tool calls, no human intervention.

longest_autonomous_stretch = 30 min

l_thread_score = min(log1p(30) * 2.0, 10) = min(6.88, 10) = 6.88

Score Reference Table

0 min
0.0
1 min
1.4
5 min
3.6
10 min
4.8
20 min
6.1
30 min
6.9
60 min
8.2
120 min
9.6
148 min
10.0

Metrics

longest_autonomous_stretch_minutesMax time (minutes) between human message and Claude's last activity. Activity-based measurement excludes idle time.
max_tool_calls_between_humanMaximum number of tool calls between human messages.
session_duration_minutesTotal session length (first to last timestamp).
max_consecutive_assistant_turnsMaximum consecutive assistant messages.

Score Ranges

0~2
Short conversation, frequent intervention (<1 min)
2~5
Moderate autonomy (1~10 min autonomous)
5~7
High autonomy (10~30 min autonomous)
7~10
Very high autonomy (30+ min L-thread)

How to Improve

  • Give clear and specific instructions at once so Claude runs autonomously longer.
  • Don't break large tasks into small pieces — deliver all requirements at once.
  • "Refactor this entire module. Write tests and create a PR too."
  • Trust Claude to make its own decisions without interrupting mid-work.
  • Write detailed project conventions in CLAUDE.md so Claude works longer without questions.

THICKER (B-threads) — Density

Max: 10+

Sub-agent scale and nesting depth

Score Formula

b_thread_score = total_sub_agents * max(1, max_sub_agent_depth)// sub-agent count × nesting depth, capped at 10
line_bonus = min(ai_written_lines / 50000, 1.0)// AI-written lines bonus: linear, max +1.0 at 50K lines
b_norm = min(b_thread_score + line_bonus, 10.0)// final density score with line bonus applied

10/10: e.g. 5 sub-agents × depth 2 = 10, or 10 flat sub-agents × depth 1 = 10. Multiplicative — depth is the multiplier. AI-written lines add up to +1.0 bonus (50K lines = full bonus).

Sub-Agent Depth Detection & AI Lines Bonus

Direct multiplication — NOT log-scaled. Raw product is capped at 10.

Depth acts as a multiplier, rewarding nested agent architectures.

depth=0: No sub-agents → score always 0

depth=1: Flat sub-agents (don't spawn their own agents)

depth=2+: Nested — sub-agents spawn sub-agents (B-thread classification)

Detection: If a subagent's JSONL file contains agent_progress events, it's nested.

Examples: 3 agents × depth 2 = 6 | 5 agents × depth 2 = 10 | 10 agents × depth 1 = 10

AI Written Lines Bonus:

Counts lines written via Write (content), Edit (new_string), MultiEdit (edits[].new_string).

Linear bonus: 0 lines → +0.0, 5K lines → +0.1, 10K → +0.2, 50K+ → +1.0 (capped).

This is per-session, not cumulative. Typical sessions earn +0.0~0.2 bonus.

Impact on overall score: max +0.25 (since 4 dimensions are averaged).

Examples

No sub-agents (score: 0)

Simple session with direct tool calls only.

total_sub_agents = 0, max_sub_agent_depth = 0

b_thread_score = 0 * max(1, 0) = 0

3 flat sub-agents (score: 3)

Claude spawns 3 Explore agents for research (none spawn sub-agents).

total_sub_agents = 3, max_sub_agent_depth = 1

b_thread_score = 3 * max(1, 1) = 3

4 nested sub-agents (score: 8)

Claude spawns a team of 4 agents. One agent spawns its own sub-agent.

total_sub_agents = 4, max_sub_agent_depth = 2 (nested)

b_thread_score = 4 * max(1, 2) = 8 (B-thread!)

Metrics

tool_calls_per_minuteTool calls per minute. total_tool_calls / max(duration_minutes, 0.1)
max_sub_agent_depthMaximum sub-agent nesting depth. 0=none, 1=flat, 2+=nested (B-thread).
total_tool_callsTotal tool calls performed across the entire session.
tokens_per_minuteTokens consumed per minute. (input_tokens + output_tokens) / duration.
ai_written_linesTotal lines of code written by AI via Write, Edit, and MultiEdit tools in this session.
ai_line_bonusBonus added to density score. min(ai_written_lines / 50000, 1.0). Max +1.0 at 50K lines.

Score Ranges

0
No sub-agents
1~3
Basic sub-agent usage
4~6
Active sub-agent usage
7+
Nested sub-agents (B-thread, high density)

How to Improve

  • Use Team/Agent features for complex tasks with nested sub-agents.
  • Request "organize a team for this" on large projects for B-thread classification.
  • Encourage deep execution trees where sub-agents spawn sub-agents.
  • Separate code analysis → implementation → testing → review into individual sub-agents.
  • Combine with worktree isolation for even higher density.
  • Write more code via Write/Edit tools to earn the AI-written lines bonus (up to +1.0).

FEWER (Trust) — Reduced Human Checkpoints

Max: 10

Human checkpoint reduction, trust level

Score Formula

effective_human = human_messages - trivial_delegations// excludes trivial delegations (≤5 tool calls after human msg)
ratio_score = min(log1p(tool_calls / effective_human) * 2.0, 10)// log1p(x) = ln(1+x), log-scaled ratio
ask_penalty = min(penalized_ask_ratio * 10.0, 3.0)// only AskUserQuestion OUTSIDE plan mode (max -3 pts)
z_thread_score = max(ratio_score - ask_penalty, 0.0)// final = ratio score - penalty (floor 0)

10/10: ~148+ tool calls per effective human message with no AskUser penalty. Trivial delegations like 'run tests' (≤5 tool calls) are excluded from human count.

Trivial Delegation Filter & Penalty System

Three-part formula: filter trivial delegations → base ratio (log scale) → minus penalty.

Uses log1p(x) = ln(1+x) for the ratio — same diminishing returns as L-thread.

Step 1: Trivial Delegation Filter (NEW)

If a human message is followed by ≤ 5 tool calls before the next human message,

it is classified as a 'trivial delegation' (e.g. 'run tests', 'build it').

These are NOT genuine checkpoints — they're simple convenience requests.

effective_human_count = human_messages - trivial_delegations (min 1).

Example: 3 human messages → tool counts per segment: [2, 40, 3]

→ Segments with ≤ 5 tools: 2 (trivial). Segment with 40: real work.

→ effective_human = 3 - 2 = 1. Only the 40-tool segment counts.

Step 2: Penalty targets ONLY AskUserQuestion outside Plan Mode:

penalized_ask_ratio = penalized_ask_count / total_tool_calls

ask_penalty = min(penalized_ask_ratio × 10, 3.0) → maximum -3 points

AskUserQuestion is classified into two contexts:

1. Inside Plan Mode (EnterPlanMode ~ ExitPlanMode):

→ No penalty! Clarifying requirements during planning is good practice.

2. Outside Plan Mode (during implementation):

→ Penalty applied. Asking users mid-implementation signals uncertainty.

Example: Plan Mode 3 questions + implementation 1 question

→ plan_mode_ask_user_count = 3 (no penalty)

→ penalized_ask_user_count = 1 (only this penalized)

Examples

Frequent back-and-forth (score: 1.4)

10 tool calls, 10 human messages (user asks after every step).

ratio = 10 / 10 = 1.0, no AskUser penalty.

z_thread_score = min(log1p(1.0) * 2.0, 10) - 0 = 1.39

Autonomous implementation (score: 6.1)

User: "Build the entire API module"

Claude makes 40 tool calls with only 2 human messages.

ratio = 40 / 2 = 20.0, no AskUser outside plan mode.

z_thread_score = min(log1p(20) * 2.0, 10) = min(6.09, 10) = 6.09

With AskUser penalty (score: 4.2)

50 tool calls, 2 human messages. But 3 AskUserQuestion outside plan mode.

ratio = 50 / 2 = 25.0 → ratio_score = min(log1p(25) * 2.0, 10) = 6.52

penalty: penalized_ask_ratio = 3/50 = 0.06 → ask_penalty = min(0.6, 3.0) = 0.6

z_thread_score = max(6.52 - 0.6, 0) = 5.92

Plan Mode exception (no penalty)

Claude enters Plan Mode, asks 3 clarifying questions, exits, then executes.

100 tool calls, 1 human message. 3 AskUser in plan mode + 0 outside.

ratio = 100 / 1 = 100 → ratio_score = 9.23, penalty = 0

z_thread_score = 9.23 (plan mode questions NOT penalized!)

Trivial delegation filter (score boost)

3 human messages, but 2 are trivial ('run tests' → 1 tool, 'build' → 2 tools).

Only 1 message triggered real work (40 tool calls). Total = 43 tools.

Without filter: ratio = 43/3 = 14.3 → score = 5.47

With filter: effective_human = 3 - 2 = 1, ratio = 43/1 = 43 → score = 7.56

z_thread_score = 7.56 (trivial delegations excluded from human count!)

Score Reference Table (ratio_score before penalty)

1 tool/human
1.4
5 tools/human
3.6
10 tools/human
4.8
20 tools/human
6.1
50 tools/human
7.9
100 tools/human
9.2
148 tools/human
10.0

Metrics

tool_calls_per_human_messageTool calls per effective human message. total_tool_calls / max(effective_human_count, 1). Excludes trivial delegations.
assistant_per_human_ratioAssistant to human message ratio. assistant_count / max(human_count, 1)
ask_user_countTotal AskUserQuestion invocations (including plan mode).
plan_mode_ask_user_countAskUserQuestion count inside Plan Mode (between EnterPlanMode ~ ExitPlanMode). No penalty.
penalized_ask_user_countAskUserQuestion count outside Plan Mode. Only these are penalized.
autonomous_tool_call_pctPercentage of tool calls excluding penalized AskUser. (1 - penalized/total) * 100
trivial_delegation_countHuman messages classified as trivial delegation (≤5 tool calls in following segment). Excluded from trust ratio.
effective_human_countHuman messages actually used in trust ratio. = human_messages - trivial_delegations (min 1).

Score Ranges

0~2
Low trust (frequent intervention, few tool calls/human)
2~5
Moderate level (~5-10 tool calls per human)
5~8
High trust (~20-50 tool calls per human)
8~10
Very high trust (Z-thread level, 100+ tools/human)

How to Improve

  • Give clear instructions once so Claude handles everything autonomously.
  • Write coding conventions, preferred patterns, and project structure in CLAUDE.md.
  • Asking questions in Plan Mode is fine — no penalty!
  • To reduce implementation questions, clarify requirements during the planning phase.
  • Pre-approve permissions (auto-accept) so execution continues without interruption.
  • Aim for Z-thread: automate entire feature implementation with a single command.

Thread Type Classification

Sessions are classified into one type using the following priority order (Z is highest):

Z-thread
human_messages <= 1 AND tool_calls >= 10

Zero-touch: Minimal human input, maximum autonomous work. Most evolved form.

B-thread
max_sub_agent_depth >= 2

Big: Sub-agents spawning sub-agents — nested execution.

L-thread
autonomous_stretch > 30min AND tool_calls > 50

Long: 30+ minutes of autonomous execution without human intervention.

F-thread
sub_agent_prompt_similarity > 70% (Jaccard)

Fusion: Similar tasks distributed to multiple agents (Map-Reduce pattern).

P-thread
max_concurrent_agents > 1

Parallel: 2+ sub-agents running concurrently.

C-thread
human_messages >= 3 AND each_gap_tool_calls >= 3

Chained: Human-AI conversation repeated in a chain pattern.

Base
None of the above conditions met

Default conversational session. Short Q&A or simple tasks.

Improvement Roadmap

Evolve in order: Base → C → P → L → B → Z. Key strategies for each step:

BaseC-thread

Continue conversations for 3+ turns, progressively building work. Ensure 3+ tool calls per turn.

C-threadP-thread

Request 2+ independent tasks simultaneously. Explicitly say 'use Agent tool for parallel processing'.

P-threadL-thread

Describe requirements in detail and reduce mid-work intervention. A thorough CLAUDE.md enables 30+ min autonomous runs.

L-threadB-thread

Use 'organize a team' or 'work in a worktree' to encourage deep sub-agent trees.

B-threadZ-thread

Ultimate goal: implement an entire feature with a single command. Auto-approve permissions + detailed project docs + clear single instruction.

Fair Comparison System

A system that filters and weights sessions for fair comparison. Prevents short test sessions or automation scripts from skewing overall scores.

Minimum Qualifying Thresholds

All criteria below must be met to be included in comparisons.

5 min
Min session duration
session_duration_minutes
10 calls
Min tool call count
total_tool_calls
1 msg
Min human message count
total_human_messages

Weighted Scoring

Longer and more complex sessions receive proportionally higher weight.

weight(session) = log1p(total_tool_calls) * log1p(session_duration_minutes)

weighted_score = Σ(score_i * weight_i) / Σ(weight_i)

Consistency Score (0~10)

Measures consistency based on standard deviation of overall scores from the last 20 sessions.

consistency = max(0, min(10, 10 - std_dev * 3.33))

std_dev = 0 → 10.0 (perfect consistency) | std_dev ≥ 3 → ~0.0

Composite Rank Score

Final comparison rank score combining weighted score (80%) and consistency score (20%).

composite_rank = weighted_score * 0.8 + consistency * 0.2

Data Source

Claude Code JSONL session logs: ~/.claude/projects/<hash>/<session>.jsonl

Sub-agent logs: <session-dir>/subagents/agent-<id>.jsonl

omas scan to scan all sessionsomas export to generate JSON