// HOW SCORES ARE CALCULATED

Scoring Guide

Thread-based engineering metrics across 4 dimensions

OVERALL COMPOSITE SCORE

overall = (p_thread_score + l_thread_score + min(b_thread_score + ai_line_bonus, 10) + z_thread_score) / 4

Each dimension score is 0-10 (capped). The overall score is the simple average. All four *_thread_score values are used directly — no additional normalization.

MORE (P-threads) — Parallelism

Max: 10

Concurrent sessions running simultaneously

Score Formula

p_thread_score = min(concurrent_sessions, 10.0)// number of concurrent sessions, capped at 10

10/10: 10 or more Claude Code sessions running at the same time. Direct value — no log scale.

Cross-Session Sweep-Line Algorithm

Measures how many Claude Code sessions run simultaneously.

True parallelism means running multiple terminal sessions in parallel — each on an independent task.

This is different from sub-agent parallelism within a session (which is measured by Thicker).

How concurrent_sessions is computed (sweep-line algorithm):

1. Gather all sessions across all projects

2. Create events: +1 at each session start, -1 at each session end

3. Sort events chronologically and sweep to build a concurrency timeline

4. For each session, find the peak concurrent count during its active window

This avoids over-counting from pairwise overlap.

Example: A long session overlapping with 3 short non-concurrent sessions → peak 2, not 4.

This is a direct value, NOT log-scaled. Score equals the raw count, capped at 10.

Note: P-thread is computed during `omas scan` (which sees all sessions).

`omas analyze` (single session) defaults to P-thread = 1.

Examples

Single session (score: 1)

Developer runs one Claude Code session at a time.

concurrent_sessions = 1

p_thread_score = min(1, 10) = 1

Three parallel terminals (score: 3)

Developer opens 3 terminals, each running Claude Code on different tasks.

All 3 sessions overlap in time. concurrent_sessions = 3

p_thread_score = min(3, 10) = 3

Full parallel workflow (score: 5)

Developer runs 5 Claude Code sessions simultaneously across multiple projects.

concurrent_sessions = 5

p_thread_score = min(5, 10) = 5

Metrics

concurrent_sessionsPeak number of Claude Code sessions running at the same time. Computed via cross-session sweep-line over all session time ranges.

Score Ranges

1
Single session (no parallelism)
2~3
Moderate parallelism
4~5
High parallelism
6~10
Expert-level parallel workflow (P-thread)

How to Improve

  • Open multiple terminals and run Claude Code sessions in parallel on independent tasks.
  • Decompose large features into independent sub-tasks and work on them in separate sessions.
  • Use tmux or terminal splits to manage multiple concurrent Claude Code sessions.
  • Each session should focus on a different module, file, or concern for true independence.
  • Within-session agent concurrency (Agent tool) now contributes to Thicker, not More.

LONGER (L-threads) — Autonomy

Max: 10

Autonomous execution time without human intervention

Score Formula

l_thread_score = min(log1p(P90_stretch_minutes) * 2.5, 10)// P90 = 90th percentile of all autonomous stretches

10/10: ~54 minutes of P90 autonomous stretch. Derived from ln(1+54) × 2.5 ≈ 10.0

P90 Activity-Based Measurement

Uses the 90th percentile of all autonomous stretches within a session (not the absolute max).

P90 rewards near-longest stretches while excluding extreme outliers.

Uses log scale: log1p(x) = ln(1+x). This compresses large values so diminishing returns apply.

Why log1p instead of ln? → ln(0) = -∞ (crashes), but ln(1+0) = 0 (safe for zero input).

Key: Measures from human message to Claude's last activity (tool call).

Measures only Claude's actual working time, not until the next human message.

Example: Human(10:00) → Claude works → last tool(10:05) → [idle] → Human(12:00)

Without activity-based measurement: incorrectly 120 min. OMAS: correctly 5 min.

Segments: (1) before first human, (2) between humans, (3) after last human — P90 of all.

Idle Gap Capping (v0.6.0+):

Gaps > 30 min (IDLE_GAP_THRESHOLD) between consecutive activities are capped at 30 min.

This prevents idle periods (e.g. permission prompt left unanswered) from inflating the stretch.

Examples

Quick Q&A (score: 0)

User asks, Claude responds immediately. No tool calls between messages.

P90 autonomous stretch = 0 min

l_thread_score = min(log1p(0) * 2.5, 10) = 0.0

5-minute autonomous work (score: 4.5)

User: "Fix the login bug"

Claude works for 5 minutes: Grep → Read → Edit → Bash(test) → done.

P90 autonomous stretch = 5 min

l_thread_score = min(log1p(5) * 2.5, 10) = min(4.48, 10) = 4.48

30-minute feature implementation (score: 8.6)

User: "Implement the entire auth module with tests and docs"

Claude works for 30 min straight: 47 tool calls, no human intervention.

P90 autonomous stretch = 30 min

l_thread_score = min(log1p(30) * 2.5, 10) = min(8.59, 10) = 8.59

Score Reference Table (P90 stretch)

0 min
0.0
1 min
1.7
5 min
4.5
10 min
6.0
20 min
7.6
30 min
8.6
54 min
10.0

Metrics

p90_autonomous_stretch_minutes90th percentile of autonomous stretch durations. Rewards near-longest stretches while excluding extreme outliers.
max_tool_calls_between_humanMaximum number of tool calls between human messages.
session_duration_minutesTotal session length (first to last timestamp).
max_consecutive_assistant_turnsMaximum consecutive assistant messages.

Score Ranges

0~2
Short conversation, frequent intervention (<1 min)
2~5
Moderate autonomy (1~10 min autonomous)
5~7
High autonomy (10~30 min autonomous)
7~10
Very high autonomy (30+ min L-thread)

How to Improve

  • Give clear and specific instructions at once so Claude runs autonomously longer.
  • Don't break large tasks into small pieces — deliver all requirements at once.
  • "Refactor this entire module. Write tests and create a PR too."
  • Trust Claude to make its own decisions without interrupting mid-work.
  • Write detailed project conventions in CLAUDE.md so Claude works longer without questions.

THICKER (B-threads) — Density

Max: 10+

Sub-agent scale and nesting depth

Score Formula

b_thread_score = min(total_agents, 10.0)// total agents (team + sub), capped at 10. Linear scale.
line_bonus = min(ai_written_lines / 50000, 1.0)// AI-written lines bonus: linear, max +1.0 at 50K lines
b_norm = min(b_thread_score + line_bonus, 10.0)// density score with line bonus applied

10/10: 10 or more agents (team + sub) in a single session. Direct value — no log scale, no depth multiplier.

Total Agent Count & AI Lines Bonus

Linear scale: b_thread_score = min(total_agents, 10). Same approach as P-thread (concurrent sessions).

total_agents = all agents spawned in the session (team agents + sub-agents).

No depth multiplier — one nested sub-agent shouldn't double the entire score.

No orchestration breadth — within-session concurrency is just part of using agents.

depth is still tracked for B-thread classification (depth 2+ = B-thread),

but it does NOT affect the score calculation.

AI Written Lines Bonus:

Counts lines written via Write (content), Edit (new_string), MultiEdit (edits[].new_string).

Linear bonus: 0 lines → +0.0, 5K lines → +0.1, 10K → +0.2, 50K+ → +1.0 (capped).

This is per-session, not cumulative. Typical sessions earn +0.0~0.2 bonus.

Impact on overall score: max +0.25 (since 4 dimensions are averaged).

Examples

No agents (score: 0)

Simple session with direct tool calls only.

total_agents = 0

b_thread_score = min(0, 10) = 0

3 agents (score: 3)

Claude spawns 3 Explore agents for research.

total_agents = 3

b_thread_score = min(3, 10) = 3

Team of 5 agents (score: 5)

Claude organizes a team of 5 agents for a feature implementation.

total_agents = 5

b_thread_score = min(5, 10) = 5

10+ agents (score: 10)

Large-scale team with 10 or more agents (team + sub-agents).

total_agents = 10

b_thread_score = min(10, 10) = 10 (max!)

Metrics

tool_calls_per_minuteTool calls per minute. total_tool_calls / max(duration_minutes, 0.1)
max_sub_agent_depthMaximum sub-agent nesting depth. 0=none, 1=flat, 2+=nested (B-thread classification).
total_tool_callsTotal tool calls performed across the entire session.
tokens_per_minuteTokens consumed per minute. (input_tokens + output_tokens) / duration.
ai_written_linesTotal lines of code written by AI via Write, Edit, and MultiEdit tools in this session.
ai_line_bonusBonus added to density score. min(ai_written_lines / 50000, 1.0). Max +1.0 at 50K lines.

Score Ranges

0
No agents
1~3
Basic agent usage
4~6
Active agent usage (team)
7~10
Heavy agent usage (10 = max)

How to Improve

  • Use Team/Agent features to spawn multiple agents for complex tasks.
  • Request "organize a team for this" to get more agents working in parallel.
  • Separate code analysis → implementation → testing → review into individual agents.
  • More agents = higher score. 10 agents in one session = perfect score.
  • Write more code via Write/Edit tools to earn the AI-written lines bonus (up to +1.0).

FEWER (Trust) — Reduced Human Checkpoints

Max: 10

Human checkpoint reduction, trust level

Score Formula

effective_human = human_messages - trivial_delegations// excludes trivial delegations (≤5 tool calls after human msg)
ratio_score = min(log1p(tool_calls / effective_human) * 2.0, 10)// log1p(x) = ln(1+x), log-scaled ratio
ask_penalty = min(penalized_ask_ratio * 10.0, 3.0)// only AskUserQuestion OUTSIDE plan mode (max -3 pts)
z_thread_score = max(ratio_score - ask_penalty, 0.0)// final = ratio score - penalty (floor 0)

10/10: ~148+ tool calls per effective human message with no AskUser penalty. Trivial delegations like 'run tests' (≤5 tool calls) are excluded from human count.

Trivial Delegation Filter & Penalty System

Three-part formula: filter trivial delegations → base ratio (log scale) → minus penalty.

Uses log1p(x) = ln(1+x) for the ratio — same diminishing returns as L-thread.

Why ratio-only (no volume penalty):

Fewer measures the QUALITY of your instructions, not the AMOUNT of work.

20 tools / 1 human = ratio 20 is excellent agentic coding.

200 tools / 1 human = ratio 200 is even better. Volume is Thicker/Longer's job.

Step 1: Trivial Delegation Filter (NEW)

If a human message is followed by ≤ 5 tool calls before the next human message,

it is classified as a 'trivial delegation' (e.g. 'run tests', 'build it').

These are NOT genuine checkpoints — they're simple convenience requests.

effective_human_count = human_messages - trivial_delegations (min 1).

Example: 3 human messages → tool counts per segment: [2, 40, 3]

→ Segments with ≤ 5 tools: 2 (trivial). Segment with 40: real work.

→ effective_human = 3 - 2 = 1. Only the 40-tool segment counts.

Step 2: Penalty targets ONLY AskUserQuestion outside Plan Mode:

penalized_ask_ratio = penalized_ask_count / total_tool_calls

ask_penalty = min(penalized_ask_ratio × 10, 3.0) → maximum -3 points

AskUserQuestion is classified into two contexts:

1. Inside Plan Mode (EnterPlanMode ~ ExitPlanMode):

→ No penalty! Clarifying requirements during planning is good practice.

2. Outside Plan Mode (during implementation):

→ Penalty applied. Asking users mid-implementation signals uncertainty.

Example: Plan Mode 3 questions + implementation 1 question

→ plan_mode_ask_user_count = 3 (no penalty)

→ penalized_ask_user_count = 1 (only this penalized)

Examples

Frequent back-and-forth (score: 1.4)

10 tool calls, 10 human messages (user asks after every step).

ratio = 10 / 10 = 1.0, no AskUser penalty.

z_thread_score = min(log1p(1.0) * 2.0, 10) - 0 = 1.39

Autonomous implementation (score: 6.1)

User: "Build the entire API module"

Claude makes 40 tool calls with only 2 human messages.

ratio = 40 / 2 = 20.0, no AskUser outside plan mode.

z_thread_score = min(log1p(20) * 2.0, 10) = min(6.09, 10) = 6.09

With AskUser penalty (score: 4.2)

50 tool calls, 2 human messages. But 3 AskUserQuestion outside plan mode.

ratio = 50 / 2 = 25.0 → ratio_score = min(log1p(25) * 2.0, 10) = 6.52

penalty: penalized_ask_ratio = 3/50 = 0.06 → ask_penalty = min(0.6, 3.0) = 0.6

z_thread_score = max(6.52 - 0.6, 0) = 5.92

Plan Mode exception (no penalty)

Claude enters Plan Mode, asks 3 clarifying questions, exits, then executes.

100 tool calls, 1 human message. 3 AskUser in plan mode + 0 outside.

ratio = 100 / 1 = 100 → ratio_score = 9.23, penalty = 0

z_thread_score = 9.23 (plan mode questions NOT penalized!)

Trivial delegation filter (score boost)

3 human messages, but 2 are trivial ('run tests' → 1 tool, 'build' → 2 tools).

Only 1 message triggered real work (40 tool calls). Total = 43 tools.

Without filter: ratio = 43/3 = 14.3 → score = 5.47

With filter: effective_human = 3 - 2 = 1, ratio = 43/1 = 43 → score = 7.56

z_thread_score = 7.56 (trivial delegations excluded from human count!)

Score Reference Table (ratio_score before penalty)

1 tool/human
1.4
5 tools/human
3.6
10 tools/human
4.8
20 tools/human
6.1
50 tools/human
7.9
100 tools/human
9.2
148 tools/human
10.0

Metrics

tool_calls_per_human_messageTool calls per effective human message. total_tool_calls / max(effective_human_count, 1). Excludes trivial delegations.
assistant_per_human_ratioAssistant to human message ratio. assistant_count / max(human_count, 1)
ask_user_countTotal AskUserQuestion invocations (including plan mode).
plan_mode_ask_user_countAskUserQuestion count inside Plan Mode (between EnterPlanMode ~ ExitPlanMode). No penalty.
penalized_ask_user_countAskUserQuestion count outside Plan Mode. Only these are penalized.
autonomous_tool_call_pctPercentage of tool calls excluding penalized AskUser. (1 - penalized/total) * 100
trivial_delegation_countHuman messages classified as trivial delegation (≤5 tool calls in following segment). Excluded from trust ratio.
effective_human_countHuman messages actually used in trust ratio. = human_messages - trivial_delegations (min 1).

Score Ranges

0~2
Low trust (frequent intervention, few tool calls/human)
2~5
Moderate level (~5-10 tool calls per human)
5~8
High trust (~20-50 tool calls per human)
8~10
Very high trust (Z-thread level, 100+ tools/human)

How to Improve

  • Give clear instructions once so Claude handles everything autonomously.
  • Write coding conventions, preferred patterns, and project structure in CLAUDE.md.
  • Asking questions in Plan Mode is fine — no penalty!
  • To reduce implementation questions, clarify requirements during the planning phase.
  • Pre-approve permissions (auto-accept) so execution continues without interruption.
  • Aim for Z-thread: automate entire feature implementation with a single command.

Thread Type Classification

Sessions are classified into one type using the following priority order (Z is highest):

Z-thread
human_messages <= 1 AND tool_calls >= 10

Zero-touch: Minimal human input, maximum autonomous work. Most evolved form.

B-thread
max_sub_agent_depth >= 2

Big: Sub-agents spawning sub-agents — nested execution.

L-thread
autonomous_stretch > 30min AND tool_calls > 50

Long: 30+ minutes of autonomous execution without human intervention.

F-thread
sub_agent_prompt_similarity > 70% (Jaccard)

Fusion: Similar tasks distributed to multiple agents (Map-Reduce pattern).

P-thread
max_concurrent_agents > 1

Parallel: 2+ sub-agents running concurrently.

C-thread
human_messages >= 3 AND each_gap_tool_calls >= 3

Chained: Human-AI conversation repeated in a chain pattern.

Base
None of the above conditions met

Default conversational session. Short Q&A or simple tasks.

Improvement Roadmap

Evolve in order: Base → C → P → L → B → Z. Key strategies for each step:

BaseC-thread

Continue conversations for 3+ turns, progressively building work. Ensure 3+ tool calls per turn.

C-threadP-thread

Request 2+ independent tasks simultaneously. Explicitly say 'use Agent tool for parallel processing'.

P-threadL-thread

Describe requirements in detail and reduce mid-work intervention. A thorough CLAUDE.md enables 30+ min autonomous runs.

L-threadB-thread

Use 'organize a team' or 'work in a worktree' to encourage deep sub-agent trees.

B-threadZ-thread

Ultimate goal: implement an entire feature with a single command. Auto-approve permissions + detailed project docs + clear single instruction.

Fair Comparison System

A system that filters and weights sessions for fair comparison. Prevents short test sessions or automation scripts from skewing overall scores.

Minimum Qualifying Thresholds

All criteria below must be met to be included in comparisons.

5 min
Min session duration
session_duration_minutes
10 calls
Min tool call count
total_tool_calls
1 msg
Min human message count
total_human_messages

Weighted Scoring

Longer and more complex sessions receive proportionally higher weight.

weight(session) = log1p(total_tool_calls) * log1p(session_duration_minutes)

weighted_score = Σ(score_i * weight_i) / Σ(weight_i)

Consistency Score (0~10)

Measures consistency based on standard deviation of overall scores from the last 70 sessions.

consistency = max(0, min(10, 10 - std_dev * 3.33))

std_dev = 0 → 10.0 (perfect consistency) | std_dev ≥ 3 → ~0.0

Composite Rank Score

Final comparison rank score combining weighted score (80%) and consistency score (20%).

composite_rank = weighted_score * 0.8 + consistency * 0.2

Data Source

Claude Code JSONL session logs: ~/.claude/projects/<hash>/<session>.jsonl

Sub-agent logs: <session-dir>/subagents/agent-<id>.jsonl

omas scan to scan all sessionsomas export to generate JSON