// HOW SCORES ARE CALCULATED
Scoring Guide
Thread-based engineering metrics across 4 dimensions
OVERALL COMPOSITE SCORE
overall = (p_thread_score + l_thread_score + b_thread_score + z_thread_score) / 4
Each dimension score is 0-10 (capped). The overall score is the simple average. All four *_thread_score values are used directly — no additional normalization.
MORE (P-threads) — Parallelism
Max: 10Concurrent sessions running simultaneously
Score Formula
p_thread_score = min(concurrent_sessions, 10.0)// number of concurrent sessions, capped at 1010/10: 10 or more Claude Code sessions running at the same time. Direct value — no log scale.
Cross-Session Sweep-Line Algorithm
Measures how many Claude Code sessions run simultaneously.
True parallelism means running multiple terminal sessions in parallel — each on an independent task.
This is different from sub-agent parallelism within a session (which is measured by Thicker).
How concurrent_sessions is computed (sweep-line algorithm):
1. Gather all sessions across all projects
2. Create events: +1 at each session start, -1 at each session end
3. Sort events chronologically and sweep to build a concurrency timeline
4. For each session, find the peak concurrent count during its active window
This avoids over-counting from pairwise overlap.
Example: A long session overlapping with 3 short non-concurrent sessions → peak 2, not 4.
This is a direct value, NOT log-scaled. Score equals the raw count, capped at 10.
Note: P-thread is computed during `omas scan` (which sees all sessions).
`omas analyze` (single session) defaults to P-thread = 1.
Examples
Single session (score: 1)
Developer runs one Claude Code session at a time.
concurrent_sessions = 1
p_thread_score = min(1, 10) = 1Three parallel terminals (score: 3)
Developer opens 3 terminals, each running Claude Code on different tasks.
All 3 sessions overlap in time. concurrent_sessions = 3
p_thread_score = min(3, 10) = 3Full parallel workflow (score: 5)
Developer runs 5 Claude Code sessions simultaneously across multiple projects.
concurrent_sessions = 5
p_thread_score = min(5, 10) = 5Metrics
concurrent_sessionsPeak number of Claude Code sessions running at the same time. Computed via cross-session sweep-line over all session time ranges.Score Ranges
How to Improve
- →Open multiple terminals and run Claude Code sessions in parallel on independent tasks.
- →Decompose large features into independent sub-tasks and work on them in separate sessions.
- →Use tmux or terminal splits to manage multiple concurrent Claude Code sessions.
- →Each session should focus on a different module, file, or concern for true independence.
- →Within-session agent concurrency (Agent tool) now contributes to Thicker, not More.
LONGER (L-threads) — Autonomy
Max: 10Autonomous execution time without human intervention
Score Formula
l_thread_score = min(log1p(longest_stretch_minutes) * 2.0, 10)// log1p(x) = ln(1+x)10/10: ~148 minutes (≈2h 28m) of continuous autonomous work. Derived from ln(1+148) × 2 ≈ 10.0
Activity-Based Measurement
Uses log scale: log1p(x) = ln(1+x). This compresses large values so diminishing returns apply.
Why log1p instead of ln? → ln(0) = -∞ (crashes), but ln(1+0) = 0 (safe for zero input).
Key: Measures from human message to Claude's last activity (tool call).
Measures only Claude's actual working time, not until the next human message.
Example: Human(10:00) → Claude works → last tool(10:05) → [idle] → Human(12:00)
Without activity-based measurement: incorrectly 120 min. OMAS: correctly 5 min.
Segments: (1) before first human, (2) between humans, (3) after last human — max of each.
Idle Gap Capping (v0.6.0+):
Gaps > 30 min (IDLE_GAP_THRESHOLD) between consecutive activities are capped at 30 min.
This prevents idle periods (e.g. permission prompt left unanswered) from inflating the stretch.
Example: Tool(10:00)→Tool(10:05)→[3h idle]→Tool(13:05)→Tool(13:10)
Without capping: 190 min → 10.0 (inflated!)
With capping: 5 + 30(cap) + 5 = 40 min → 7.4 (accurate)
Examples
Quick Q&A (score: 0)
User asks, Claude responds immediately. No tool calls between messages.
longest_autonomous_stretch = 0 min
l_thread_score = min(log1p(0) * 2.0, 10) = 0.05-minute autonomous work (score: 3.6)
User: "Fix the login bug"
Claude works for 5 minutes: Grep → Read → Edit → Bash(test) → done.
longest_autonomous_stretch = 5 min
l_thread_score = min(log1p(5) * 2.0, 10) = min(3.58, 10) = 3.5830-minute feature implementation (score: 6.9)
User: "Implement the entire auth module with tests and docs"
Claude works for 30 min straight: 47 tool calls, no human intervention.
longest_autonomous_stretch = 30 min
l_thread_score = min(log1p(30) * 2.0, 10) = min(6.88, 10) = 6.88Score Reference Table
Metrics
longest_autonomous_stretch_minutesMax time (minutes) between human message and Claude's last activity. Activity-based measurement excludes idle time.max_tool_calls_between_humanMaximum number of tool calls between human messages.session_duration_minutesTotal session length (first to last timestamp).max_consecutive_assistant_turnsMaximum consecutive assistant messages.Score Ranges
How to Improve
- →Give clear and specific instructions at once so Claude runs autonomously longer.
- →Don't break large tasks into small pieces — deliver all requirements at once.
- →"Refactor this entire module. Write tests and create a PR too."
- →Trust Claude to make its own decisions without interrupting mid-work.
- →Write detailed project conventions in CLAUDE.md so Claude works longer without questions.
THICKER (B-threads) — Density
Max: 10+Sub-agent scale and nesting depth
Score Formula
b_thread_score = min(total_agents, 10.0)// total agents (team + sub), capped at 10. Linear scale.line_bonus = min(ai_written_lines / 50000, 1.0)// AI-written lines bonus: linear, max +1.0 at 50K linesb_norm = min(b_thread_score + line_bonus, 10.0)// density score with line bonus applied10/10: 10 or more agents (team + sub) in a single session. Direct value — no log scale, no depth multiplier.
Total Agent Count & AI Lines Bonus
Linear scale: b_thread_score = min(total_agents, 10). Same approach as P-thread (concurrent sessions).
total_agents = all agents spawned in the session (team agents + sub-agents).
No depth multiplier — one nested sub-agent shouldn't double the entire score.
No orchestration breadth — within-session concurrency is just part of using agents.
depth is still tracked for B-thread classification (depth 2+ = B-thread),
but it does NOT affect the score calculation.
AI Written Lines Bonus:
Counts lines written via Write (content), Edit (new_string), MultiEdit (edits[].new_string).
Linear bonus: 0 lines → +0.0, 5K lines → +0.1, 10K → +0.2, 50K+ → +1.0 (capped).
This is per-session, not cumulative. Typical sessions earn +0.0~0.2 bonus.
Impact on overall score: max +0.25 (since 4 dimensions are averaged).
Examples
No agents (score: 0)
Simple session with direct tool calls only.
total_agents = 0
b_thread_score = min(0, 10) = 03 agents (score: 3)
Claude spawns 3 Explore agents for research.
total_agents = 3
b_thread_score = min(3, 10) = 3Team of 5 agents (score: 5)
Claude organizes a team of 5 agents for a feature implementation.
total_agents = 5
b_thread_score = min(5, 10) = 510+ agents (score: 10)
Large-scale team with 10 or more agents (team + sub-agents).
total_agents = 10
b_thread_score = min(10, 10) = 10 (max!)Metrics
tool_calls_per_minuteTool calls per minute. total_tool_calls / max(duration_minutes, 0.1)max_sub_agent_depthMaximum sub-agent nesting depth. 0=none, 1=flat, 2+=nested (B-thread classification).total_tool_callsTotal tool calls performed across the entire session.tokens_per_minuteTokens consumed per minute. (input_tokens + output_tokens) / duration.ai_written_linesTotal lines of code written by AI via Write, Edit, and MultiEdit tools in this session.ai_line_bonusBonus added to density score. min(ai_written_lines / 50000, 1.0). Max +1.0 at 50K lines.Score Ranges
How to Improve
- →Use Team/Agent features to spawn multiple agents for complex tasks.
- →Request "organize a team for this" to get more agents working in parallel.
- →Separate code analysis → implementation → testing → review into individual agents.
- →More agents = higher score. 10 agents in one session = perfect score.
- →Write more code via Write/Edit tools to earn the AI-written lines bonus (up to +1.0).
FEWER (Trust) — Reduced Human Checkpoints
Max: 10Human checkpoint reduction, trust level
Score Formula
effective_human = human_messages - trivial_delegations// excludes trivial delegations (≤5 tool calls after human msg)ratio_score = min(log1p(tool_calls / effective_human) * 2.0, 10)// log1p(x) = ln(1+x), log-scaled ratioask_penalty = min(penalized_ask_ratio * 10.0, 3.0)// only AskUserQuestion OUTSIDE plan mode (max -3 pts)z_thread_score = max(ratio_score - ask_penalty, 0.0)// final = ratio score - penalty (floor 0)10/10: ~148+ tool calls per effective human message with no AskUser penalty. Trivial delegations like 'run tests' (≤5 tool calls) are excluded from human count.
Trivial Delegation Filter & Penalty System
Three-part formula: filter trivial delegations → base ratio (log scale) → minus penalty.
Uses log1p(x) = ln(1+x) for the ratio — same diminishing returns as L-thread.
Why ratio-only (no volume penalty):
Fewer measures the QUALITY of your instructions, not the AMOUNT of work.
20 tools / 1 human = ratio 20 is excellent agentic coding.
200 tools / 1 human = ratio 200 is even better. Volume is Thicker/Longer's job.
Step 1: Trivial Delegation Filter (NEW)
If a human message is followed by ≤ 5 tool calls before the next human message,
it is classified as a 'trivial delegation' (e.g. 'run tests', 'build it').
These are NOT genuine checkpoints — they're simple convenience requests.
effective_human_count = human_messages - trivial_delegations (min 1).
Example: 3 human messages → tool counts per segment: [2, 40, 3]
→ Segments with ≤ 5 tools: 2 (trivial). Segment with 40: real work.
→ effective_human = 3 - 2 = 1. Only the 40-tool segment counts.
Step 2: Penalty targets ONLY AskUserQuestion outside Plan Mode:
penalized_ask_ratio = penalized_ask_count / total_tool_calls
ask_penalty = min(penalized_ask_ratio × 10, 3.0) → maximum -3 points
AskUserQuestion is classified into two contexts:
1. Inside Plan Mode (EnterPlanMode ~ ExitPlanMode):
→ No penalty! Clarifying requirements during planning is good practice.
2. Outside Plan Mode (during implementation):
→ Penalty applied. Asking users mid-implementation signals uncertainty.
Example: Plan Mode 3 questions + implementation 1 question
→ plan_mode_ask_user_count = 3 (no penalty)
→ penalized_ask_user_count = 1 (only this penalized)
Examples
Frequent back-and-forth (score: 1.4)
10 tool calls, 10 human messages (user asks after every step).
ratio = 10 / 10 = 1.0, no AskUser penalty.
z_thread_score = min(log1p(1.0) * 2.0, 10) - 0 = 1.39Autonomous implementation (score: 6.1)
User: "Build the entire API module"
Claude makes 40 tool calls with only 2 human messages.
ratio = 40 / 2 = 20.0, no AskUser outside plan mode.
z_thread_score = min(log1p(20) * 2.0, 10) = min(6.09, 10) = 6.09With AskUser penalty (score: 4.2)
50 tool calls, 2 human messages. But 3 AskUserQuestion outside plan mode.
ratio = 50 / 2 = 25.0 → ratio_score = min(log1p(25) * 2.0, 10) = 6.52
penalty: penalized_ask_ratio = 3/50 = 0.06 → ask_penalty = min(0.6, 3.0) = 0.6
z_thread_score = max(6.52 - 0.6, 0) = 5.92Plan Mode exception (no penalty)
Claude enters Plan Mode, asks 3 clarifying questions, exits, then executes.
100 tool calls, 1 human message. 3 AskUser in plan mode + 0 outside.
ratio = 100 / 1 = 100 → ratio_score = 9.23, penalty = 0
z_thread_score = 9.23 (plan mode questions NOT penalized!)Trivial delegation filter (score boost)
3 human messages, but 2 are trivial ('run tests' → 1 tool, 'build' → 2 tools).
Only 1 message triggered real work (40 tool calls). Total = 43 tools.
Without filter: ratio = 43/3 = 14.3 → score = 5.47
With filter: effective_human = 3 - 2 = 1, ratio = 43/1 = 43 → score = 7.56
z_thread_score = 7.56 (trivial delegations excluded from human count!)Score Reference Table (ratio_score before penalty)
Metrics
tool_calls_per_human_messageTool calls per effective human message. total_tool_calls / max(effective_human_count, 1). Excludes trivial delegations.assistant_per_human_ratioAssistant to human message ratio. assistant_count / max(human_count, 1)ask_user_countTotal AskUserQuestion invocations (including plan mode).plan_mode_ask_user_countAskUserQuestion count inside Plan Mode (between EnterPlanMode ~ ExitPlanMode). No penalty.penalized_ask_user_countAskUserQuestion count outside Plan Mode. Only these are penalized.autonomous_tool_call_pctPercentage of tool calls excluding penalized AskUser. (1 - penalized/total) * 100trivial_delegation_countHuman messages classified as trivial delegation (≤5 tool calls in following segment). Excluded from trust ratio.effective_human_countHuman messages actually used in trust ratio. = human_messages - trivial_delegations (min 1).Score Ranges
How to Improve
- →Give clear instructions once so Claude handles everything autonomously.
- →Write coding conventions, preferred patterns, and project structure in CLAUDE.md.
- →Asking questions in Plan Mode is fine — no penalty!
- →To reduce implementation questions, clarify requirements during the planning phase.
- →Pre-approve permissions (auto-accept) so execution continues without interruption.
- →Aim for Z-thread: automate entire feature implementation with a single command.
Thread Type Classification
Sessions are classified into one type using the following priority order (Z is highest):
human_messages <= 1 AND tool_calls >= 10Zero-touch: Minimal human input, maximum autonomous work. Most evolved form.
max_sub_agent_depth >= 2Big: Sub-agents spawning sub-agents — nested execution.
autonomous_stretch > 30min AND tool_calls > 50Long: 30+ minutes of autonomous execution without human intervention.
sub_agent_prompt_similarity > 70% (Jaccard)Fusion: Similar tasks distributed to multiple agents (Map-Reduce pattern).
max_concurrent_agents > 1Parallel: 2+ sub-agents running concurrently.
human_messages >= 3 AND each_gap_tool_calls >= 3Chained: Human-AI conversation repeated in a chain pattern.
None of the above conditions metDefault conversational session. Short Q&A or simple tasks.
Improvement Roadmap
Evolve in order: Base → C → P → L → B → Z. Key strategies for each step:
Continue conversations for 3+ turns, progressively building work. Ensure 3+ tool calls per turn.
Request 2+ independent tasks simultaneously. Explicitly say 'use Agent tool for parallel processing'.
Describe requirements in detail and reduce mid-work intervention. A thorough CLAUDE.md enables 30+ min autonomous runs.
Use 'organize a team' or 'work in a worktree' to encourage deep sub-agent trees.
Ultimate goal: implement an entire feature with a single command. Auto-approve permissions + detailed project docs + clear single instruction.
Fair Comparison System
A system that filters and weights sessions for fair comparison. Prevents short test sessions or automation scripts from skewing overall scores.
Minimum Qualifying Thresholds
All criteria below must be met to be included in comparisons.
Weighted Scoring
Longer and more complex sessions receive proportionally higher weight.
weight(session) = log1p(total_tool_calls) * log1p(session_duration_minutes)
weighted_score = Σ(score_i * weight_i) / Σ(weight_i)
Consistency Score (0~10)
Measures consistency based on standard deviation of overall scores from the last 70 sessions.
consistency = max(0, min(10, 10 - std_dev * 3.33))
std_dev = 0 → 10.0 (perfect consistency) | std_dev ≥ 3 → ~0.0
Composite Rank Score
Final comparison rank score combining weighted score (80%) and consistency score (20%).
Data Source
Claude Code JSONL session logs: ~/.claude/projects/<hash>/<session>.jsonl
Sub-agent logs: <session-dir>/subagents/agent-<id>.jsonl
omas scan to scan all sessions → omas export to generate JSON