TechLens Signal Guide

Auto-generated from signal definitions. Run python scripts/generate_signal_docs.py to regenerate.

Pipeline Overview

The FULL_ANALYSIS pipeline runs 13 stages in order. Dashed borders indicate SLM-dependent stages that are skipped when no local model is available.

graph TB
  subgraph "File Discovery"
    direction LR
    S0["filter_files
Walk repo, select source files"] S1["create_manager
Instantiate per-file score accumulator"] end subgraph "Signal Collection" direction LR S2["git_density
Git density, author count, co-change"] S3["import_graph
Import indegree analysis"] S4["code_metrics
Control flow, definitions, comments, glue, domain tokens"] S5["token_metrics
Cyclomatic complexity, operands, operators, fanout"] S6["test_coverage
Map test files to source files"] S7["derived_metrics
Halstead, maintainability index, TIOBE"] S8["moat_signals
API surface, data accumulation, realtime, governance, regulatory"] end subgraph "Scoring" direction LR S9["composite_scores
Weighted composite scoring + top-files ranking"] end subgraph "SLM (optional)" direction LR S10["lm_code_value
SLM: uniqueness, difficulty, quality"] S11["file_descriptions
SLM: per-file descriptions + program synthesis"] S12["slm_post_filter
SLM: outlier queries (code value + glue assessment)"] end S0 --> S1 S1 --> S2 S2 --> S3 S3 --> S4 S4 --> S5 S5 --> S6 S6 --> S7 S7 --> S8 S8 --> S9 S9 --> S10 S10 --> S11 S11 --> S12 style S10 stroke-dasharray: 5 5 style S11 stroke-dasharray: 5 5 style S12 stroke-dasharray: 5 5

Signal → Composite Score

Signals feed into composite scores with weighted contributions. Dashed arrows indicate inverted signals (lower raw value = higher contribution). ⚛ marks signals requiring a local SLM.

architecturegitheuristicslmstatic_analysis
graph LR
  subgraph " "
    direction LR
    control_flow_density["control_flow_density"]:::heuristic
    definition_density["definition_density"]:::heuristic
    comment_ratio["comment_ratio"]:::heuristic
    test_coverage_boosted["test_coverage_boosted"]:::architecture
    duplication_ratio["duplication_ratio"]:::heuristic
  end
  engineering_score(["engineering_score"]):::composite
  subgraph " "
    direction LR
    lm_uniqueness["lm_uniqueness ⚛"]:::slm
    lm_complexity["lm_complexity ⚛"]:::slm
    lm_quality["lm_quality ⚛"]:::slm
  end
  lm_code_value_score(["lm_code_value_score"]):::composite
  subgraph " "
    direction LR
    domain_match_density["domain_match_density"]:::heuristic
    gluemarker_ratio["gluemarker_ratio"]:::heuristic
    external_call_ratio["external_call_ratio"]:::heuristic
    api_surface_ratio["api_surface_ratio"]:::architecture
    realtime_pattern_density["realtime_pattern_density"]:::architecture
    data_accumulation_pattern["data_accumulation_pattern"]:::architecture
    governance_workflow_density["governance_workflow_density"]:::architecture
    regulatory_token_density["regulatory_token_density"]:::architecture
  end
  moat_score(["moat_score"]):::composite
  control_flow_density -->|15%| engineering_score
  definition_density -->|15%| engineering_score
  comment_ratio -->|15%| engineering_score
  test_coverage_boosted -->|30%| engineering_score
  duplication_ratio -.->|25% inv| engineering_score
  lm_uniqueness -->|33%| lm_code_value_score
  lm_complexity -->|33%| lm_code_value_score
  lm_quality -->|33%| lm_code_value_score
  domain_match_density -->|20%| moat_score
  gluemarker_ratio -.->|15% inv| moat_score
  external_call_ratio -.->|12% inv| moat_score
  api_surface_ratio -->|6%| moat_score
  realtime_pattern_density -->|6%| moat_score
  data_accumulation_pattern -->|10%| moat_score
  governance_workflow_density -->|6%| moat_score
  regulatory_token_density -->|12%| moat_score
  classDef git fill:#4A90D9,color:#fff,stroke:#333
  classDef heuristic fill:#50C878,color:#fff,stroke:#333
  classDef architecture fill:#FF8C42,color:#fff,stroke:#333
  classDef static_analysis fill:#9B59B6,color:#fff,stroke:#333
  classDef slm fill:#E74C3C,color:#fff,stroke:#333
  classDef composite fill:#2C3E50,color:#fff,stroke:#fff,font-size:16px
    

Composite Weight Distribution

pie title engineering_score
  "control_flow_density" : 15
  "definition_density" : 15
  "comment_ratio" : 15
  "test_coverage_boosted" : 30
  "duplication_ratio (inv)" : 25
pie title lm_code_value_score
  "lm_uniqueness" : 33
  "lm_complexity" : 33
  "lm_quality" : 33
pie title moat_score
  "domain_match_density" : 20
  "gluemarker_ratio (inv)" : 15
  "external_call_ratio (inv)" : 12
  "api_surface_ratio" : 6
  "realtime_pattern_density" : 6
  "data_accumulation_pattern" : 10
  "governance_workflow_density" : 6
  "regulatory_token_density" : 12

Signal Glossary

Click a signal to see what it measures. Per-file signals produce one value per source file; repo-level signals produce a single value for the entire repository.

Per-file (26)

  • git_density gitgit
    Normalized change activity score from git history
    Answers the four most common workshop questions at once: Where is active development focused? Who are the critical developers? What happens if they leave? Which components are secretly coupled? These findings directly influence retention clauses, earnout structures, and integration timelines.
  • author_count gitgit
    Normalized count of distinct commit authors per file
    Answers the four most common workshop questions at once: Where is active development focused? Who are the critical developers? What happens if they leave? Which components are secretly coupled? These findings directly influence retention clauses, earnout structures, and integration timelines.
  • cochange gitgit
    Normalized co-change coupling frequency per file
    Answers the four most common workshop questions at once: Where is active development focused? Who are the critical developers? What happens if they leave? Which components are secretly coupled? These findings directly influence retention clauses, earnout structures, and integration timelines.
  • import_indegreearchitecture
    Normalized count of internal imports pointing to this file
    Identifies architectural bottlenecks and single points of failure. A file imported by dozens of others is both the most valuable and the most dangerous — any bug there cascades across the entire system, and it cannot be replaced without coordinated refactoring.
  • comment_density_hotheuristic
    Per-file interaction signal: comment ratio * git density
    Documentation culture is a leading indicator of post-close maintenance risk. Near-zero comments + high change frequency = institutional knowledge trapped in developers' heads. Directly affects knowledge transfer timelines and key-person retention priorities.
  • test_coveragearchitecture
    Binary 1.0/0.0 per file indicating presence of an associated test file
    Reveals engineering discipline. Low test coverage tells an investor the team ships without a safety net. The absence of tests is one of the most common findings that changes rebuild estimates and post-close stabilization budgets.
  • test_coverage_boostedarchitecture
    Churn-weighted boost for tested files (test_coverage * git_density)
    Reveals engineering discipline. Low test coverage tells an investor the team ships without a safety net. The absence of tests is one of the most common findings that changes rebuild estimates and post-close stabilization budgets.
  • slm_value_score SLMslm
    SLM-assessed value score per individual source file
    Enables 'Is this code worth acquiring?' as a scored, benchmarkable answer. Per-file IP identification transforms workshops. Rebuild estimation with AI-assessed complexity.
  • cyclomatic_complexitystatic_analysis
    McCabe cyclomatic complexity per file
    Measures the number of independent execution paths through the code. High cyclomatic complexity means more test cases needed for full coverage and higher probability of latent defects — directly impacting post-close stabilization effort.
  • operands_sumstatic_analysis
    Total operand token count per file
    Raw input for Halstead metrics. Total operand count reflects data volume flowing through the code — higher counts indicate files doing substantive data transformation rather than simple pass-through.
  • operands_uniquestatic_analysis
    Unique operand token count per file
    Vocabulary breadth of data elements. A file with many unique operands is manipulating a rich data model, suggesting domain-specific logic that is harder to replicate or replace with generic tooling.
  • operators_sumstatic_analysis
    Total operator token count per file
    Raw input for Halstead metrics. High operator counts relative to operands indicate dense control logic and data manipulation, contributing to implementation effort estimates.
  • operators_uniquestatic_analysis
    Unique operator token count per file
    Indicates the breadth of language features and constructs used. Files leveraging more unique operators tend to implement more sophisticated algorithms that require deeper expertise to maintain.
  • definitions_countstatic_analysis
    Unique function and class definition count per file
    Counts distinct function and class definitions via Pygments tokens. High definition counts indicate files that declare substantial API surface — more definitions mean more contracts to maintain, test, and document.
  • halstead_volumestatic_analysis
    Halstead program volume
    Measures the information content of the code in bits. High volume files contain more logic to understand, test, and maintain — directly proportional to the knowledge transfer effort required post-acquisition.
  • halstead_difficultystatic_analysis
    Halstead program difficulty
    Quantifies how hard the code is to write or understand. High difficulty files are error-prone to modify and require senior engineers — a factor in retention planning and post-close staffing decisions.
  • halstead_effortstatic_analysis
    Halstead implementation effort
    Enables rebuild estimation and t-shirt sizing. The top 10 files account for 60% of total comprehension effort, giving concrete input for post-close staffing and timeline planning.
  • halstead_bugpropstatic_analysis
    Halstead estimated bug propensity
    Predicts the expected number of delivered defects based on code volume. Files with high bug propensity are where post-close quality issues will concentrate — informing QA resource allocation and stabilization timelines.
  • halstead_timerequiredstatic_analysis
    Halstead estimated implementation time
    Estimates implementation time in seconds per file. High values flag modules that will take longest to rewrite or onboard — giving concrete input for rebuild timelines and staffing plans.
  • maintainability_indexstatic_analysis
    SEI maintainability index (0-100)
    Ready-made quality score on a 0-100 scale that's immediately interpretable. 'This repo scores 35 on maintainability, bottom quartile across 200+ codebases we've assessed' is a finding that lands in investment committee presentations.
  • fanout_internalstatic_analysis
    Count of unique internal (relative) imports per file
    Measures internal coupling. Files that import many other project files are integration points — complex to modify and test because changes ripple across the dependency chain.
  • fanout_externalstatic_analysis
    Count of unique external imports per file
    Measures external dependency at the file level. Files pulling in many third-party libraries are vulnerable to vendor changes and are often glue code rather than proprietary logic.
  • tiobestatic_analysis
    TIOBE quality index (0-100)
    Industry-standard composite quality score combining complexity, code size, and duplication. Enables cross-repo benchmarking: 'This file scores 45/100, below the threshold for maintainable production code.'
  • pylintstatic_analysis
    Pylint-style quality score (0-100)
    Widely recognized quality score in the Python ecosystem. Provides a familiar benchmark that engineering teams and technical advisors already understand, reducing friction in due-diligence conversations.
  • file_rolestatic_analysis
    Semantic role classification per file (generated, vendor, lock, build_artifact, fixture, test, migration, view, controller, model, service, repository, middleware, config, infra, util, source)
    Enables targeted analysis by distinguishing code that matters (business logic, controllers, models) from code that doesn't (generated, vendored, config). Downstream signals can weight files differently based on their role.
  • churn_risk gitgit
    Per-file risk score: churn * (1 - protective_factors). Protective factors: has_tests (0.4), multiple authors (0.3), low complexity (0.3). High churn with good tests and multiple authors scores low risk. High churn with no tests, single author, and high complexity scores high risk.
    Churn alone is ambiguous — active development is healthy. Churn risk isolates files where high change velocity lacks protective factors (tests, shared ownership, manageable complexity), surfacing true maintenance debt and key-person risk.

Repo-level (21)

  • bus_factor_files gitgit
    Count of business-logic files (source, model, service, controller, repository, middleware, util, view) with only one author (bus-factor risk). Excludes vendor, generated, lock, build artifact, fixture, test, migration, config, and infra files.
    Answers the four most common workshop questions at once: Where is active development focused? Who are the critical developers? What happens if they leave? Which components are secretly coupled? These findings directly influence retention clauses, earnout structures, and integration timelines.
  • control_flow_densityheuristic
    Control-flow statements per 100 code lines
    Identifies where business rules concentrate. High control flow density in a few files means proprietary decision logic is packed into complex code, driving up rebuild cost. Low density across the repo may indicate the product is mostly glue with little original logic.
  • definition_densityheuristic
    Definitions (functions, classes) per 100 code lines
    Indicates code modularity and reuse patterns. Low definition density suggests monolithic functions doing too much, increasing maintenance cost.
  • comment_ratioheuristic
    Repo-wide ratio of comment lines to total lines
    Documentation culture is a leading indicator of post-close maintenance risk. Near-zero comments + high change frequency = institutional knowledge trapped in developers' heads. Directly affects knowledge transfer timelines and key-person retention priorities.
  • test_coverage_ratioarchitecture
    Ratio of testable files (source, model, service, controller, repository, middleware, util) that have associated test files. Excludes views, vendor, generated, lock, build artifact, fixture, test, migration, config, and infra from the denominator.
    Reveals engineering discipline. Low test coverage tells an investor the team ships without a safety net. The absence of tests is one of the most common findings that changes rebuild estimates and post-close stabilization budgets.
  • domain_token_densityheuristic
    Ratio of domain-specific vocabulary to generic tokens
    Unlocks the most strategically important signal. Function names like 'calculate_actuarial_reserve' or 'apply_regulatory_haircut' reveal the depth of proprietary domain knowledge that can't be replicated by hiring generic developers.
  • domain_match_densityheuristic
    Ratio of domain-specific vocabulary matches (stemmed + prefix)
    Measures positive matches against curated domain dictionaries (fintech, healthtech, enterprise). Unlike generic density, this confirms the codebase actually uses specialized vocabulary — a strong indicator of proprietary domain logic.
  • gluemarker_ratioheuristic
    Weighted ratio of glue/boilerplate markers to total lines
    Quantifies how much of the codebase is commodity wiring vs. substantive logic. A product that's 70% glue code has a fundamentally different value proposition than one with deep proprietary processing. Directly answers: 'How much of this is replaceable by off-the-shelf tools or AI?'
  • external_call_ratioheuristic
    Ratio of external library calls to total function calls
    Measures how self-contained the product's value is. High external dependency means vulnerability to vendor pricing changes, API deprecation, or third-party outages. Inverse relationship to defensibility.
  • duplication_ratioheuristic
    Ratio of duplicated code blocks across files
    High duplication is a direct indicator of technical debt and rebuild cost. Copy-pasted code means bugs exist in multiple places, refactoring is more expensive than it appears, and the team's engineering practices are weak.
  • lm_uniqueness SLMslm
    LM-assessed code uniqueness/novelty
    Enables 'Is this code worth acquiring?' as a scored, benchmarkable answer. Per-file IP identification transforms workshops. Rebuild estimation with AI-assessed complexity.
  • lm_complexity SLMslm
    LM-assessed implementation difficulty
    Enables 'Is this code worth acquiring?' as a scored, benchmarkable answer. Per-file IP identification transforms workshops. Rebuild estimation with AI-assessed complexity.
  • lm_quality SLMslm
    LM-assessed code quality
    Enables 'Is this code worth acquiring?' as a scored, benchmarkable answer. Per-file IP identification transforms workshops. Rebuild estimation with AI-assessed complexity.
  • program_value_score SLMslm
    SLM-synthesized holistic program value from per-file summaries
    Enables 'Is this code worth acquiring?' as a scored, benchmarkable answer. Per-file IP identification transforms workshops. Rebuild estimation with AI-assessed complexity.
  • assessment_score SLMslm
    Engineered-vs-assembled verdict from chunk+merge SLM pipeline
    Enables 'Is this code worth acquiring?' as a scored, benchmarkable answer. Per-file IP identification transforms workshops. Rebuild estimation with AI-assessed complexity.
  • api_surface_ratioarchitecture
    Ratio of API-exposed files (controllers, routes, handlers, views, resolvers, resources). Measures what fraction of source files exposes external endpoints.
    Detects interface control. A product with a large API surface that third parties integrate with creates switching costs proportional to the number of external consumers. More endpoints = more ecosystem dependence = stickier product.
  • realtime_pattern_densityarchitecture
    Ratio of realtime/event-driven dependencies to total dependencies
    Detects a resilient AI moat. Low-latency and high-frequency domains (WebSocket, gRPC, streaming) are among the hardest for AI to replicate because they require real-time state management and performance engineering. Presence signals sticky technical complexity.
  • data_accumulation_pattern gitarchitecture
    Migration/seed files + git history span indicating data accumulation moat
    Detects data flywheel effects. 200+ migration files represent years of schema evolution driven by real-world usage. The accumulated data model and the logic to manage it are often more valuable than the application code itself.
  • governance_workflow_densityarchitecture
    Density of governance/workflow paths and deps indicating process moat
    Detects one of the stickiest moats: governance authority. Products with deep approval workflows, RBAC, and audit trails have switching costs that compound over time. Customers can't easily rip out a system that enforces their compliance processes.
  • regulatory_token_densityarchitecture
    Density of compliance/regulatory paths and deps indicating regulatory moat
    Regulated industries create natural moats because compliance requirements raise the barrier to entry for competitors and AI. A codebase with deep HIPAA, PCI, or SOX implementations reflects years of regulatory domain knowledge that can't be shortcut.
  • cochange_clusters gitgit
    Clusters of files that frequently change together, detected via Louvain community detection
    Reveals hidden architectural coupling. Files that always change together may be tightly coupled even if they have no import relationship. This informs modularization effort estimates and highlights integration risk during platform migrations.

How compute_composite Works

  1. Collect contributions — iterate all SignalDefinitions, find those with a CompositeContribution matching the target composite name.
  2. Skip missing — if a signal's value is None (e.g. SLM unavailable, no git history), skip it entirely. Its weight is excluded from both numerator and denominator — the score degrades gracefully rather than penalizing.
  3. Normalize — if normalize_divisor is set, clamp: v = min(raw / divisor, 1.0).
  4. Invert — if inverted=True (lower-is-better signals like duplication), flip: v = 1.0 - v.
  5. Accumulateweighted_sum += v * weight, total_weight += weight.
  6. Re-normalize — final score = weighted_sum / total_weight, so the result is always 0.0–1.0 regardless of which signals were available.