TechLens Signal Guide
Auto-generated from signal definitions. Run python scripts/generate_signal_docs.py to regenerate.
Pipeline Overview
The FULL_ANALYSIS pipeline runs 13 stages in order.
Dashed borders indicate SLM-dependent stages that are skipped when no local model is available.
graph TB
subgraph "File Discovery"
direction LR
S0["filter_files
Walk repo, select source files"]
S1["create_manager
Instantiate per-file score accumulator"]
end
subgraph "Signal Collection"
direction LR
S2["git_density
Git density, author count, co-change"]
S3["import_graph
Import indegree analysis"]
S4["code_metrics
Control flow, definitions, comments, glue, domain tokens"]
S5["token_metrics
Cyclomatic complexity, operands, operators, fanout"]
S6["test_coverage
Map test files to source files"]
S7["derived_metrics
Halstead, maintainability index, TIOBE"]
S8["moat_signals
API surface, data accumulation, realtime, governance, regulatory"]
end
subgraph "Scoring"
direction LR
S9["composite_scores
Weighted composite scoring + top-files ranking"]
end
subgraph "SLM (optional)"
direction LR
S10["lm_code_value
SLM: uniqueness, difficulty, quality"]
S11["file_descriptions
SLM: per-file descriptions + program synthesis"]
S12["slm_post_filter
SLM: outlier queries (code value + glue assessment)"]
end
S0 --> S1
S1 --> S2
S2 --> S3
S3 --> S4
S4 --> S5
S5 --> S6
S6 --> S7
S7 --> S8
S8 --> S9
S9 --> S10
S10 --> S11
S11 --> S12
style S10 stroke-dasharray: 5 5
style S11 stroke-dasharray: 5 5
style S12 stroke-dasharray: 5 5
Signal → Composite Score
Signals feed into composite scores with weighted contributions. Dashed arrows indicate inverted signals (lower raw value = higher contribution). ⚛ marks signals requiring a local SLM.
architecturegitheuristicslmstatic_analysis
graph LR
subgraph " "
direction LR
control_flow_density["control_flow_density"]:::heuristic
definition_density["definition_density"]:::heuristic
comment_ratio["comment_ratio"]:::heuristic
test_coverage_boosted["test_coverage_boosted"]:::architecture
duplication_ratio["duplication_ratio"]:::heuristic
end
engineering_score(["engineering_score"]):::composite
subgraph " "
direction LR
lm_uniqueness["lm_uniqueness ⚛"]:::slm
lm_complexity["lm_complexity ⚛"]:::slm
lm_quality["lm_quality ⚛"]:::slm
end
lm_code_value_score(["lm_code_value_score"]):::composite
subgraph " "
direction LR
domain_match_density["domain_match_density"]:::heuristic
gluemarker_ratio["gluemarker_ratio"]:::heuristic
external_call_ratio["external_call_ratio"]:::heuristic
api_surface_ratio["api_surface_ratio"]:::architecture
realtime_pattern_density["realtime_pattern_density"]:::architecture
data_accumulation_pattern["data_accumulation_pattern"]:::architecture
governance_workflow_density["governance_workflow_density"]:::architecture
regulatory_token_density["regulatory_token_density"]:::architecture
end
moat_score(["moat_score"]):::composite
control_flow_density -->|15%| engineering_score
definition_density -->|15%| engineering_score
comment_ratio -->|15%| engineering_score
test_coverage_boosted -->|30%| engineering_score
duplication_ratio -.->|25% inv| engineering_score
lm_uniqueness -->|33%| lm_code_value_score
lm_complexity -->|33%| lm_code_value_score
lm_quality -->|33%| lm_code_value_score
domain_match_density -->|20%| moat_score
gluemarker_ratio -.->|15% inv| moat_score
external_call_ratio -.->|12% inv| moat_score
api_surface_ratio -->|6%| moat_score
realtime_pattern_density -->|6%| moat_score
data_accumulation_pattern -->|10%| moat_score
governance_workflow_density -->|6%| moat_score
regulatory_token_density -->|12%| moat_score
classDef git fill:#4A90D9,color:#fff,stroke:#333
classDef heuristic fill:#50C878,color:#fff,stroke:#333
classDef architecture fill:#FF8C42,color:#fff,stroke:#333
classDef static_analysis fill:#9B59B6,color:#fff,stroke:#333
classDef slm fill:#E74C3C,color:#fff,stroke:#333
classDef composite fill:#2C3E50,color:#fff,stroke:#fff,font-size:16px
Composite Weight Distribution
pie title engineering_score "control_flow_density" : 15 "definition_density" : 15 "comment_ratio" : 15 "test_coverage_boosted" : 30 "duplication_ratio (inv)" : 25
pie title lm_code_value_score "lm_uniqueness" : 33 "lm_complexity" : 33 "lm_quality" : 33
pie title moat_score "domain_match_density" : 20 "gluemarker_ratio (inv)" : 15 "external_call_ratio (inv)" : 12 "api_surface_ratio" : 6 "realtime_pattern_density" : 6 "data_accumulation_pattern" : 10 "governance_workflow_density" : 6 "regulatory_token_density" : 12
Signal Glossary
Click a signal to see what it measures. Per-file signals produce one value per source file; repo-level signals produce a single value for the entire repository.
Per-file (26)
git_densitygitgitNormalized change activity score from git history
Answers the four most common workshop questions at once: Where is active development focused? Who are the critical developers? What happens if they leave? Which components are secretly coupled? These findings directly influence retention clauses, earnout structures, and integration timelines.author_countgitgitNormalized count of distinct commit authors per file
Answers the four most common workshop questions at once: Where is active development focused? Who are the critical developers? What happens if they leave? Which components are secretly coupled? These findings directly influence retention clauses, earnout structures, and integration timelines.cochangegitgitNormalized co-change coupling frequency per file
Answers the four most common workshop questions at once: Where is active development focused? Who are the critical developers? What happens if they leave? Which components are secretly coupled? These findings directly influence retention clauses, earnout structures, and integration timelines.import_indegreearchitectureNormalized count of internal imports pointing to this file
Identifies architectural bottlenecks and single points of failure. A file imported by dozens of others is both the most valuable and the most dangerous — any bug there cascades across the entire system, and it cannot be replaced without coordinated refactoring.comment_density_hotheuristicPer-file interaction signal: comment ratio * git density
Documentation culture is a leading indicator of post-close maintenance risk. Near-zero comments + high change frequency = institutional knowledge trapped in developers' heads. Directly affects knowledge transfer timelines and key-person retention priorities.test_coveragearchitectureBinary 1.0/0.0 per file indicating presence of an associated test file
Reveals engineering discipline. Low test coverage tells an investor the team ships without a safety net. The absence of tests is one of the most common findings that changes rebuild estimates and post-close stabilization budgets.test_coverage_boostedarchitectureChurn-weighted boost for tested files (test_coverage * git_density)
Reveals engineering discipline. Low test coverage tells an investor the team ships without a safety net. The absence of tests is one of the most common findings that changes rebuild estimates and post-close stabilization budgets.slm_value_scoreSLMslmSLM-assessed value score per individual source file
Enables 'Is this code worth acquiring?' as a scored, benchmarkable answer. Per-file IP identification transforms workshops. Rebuild estimation with AI-assessed complexity.cyclomatic_complexitystatic_analysisMcCabe cyclomatic complexity per file
Measures the number of independent execution paths through the code. High cyclomatic complexity means more test cases needed for full coverage and higher probability of latent defects — directly impacting post-close stabilization effort.operands_sumstatic_analysisTotal operand token count per file
Raw input for Halstead metrics. Total operand count reflects data volume flowing through the code — higher counts indicate files doing substantive data transformation rather than simple pass-through.operands_uniquestatic_analysisUnique operand token count per file
Vocabulary breadth of data elements. A file with many unique operands is manipulating a rich data model, suggesting domain-specific logic that is harder to replicate or replace with generic tooling.operators_sumstatic_analysisTotal operator token count per file
Raw input for Halstead metrics. High operator counts relative to operands indicate dense control logic and data manipulation, contributing to implementation effort estimates.operators_uniquestatic_analysisUnique operator token count per file
Indicates the breadth of language features and constructs used. Files leveraging more unique operators tend to implement more sophisticated algorithms that require deeper expertise to maintain.definitions_countstatic_analysisUnique function and class definition count per file
Counts distinct function and class definitions via Pygments tokens. High definition counts indicate files that declare substantial API surface — more definitions mean more contracts to maintain, test, and document.halstead_volumestatic_analysisHalstead program volume
Measures the information content of the code in bits. High volume files contain more logic to understand, test, and maintain — directly proportional to the knowledge transfer effort required post-acquisition.halstead_difficultystatic_analysisHalstead program difficulty
Quantifies how hard the code is to write or understand. High difficulty files are error-prone to modify and require senior engineers — a factor in retention planning and post-close staffing decisions.halstead_effortstatic_analysisHalstead implementation effort
Enables rebuild estimation and t-shirt sizing. The top 10 files account for 60% of total comprehension effort, giving concrete input for post-close staffing and timeline planning.halstead_bugpropstatic_analysisHalstead estimated bug propensity
Predicts the expected number of delivered defects based on code volume. Files with high bug propensity are where post-close quality issues will concentrate — informing QA resource allocation and stabilization timelines.halstead_timerequiredstatic_analysisHalstead estimated implementation time
Estimates implementation time in seconds per file. High values flag modules that will take longest to rewrite or onboard — giving concrete input for rebuild timelines and staffing plans.maintainability_indexstatic_analysisSEI maintainability index (0-100)
Ready-made quality score on a 0-100 scale that's immediately interpretable. 'This repo scores 35 on maintainability, bottom quartile across 200+ codebases we've assessed' is a finding that lands in investment committee presentations.fanout_internalstatic_analysisCount of unique internal (relative) imports per file
Measures internal coupling. Files that import many other project files are integration points — complex to modify and test because changes ripple across the dependency chain.fanout_externalstatic_analysisCount of unique external imports per file
Measures external dependency at the file level. Files pulling in many third-party libraries are vulnerable to vendor changes and are often glue code rather than proprietary logic.tiobestatic_analysisTIOBE quality index (0-100)
Industry-standard composite quality score combining complexity, code size, and duplication. Enables cross-repo benchmarking: 'This file scores 45/100, below the threshold for maintainable production code.'pylintstatic_analysisPylint-style quality score (0-100)
Widely recognized quality score in the Python ecosystem. Provides a familiar benchmark that engineering teams and technical advisors already understand, reducing friction in due-diligence conversations.file_rolestatic_analysisSemantic role classification per file (generated, vendor, lock, build_artifact, fixture, test, migration, view, controller, model, service, repository, middleware, config, infra, util, source)
Enables targeted analysis by distinguishing code that matters (business logic, controllers, models) from code that doesn't (generated, vendored, config). Downstream signals can weight files differently based on their role.churn_riskgitgitPer-file risk score: churn * (1 - protective_factors). Protective factors: has_tests (0.4), multiple authors (0.3), low complexity (0.3). High churn with good tests and multiple authors scores low risk. High churn with no tests, single author, and high complexity scores high risk.
Churn alone is ambiguous — active development is healthy. Churn risk isolates files where high change velocity lacks protective factors (tests, shared ownership, manageable complexity), surfacing true maintenance debt and key-person risk.
Repo-level (21)
bus_factor_filesgitgitCount of business-logic files (source, model, service, controller, repository, middleware, util, view) with only one author (bus-factor risk). Excludes vendor, generated, lock, build artifact, fixture, test, migration, config, and infra files.
Answers the four most common workshop questions at once: Where is active development focused? Who are the critical developers? What happens if they leave? Which components are secretly coupled? These findings directly influence retention clauses, earnout structures, and integration timelines.control_flow_densityheuristicControl-flow statements per 100 code lines
Identifies where business rules concentrate. High control flow density in a few files means proprietary decision logic is packed into complex code, driving up rebuild cost. Low density across the repo may indicate the product is mostly glue with little original logic.definition_densityheuristicDefinitions (functions, classes) per 100 code lines
Indicates code modularity and reuse patterns. Low definition density suggests monolithic functions doing too much, increasing maintenance cost.comment_ratioheuristicRepo-wide ratio of comment lines to total lines
Documentation culture is a leading indicator of post-close maintenance risk. Near-zero comments + high change frequency = institutional knowledge trapped in developers' heads. Directly affects knowledge transfer timelines and key-person retention priorities.test_coverage_ratioarchitectureRatio of testable files (source, model, service, controller, repository, middleware, util) that have associated test files. Excludes views, vendor, generated, lock, build artifact, fixture, test, migration, config, and infra from the denominator.
Reveals engineering discipline. Low test coverage tells an investor the team ships without a safety net. The absence of tests is one of the most common findings that changes rebuild estimates and post-close stabilization budgets.domain_token_densityheuristicRatio of domain-specific vocabulary to generic tokens
Unlocks the most strategically important signal. Function names like 'calculate_actuarial_reserve' or 'apply_regulatory_haircut' reveal the depth of proprietary domain knowledge that can't be replicated by hiring generic developers.domain_match_densityheuristicRatio of domain-specific vocabulary matches (stemmed + prefix)
Measures positive matches against curated domain dictionaries (fintech, healthtech, enterprise). Unlike generic density, this confirms the codebase actually uses specialized vocabulary — a strong indicator of proprietary domain logic.gluemarker_ratioheuristicWeighted ratio of glue/boilerplate markers to total lines
Quantifies how much of the codebase is commodity wiring vs. substantive logic. A product that's 70% glue code has a fundamentally different value proposition than one with deep proprietary processing. Directly answers: 'How much of this is replaceable by off-the-shelf tools or AI?'external_call_ratioheuristicRatio of external library calls to total function calls
Measures how self-contained the product's value is. High external dependency means vulnerability to vendor pricing changes, API deprecation, or third-party outages. Inverse relationship to defensibility.duplication_ratioheuristicRatio of duplicated code blocks across files
High duplication is a direct indicator of technical debt and rebuild cost. Copy-pasted code means bugs exist in multiple places, refactoring is more expensive than it appears, and the team's engineering practices are weak.lm_uniquenessSLMslmLM-assessed code uniqueness/novelty
Enables 'Is this code worth acquiring?' as a scored, benchmarkable answer. Per-file IP identification transforms workshops. Rebuild estimation with AI-assessed complexity.lm_complexitySLMslmLM-assessed implementation difficulty
Enables 'Is this code worth acquiring?' as a scored, benchmarkable answer. Per-file IP identification transforms workshops. Rebuild estimation with AI-assessed complexity.lm_qualitySLMslmLM-assessed code quality
Enables 'Is this code worth acquiring?' as a scored, benchmarkable answer. Per-file IP identification transforms workshops. Rebuild estimation with AI-assessed complexity.program_value_scoreSLMslmSLM-synthesized holistic program value from per-file summaries
Enables 'Is this code worth acquiring?' as a scored, benchmarkable answer. Per-file IP identification transforms workshops. Rebuild estimation with AI-assessed complexity.assessment_scoreSLMslmEngineered-vs-assembled verdict from chunk+merge SLM pipeline
Enables 'Is this code worth acquiring?' as a scored, benchmarkable answer. Per-file IP identification transforms workshops. Rebuild estimation with AI-assessed complexity.api_surface_ratioarchitectureRatio of API-exposed files (controllers, routes, handlers, views, resolvers, resources). Measures what fraction of source files exposes external endpoints.
Detects interface control. A product with a large API surface that third parties integrate with creates switching costs proportional to the number of external consumers. More endpoints = more ecosystem dependence = stickier product.realtime_pattern_densityarchitectureRatio of realtime/event-driven dependencies to total dependencies
Detects a resilient AI moat. Low-latency and high-frequency domains (WebSocket, gRPC, streaming) are among the hardest for AI to replicate because they require real-time state management and performance engineering. Presence signals sticky technical complexity.data_accumulation_patterngitarchitectureMigration/seed files + git history span indicating data accumulation moat
Detects data flywheel effects. 200+ migration files represent years of schema evolution driven by real-world usage. The accumulated data model and the logic to manage it are often more valuable than the application code itself.governance_workflow_densityarchitectureDensity of governance/workflow paths and deps indicating process moat
Detects one of the stickiest moats: governance authority. Products with deep approval workflows, RBAC, and audit trails have switching costs that compound over time. Customers can't easily rip out a system that enforces their compliance processes.regulatory_token_densityarchitectureDensity of compliance/regulatory paths and deps indicating regulatory moat
Regulated industries create natural moats because compliance requirements raise the barrier to entry for competitors and AI. A codebase with deep HIPAA, PCI, or SOX implementations reflects years of regulatory domain knowledge that can't be shortcut.cochange_clustersgitgitClusters of files that frequently change together, detected via Louvain community detection
Reveals hidden architectural coupling. Files that always change together may be tightly coupled even if they have no import relationship. This informs modularization effort estimates and highlights integration risk during platform migrations.
How compute_composite Works
- Collect contributions — iterate all
SignalDefinitions, find those with aCompositeContributionmatching the target composite name. - Skip missing — if a signal's value is
None(e.g. SLM unavailable, no git history), skip it entirely. Its weight is excluded from both numerator and denominator — the score degrades gracefully rather than penalizing. - Normalize — if
normalize_divisoris set, clamp:v = min(raw / divisor, 1.0). - Invert — if
inverted=True(lower-is-better signals like duplication), flip:v = 1.0 - v. - Accumulate —
weighted_sum += v * weight,total_weight += weight. - Re-normalize — final score =
weighted_sum / total_weight, so the result is always 0.0–1.0 regardless of which signals were available.