Metadata-Version: 2.4
Name: codeir_tools
Version: 0.2.0
Summary: Deterministic code compression and indexing for Python repositories
License: MIT
Keywords: code-analysis,ir,indexing,compression,llm,agents
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Software Development :: Compilers
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: eval
Requires-Dist: anthropic; extra == "eval"
Requires-Dist: openai; extra == "eval"
Requires-Dist: google-generativeai; extra == "eval"
Requires-Dist: numpy; extra == "eval"
Requires-Dist: sentence-transformers; extra == "eval"
Requires-Dist: scikit-learn; extra == "eval"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Dynamic: license-file

# Why LLMs Struggle With Raw Codebases

LLMs work poorly on real code **not because the models are weak**, but because code is optimized for human readability, not transformer efficiency.

This mismatch produces **predictable failures**.

---

## 1. Raw Code Wastes Tokens and Buries Meaning

Code encodes intent through formatting, naming, indentation, imports, decorators, boilerplate, and syntactic rituals that carry **zero semantic weight** to a model.

LLMs must ingest all of this noise token-by-token before they can reach the behavior that actually matters.

### The Problem

```python
# What the LLM sees (high token count, low signal)
def get_user_profile_from_database_by_id(user_id: str) -> UserProfile:
    """
    Retrieves a user profile from the database given a user ID.

    Args:
        user_id: The unique identifier for the user

    Returns:
        UserProfile object containing user data
    """
    if user_id is None:
        raise ValueError("user_id cannot be None")

    # ... 50+ lines of boilerplate
```

```
# What actually matters (Behavior IR)
FN USRP C=DBQY,VALD F=EIR A=3 #DB #CORE
```

→ A function that calls `DBQY` and `VALD`, raises exceptions, has conditionals, returns a value. 3 assignments. That's the whole behavioral surface.

---

## 2. Context Limits Break Multi-File Reasoning

Large repositories exceed model context windows, which forces the model to reason over **fragments**.

Without global visibility, it cannot maintain stable understanding of:

- **Call relationships** — who calls whom?
- **Shared invariants** — what assumptions cross file boundaries?
- **Cross-file constraints** — which changes break what?
- **Architectural intent** — what's the actual design?

CodeIR's bearings file provides module-level architecture in ~200-400 tokens, and Behavior-level IR fits entire codebases in context where raw source cannot.

---

## 3. Redundant Variation Inflates Complexity

Equivalent constructs written in different styles look **unrelated** unless normalized:

| Python | JavaScript | Swift |
|--------|------------|-------|
| `get_user()` | `fetchUser()` | `retrieveUser()` |
| `user_data` | `userData` | `userInfo` |
| `db_query()` | `queryDB()` | `databaseQuery()` |

LLMs treat syntactically different expressions of the same idea as **separate concepts**, fragmenting reasoning.

### After Compression

Stable entity IDs normalize naming:
```
FN USRG → "get user"         (regardless of casing/style)
FN DBQY → "database query"   (regardless of language)
```

---

## 4. Structural Information Is Implicit, Not Explicit

Architectural boundaries are **buried in syntax**:

- Stateful regions
- Async boundaries
- Platform-specific logic
- Critical error paths

LLMs see **sequential text, not structure**, unless forced into a better representation.

### Example: Behavior IR Makes Structure Explicit

```python
# LLM sees: "just another function"
async def process_payment(order_id):
    result = await db.query(...)
    if not result:
        raise PaymentError("not found")
    return result
```

```
# Behavior IR surfaces the structure
AMT PRCSPYMNT C=PaymentError,db.query F=AEIR A=1 #DB #CORE
```

→ Async method, awaits, raises, has conditionals, returns. The async boundary, error path, and DB dependency are all explicit.

---

## 5. Token-Heavy Regions Distort Importance

Large helper functions, repeated boilerplate, and verbose patterns **dominate attention** even when they are unimportant.

The model cannot prioritize high-impact architectural nodes.

### The Attention Problem

```python
# 300 lines of logging boilerplate
def setup_logging_configuration():
    # ... consumes 800+ tokens ...

# 5 lines of critical business logic
def validate_payment():
    # ... only 50 tokens, but this is what matters ...
```

The LLM spends most of its attention on **noise**, not **signal**.

### With IR Compression

```
FN STPLG F=A A=15 #CORE              # 8 tokens
FN VLDPYMNT C=BNKP,FRDCHK F=EIR #CORE  # 12 tokens
```

Equal representation regardless of verbosity. The model sees both at the same resolution.

---

## Why Semantic Compression Matters

Semantic compression makes structure **explicit** and collapses unnecessary variation, allowing LLMs to operate where they're strongest:

| Capability | Raw Code | With CodeIR |
|------------|----------|-----------------|
| **Pattern detection** | Fragmented by syntax | Normalized and clear |
| **Architectural reasoning** | Implicit, buried | Explicit, structured |
| **Relational understanding** | Context-limited | Graph-based, complete |
| **Token efficiency** | ~100x more tokens | Behavior IR at ~3-5% of source |

Instead of parsing noise, the model operates on a **consistent, low-entropy substrate**.

---

## The CodeIR Approach

> **IR is the operating system.**
> **Raw code is just a UI.**

By transforming code into a deterministic, compressed, structure-first representation, we give LLMs the substrate they need to reason effectively about real-world software.

---

## Related Documentation

- [Main README](../README.md) — Project overview
- [IR Spec (As Built)](IR_spec_as_built_v0_2.md) — Technical details
- [Future Considerations](Future_Considerations.md) — Planned expansions
