Metadata-Version: 2.4
Name: rag_knowledge_preparation
Version: 1.0.3
Summary: RAG Knowledge Preparation in Python
Home-page: https://bitbucket.org/entinco/eic-aimodelknowledge-utils/src/master/lib-ragknowledgepreparation-python
Author: Enterprise Innovation Consulting LLC
Author-email: seroukhov@entinco.com
License: Commercial
Platform: any
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests<3.0.0,>=2.27.1
Requires-Dist: urllib3<2.0.0,>=1.26.8
Requires-Dist: tiktoken>=0.5.0
Requires-Dist: pathspec>=0.11.0
Requires-Dist: tree-sitter>=0.20.0
Requires-Dist: tree-sitter-python>=0.20.0
Requires-Dist: tree-sitter-javascript>=0.20.0
Requires-Dist: tree-sitter-typescript>=0.20.0
Requires-Dist: pygments>=2.15.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: chardet>=5.0.0
Requires-Dist: google-generativeai>=0.8.5
Requires-Dist: langchain-google-genai>=2.0.1
Requires-Dist: langchain-core>=0.3.21
Requires-Dist: pdf2image>=1.17.0
Requires-Dist: Pillow>=11.0.0
Requires-Dist: python-docx>=1.1.2
Requires-Dist: tenacity>=9.0.0
Requires-Dist: tqdm>=4.66.5
Requires-Dist: langchain-text-splitters>=0.3.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: platform
Dynamic: requires-dist
Dynamic: summary

# RAG Knowledge Preparation Python

A comprehensive Python library for preparing knowledge bases for Retrieval-Augmented Generation (RAG) systems. This library now focuses on Gemini OCR-based PDF-to-Markdown conversion alongside intelligent codebase analysis.

## Features

### Document Processing Features

- **Multi-format Intake**: PDF, images (PNG/JPG/TIFF/BMP/WebP), DOCX, and text/Markdown/CSV
- **Gemini OCR**: Convert PDFs/images to Markdown via Gemini 2.5 Pro multimodal
- **Strict Markdown Output**: Page-by-page extraction with a table-aware prompt
- **Async & Parallel**: Concurrency controls for multi-page PDFs
- **Batch Processing**: Process multiple PDFs or entire folders efficiently
- **Configurable Quality**: Presets for fast, table-focused, or high-quality OCR

### Codebase Analysis Features

- **Comprehensive Analysis**: Extract structure, dependencies, and metadata from codebases
- **Multi-language Support**: Python, JavaScript, TypeScript and more
- **AI-Powered Summaries**: Generate intelligent code summaries using Google Gemini
- **Project-aware Metadata**: Capture project names, aliases, and file aliases for precise RAG context
- **Dependency Analysis**: Identify and categorize internal, external, and standard library dependencies
- **Structure Extraction**: Parse classes, functions, imports, and code organization
- **Token Estimation**: Accurate token counting for RAG optimization

### Configuration & Customization

- **Flexible Configuration**: Extensive configuration options for both document and codebase processing
- **Preset Configurations**: Pre-built configurations for common use cases
- **Custom Metadata**: Configurable metadata fields for different analysis needs
- **Performance Optimization**: Built-in performance modes for large-scale processing

## Installation

### Prerequisites
- Poppler (for pdf2image): `brew install poppler` (macOS) or `sudo apt-get install -y poppler-utils` (Linux)
- Gemini API key: set the `GOOGLE_API_KEY` environment variable (needed only for OCR on PDF/images)

```bash
pip install rag-knowledge-preparation-python
```

### Development Installation

```bash
git clone 
cd rag-knowledge-preparation-python
pip install -e ".[dev]"
```

## Quick Start

### Document Processing

```python
from rag_knowledge_preparation import (
    convert_document_to_markdown,
    convert_scanned_document_to_markdown,
    convert_documents_batch
)

# Convert a single document (GOOGLE_API_KEY env var must be set)
markdown_content = convert_document_to_markdown("document.pdf")

# Convert a scanned document with OCR
scanned_content = convert_scanned_document_to_markdown("scanned_document.pdf")

# Process multiple documents
results = convert_documents_batch(["doc1.pdf", "doc2.pdf"])

# DOCX/text/CSV/Markdown are handled locally (no API key needed)
docx_md = convert_document_to_markdown("report.docx")
notes_md = convert_document_to_markdown("notes.md")

# Images go through Gemini OCR (needs GOOGLE_API_KEY)
image_md = convert_document_to_markdown("whiteboard.png")
```

### Codebase Analysis

```python
from rag_knowledge_preparation import (
    export_codebase_to_markdown,
    analyze_codebase_structure,
    get_codebase_overview
)

# Export entire codebase to Markdown
output_file = export_codebase_to_markdown("./my_project", "codebase_export.md")

# Analyze codebase structure
structure = analyze_codebase_structure("./my_project")

# Get high-level overview
overview = get_codebase_overview("./my_project")
```

## Document Processing Details

### Supported Formats

- **PDF**: Gemini OCR (rasterized to images under the hood)
- **Images**: PNG, JPG/JPEG, TIFF, BMP, GIF, WebP (Gemini OCR)
- **DOCX**: Parsed to Markdown via `python-docx` (no OCR required)
- **Text/Markdown/CSV**: Read directly with encoding auto-detection (no OCR required)

### Processing Presets

#### Basic Processing

```python
from rag_knowledge_preparation import convert_document_to_markdown

# Basic, lightweight OCR (lower DPI + fewer tokens)
content = convert_document_to_markdown(
    "document.pdf", 
    processing_preset="basic"
)
```

#### Standard Document Processing

```python
# Balanced Gemini OCR (default prompt/DPI)
content = convert_document_to_markdown(
    "document.pdf", 
    processing_preset="standard"
)
```

#### OCR-Heavy Processing

```python
# Higher DPI and retries for tough scans
content = convert_document_to_markdown(
    "scanned_document.pdf", 
    processing_preset="ocr_heavy"
)
```

#### Table-Focused Processing

```python
# Table-aware prompt for documents with dense tabular content
content = convert_document_to_markdown(
    "data_heavy_document.pdf", 
    processing_preset="table_focused"
)
```

#### High-Quality Processing

```python
# Maximum quality with highest DPI and token limits
content = convert_document_to_markdown(
    "important_document.pdf", 
    processing_preset="high_quality"
)
```

### Custom Configuration

```python
from rag_knowledge_preparation import convert_document_to_markdown

# Custom configuration
content = convert_document_to_markdown(
    "document.pdf",
    processing_preset="standard",
    dpi=350,
    page_selection="1-5,8",
    temperature=0.15,
    max_output_tokens=5000
)
```

### Batch Processing

```python
from rag_knowledge_preparation import convert_documents_batch, convert_folder_to_markdown

# Process multiple files
results = convert_documents_batch([
    "document1.pdf",
    "document2.pdf"
])

# Process entire folder
folder_results = convert_folder_to_markdown("./documents/")
```

### Working with non-PDF inputs

```python
# Images -> Gemini OCR (needs GOOGLE_API_KEY)
image_markdown = convert_document_to_markdown("whiteboard.png")

# DOCX -> parsed locally, no OCR/API key required
docx_markdown = convert_document_to_markdown("report.docx")

# Text/Markdown/CSV -> pass-through
notes_markdown = convert_document_to_markdown("notes.txt")
```

Folder/batch helpers (`convert_documents_batch`, `convert_folder_to_markdown`) automatically pick up all supported extensions.

## Codebase Analysis Usage

### Basic Analysis

```python
from rag_knowledge_preparation import analyze_codebase_structure

# Analyze codebase structure
structure = analyze_codebase_structure("./my_project")

print(f"Total files: {structure['total_files']}")
print(f"Total lines: {structure['total_lines']}")
print(f"Languages: {structure['languages']}")
```

### Export to Markdown

```python
from rag_knowledge_preparation import export_codebase_to_markdown

# Export with default settings
output_file = export_codebase_to_markdown("./my_project")

# Export with custom output file
output_file = export_codebase_to_markdown(
    "./my_project", 
    output_file="my_codebase.md"
)
```

### AI-Powered Analysis

```python
from rag_knowledge_preparation import export_codebase_to_markdown

# Export with AI summaries (requires Gemini API key)
output_file = export_codebase_to_markdown(
    "./my_project",
    gemini_api_key="your-google-api-key",
    gemini_model="gemini-2.5-flash"
)
```

### Codebase Processing Presets

#### Minimal Processing

```python
from rag_knowledge_preparation import export_codebase_to_markdown

# Minimal processing - basic analysis only
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="minimal"
)
```

#### Standard Processing

```python
# Standard processing with full analysis
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="standard"
)
```

#### Comprehensive Processing

```python
# Comprehensive processing with all features
output_file = export_codebase_to_markdown(
    "./my_project", 
    processing_preset="comprehensive"
)
```

### Configuration Options

```python
from rag_knowledge_preparation import (
    CodebaseProcessingConfig,
    MetadataConfig,
    export_codebase_to_markdown
)

# Custom configuration
config = CodebaseProcessingConfig(
    max_file_size_mb=2.0,
    include_test_files=False,
    include_documentation=True,
    enable_ai_summary=True,
    gemini_api_key="your-api-key",
    custom_ignore_patterns=["*.log", "temp/*"]
)

# Custom metadata configuration
metadata_config = MetadataConfig(
    include_file_path=True,
    include_language=True,
    include_purpose=True,
    include_dependencies=True,
    include_structure=True,
    include_summary=True
)

config.metadata_config = metadata_config

# Use custom configuration
output_file = export_codebase_to_markdown(
    "./my_project",
    processing_preset="standard",  # apply overrides on top of the standard preset
    **config.model_dump()
)
```

### Project-aware metadata & aliases

You can enrich every exported file with project context so downstream RAG systems can ground answers:

```python
from rag_knowledge_preparation import CodebaseProcessingConfig, MetadataConfig

config = CodebaseProcessingConfig(
    project_name="EIC AI Knowledge Utils",
    project_aliases=["EIC-AI", "Knowledge Utils"],
    project_description="Utilities that prep internal knowledge for RAG pipelines.",
    metadata_config=MetadataConfig(
        include_project_description=True,
        include_project_aliases=True,
        include_file_aliases=True
    )
)
```

The exporter now injects the project name, aliases, optional description, and a set of handy file aliases (for example, `Project::path/to/file.py`). The Gemini prompt receives this context, yet the summaries stay concise because the Metadata block already lists project and path information.

`MetadataConfig` ships with four new toggles (`include_project_name`, `include_project_aliases`, `include_project_description`, `include_file_aliases`) that default to `True` (description defaults to `False`). Disable them if you prefer leaner metadata blocks.

## Advanced Features

### Language Detection and Classification

The library automatically detects programming languages and classifies files by purpose:

```python
from rag_knowledge_preparation.codebase_processing.analysis import (
    get_language_from_extension,
    classify_file_by_purpose
)

# Detect language from file extension
language = get_language_from_extension("script.py")  # Returns "python"

# Classify file by purpose
purpose = classify_file_by_purpose("test_utils.py")  # Returns "Tests"
```

### Dependency Analysis

```python
from pathlib import Path
from rag_knowledge_preparation.codebase_processing.analysis import analyze_file_dependencies

# Analyze dependencies in a Python file
with open("main.py", "r") as f:
    content = f.read()
dependencies = analyze_file_dependencies(content, Path("main.py"), "python")

print("External packages:", dependencies["external_packages"])
print("Standard library:", dependencies["standard_library"])
print("Internal modules:", dependencies["internal_modules"])
```

### Code Structure Extraction

```python
from pathlib import Path
from rag_knowledge_preparation.codebase_processing.analysis import extract_code_structure

# Extract structure from code file
code_content = """
class MyClass:
    def __init__(self):
        pass
    
    def method(self):
        pass
"""
structure = extract_code_structure(Path("example.py"), "python", code_content)

print("Classes:", structure["classes"])
print("Functions:", structure["functions"])
```

### Token Estimation

```python
from rag_knowledge_preparation.codebase_processing.analysis import estimate_token_count

# Estimate tokens in text
token_count = estimate_token_count("Hello, world!")
print(f"Estimated tokens: {token_count}")

# Estimate tokens in code
code_tokens = estimate_token_count("""
def hello():
    print("Hello, world!")
""")
```

## Configuration Reference

### Document Processing Configuration

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_name` | str | "gemini-2.5-pro" | Gemini multimodal model for OCR |
| `prompt` | str | Markdown prompt | Per-page OCR extraction prompt |
| `temperature` | float | 0.2 | Model temperature |
| `max_output_tokens` | int | 4096 | Max tokens per page generation |
| `dpi` | int | 300 | DPI used when rasterizing PDFs |
| `page_selection` | Optional[str] | None | Page ranges, e.g. `"1-3,5"` |
| `parallel_concurrency` | int | 5 | Pages processed concurrently |
| `max_retries` | int | 4 | Retry attempts for transient errors |

### Codebase Processing Configuration

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `max_file_size_mb` | float | 1.0 | Maximum file size to process |
| `include_hidden_files` | bool | False | Include hidden files |
| `include_test_files` | bool | True | Include test files |
| `include_documentation` | bool | True | Include documentation files |
| `include_config_files` | bool | True | Include configuration files |
| `include_static_assets` | bool | False | Include binary/static assets (images, fonts, etc.) |
| `enable_structure_analysis` | bool | True | Enable code structure analysis |
| `enable_ai_summary` | bool | True | Enable AI-powered summaries |
| `gemini_api_key` | str | None | Google Gemini API key |
| `gemini_model` | str | "gemini-2.5-flash" | Gemini model to use |
| `custom_ignore_patterns` | List[str] | None | Custom ignore patterns |
| `project_name` | Optional[str] | None | Override for the primary project name used in summaries and metadata |
| `project_aliases` | List[str] | [] | Additional aliases that will also be emitted in metadata |
| `project_description` | Optional[str] | None | Short description included in metadata when enabled |
| `duplicate_tracker` | Optional[DuplicateTracker] | None | Track duplicate files across multiple exports |
| `duplicate_content_strategy` | `"full"`/`"link"` | "full" | Replace repeated file content with a link to the shared report |
| `exclude_directories` | List[str] | None | Directory names to skip entirely (case-insensitive) |
| `exclude_file_extensions` | List[str] | None | Extensions (e.g. `.log`) to skip |

### Codebase Presets

| Preset | Description | Key Differences |
|--------|-------------|-----------------|
| `minimal` | Focus on essential source files only | Skips tests/docs/configs, disables AI summaries, emits only file path + language metadata |
| `standard` | Balanced default for most repos | Includes tests/docs/configs, AI summaries enabled, metadata covers project/file aliases and structure |
| `comprehensive` | Deep dive for large audits | Higher size limit (5â€¯MB), hidden files allowed, emits every metadata field (dates, encoding, git info, etc.) |

Use `processing_preset="<name>"` when calling `export_codebase_to_markdown`. You can still override any field via `**CodebaseProcessingConfig(...).model_dump()` if a preset needs tweaks.

### Tracking shared files

You can collect repeated scripts/configs that appear across many projects using `DuplicateTracker` and optionally replace duplicated content with shared references:

```python
from rag_knowledge_preparation import export_codebase_to_markdown, CodebaseProcessingConfig
from rag_knowledge_preparation.codebase_processing.utils import DuplicateTracker

tracker = DuplicateTracker(min_occurrences=2)

for project in projects:
    export_codebase_to_markdown(
        project,
        output_file=f"{project.name}.md",
        processing_preset="standard",
        duplicate_tracker=tracker,
        duplicate_content_strategy="link",  # swap file bodies with references to shared digest
    )

if tracker.has_duplicates():
    tracker.write_markdown_report("shared_files.md")
```

The generated `shared_files.md` lists every repeated file, its digest, language, and all project locations, so you can link to a single canonical snippet instead of duplicating boilerplate. Duplicate detection uses a raw hash of the file contents (byte-for-byte match). Set `duplicate_content_strategy="link"` in `CodebaseProcessingConfig` to replace duplicate files in the per-project exports with a short sentence that points to the shared digest instead of embedding their full content.

## Error Handling

The library provides comprehensive error handling with custom exceptions:

```python
from rag_knowledge_preparation import (
    RAGKnowledgePreparationError,
    DocumentNotFoundError,
    ConfigurationError,
    ConversionError,
    UnsupportedFormatError
)

try:
    content = convert_document_to_markdown("nonexistent.pdf")
except DocumentNotFoundError as e:
    print(f"Document not found: {e}")
except ConversionError as e:
    print(f"Conversion failed: {e}")
except ConfigurationError as e:
    print(f"Configuration error: {e}")
```

## Performance Considerations

### Large File Processing

The library includes built-in optimizations for large files:

- **File Size Limits**: Configurable maximum file size limits
- **Memory Efficiency**: Streaming processing for large documents
- **Batch Processing**: Efficient processing of multiple files
- **Parallel Processing**: Concurrent processing where possible

### Performance Modes

```python
# Use performance-optimized settings
config = CodebaseProcessingConfig(
    max_file_size_mb=0.5,  # Smaller file limit
    enable_ai_summary=False,  # Disable AI for speed
    enable_structure_analysis=False  # Disable structure analysis
)
```

## Examples

### Complete Document Processing Pipeline

```python
from rag_knowledge_preparation import (
    convert_folder_to_markdown,
    list_document_configs
)

# List available configurations
configs = list_document_configs()
print("Available configurations:", list(configs.keys()))

# Process entire document folder
results = convert_folder_to_markdown(
    "./documents/",
    processing_preset="high_quality"
)

# Save results
for file_path, content in results.items():
    output_path = f"processed_{file_path.split('/')[-1]}.md"
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(content)
```

### Complete Codebase Analysis Pipeline

```python
from rag_knowledge_preparation import (
    export_codebase_to_markdown,
    analyze_codebase_structure,
    get_codebase_overview,
    list_available_codebase_configs
)

# List available configurations
configs = list_available_codebase_configs()
print("Available configurations:", list(configs.keys()))

# Get overview
overview = get_codebase_overview("./my_project")
print(f"Project: {overview['name']}")
print(f"Files: {overview['total_files']}")
print(f"Languages: {overview['languages']}")

# Analyze structure
structure = analyze_codebase_structure("./my_project")
print(f"Structure analysis complete: {structure['total_files']} files processed")

# Export to Markdown
output_file = export_codebase_to_markdown(
    "./my_project",
    output_file="project_analysis.md",
    gemini_api_key="your-api-key"
)
print(f"Exported to: {output_file}")
```

## Acknowledgments

- Gemini OCR stack powered by [LangChain Google Gemini](https://python.langchain.com/docs/integrations/chat/google_generative_ai)
- [Tree-sitter](https://tree-sitter.github.io/) for code parsing
- [Google Gemini](https://ai.google.dev/) for AI-powered summarization
- [Pygments](https://pygments.org/) for syntax highlighting and language detection

## Changelog

### Version 1.0.0

- Initial release
- Document processing with OCR support
- Codebase analysis and export
- AI-powered summarization
- Comprehensive configuration options
- Multi-language support
