Metadata-Version: 2.4
Name: toporag
Version: 0.1.0
Summary: A topological data analysis library for detecting knowledge gaps in RAG systems.
Project-URL: Homepage, https://github.com/MuLIAICHI/toporag
Author: Mustapha LIAICHI
License: MIT
Requires-Python: >=3.9
Requires-Dist: beautifulsoup4>=4.10.0
Requires-Dist: httpx>=0.20.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: openai>=1.0.0
Requires-Dist: ripser>=0.6.0
Requires-Dist: scikit-learn>=1.0.0
Description-Content-Type: text/markdown

# TopoRAG

TopoRAG is a Topological Data Analysis (TDA) library designed to detect conceptual gaps and blind spots in text documents, specifically useful for evaluating the context retrieved in Retrieval-Augmented Generation (RAG) pipelines.

By using Persistent Homology (via Ripser), it identifies topological "holes" in the semantic embedding space of your text, and then uses an LLM to label those missing concepts.

## Installation

```bash
pip install toporag
```

*(Note: Currently in development, install from source)*
```bash
git clone <repository>
cd toporag_lib
pip install -e .
```

## Setup

You need an OpenAI API key for text embeddings (`text-embedding-3-small`) and gap labeling (`gpt-4o-mini`). 

```bash
export OPENAI_API_KEY="sk-..."
```

## Basic Usage

```python
import asyncio
from toporag import TopoAnalyzer

async def main():
    analyzer = TopoAnalyzer() # Automatically picks up OPENAI_API_KEY
    
    texto = """
    We started the project with great enthusiasm. The team was assembled, 
    requirements were gathered, and we had a solid plan for the architecture.
    
    Finally, we deployed the application to production and celebrated our success. 
    The customers loved the final result and our metrics improved dramatically.
    """
    
    gaps = await analyzer.analyze_text(texto)
    for gap in gaps:
        print(f"Missing Topic: {gap['topic_label']}")
        print(f"Explanation: {gap['explanation']}")
        print("---")

if __name__ == "__main__":
    asyncio.run(main())
```

## API

### `TopoAnalyzer.analyze_text(text: str, threshold: float = 0.15, max_holes: int = 5, generate_suggestions: bool = True)`
Splits the text into segments, embeds them, finds gaps with persistence above `threshold`, and uses the LLM to label up to `max_holes` gaps.

### `TopoAnalyzer.analyze_url(url: str, ...)`
Scrapes the URL for readable text and runs the topological analysis.

### `TopoAnalyzer.analyze_segments(segments: List[str], ...)`
Runs the analysis directly on a pre-chunked list of strings.
