Metadata-Version: 2.4
Name: knowledge-engine-backend
Version: 0.1.0
Summary: Pure-Python port of the Knowledge Engine backend
Author: Knowledge Engine
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.31
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"

# Knowledge Engine Backend

A powerful, pure-Python search and knowledge retrieval engine designed to run in-process. It combines focused web crawling, vector-based indexing, and LLM-powered answer generation.

---

## 👥 For Humans

### Features
- **In-Process Architecture**: Runs directly within your Python application. No external server orchestration required.
- **Smart Crawling**: Includes a politeness-aware focused crawler with frontier management.
- **Vector Search**: Local vector store for semantic search capabilities.
- **LLM Integration**: Seamless integration with **Ollama** (local) and **OpenAI** (cloud) for RAG (Retrieval-Augmented Generation).
- **Configurable**: Fully customizable via environment variables or configuration objects.

### Installation

1. **Clone the repository:**
   ```bash
   git clone <repository-url>
   cd knowledge_engine_backend
   ```

2. **Install dependencies:**
    It is recommended to use a virtual environment.
   ```bash
   pip install -e .
   ```

   For development dependencies (testing, linting):
   ```bash
   pip install -e ".[dev]"
   ```

### Quick Start

```python
from knowledge_engine_backend import KnowledgeEngineApp

# Initialize the application
app = KnowledgeEngineApp()

# 1. Start crawling a target website (runs in background)
app.start_crawl("https://example.com")

# 2. Search the indexed content
results = app.search("example domain")
print(results)

# 3. Generate an answer using LLM (RAG)
# Note: Ensure LLM environment variables are set correctly
# answer = app.generate("What is the main purpose of this website?")
# print(answer)
```

### Configuration

Configure the engine using environment variables or a `Config` object.

| Variable | Default | Description |
|----------|---------|-------------|
| `LLM_PROVIDER` | `ollama` | Provider backend (`ollama` or `openai`) |
| `LLM_BASE_URL` | `http://localhost:11434` | API Endpoint for the LLM (e.g. `/api/generate` for Ollama) |
| `LLM_MODEL` | `qwen3:1.7b` | Model name to use |
| `LLM_API_KEY` | - | API Key (required for OpenAI) |

---

## 🤖 For AI Agents

This section provides structural and contextual information to assist AI agents in understanding, maintaining, and extending this codebase.

### Project Structure

- **`knowledge_engine_backend/`**: Core package source code.
  - **`app.py`**: High-level entry point (`KnowledgeEngineApp`). Facade for the system.
  - **`engine.py`**: Orchestrator the `KnowledgeEngine` class connecting components.
  - **`fetcher.py`**: Handles HTTP requests, content extraction, and parsing.
  - **`frontier.py`**: Manages crawl queues, prioritization, and visited sets.
  - **`storage.py`**: File-based persistence layer for raw data.
  - **`search.py`**: Vector embedding and similarity search logic.
  - **`provider.py`**: LLM interface implementations (OpenAI, Ollama).
  - **`config.py`**: Configuration schema and loading logic.
- **`scripts/`**: Utility and test scripts.
  - **`e2e_ollama.py`**: End-to-end testing script proving the full pipeline with Ollama.
- **`data/`**: Directory for storing raw JSON data artifacts from crawls.
- **`tests/`**: Pytest suite for unit and integration testing.

### Component Architecture

The `KnowledgeEngine` follows a modular component-based architecture:
1.  **Frontier `(frontier.py)`**: Supplies URLs to be processed.
2.  **Fetcher `(fetcher.py)`**: Downloads and parses HTML content.
3.  **Storage `(storage.py)`**: Saves raw document data.
4.  **VectorStore `(search.py)`**: Indexes document embeddings for retrieval.
5.  **LLMProvider `(provider.py)`**: Interfaces with AI models for generation tasks.

### Development & Maintenance Tasks

**Running Tests:**
Execute the test suite using `pytest`. Ensure `dev` dependencies are installed.
```bash
pytest
```

**Running E2E Scripts:**
To verify the full pipeline with Ollama (requires Ollama running locally):
```bash
python scripts/e2e_ollama.py "What is the capital of France?" --source "https://en.wikipedia.org/wiki/France"
```

**Code Style:**
- The project allows `from __future__ import annotations`.
- Type hinting is encouraged for all new methods.
