Metadata-Version: 2.4
Name: stfo-colbert
Version: 0.2.0
Summary: Straightforward ColBERT indexing and serving (via PyLate)
Author-email: Jakub Gajski <jakub.gajski@j-labs.pl>
Maintainer-email: Jakub Gajski <jakub.gajski@gmail.com>
License-Expression: MIT
Project-URL: Repository, https://github.com/j-labs/stfo-colbert
Project-URL: Homepage, https://github.com/j-labs/stfo-colbert
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Requires-Dist: fastapi>=0.115
Requires-Dist: uvicorn>=0.30
Requires-Dist: pylate>=1.3.4
Requires-Dist: pymupdf>=1.26.5
Requires-Dist: langchain-text-splitters>=1.0.0

# stfo-colbert

> Straightforward ColBERT indexing and serving (if you need a development ColBERT server)

## Design Goals

- **Straightforward**: Single-command usage via CLI (stfo is for "straightforward")
- **Minimal**: Readable, functional code with minimal default dependencies
- **Simple**: One HTTP endpoint only: `GET /search`
- **For development usage**: Suitable for anyone who needs an adhoc sematic search server

## When to Use

Use **stfo-colbert** when you:
- Have a small-to-medium collection and want a simple way to build a ColBERT-style index (via PyLate) and query it over HTTP
- Prefer a one-shot CLI to index and serve, without additional orchestration

## Installation

### From PyPI

```bash
pip install stfo-colbert
```

### From source (development)

```bash
git clone <repository-url>
cd stfo_colbert
pip install -e .
```

## Quickstart

### 1. Install the package

```bash
pip install stfo-colbert
```

### 2. Run the CLI (index and serve)

```bash
stfo-colbert \
  --dataset-path /path/to/dataset.txt
```

### 3. Query the API

```bash
curl "http://127.0.0.1:8889/search?query=hello&k=2"
```

### 4. Example response

```json
{
  "query": "hello",
  "topk": [
    {
      "pid": "1",
      "rank": 0,
      "score": 0.92,
      "text": "Hello world! This is a sample document.",
      "prob": 0.51
    },
    {
      "pid": "2",
      "rank": 1,
      "score": 0.87,
      "text": "A friendly hello from another document.",
      "prob": 0.49
    }
  ]
}
```

## CLI Reference

```bash
stfo-colbert [options]
```

### Options

| Option | Description | Default                                    |
|--------|-------------|--------------------------------------------|
| `--port` | Port to serve on | `8889`                                     |
| `--model-name` | Hugging Face model id/name | `mixedbread-ai/mxbai-edge-colbert-v0-17m`  |
| `--index-path` | Path to existing PyLate index directory | (mutually exclusive with `--dataset-path`) |
| `--dataset-path` | Path to dataset for index creation (file or directory) | -                                          |
| `--batch-size` | Batch size for encoding | `64`                                       |
| `--chunk-size` | Number of documents to accumulate before encoding | `10000`                              |

## Usage Patterns

**Serve an existing index:**
```bash
stfo-colbert --index-path ./experiments/my_index --port 8889
```

**Build from a delimited TXT, then serve:**
```bash
stfo-colbert --dataset-path ./data/my_corpus.txt --port 8889
```

**Build from a directory of docs, then serve:**
```bash
stfo-colbert --dataset-path ./docs_dir --port 8889
```

## Dataset Formats

### 1. Delimited text file (default)

A plain text file where each document is separated by the delimiter: `\n\n--------\n\n`

**Example:**
```
Document one text

--------

Document two text
```

> **Note:** Any occurrences of the delimiter inside documents are removed during preprocessing to avoid boundary confusion.

### 2. Directory of document files

When `--dataset-path` points to a directory, stfo-colbert will scan for files and create a compressed cache file (`.stfo_colbert_cache.txt.xz`) in that directory. On later runs, this cache is reused instead of re-parsing all files, significantly speeding up initialization.

**Supported file types:**
- `.txt`, `.md`
- `.pdf`

**Cache behavior:**
- The cache file is automatically created after the first directory scan
- To force a re-scan, delete the `.stfo_colbert_cache.txt.xz` file from the dataset directory

## Index Format

stfo-colbert uses PyLate's PLAID index under the hood:
- Loads the model (default: `mixedbread-ai/mxbai-edge-colbert-v0-17m`)
- Encodes documents in chunks and builds an index incrementally
- Serves top-k retrieval via a simple HTTP API

The index directory contains:
- **PLAID index files**: The core PyLate index structure
- **`collection.db`**: A SQLite database mapping document IDs to their text content

### Streaming and Chunked Processing

To handle large datasets efficiently, stfo-colbert processes documents in chunks:
- Documents are streamed from the dataset (not loaded entirely into memory)
- Each chunk is encoded and added to the index incrementally
- The collection mapping is saved to SQLite progressively during indexing
- Default chunk size is 10,000 documents (configurable via `--chunk-size`)

This approach enables indexing of large datasets (e.g., entire Wikipedia) without running out of memory.

When you build an index from documents, stfo-colbert automatically creates the `collection.db` file to enable text retrieval in search results. If you pass `--index-path` with an existing index, search results will include text snippets only if `collection.db` is present in the index directory.

## HTTP API

### `GET /search`

**Parameters:**
- `query` (string, required): The search string
- `k` (integer, optional): Top-k results (default: `10`, max: `100`)

**Response:**
```json
{
  "query": "...",
  "topk": [
    {
      "pid": "<document_id>",
      "score": 0.95,
      "text": "...",
      "prob": 0.87
    }
  ]
}
```

> **Note:** The `text` field is included if the collection mapping is available (e.g., from a delimited TXT or `collection.db`).

## Design Notes

- **Functional approach**: Modules expose pure functions; the CLI composes them
- **Minimal dependencies**: FastAPI for the web layer, Uvicorn ASGI server, PyLate for model+index, PyMuPDF for PDF parsing
- **Persistent caching**: When processing directories, a compressed cache file (`.stfo_colbert_cache.txt.xz`) is saved in the dataset directory for faster subsequent runs

## Development

**Install in editable mode:**
```bash
pip install -e .
```

**Run tests:**
```bash
pip install pytest
pytest
```

## Examples

### Using the included example data

**Index Wikipedia summaries and query for specific topics:**
```bash
# Start the server with Wikipedia summaries
stfo-colbert --dataset-path example_data/wikipedia_summaries.txt

# Query for movies
curl "http://127.0.0.1:8889/search?query=Disney%20animated%20movies&k=3"

# Query for sports
curl "http://127.0.0.1:8889/search?query=Olympic%20track%20and%20field%20events&k=5"
```

**Index arXiv PDFs and search research papers:**
```bash
# Start the server with PDF directory
stfo-colbert --dataset-path example_data/arxiv_sample

# Search for AI/ML topics
curl "http://127.0.0.1:8889/search?query=machine%20learning%20transformers&k=5"

# Search for specific research areas
curl "http://127.0.0.1:8889/search?query=neural%20network%20architecture&k=3"
```

**Index large Wikipedia dataset:**
```bash
# First, download and prepare the Wikipedia 20231101.en dataset
# Note: This is a large dataset (~20 GB) and will take time to download
python example_data/wikipedia_20231101_en.py

# Index the Wikipedia dataset with streaming (handles large datasets efficiently)
# The data will be processed in chunks to avoid memory issues but it will take a lot of time anyway
stfo-colbert --dataset-path wikipedia_20231101_en_shuffled.txt --chunk-size 10000

# Search for topics in Wikipedia
curl "http://127.0.0.1:8889/search?query=machine%20learning%20history&k=5"
```

The `wikipedia_20231101_en.py` script:
- Downloads the Wikipedia 20231101.en dataset from Hugging Face
- Shuffles it with a buffer size of 100,000 (good for building index centroids)
- Formats it as a delimited text file compatible with stfo-colbert
- Uses streaming to avoid loading the entire dataset into memory

### General usage examples

**Index directory of Markdown notes and serve on port 7777:**
```bash
stfo-colbert --dataset-path ~/notes --port 7777
```

**Serve existing index folder:**
```bash
stfo-colbert --index-path ./experiments/wiki_index --port 8889
```
