Metadata-Version: 2.4
Name: retrievalbase
Version: 2.1.0
Author-email: jalal <jalalkhaldi3@gmail.com>
Requires-Python: <3.13,>=3.11
Requires-Dist: faiss-cpu<2.0.0,>=1.13.2
Requires-Dist: langchain<2.0.0,>=1.2.10
Requires-Dist: minio<8.0.0,>=7.2.20
Requires-Dist: numpy<3.0.0,>=2.4.2
Requires-Dist: openai<3.0.0,>=2.21.0
Requires-Dist: polars<2.0.0,>=1.38.1
Requires-Dist: pydantic-settings<3.0.0,>=2.13.0
Requires-Dist: qdrant-client<2.0.0,>=1.16.2
Requires-Dist: rank-bm25<0.3.0,>=0.2.2
Provides-Extra: torch
Requires-Dist: datasets<5.0.0,>=4.5.0; extra == 'torch'
Requires-Dist: sentence-transformers<6.0.0,>=5.1.2; extra == 'torch'
Requires-Dist: torch<3.0.0,>=2.10.0; extra == 'torch'
Requires-Dist: transformers<6.0.0,>=5.3.0; extra == 'torch'
Provides-Extra: transformers
Requires-Dist: datasets<5.0.0,>=4.5.0; extra == 'transformers'
Requires-Dist: sentence-transformers<6.0.0,>=5.1.2; extra == 'transformers'
Requires-Dist: transformers<6.0.0,>=5.3.0; extra == 'transformers'
Description-Content-Type: text/markdown

# retrievalbase

`retrievalbase` is a typed Python toolkit for building retrieval and evaluation workflows around
structured text datasets.

It provides:

- dataset connectors for loading and saving text corpora,
- Polars-based dataset abstractions,
- configurable preprocessing pipelines,
- retrieval components such as BM25, dense retrieval, reranking, and vector stores,
- evaluation components for scoring retrieval quality,
- a config-driven runtime model based on Pydantic settings and dynamic component loading.

The project is designed around explicit component contracts rather than a single monolithic pipeline.

## Why This Project Exists

Retrieval systems usually degrade when data loading, preprocessing, indexing, retrieval, reranking,
and evaluation are tightly coupled.

This repository separates those concerns into components with clear interfaces:

- connectors handle storage and transport,
- datasets handle schema-aware tabular text data,
- preprocessors transform text datasets,
- retrievers execute candidate selection,
- rerankers refine candidate ordering,
- evaluators measure retrieval quality.

That separation makes it easier to:

- swap backends without rewriting orchestration,
- test behavior in isolation,
- drive runtime composition from config,
- keep experimentation reproducible.

## Core Ideas

### 1. Config-Driven Components

Most runtime objects are built from Pydantic settings models derived from
`FromConfigMixinSettings`.

Each config carries a `module_path` pointing to the concrete runtime class.
The class is resolved dynamically with `retrievalbase.utils.load_class(...)` and instantiated
through `FromConfigMixin`.

This pattern is used across:

- connectors,
- preprocessors,
- token counters,
- embedders,
- vector stores,
- rerankers,
- retrievers,
- evaluators,
- ingestion pipelines.

### 2. Typed Interfaces

The repository uses abstract base classes to define stable contracts for component categories.
Concrete implementations extend those contracts and provide backend-specific behavior.

### 3. Polars As The Dataset Backbone

Datasets are represented with Polars `DataFrame` or `LazyFrame` values wrapped in repository
dataset abstractions.

### 4. Text Dataset Contract

Text datasets are expected to contain:

- `page_content`
- `metadata`

Many higher-level components assume that schema.

## Repository Layout

```text
.
├── AGENTS.md
├── Makefile
├── README.md
├── pyproject.toml
├── src/
│   └── retrievalbase/
│       ├── connector/
│       ├── dataset/
│       │   └── preprocess/
│       ├── evaluation/
│       │   ├── evaluators/
│       │   │   └── python/
│       │   ├── retrievers/
│       │   │   └── dense/
│       │   ├── async_batcher.py
│       │   ├── embedders.py
│       │   ├── processors.py
│       │   ├── rerankers.py
│       │   ├── settings.py
│       │   └── vector_stores.py
│       ├── ingestion/
│       ├── constants.py
│       ├── enums.py
│       ├── exceptions.py
│       ├── mixins.py
│       ├── settings.py
│       ├── types.py
│       └── utils.py
└── tests/
    ├── conftest.py
    ├── fixtures/
    │   ├── components.py
    │   └── data.py
    ├── integration/
    │   ├── test_dataset/
    │   └── test_evaluation/
    └── unit/
        ├── test_config/
        ├── test_connector/
        ├── test_dataset/
        ├── test_evaluation/
        ├── test_ingestion/
        └── test_utils/
```

High-level responsibility split:

- `connector/`: load and persist datasets from external systems such as parquet and MinIO.
- `dataset/`: base dataset abstractions, Polars adapters, Hugging Face adapter, preprocessing, token counting.
- `evaluation/`: embedders, processors, async batching, vector stores, rerankers, retrievers, Python evaluators.
- `ingestion/`: ingestion pipelines that combine connectors and preprocessors.
- `tests/fixtures/`: reusable test data builders, fake components, and component factories.
- `tests/conftest.py`: global test setup shared across the suite.
- `tests/unit/test_*/`: source-aligned unit test groups for isolated behavior and edge cases.
- `tests/integration/test_*/`: multi-component integration tests grouped by module area.

Testing layout conventions:

- mirror source areas with module-oriented test directories such as `tests/unit/test_dataset` and `tests/integration/test_evaluation`,
- keep reusable component setup out of individual tests and build test components through shared factories in `tests/fixtures`,
- add a local `conftest.py` only when a test group shares setup that should not be global,
- prefer parametrized tests when the same behavior should be validated across multiple inputs or component variants.

## Installation

### Requirements

- Python `>=3.11,<3.13`
- `uv` recommended for dependency management and command execution

### Install Production Dependencies

```bash
make install
```

### Install Developer Environment

```bash
make dev-install
```

This installs:

- development dependencies,
- optional extras,
- pre-commit hooks.

## Development Commands

The `Makefile` is the source of truth for local development tasks.

```bash
make format
make lint
make type-check
make security
make test
make test-cov
make ci
make ci-fast
make clean
```

Command meaning:

- `make format`: run `ruff format` and `ruff check --fix`
- `make lint`: run Ruff lint checks
- `make type-check`: run `ty check`
- `make security`: run Bandit
- `make test`: run the test suite
- `make test-cov`: run tests with coverage and enforce 80% minimum coverage
- `make ci`: local CI equivalent
- `make ci-fast`: faster loop without security gate

For narrow test runs during development, prefer targeting the relevant module directory, for example:

```bash
uv run pytest tests/unit/test_dataset
uv run pytest tests/integration/test_evaluation
```

## Architecture Overview

### Shared Infrastructure

Shared infrastructure lives in:

- `retrievalbase.mixins`
- `retrievalbase.settings`
- `retrievalbase.types`
- `retrievalbase.utils`

These modules provide:

- config loading,
- runtime factories,
- reusable type variables,
- dynamic module resolution,
- shared schema helpers.

### Connectors

Connectors are the storage boundary.

Base contract:

- `DatasetConnector`

Current implementations:

- `ParquetDatasetConnector`
- `MinioDatasetConnector`

Connector rules:

- `_load()` returns Polars data,
- `to(ds)` persists a dataset,
- connectors should not contain retrieval or preprocessing business logic.

### Datasets

Base contracts:

- `Dataset`
- `TextDataset`

Concrete Polars implementations:

- `PolarsDataset`
- `PolarsTextDataset`

Dataset responsibilities:

- expose Polars-backed operations,
- validate required schema for text data,
- provide convenience conversions and iteration helpers.

### Preprocessing

Base contracts:

- `TextPreprocessor`
- `TokenCounter`

Current preprocessing components include token-based filters and preprocess pipelines.

Design rule:

- preprocessors accept a `TextDataset` and return a `TextDataset`,
- token counters stay focused on counting,
- pipelines compose preprocessing steps instead of duplicating orchestration.

### Ingestion

Base runtime:

- `TextIngestionPipeline`

Typical flow:

```text
DatasetConnector -> TextDataset -> TextPreprocessor -> TextDataset
```

### Evaluation Stack

Important contracts:

- `Processor`
- `Embedder`
- `VectorStore`
- `Reranker`
- `Retriever`
- `Evaluator`

Typical dense retrieval flow:

```text
query -> Processor -> Embedder -> VectorStore -> Reranker -> results
```

Typical BM25 flow:

```text
query -> Retriever over TextDataset -> optional Reranker -> results
```

Typical evaluation flow:

```text
dataset + retriever -> evaluator -> scores
```

Current evaluation coverage in the codebase includes:

- async batching helpers,
- BM25, dense, and hybrid retriever behavior,
- reranker and vector store contracts,
- Python evaluator runtime and score calculation paths.

## How Components Are Composed

The system uses configuration to compose components instead of hard-coding most concrete classes.

Common pattern:

1. Define a settings model.
2. Include `module_path`.
3. Validate config with Pydantic.
4. Resolve the runtime class dynamically.
5. Instantiate the runtime object from config.

This allows nested configuration.

For example:

- a retriever config can include a reranker config,
- an evaluator config can include a retriever config and a dataset connector config,
- an ingestion pipeline can include both connector and preprocessor configs.

## Minimal Example: Build A Text Dataset

```python
from retrievalbase.dataset.polars import PolarsTextDataset

ds = PolarsTextDataset.from_records(
    [
        ("hello world", {"doc_id": "1"}),
        ("retrieval base", {"doc_id": "2"}),
    ]
)

print(ds.polars)
```

## Minimal Example: Load Text Data From Parquet

```python
from retrievalbase.dataset.polars import PolarsTextDataset

ds = PolarsTextDataset.from_parquet("data/corpus.parquet", lazy=True)
print(len(ds))
```

## Minimal Example: Config-Driven Component Instantiation

```python
from retrievalbase.utils import comp

component = comp("config/component.yaml", key="retriever")
```

The YAML entry must include a valid `module_path`.

## Best Practices

### Code Design

- Prefer composition over deep inheritance.
- Use inheritance only for stable contracts such as connectors, retrievers, rerankers, and evaluators.
- Keep settings validation in settings models, not scattered through runtime logic.
- Keep external I/O at the boundaries. Storage code belongs in connectors, not datasets or retrievers.
- Keep public APIs typed and explicit.
- Make failure modes clear and actionable.

### Config Design

- Always include `module_path` for dynamically loaded components.
- Keep nested configs explicit instead of passing untyped dicts deep into the system.
- Put environment-sensitive values such as secrets in settings-compatible sources rather than hard-coding them.
- Reuse existing settings hierarchies before introducing parallel config models.

### Dataset Design

- Preserve the text dataset contract: `page_content` and `metadata`.
- Validate schema as early as possible.
- Prefer Polars-native transformations over row-by-row Python loops when possible.
- Use lazy execution when loading large parquet corpora unless the operation requires eager materialization.

### Retrieval And Evaluation

- Keep embedding, vector search, reranking, and scoring as separate concerns.
- Preserve batch ordering in async batch APIs.
- Close async resources when implementations own clients or sockets.
- Add tests for limit semantics, ordering guarantees, and empty input behavior.

### Testing

- Put fast isolated logic under `tests/unit`.
- Put multi-component behavior under `tests/integration`.
- Test contracts, not just implementation details.
- Add regression tests when fixing a bug.
- Use fixtures and fakes to isolate external systems.

### Dependency Hygiene

- Avoid circular dependencies between feature modules.
- Keep abstract interfaces backend-agnostic.
- Add optional backend imports lazily and raise helpful installation errors.
- Do not bypass the config-driven architecture with hard-coded concrete imports in orchestration layers unless there is a narrow local reason.

## Recommended Workflow For Contributors

1. Install the dev environment with `make dev-install`.
2. Read [AGENTS.md](/home/jalal/projects/retrievalbase/AGENTS.md) before making structural changes.
3. Make focused changes in the relevant package slice.
4. Add or update tests near the changed behavior.
5. Run `make ci` before considering the change done.

## Quality Bar

Changes should be considered complete only when they:

- follow the typed component architecture,
- preserve clean dependency direction,
- include tests for changed behavior,
- pass local CI expectations,
- remain understandable without hidden assumptions.

## Current Toolchain

Configured in `pyproject.toml` and `Makefile`:

- Ruff for formatting and linting
- Ty for static type checking
- Pytest for tests
- Pytest coverage with 80% minimum threshold
- Bandit for security scanning
- Hatchling for packaging
- UV for environment and command management

## Notes

- The default YAML config path in shared settings is `/config/config.yaml`.
- Some optional components require extra dependencies such as `transformers` or `torch`.
- When adding new backends, keep those dependencies optional and fail lazily with actionable guidance.
