Metadata-Version: 2.4
Name: labelrag
Version: 0.1.3
Summary: A label-driven RAG pipeline built on top of paralabelgen.
Project-URL: Homepage, https://github.com/HuRuilizhen/labelrag
Project-URL: Repository, https://github.com/HuRuilizhen/labelrag
Project-URL: Issues, https://github.com/HuRuilizhen/labelrag/issues
Author-email: huruilizhen <huruilizhen@gmail.com>
License: MIT
License-File: LICENSE
Keywords: evaluation,labels,llm,paragraphs,rag,retrieval
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Requires-Dist: numpy>=1.26.0
Requires-Dist: paralabelgen==0.2.3
Requires-Dist: sentence-transformers>=3.0.0
Provides-Extra: dev
Requires-Dist: build>=1.2.0; extra == 'dev'
Requires-Dist: pyright>=1.1.390; extra == 'dev'
Requires-Dist: pytest-cov>=6.0.0; extra == 'dev'
Requires-Dist: pytest>=8.3.0; extra == 'dev'
Requires-Dist: ruff>=0.11.0; extra == 'dev'
Requires-Dist: twine>=6.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# labelrag

`labelrag` is a Python library for label-driven retrieval-augmented generation
pipelines built on top of `paralabelgen`.

- PyPI distribution: `labelrag`
- Python import package: `labelrag`
- Core dependency target: `paralabelgen==0.2.3`
- Primary supported extraction path: `paralabelgen` LLM concept extraction
- First semantic-reranking embedding provider: `sentence-transformers`

## Install

```bash
pip install labelrag
```

If you want to use the spaCy-backed extraction path, install a compatible
English pipeline such as:

```bash
python -m spacy download en_core_web_sm
```

`en_core_web_sm` is a convenient local option, but the semantic-retrieval
release line targets the `paralabelgen==0.2.3` LLM concept-extraction pipeline
as the primary supported extraction path.

The upstream `paralabelgen==0.2.3` runtime also supports DeepSeek-backed
extraction through its own configuration surface. `labelrag` does not introduce
DeepSeek-specific APIs; it continues to pass extraction configuration through to
`paralabelgen`.

The first shipped semantic-reranking provider uses `sentence-transformers`.
Its model weights may be downloaded on first use if they are not already cached
locally.

## Quick Start

### Retrieval-only workflow

```python
from labelrag import (
    RAGPipeline,
    RAGPipelineConfig,
)

paragraphs = [
    "OpenAI builds language models for developers.",
    "Developers use language models in production systems.",
    "Production systems need monitoring and evaluation tooling.",
]

config = RAGPipelineConfig()
config.labelgen.extractor_mode = "heuristic"
config.labelgen.use_graph_community_detection = False

pipeline = RAGPipeline(config)
pipeline.fit(paragraphs)

retrieval = pipeline.build_context("How do developers use language models?")
print(retrieval.prompt_context)
print(retrieval.metadata)
```

### Retrieval plus provider-backed answer generation

```python
from labelrag import (
    OpenAICompatibleAnswerGenerator,
    OpenAICompatibleConfig,
    RAGPipeline,
    RAGPipelineConfig,
)

paragraphs = [
    "OpenAI builds language models for developers.",
    "Developers use language models in production systems.",
    "Production systems need monitoring and evaluation tooling.",
]

config = RAGPipelineConfig()
config.labelgen.extractor_mode = "heuristic"
config.labelgen.use_graph_community_detection = False

pipeline = RAGPipeline(config)
pipeline.fit(paragraphs)

generator = OpenAICompatibleAnswerGenerator(
    OpenAICompatibleConfig(
        model="mistral-small-latest",
        api_key_env_var="MISTRAL_API_KEY",
        base_url="https://api.mistral.ai/v1",
    )
)

answer = pipeline.answer_with_generator(
    "How do developers use language models?",
    generator,
)
print(answer.answer_text)
print(answer.metadata)
```

## Retrieval Model

The current retrieval layer is deterministic and still label-driven at the
candidate-generation stage.

- `fit(...)` delegates paragraph analysis to `labelgen.LabelGenerator`
- `fit(...)` also builds paragraph embeddings
- `build_context(...)` maps the question into the fitted label space
- retrieval uses greedy coverage over query label IDs
- semantic similarity is used as a secondary ranking signal inside greedy
  selection
- label-free queries can use configurable fallback strategies
- `require_full_label_coverage=True` suppresses partial retrieval results while
  preserving attempted coverage trace in metadata

Greedy selection order is:

1. larger overlap with remaining query labels
2. larger semantic similarity
3. larger overlap on query concept IDs
4. larger total paragraph label count
5. lexicographically smaller `paragraph_id`

In `0.1.3`, the default main-path strategy is slightly more practical when
coverage completes early:

- it first builds the label-overlap candidate universe
- it still runs greedy label coverage first
- if coverage finishes before `max_paragraphs`, it backfills from the remaining
  label-overlap candidates by semantic similarity

This keeps the default path label-bounded while making `top_k` retrieval less
likely to collapse to a single paragraph for single-label queries.

`0.1.2` supports two main-path retrieval strategies:

- `greedy_label_coverage_semantic_rerank`
- `label_gate_semantic_rank`

`label_gate_semantic_rank` keeps label overlap as a candidate gate but lets
semantic similarity become the primary ranking signal inside that gated set.

`0.1.2` supports four label-free fallback strategies:

- `concept_overlap_only`
- `concept_overlap_semantic_rerank`
- `concept_gate_semantic_rank`
- `semantic_only`

The default is `concept_overlap_semantic_rerank`.

`concept_gate_semantic_rank` mirrors the main gated semantic-first behavior for
label-free queries:

- paragraph concepts must still intersect the query concepts to enter the
  candidate set
- semantic similarity is then the primary ranking signal inside that gated set

The default strategies remain unchanged in `0.1.3`:

- main path:
  - `greedy_label_coverage_semantic_rerank`
- label-free fallback:
  - `concept_overlap_semantic_rerank`

The retrieval trace now also distinguishes the meaning of `retrieval_score`
per result through `retrieval_score_kind`, and the default greedy main path
reports whether semantic backfill ran through `semantic_backfill_used`.

## OpenAI-Compatible Provider Notes

The built-in answer-generation adapter targets a minimal OpenAI-compatible
chat-completions API surface.

It supports:

- standard base URLs such as `https://api.openai.com/v1`
- full endpoint URLs such as `https://api.mistral.ai/v1/chat/completions`
- API key injection through explicit config or optional environment-variable
  lookup
- non-streaming text generation for `answer_with_generator(...)`

This adapter is intended to cover providers such as OpenAI, Mistral, and Qwen
when they expose an OpenAI-compatible endpoint shape.

## Public API

The main public entrypoints are:

- `RAGPipeline`
- `RAGPipelineConfig`, `RetrievalConfig`, `PromptConfig`
- `IndexedParagraph`, `LabelRecord`, `ConceptRecord`
- `QueryAnalysis`, `RetrievedParagraph`
- `RetrievalResult`, `RAGAnswerResult`
- `GeneratedAnswer`, `AnswerGenerator`
- `OpenAICompatibleAnswerGenerator`, `OpenAICompatibleConfig`
- convenience re-export: `Paragraph`

`RAGPipeline` also exposes record-oriented inspection helpers for
paragraph/label/concept lookup workflows:

- `get_paragraph(...)`
- `get_label(...)`
- `get_paragraph_labels(...)`
- `get_paragraph_concepts(...)`
- `get_label_paragraphs(...)`
- `get_concept_paragraphs(...)`

Lower-level ID-oriented helpers remain available when you only need stable IDs:

- `get_label_paragraph_ids(...)`
- `get_paragraph_label_ids(...)`
- `get_paragraph_concept_ids(...)`
- `get_concept_paragraph_ids(...)`

Detailed API notes are available in [`docs/public_api.md`](docs/public_api.md).

## Embedding Notes

- `RAGPipeline.fit(...)` now requires an embedding provider
- the default runtime path is `RAGPipeline(config)` and resolves the embedding
  provider from `config.embedding`
- explicit `embedding_provider=` is still available as an advanced override
- the first shipped provider is `SentenceTransformerEmbeddingProvider`
- the default model is `sentence-transformers/all-MiniLM-L6-v2`
- the model may be downloaded on first use
- offline environments should pre-cache the embedding model before running
  `fit(...)`

Common runtime failures:

- missing `sentence-transformers` package:
  - reinstall project dependencies, for example `pip install -e .`
- model load/download failure:
  - verify the configured model name
  - ensure the model is already cached locally or that the environment can
    reach Hugging Face

## Examples

Runnable examples are available in [`examples/`](examples/):

- [`examples/basic_usage.py`](examples/basic_usage.py)
- [`examples/custom_config.py`](examples/custom_config.py)
- [`examples/inspection_api.py`](examples/inspection_api.py)
- [`examples/fallback_policies.py`](examples/fallback_policies.py)
- [`examples/gated_semantic_rank.py`](examples/gated_semantic_rank.py)
- [`examples/greedy_backfill.py`](examples/greedy_backfill.py)
- [`examples/semantic_rerank.py`](examples/semantic_rerank.py)
- [`examples/save_and_load.py`](examples/save_and_load.py)
- [`examples/provider_answer.py`](examples/provider_answer.py)

Example note:

- the runnable example scripts use a tiny local demo embedding provider so they
  stay runnable offline
- production usage should prefer `SentenceTransformerEmbeddingProvider`

## Persistence Notes

`save(path)` produces a human-inspectable directory containing:

- `manifest.json`
- `config.json`
- `label_generator.json`
- `corpus_index.json`
- `fit_result.json`
- `paragraph_embeddings.npz`

The persistence layer now supports:

- `json`
- `json.gz`

Compression is applied to the full saved snapshot rather than mixing compressed
and uncompressed artifacts in one directory.

Snapshots written by the current release include a lightweight manifest
describing the saved version, persistence format, and expected artifacts.

Public guarantee:

- a saved and reloaded pipeline should preserve retrieval behavior for the same
  fitted state, question, and config

Current update boundary:

- `fit(...)` is batch-only
- adding new paragraphs currently requires a full refit
- save/load restores a static fitted state rather than an incrementally
  updateable corpus state

Legacy snapshot note:

- loading pre-embedding snapshots remains a best-effort compatibility path
- when older snapshots are missing derived concept inspection tables, `load()`
  may rebuild them from paragraph-side concept data that is still present
- when older snapshots predate `paragraph_embeddings.npz`, `load()` may rebuild
  paragraph embeddings from persisted paragraph texts if an embedding provider
  is available
- persisted manifests include a non-empty `labelrag_version`
- `save()` fails explicitly if the current package version cannot be determined
  for manifest writing

## Configuration Notes

- `RetrievalConfig.max_paragraphs` sets the hard retrieval limit
- `RetrievalConfig.retrieval_strategy` selects one of:
  - `greedy_label_coverage_semantic_rerank`
  - `label_gate_semantic_rank`
- `RetrievalConfig.allow_label_free_fallback` enables deterministic concept
  fallback behavior for label-free queries
- `RetrievalConfig.label_free_fallback_strategy` selects one of:
  - `concept_overlap_only`
  - `concept_overlap_semantic_rerank`
  - `concept_gate_semantic_rank`
  - `semantic_only`
- `RetrievalConfig.require_full_label_coverage` suppresses partial retrieval
  output when not all query labels can be covered
- `PromptConfig.include_paragraph_ids` includes stable paragraph IDs in the
  rendered prompt context
- `PromptConfig.include_label_annotations` includes paragraph label annotations
  in rendered prompt context
- `PromptConfig.max_context_characters` applies a hard cap to rendered context
  length

## Development Checks

```bash
.venv/bin/ruff check . --fix
.venv/bin/pyright
.venv/bin/pytest
```

## Release Checks

```bash
.venv/bin/python -m build
.venv/bin/python -m twine check dist/*
```
