Metadata-Version: 2.4
Name: query-generator
Version: 2.0.0
Author-email: jalal <jalalkhaldi3@gmail.com>
Requires-Python: <3.13,>=3.11
Requires-Dist: pyyaml<7.0.0,>=6.0.0
Requires-Dist: retrievalbase<3.0.0,>=2.1.1
Description-Content-Type: text/markdown

# Query Generator

Query Generator is a typed Python library and CLI for generating search-query records from document rows. It is designed for retrieval and evaluation workflows where each source document should be expanded into natural-language queries while preserving the original document content and metadata.

## Highlights

- Configurable `QueryGenerator` contract for producing structured query output.
- OpenAI and Ollama-compatible generator implementations.
- Prompt abstraction for project-specific query-generation instructions.
- CLI for generating queries from a JSON row, file, or standard input.
- RetrievalBase `TextPreprocessor` integration for expanding datasets into query-enriched rows.
- Typed Pydantic settings models for config-driven component loading.
- Retry handling for empty, invalid, or provider-failed model responses.

## Overview

Query Generator turns document rows into structured query records. A row is expected to look like a RetrievalBase text row:

```json
{
  "page_content": "Retrieval augmented generation combines search with language models.",
  "metadata": {
    "source": "paper-1",
    "page": 3
  }
}
```

A generator renders a prompt from the row, calls a model provider, and returns JSON in this shape:

```json
{
  "queries": [
    {
      "query": "what is retrieval augmented generation?"
    }
  ]
}
```

When used as a `TextPreprocessor`, each generated query becomes a new dataset row with the original `page_content` and metadata plus a `query` metadata field. This is useful for retrieval evaluation datasets, synthetic search-query generation, query-document pair creation, and batch preparation before indexing or scoring.

## Installation

This project requires Python 3.11 or newer.

For local development from this repository, use `uv`:

```bash
uv sync --group dev --all-extras
```

Install production dependencies only:

```bash
make install
```

## Usage

### Define a Prompt

Prompts are application-specific. Implement `Prompt.render()` to turn a document row into model instructions.

```py
from typing import Any

from query_generator.prompt import Prompt
from query_generator.settings import PromptSettings


class RetrievalPrompt(Prompt[PromptSettings]):
    def render(self, row: dict[str, Any]) -> str:
        return (
            f"Generate {self.config.n_queries} search queries for this passage.\n"
            "Return JSON with a top-level 'queries' list. "
            "Each item must contain a string field named 'query'.\n\n"
            f"Passage:\n{row['page_content']}"
        )
```

### Generate with OpenAI

```py
from query_generator.generators.openai import OpenAIQueryGenerator
from query_generator.settings import OpenAIQueryGeneratorSettings, PromptSettings


generator = OpenAIQueryGenerator(
    OpenAIQueryGeneratorSettings(
        module_path="query_generator.generators.openai.OpenAIQueryGenerator",
        prompt=PromptSettings(
            module_path="your_package.prompts.RetrievalPrompt",
            n_queries=3,
        ),
        model_name="gpt-4.1-mini",
        temperature=0.2,
        max_retries=3,
    )
)

result = generator.generate(
    {
        "page_content": "Retrieval augmented generation combines search with language models.",
        "metadata": {"source": "paper-1", "page": 3},
    }
)
```

Set provider credentials in the environment expected by the OpenAI Python client, for example:

```bash
export OPENAI_API_KEY="..."
```

### Generate with Ollama

The Ollama generator uses Ollama's OpenAI-compatible API and automatically normalizes the base URL to include `/v1`.

```py
from query_generator.generators.ollama import OllamaQueryGenerator
from query_generator.settings import OllamaQueryGeneratorSettings, PromptSettings


generator = OllamaQueryGenerator(
    OllamaQueryGeneratorSettings(
        module_path="query_generator.generators.ollama.OllamaQueryGenerator",
        prompt=PromptSettings(
            module_path="your_package.prompts.RetrievalPrompt",
            n_queries=3,
        ),
        model="llama3.1",
        base_url="http://localhost:11434",
        temperature=0.2,
        max_retries=3,
    )
)

result = generator.generate(
    {
        "page_content": "A document passage to turn into search queries.",
        "metadata": {"source": "local-doc"},
    }
)
```

### Use the CLI

Create a YAML config that resolves to a `QueryGenerator`:

```yaml
generator:
  module_path: query_generator.generators.openai.OpenAIQueryGenerator
  prompt:
    module_path: your_package.prompts.RetrievalPrompt
    n_queries: 3
  model_name: gpt-4.1-mini
  temperature: 0.2
  max_retries: 3
```

Generate queries from an inline JSON row:

```bash
query-generator generate \
  --config config.yaml \
  --config-key generator \
  --row-json '{"page_content":"Retrieval augmented generation combines search with language models.","metadata":{"source":"paper-1","page":3}}'
```

Generate from a row file:

```bash
query-generator generate \
  --config config.yaml \
  --config-key generator \
  --row-file row.json
```

Or pipe JSON through standard input:

```bash
cat row.json | query-generator generate --config config.yaml --config-key generator
```

### Use as a RetrievalBase Preprocessor

`QueryGeneratorPreprocessor` expands each input row into one output row per generated query. The output row keeps the original `page_content`; metadata is copied and augmented with `query`.

```py
from query_generator.preprocessor import QueryGeneratorPreprocessor
from query_generator.settings import (
    OpenAIQueryGeneratorSettings,
    PromptSettings,
    QueryGeneratorPreprocessorSettings,
)


settings = QueryGeneratorPreprocessorSettings[OpenAIQueryGeneratorSettings](
    module_path="query_generator.preprocessor.QueryGeneratorPreprocessor",
    kind="query_generator",
    query_generator=OpenAIQueryGeneratorSettings(
        module_path="query_generator.generators.openai.OpenAIQueryGenerator",
        prompt=PromptSettings(
            module_path="your_package.prompts.RetrievalPrompt",
            n_queries=2,
        ),
        model_name="gpt-4.1-mini",
        temperature=0.2,
        max_retries=3,
    ),
)

preprocessor = QueryGeneratorPreprocessor.from_config(settings)
expanded_dataset = preprocessor.apply(text_dataset)
```

## Expected Output Contract

Generators must return a dictionary with a top-level `queries` list. Each item must be a dictionary with a string `query` field:

```py
{
    "queries": [
        {"query": "first generated query"},
        {"query": "second generated query"},
    ]
}
```

The preprocessor raises `InvalidQueryGeneratorOutputError` when this shape is not met.

## Project Structure

```text
query-generator/
|-- src/query_generator/
|   |-- generators/
|   |   |-- openai.py       # OpenAI chat-completions generator
|   |   `-- ollama.py       # Ollama OpenAI-compatible generator
|   |-- prompt/             # Prompt base contract
|   |-- exceptions.py       # Package and CLI exceptions
|   |-- main.py             # query-generator CLI
|   |-- preprocessor.py     # RetrievalBase TextPreprocessor integration
|   |-- settings.py         # Typed component settings
|   `-- py.typed            # Type information marker
|-- tests/
|   |-- fixtures/           # Shared pytest fixtures and test components
|   |-- unit/               # Unit tests
|   `-- integration/        # Config-loading tests
|-- pyproject.toml
|-- Makefile
|-- uv.lock
`-- README.md
```

## Common Use Cases

- Generate synthetic queries for retrieval evaluation datasets.
- Expand document rows into query-document training or scoring records.
- Keep model-provider logic replaceable behind a common generator interface.
- Run query generation from YAML or JSON component configs.
- Use local Ollama models for development before switching to a hosted provider.
- Integrate query generation into RetrievalBase dataset preprocessing pipelines.

## Development

Run tests:

```bash
make test
```

Run formatting and linting:

```bash
make format
make lint
```

Run type checking:

```bash
make type-check
```

Run security checks:

```bash
make security
```

Run the local CI equivalent:

```bash
make ci
```

## Feedback and Contributing

Bug reports, feature requests, and implementation ideas are welcome. Include:

- What you expected to happen.
- What actually happened.
- A minimal config, row, or test case when possible.
- The Python version, provider, model, and relevant dependency versions.

Good contributions include new generator providers, reusable prompt implementations, validation improvements, tests, examples, and documentation updates.
