Metadata-Version: 2.4
Name: fine-tuning-dataset-preparation
Version: 0.1.0
Summary: Utilities for building code and document fine-tuning datasets.
Home-page: https://bitbucket.org/entinco/eic-aimodelknowledge-utils/src/master/lib-finetuningdatasetpreparation-python
Author: Your Name
Author-email: seroukhov@entinco.com
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: langchain<2.0,>=1.0.5
Requires-Dist: langchain-google-genai>=3.0.1
Requires-Dist: langchain-openai>=1.0.2
Requires-Dist: langchain-anthropic>=1.0.2
Requires-Dist: chardet>=5.2.0
Requires-Dist: tree_sitter>=0.25.2
Requires-Dist: tree_sitter_python>=0.25.0
Requires-Dist: tree_sitter_c_sharp>=0.23.1
Requires-Dist: tree_sitter_go>=0.25.0
Requires-Dist: tree_sitter_javascript>=0.25.0
Requires-Dist: tree_sitter_typescript>=0.23.2
Requires-Dist: pytest>=8.0.0
Dynamic: author-email
Dynamic: home-page
Dynamic: license-file

# fine_tuning_dataset_preparation

Comprehensive toolkit for turning codebases and Markdown documentation into fine-tuning datasets for LLMs. It ships a code pipeline (tree-sitter extraction, instruction generation, optional paraphrasing), a document pipeline (Markdown segmentation, exhaustive Q/A generation, multi-file support), shared exporters for Gemini/OpenAI/Q/A JSONL, and configurable prompts. Tests run on pytest to keep changes safe.

## Installation

```bash
pip install -e .
```

Python 3.10+ is recommended. Before running pipelines, set your LLM credentials:
- `GOOGLE_API_KEY` for Gemini (default)
- or `OPENAI_API_KEY` / `ANTHROPIC_API_KEY` if you switch providers

## Code pipeline

Create instruction datasets from a repository (single or multiple projects).

CLI (see `examples/run_code_pipeline.py`):
```bash
python examples/run_code_pipeline.py
```

Programmatic use:
```python
from fine_tuning_dataset_preparation.code_dataset import PromptConfig
from fine_tuning_dataset_preparation.code_dataset.pipeline import code_pipeline

code_pipeline(
    project_path="path/to/repo",
    multi_project=True,                # treat subfolders as projects
    dataset_path="dataset.jsonl",
    llm_provider="gemini",
    model_name="gemini-2.5-flash",
    instruction_concurrency=8,
    instruction_temperature=0.7,
    prompt_config=PromptConfig(
        instruction_hint="Keep instructions concise and actionable.",
        paraphrase_hint="Return one alternative phrasing.",
    ),
    paraphrase_variations=1,           # optional paraphrasing
    paraphrase_temperature=0.9,
    exports=[
        {"target": "gemini", "output_path": "gemini_dataset.jsonl", "options": {"jsonl": True}},
        {"target": "openai", "output_path": "openai_dataset.jsonl", "options": {"jsonl": True}},
    ],
)
```

Key arguments: `project_path`, `multi_project`, `instruction_concurrency`, `instruction_temperature`, `prompt_config`, optional `paraphrase_*`, and `exports` with targets `gemini` or `openai`.

## Document pipeline

Generate Q/A datasets from Markdown. You can point to a single file, a directory, or a list of files.

CLI (see `examples/run_document_pipeline.py`):
```bash
python examples/run_document_pipeline.py
```

Programmatic use:
```python
from fine_tuning_dataset_preparation.document_dataset import run_document_pipeline
from fine_tuning_dataset_preparation.document_dataset.dataset import DocumentPromptConfig

run_document_pipeline(
    markdown_dir="docs",               # or markdown_path="file.md" or markdown_paths=["a.md", "b.md"]
    output_path="document_dataset.json",
    min_total_pairs=1,
    llm_provider="gemini",
    model_name="gemini-2.5-pro",
    prompt_config=DocumentPromptConfig(
        system_message="Use only the provided docs; end with attribution.",
        instructions=[
            "Use only supplied documentation fragments.",
            "If missing, say it is not specified.",
            "End every answer with the attribution line.",
        ],
        attribution="Information sourced from ACME Docs © 2025.",
    ),
    exports=[
        {"target": "qa_openai", "output_path": "doc_openai.jsonl", "options": {"jsonl": True}},
        {"target": "qa_gemini", "output_path": "doc_gemini.jsonl", "options": {"jsonl": True}},
        {"target": "qa_jsonl", "output_path": "doc_pairs.jsonl"},
    ],
)
```

Key arguments: `markdown_path` or `markdown_dir` or `markdown_paths`, `min_total_pairs`, `prompt_config` for tone/attribution, and `exports` with Q/A targets `qa_openai`, `qa_gemini`, `qa_jsonl`.

## Exporters

All exporters live in `fine_tuning_dataset_preparation/common/exporters`. Use them directly or via the pipeline `exports` argument.

```python
from fine_tuning_dataset_preparation.common.exporters import export_dataset

export_dataset(
    target="qa_openai",                 # gemini | openai | qa_openai | qa_gemini | qa_jsonl
    output_path="out.jsonl",
    pairs=[{"question": "...", "answer": "..."}],  # or instruction records when using instruction targets
    options={"jsonl": True},
)
```

Instruction targets: `gemini`, `openai`. Q/A targets: `qa_openai`, `qa_gemini`, `qa_jsonl`.

## Project structure

- `fine_tuning_dataset_preparation/code_dataset`: tree-sitter extraction, instruction generation, paraphrasing
- `fine_tuning_dataset_preparation/document_dataset`: Markdown ingestion, Q/A generation, prompt helpers
- `fine_tuning_dataset_preparation/common`: LLM utilities, text helpers, exporters
- `examples/`: runnable scripts for code and document pipelines, plus export helper
- `tests/`: pytest suite organized by domain

## Testing

```bash
pytest
pytest --cov=. --cov-report=term-missing
```

## Tips

- Match provider extras to the model you choose; the pinned requirements already bring tree-sitter grammars.
- Export your API key(s) before running examples to avoid partial or templated outputs.
- Pick models by task: fast models (Gemini 2.5 Flash, small OpenAI tiers) for bulk coverage/paraphrases; higher-fidelity models (Gemini 2.5 Pro, GPT-4.1) for final passes or sensitive Q/A. Raise temperature (0.7–0.9) for paraphrasing; keep it lower (0.2–0.5) for deterministic instruction/Q/A generation. 
