Metadata-Version: 2.4
Name: asynth
Version: 0.1.0
Summary: Synthetic data generation engine for building task models
Project-URL: Homepage, https://amortized.ai
Project-URL: Repository, https://github.com/amortized-ai/asynth
Project-URL: Issues, https://github.com/amortized-ai/asynth/issues
Author: Amortized AI
License: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: jinja2>=3.1
Requires-Dist: jsonlines>=4.0
Requires-Dist: jsonschema>=4.20
Requires-Dist: litellm>=1.40
Requires-Dist: pandas>=2.0
Requires-Dist: pydantic>=2.7
Requires-Dist: pyyaml>=6.0
Requires-Dist: tiktoken>=0.7
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Provides-Extra: docs
Requires-Dist: openpyxl>=3.1; extra == 'docs'
Requires-Dist: pdftext>=0.3; extra == 'docs'
Requires-Dist: python-docx>=1.0; extra == 'docs'
Provides-Extra: hf
Requires-Dist: datasets>=2.19; extra == 'hf'
Description-Content-Type: text/markdown

# asynth

Generate targeted training data to replace expensive LLM API calls with fast, specialized models.

[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-green)](LICENSE)

```python
from asynth import synthesize, SynthesisConfig, LiteLLMInferenceConfig
from asynth.configs import GeneralSynthesisParams
from asynth.configs.params.synthesis_params import GeneratedAttribute, TextMessage
from asynth.types.conversation import Role

results = synthesize(SynthesisConfig(
    num_samples=10,
    inference_config=LiteLLMInferenceConfig(model="openai/gpt-4o-mini"),
    strategy_params=GeneralSynthesisParams(
        generated_attributes=[
            GeneratedAttribute(
                id="qa_pair",
                instruction_messages=[
                    TextMessage(role=Role.SYSTEM, content="You are a trivia question writer."),
                    TextMessage(role=Role.USER, content="Write a trivia Q&A about science."),
                ],
            ),
        ],
    ),
))
```

> [!NOTE]
> asynth is the data engine behind [amortized](https://github.com/amortized-ai/amortized) — a platform for building and deploying task models that replace expensive LLM API calls with fast, cheap, specialized inference.

## Why asynth?

Large models are expensive to run on every request. The alternative: generate synthetic training data, fine-tune a small purpose-built model, and amortize the cost over time.

- **Build task models** — small models that do one thing well, at a fraction of the cost
- **Any LLM as teacher** — use GPT-4o, Claude, Gemini, or any [LiteLLM provider](https://docs.litellm.ai/docs/providers) to generate data — just change the model string
- **No heavy dependencies** — no torch, no transformers, no CUDA. Installs in seconds
- **Production pipeline** — attribute sampling, quality checks, conversation planning, and tool-use simulation in a single `synthesize()` call

## Install

```bash
pip install asynth
```

```bash
pip install asynth[hf]    # HuggingFace dataset loading
pip install asynth[docs]  # Document ingestion (PDF, DOCX)
```

Requires Python >= 3.11.

## Features

### Data generation

- **Attribute-based synthesis** — combine sampled, generated, and transformed attributes in a single pipeline
- **Multi-turn conversations** — LLM-powered conversation planning with configurable turn counts and per-role personas
- **Tool-use simulation** — generate agentic conversations with tool calls grounded in environment definitions

### Data sources

- **Documents** — PDF, DOCX, TXT, Markdown, HTML with token-based segmentation
- **Datasets** — JSONL, CSV, Parquet, TSV, XLSX, and HuggingFace datasets (`hf:org/dataset`)

### Quality

- **Structural validation** — role alternation, empty content, tool-call consistency checks before output
- **LLM-as-a-Judge** — `SimpleJudge` and `RuleBasedJudge` with 15 pre-built evaluation configs (code quality, safety, truthfulness, etc.)

### Infrastructure

- **Provider-agnostic** — OpenAI, Anthropic, Google, Azure, Together, Fireworks, Ollama, vLLM via LiteLLM
- **Concurrent generation** — async LLM calls with configurable concurrency limits

## License

Apache 2.0
