Metadata-Version: 2.4
Name: teich
Version: 0.1.1a47
Summary: Turn coding agent traces into auditable supervised fine-tuning data
License: Apache-2.0
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: datasets>=2.19.0
Requires-Dist: huggingface-hub>=0.23.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: jinja2>=3.1; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Description-Content-Type: text/markdown

# Teich

Agent SFT data infrastructure for generation, normalization, chat-template rendering, response masking, and training audits.

Teich is not only a dataset generator. It is a bridge between messy agent/chat data and model-specific supervised fine-tuning.

Start from any of these:

- Fresh Codex or Pi traces
- Text-only chat generations
- Local JSONL files or folders
- Hugging Face datasets
- Already-loaded `datasets.Dataset` objects

Then Teich handles the training-sensitive parts:

- Normalize sources into OpenAI-style `messages` / `tools`
- Render through the target tokenizer chat template
- Preserve typed supervision spans before tokenization
- Apply response-only labels after trainer tokenization

That means the same package can:

- Generate new agent traces or chat-only distillation data.
- Load and normalize existing local or Hub datasets.
- Mix chat-only and tool-call datasets with explicit ratios.
- Preserve raw traces as source-of-truth artifacts.
- Render with arbitrary tokenizer chat templates.
- Mask assistant reasoning, final answers, and tool calls while keeping prompts/tool responses ignored.
- Audit labels before training so fully masked or misaligned rows fail early.

## Mental Model

```text
prompts / traces / JSONL / HF datasets / Dataset objects
        ↓
load_traces() or prepare_data()
        ↓
normalized messages + tools
        ↓
tokenizer chat template rendering
        ↓
trainer-friendly text + Teich supervision spans
        ↓
SFTTrainer tokenization
        ↓
mask_data()
        ↓
audited input_ids + labels
```

Use only the pieces you need:

- Already have a dataset? Skip generation and go straight to `prepare_data()`.
- Want raw trace preservation? Use the CLI.
- Want standard next-token training? Use `prepare_data(..., teich_masking=False)` and skip `mask_data()`.

## Entry Points

| Goal | Use |
| --- | --- |
| Generate coding-agent traces | `teich generate` with `agent.provider: codex` or `pi` |
| Generate text-only chat rows | `teich generate` with `agent.provider: chat` |
| Load raw traces manually | `load_traces()` |
| Prepare local/HF/mixed datasets for training | `prepare_data()` |
| Apply response-only labels after TRL/Unsloth tokenization | `mask_data()` |
| Inspect supervised vs masked tokens | `preview_sft_example()` / `trainer.train_dataset.preview()` |

## Install

```bash
pip install teich
```

To create a new generation project:

```bash
teich init my-project && cd my-project
teich generate -c config.yaml
```

Or use [astral-uv](https://docs.astral.sh/uv/getting-started/installation/)

```bash
uvx teich init my-project && cd my-project
uvx teich generate -c config.yaml
```

> Edit `config.yaml` and `prompts.jsonl` before running a real generation batch.

## Core Capabilities

- **Trace-first data collection**: Run real coding agents and keep raw session traces as the source of truth.
- **Dataset-first training**: Load existing JSONL files, folders, Hugging Face repos, or `datasets.Dataset` objects without using the generator.
- **Multi-provider generation**: Works with Docker-backed Codex/Pi and a direct OpenAI-compatible `chat` mode.
- **Structured conversion**: Converts traces into chat messages with tool calls, reasoning, tool results, metadata, and configured tool snapshots.
- **Universal masking surface**: Supports assistant reasoning, final answers, tool calls, user/system/developer text, and tool responses as independently configurable masking targets.
- **Multi-turn and tool-aware labels**: Avoids Unsloth-style single-span masking pitfalls by storing typed spans before tokenization and aligning them after trainer tokenization.
- **Source mixing**: Mix local paths, Hub datasets, and in-memory datasets; explicit percentages stay true by scaling to the limiting source instead of silently changing ratios.
- **Hugging Face integration**: Publishes dataset cards with embedded tool-schema snapshots, and loads local or Hub datasets through one API.

## 📥 Prerequisites

Requirements for agent trace generation:

- Docker
- OpenAI/OpenRouter API key (or local OpenAI-compatible endpoint)

`agent.provider: chat` does not require Docker.

The Python utilities also work without Docker if you already have traces or structured JSONL datasets.

Training examples use your existing finetuning stack. For the TRL example below, install compatible versions of `transformers`, `trl`, and your model-loading stack separately.

## Common Workflows

### Prepare an existing dataset for training

You do not need to generate data with Teich first.

If a local file, folder, Hugging Face dataset, or `datasets.Dataset` has a `messages` column, Teich can usually prepare it directly.

```python
from teich import prepare_data

train_dataset = prepare_data(
    "TeichAI/Claude-Opus-4.6-Reasoning-887x",
    tokenizer,
    max_length=32768,
    drop_oversized_examples=True,
    tokenize=True,
    chat_template_kwargs={"enable_thinking": True, "preserve_thinking": True},
)
```

`prepare_data()` returns rendered `text`, Teich span metadata, and optionally `input_ids` / `attention_mask`. Call `mask_data()` after constructing your trainer to convert those spans into labels.

### Mix agent and chat datasets

```python
train_dataset = prepare_data(
    {
        "max_examples": 1000,
        "agent": {"source": "badlogicgames/pi-mono", "percentage": 80},
        "chat": {"source": "TeichAI/Claude-Opus-4.6-Reasoning-887x", "percentage": 20},
    },
    tokenizer,
    max_length=32768,
    drop_oversized_examples=True,
    tokenize=True,
    chat_template_kwargs={"enable_thinking": True, "preserve_thinking": True},
)
```

Explicit `percentage`, `proportion`, and `weight` values are treated as true ratios.

If one source cannot fill its share after filtering or context-window drops, Teich scales the total row count down instead of silently changing the realized mix.

### Generate new data from prompts

```bash
# Initialize project
teich init my-project
cd my-project

# Add prompts to prompts.jsonl, then:
export OPENAI_API_KEY=sk-...
teich generate -c config.yaml
```

Outputs:

- `codex` / `pi`: raw traces in `output/`, sandboxes in `sandbox/`, and a `README.md`
- `chat`: text-only JSONL training rows in `output/` and a dataset `README.md`

If `publish.repo_id` is configured, Teich also creates or updates the matching Hugging Face **dataset** repo.

Uploaded artifacts include:

- Generated JSONL
- Dataset `README.md`
- Embedded tool-schema snapshot in the dataset card when tools are present

If a long run is interrupted, use:

```bash
teich generate -c config.yaml --resume
```

Teich will scan existing outputs and skip prompts that already converted into completed training examples.

Prompt files can be JSONL/NDJSON, JSON, CSV, or plain text.

JSONL is recommended because it handles long multiline prompts, repository metadata, and chat follow-up turns without CSV escaping problems.

Recommended `prompts.jsonl`:

```jsonl
{"prompt":"Build a simple todo list app in React"}
{"github_repo":"armand0e/perplexica-mcp","prompt":"Add a small usability improvement and update the tests"}
{"prompt":"Draft a compact project plan","follow_up_prompts":["Revise it for a solo developer","Add a risk checklist"]}
```

`follow_up_prompts` is supported by `agent.provider: chat` as real additional user turns in one generated training row.

`codex` and `pi` currently run one non-interactive coding-agent prompt per trace. Keep those prompt rows single-turn until native interactive follow-ups are added.

### Generate a text-only chat dataset

```yaml
agent:
  provider: chat

model:
  model: gpt-4.1-mini

api:
  provider: openai
  wire_api: responses
```

Each generated JSONL line will look like:

```json
{"messages":[{"role":"system","content":"You are a helpful assistant","thinking":null},{"role":"user","content":"Hello","thinking":null},{"role":"assistant","content":"Hi!","thinking":"I should greet the user."}],"system":"You are a helpful assistant","prompt":"Hello","thinking":"I should greet the user.","response":"Hi!","model":"gpt-4.1-mini"}
```

With follow-ups, the same row contains:

- Alternating `user` and `assistant` messages
- `follow_up_prompts`
- Per-turn `responses`
- Final `response`

### Train with Unsloth and TRL `SFTTrainer`

Use the trainer-first path:

1. `prepare_data` renders trainer-friendly `text` rows with Teich supervision metadata.
2. `SFTTrainer` tokenizes them.
3. `mask_data` applies multi-turn/tool-aware response-only labels to the trainer dataset.

```python
import os

from unsloth import FastLanguageModel
from trl import SFTConfig, SFTTrainer

from teich import mask_data, prepare_data

MAX_SEQ_LEN = 32768
MODEL_NAME = "unsloth/Qwen3.5-0.8B"
CHAT_TEMPLATE_KWARGS = {"enable_thinking": True}
PUSH_TO_HUB_REPO_ID = "username/teich-sft-model"
HF_TOKEN = os.environ.get("HF_TOKEN") or ""

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LEN,
    load_in_4bit=False,
    load_in_8bit=False,
    full_finetuning=False,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "out_proj"],
    lora_alpha=64,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

train_dataset = prepare_data(
    "TeichAI/lordx64-claude-opus-4.7-max-cleaned",
    tokenizer,
    split="train",
    max_examples=500,
    chat_template_kwargs=CHAT_TEMPLATE_KWARGS,
    max_length=MAX_SEQ_LEN,
    drop_oversized_examples=True,
    tokenize=True,
    strict=True,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=None,
    args=SFTConfig(
        dataset_text_field="text",
        dataset_num_proc=1,
        max_length=MAX_SEQ_LEN,
        packing=False,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=1,
        learning_rate=2e-4,
        logging_steps=1,
        optim="muon",
        optim_target_modules="all-linear",
        weight_decay=0.001,
        lr_scheduler_type="linear",
        output_dir="outputs",
        seed=3407,
        report_to="none",
    ),
)
trainer = mask_data(
    trainer,
    tokenizer=tokenizer,
    train_on_reasoning=True,
    train_on_final_answers=True,
    train_on_tools=True,
)

trainer_stats = trainer.train(resume_from_checkpoint=False)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")

model.push_to_hub_merged(PUSH_TO_HUB_REPO_ID, tokenizer, save_method="merged_16bit", token=HF_TOKEN)
```

`prepare_data`:

- Loads local folders, local files, Hugging Face datasets, source mixes, or `datasets.Dataset` objects.
- Applies the tokenizer chat template.
- Optionally tokenizes only to drop rows above `max_length`.
- Returns trainer-friendly `text` rows with typed Teich span metadata.
- Supports `teich_masking=False` for plain next-token training without Teich response-only labels.

For Unsloth / TRL, pass `tokenize=True` so trainer setup treats the dataset as already tokenized and preserves Teich span metadata until `mask_data()` runs.

`mask_data`:

- Follows the same trainer-first shape as Unsloth's response-only helper.
- Uses Teich span metadata so multi-turn tool calls and tool responses are masked correctly.
- Trains on assistant reasoning, final answers, and tool calls by default.
- Keeps user/system/developer/tool-response text masked by default.
- Returns a compact trainer dataset with only `input_ids` and `labels`.

You can override the default policy with `train_on_reasoning`, `train_on_final_answers`, `train_on_tools`, `train_on_user`, `train_on_system`, `train_on_developer`, and `train_on_tool_responses`.

Keep `packing=False` for this flow because packed datasets merge row boundaries before masking. For long-context runs, `max_supervised_tokens` defaults to the trainer's `max_length` to cap the number of trainable answer tokens per row.

To combine datasets, pass a list of dataset IDs, local paths, or loaded `datasets.Dataset` objects:

```python
train_dataset = prepare_data(
    ["username/chat-traces", "username/tool-traces"],
    tokenizer,
    max_length=MAX_SEQ_LEN,
    drop_oversized_examples=True,
    tokenize=True,
    chat_template_kwargs=CHAT_TEMPLATE_KWARGS,
)
```

For weighted mixes, explicit `percentage`, `proportion`, and `weight` values are treated as true ratios.

If one source cannot fill its share after filtering or context-window drops, Teich scales the total row count down instead of silently changing the realized mix.

### Fallback manual flow with `load_traces`

Use `load_traces` directly when you want to own the rest of the training pipeline yourself:

- Chat-template rendering
- Filtering
- Tokenization
- Label masking
- Packing policy
- Auditing

```python
from teich import load_traces

dataset = load_traces("./output")
example = dataset[0]

rendered = tokenizer.apply_chat_template(
    example["messages"],
    tools=example.get("tools") or [],
    tokenize=False,
    add_generation_prompt=False,
    enable_thinking=True,
)
tokenized = tokenizer(rendered, truncation=True, max_length=32768)
```

## 📋 Configuration

`config.yaml`:

```yaml
agent:
  provider: codex  # or pi or chat

model:
  model: codex-mini-latest
  approval_policy: never
  sandbox: danger-full-access

prompts_file: prompts.jsonl

prompts: []
# For chat provider follow-up turns:
# prompts:
#   - prompt: "Draft a compact project plan"
#     follow_up_prompts:
#       - "Revise it for a solo developer"
#       - "Add a risk checklist"

output:
  traces_dir: ./output
  sandbox_dir: ./sandbox
  pretty_name: "My Agent Traces"

publish:
  repo_id: armand0e/my-dataset
  hf_token: hf_xxx
  private: false
```

Dataset tags are auto-generated from the provider and model:

- `codex` / `pi`: `agent-traces`, `<provider>`, `distillation`, `<model>`, `teich`
- `chat`: `conversational`, `distillation`, `teich`, `<model>`

If `publish.hf_token` is omitted, Teich also accepts `HF_TOKEN`, `HUGGINGFACE_HUB_TOKEN`, or `TEICH_HF_TOKEN` from the environment.

### Local providers (LM Studio, Ollama)

```bash
export TEICH_PROVIDER=LMstudio
export TEICH_MODEL=gemma-4
export TEICH_BASE_URL=http://localhost:1234/v1
export TEICH_API_KEY=llm

teich generate -c config.yaml
```

## 🏗️ Data Structure

Training examples include:

- `prompt`: initial task description
- `follow_up_prompts`: optional additional chat turns generated after the initial prompt
- `messages`: chat history (system, user, assistant, tool)
- `tools`: tool schemas used in the session
- `metadata`: session info, model, timestamps, and usage when available

Structured chat datasets can also include convenience top-level fields like:

- `system`
- `follow_up_prompts`
- `thinking`
- `response`
- `responses`
- `model`

Assistant messages capture:

- `content`: text response
- `reasoning_content`: chain-of-thought traces
- `tool_calls`: function calls with arguments

## 🔧 Python API

```python
from teich import (
    prepare_data,        # Recommended: render trainer-friendly text rows
    mask_data,           # Recommended: apply Teich labels after SFTTrainer tokenization
    load_traces,         # Fallback: load rows for fully manual processing
    preview_sft_example, # Preview supervised vs masked tokens
    Config,              # Load config.yaml
    TrainingExample,     # Typed training example
)
```

`README.md` is the package readme used for PyPI, so these examples are the canonical public package docs.

## 📦 Trace-First Workflow

Teich preserves the **raw agent session** as the source of truth:

1. **Collect**: Run agents on real tasks → raw `.jsonl` traces
2. **Inspect/Share**: Traces are human-readable and uploadable
3. **Convert**: Transform to structured examples when ready
4. **Prepare**: Use `prepare_data()` + `mask_data()` to apply model-specific templates and labels through the trainer-first flow

If you choose `agent.provider: chat`, Teich skips the trace-preservation step and writes structured text-only JSONL rows directly.

This means you can:

- Re-convert with different logic later
- Share raw traces before releasing training data
- Train on the same sessions with different model templates

## 🛠️ Development

```bash
uv pip install -e ".[dev]"
uv run pytest --ignore=tests/test_integration.py -q
```

## 📌 Status

Teich is **alpha**. The core workflow is stable and usable. APIs may evolve as more agent types and training workflows are added.

## 📄 License

Apache-2.0
