Metadata-Version: 2.4
Name: translate-lecture
Version: 0.1.0
Summary: Translate university lectures between languages using the Claude API
License: Apache-2.0
Project-URL: Repository, https://github.com/aieng-lab/translate-lecture
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: anthropic>=0.84
Requires-Dist: python-pptx>=1.0
Dynamic: license-file

# translate-lecture

Translates university lectures between languages using the Claude API.
All visual formatting and styling is preserved exactly — only the text changes.

> **Note:** The current release supports PowerPoint (`.pptx`) files only. Support for additional
> formats — including Jupyter notebooks and LaTeX files — is planned for future releases.

## Features

- Translates every text element: slide bodies, tables, speaker notes, grouped shapes, and elements inside `AlternateContent` XML wrappers
- Fully config-driven: one JSON file per lecture controls the language pair, file list, and all terminology
- Auto-generates an initial config by analysing your PPTX files with Claude
- Respects domain-specific terminology: keep technical terms in the source language, force specific translations, never translate acronyms or author names
- Batched API calls with automatic retry and exponential backoff on rate-limit errors
- Post-processing fixes compound-word spacing artifacts (e.g. `ClusteringAlgorithmen` → `Clustering-Algorithmen`)

## Setup

**Prerequisites:** Python 3.10+ and an [Anthropic API key](https://console.anthropic.com/).

Export your API key once per shell session before running any commands:

```bash
export ANTHROPIC_API_KEY=<your-key>
```

### Option A — install from PyPI

```bash
pip install translate-lecture
```

### Option B — clone and install locally

```bash
git clone https://github.com/...
cd translate-lecture
pip install -e .
```

Both options install dependencies and register two commands:

```bash
translate-lecture my_lecture.json
translate-lecture-init-config lecture1.pptx --target-language German
```

A virtual environment is recommended for either option:

```bash
python3 -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
```

## Workflow

### 1. Generate an initial config

Point `init-config` at your source PPTX files and specify the target language. Claude analyses
the slide text and produces a JSON config pre-populated with terminology for your domain.

```bash
translate-lecture-init-config \
    lecture1.pptx lecture2.pptx \
    --target-language German \
    --output my_lecture.json
```

The generated config uses `claude-sonnet-4-6` by default. Override with `--model` if needed.

### 2. Review and edit the config

Open the generated JSON and adjust the term lists to match your lecture's conventions. See
[Config reference](#config-reference) below for what each field does.

### 3. Translate

```bash
translate-lecture my_lecture.json
```

Each source file is translated and saved to its configured target path. Slide counts are
verified after saving.

### Audit mode

Preview all extracted text without making any API calls:

```bash
translate-lecture my_lecture.json --audit
```

### Selecting a model

The default translation model is `claude-haiku-4-5-20251001` (fast and cheap). Override for
higher quality:

```bash
translate-lecture my_lecture.json --model claude-sonnet-4-6
```

## Config reference

```json
{
  "lecture": {
    "description": "machine learning and AI lecture",
    "source_language": "English",
    "target_language": "German"
  },
  "model": {
    "batch_size": 40
  },
  "files": [
    {"source": "01_Introduction.pptx", "target": "01_Introduction_DE.pptx"},
    {"source": "05_Clustering.pptx",   "target": "05_Clustering_DE.pptx"}
  ],
  "translation": {
    "skip_exact": [...],
    "keep_in_source_language": [...],
    "forced_translations": {...},
    "spacing_fix_prefixes": [...]
  }
}
```

| Field | Description |
|---|---|
| `lecture.description` | Used in the system prompt to set the translation context |
| `lecture.source_language` | Source language name, e.g. `"English"` |
| `lecture.target_language` | Target language name, e.g. `"German"` |
| `model.batch_size` | Texts per API call (default: 40) |
| `files` | Source/target PPTX pairs, resolved relative to the config file's directory |
| `translation.skip_exact` | Strings never sent for translation: acronyms, author names, single chars, digits |
| `translation.keep_in_source_language` | Multi-word technical terms kept in the source language (listed in the system prompt) |
| `translation.forced_translations` | Exact source→target mappings included in the system prompt |
| `translation.spacing_fix_prefixes` | Source-language prefixes that get incorrectly merged with target-language words; a hyphen is inserted automatically |

PPTX file paths are resolved relative to the **config file's directory**, so the config and
slides can live anywhere on disk.

## Example config

We recommend starting with the generated config from `translate-lecture-init-config` and adjusting
the term lists as needed. Below, you can see how the generated file a machine learning lecture might look.

```json
{
  "lecture": {
    "description": "machine learning and AI lecture",
    "source_language": "English",
    "target_language": "German"
  },
  "model": {
    "batch_size": 40,
    "retry_batch_size": 25
  },
  "files": [
    {"source": "01_Introduction.pptx",      "target": "01_Introduction_DE.pptx"},
    {"source": "02_Foundations-of-ML.pptx", "target": "02_Foundations-of-ML_DE.pptx"}
  ],
  "translation": {
    "skip_exact": [
      "Prof. Dr. Steffen Herbold",
      "DBSCAN", "k-means", "SVM", "CNN", "LSTM", "LLM",
      "k", "n", "d", "m",
      "+", "-", "0", "1", "2"
    ],
    "keep_in_source_language": [
      "Deep Learning", "Clustering", "Backpropagation",
      "Gradient Descent", "Overfitting", "Underfitting",
      "Random Forest", "Neural Network", "Decision Tree",
      "Training", "Test", "Validation", "Feature", "Label",
      "Precision", "Recall", "F1", "Bias", "Variance"
    ],
    "forced_translations": {
      "Artificial Intelligence": "Künstliche Intelligenz (KI)",
      "Classification":          "Klassifikation",
      "Introduction":            "Einführung",
      "Overview":                "Überblick",
      "Summary":                 "Zusammenfassung",
      "Algorithm":               "Algorithmus"
    },
    "spacing_fix_prefixes": [
      "Clustering", "Learning", "Training", "Boosting", "Bagging"
    ]
  }
}
```

**`skip_exact`** — strings passed through unchanged without any API call. Use this for
acronyms, author names, single-letter variables, digits, and operators.

**`keep_in_source_language`** — multi-word technical terms that stay in English even inside
German text. They are listed in the system prompt so the model knows not to translate them.

**`forced_translations`** — explicit source→target pairs injected into the system prompt.
Use this when a term has one unambiguous well-known translation.

**`spacing_fix_prefixes`** — English prefixes that the model sometimes merges with the
following German word (e.g. `ClusteringAlgorithmen`). A hyphen is inserted automatically
to produce `Clustering-Algorithmen`.
