Metadata-Version: 2.4
Name: doctr-synth-generator
Version: 0.3.0
Summary: A synthetic data generator for training OCR models
Author-email: Felix Dittrich <felixdittrich92@gmail.com>
Maintainer: Felix Dittrich
License:                                  Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding those notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for reasonable and customary use in describing the
              origin of the Work and reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Additional Liability. While redistributing
              the Work or Derivative Works thereof, You may choose to offer,
              and charge a fee for, acceptance of support, warranty, indemnity,
              or other liability obligations and/or rights consistent with this
              License. However, in accepting such obligations, You may act only
              on Your own behalf and on Your sole responsibility, not on behalf
              of any other Contributor, and only if You agree to indemnify,
              defend, and hold each Contributor harmless for any liability
              incurred by, or claims asserted against, such Contributor by reason
              of your accepting any such warranty or additional liability.
        
           END OF TERMS AND CONDITIONS
        
           APPENDIX: How to apply the Apache License to your work.
        
              To apply the Apache License to your work, attach the following
              boilerplate notice, with the fields enclosed by brackets "[]"
              replaced with your own identifying information. (Don't include
              the brackets!)  The text should be enclosed in the appropriate
              comment syntax for the file format. We also recommend that a
              file or class name and description of purpose be included on the
              same "printed page" as the copyright notice for easier
              identification within third-party archives.
        
           Copyright [yyyy] [name of copyright owner]
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
        
Project-URL: repository, https://github.com/felixdittrich92/docTR-Synth-Generator
Project-URL: tracker, https://github.com/felixdittrich92/docTR-Synth-Generator/issues
Project-URL: changelog, https://github.com/felixdittrich92/docTR-Synth-Generator/releases
Keywords: OCR,deep learning,computer vision,text recognition,synthetic data,data augmentation,image processing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: <4,>=3.10.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fonttools<5.0.0,>=4.50.0
Requires-Dist: numpy<3.0.0,>=2.0.0
Requires-Dist: Pillow>=11.0.0
Provides-Extra: testing
Requires-Dist: pytest>=5.3.2; extra == "testing"
Requires-Dist: coverage[toml]>=4.5.4; extra == "testing"
Provides-Extra: quality
Requires-Dist: ruff>=0.1.5; extra == "quality"
Requires-Dist: mypy>=0.812; extra == "quality"
Requires-Dist: pre-commit>=2.17.0; extra == "quality"
Provides-Extra: dev
Requires-Dist: fonttools<5.0.0,>=4.50.0; extra == "dev"
Requires-Dist: numpy<3.0.0,>=2.0.0; extra == "dev"
Requires-Dist: Pillow>=11.0.0; extra == "dev"
Requires-Dist: pytest>=5.3.2; extra == "dev"
Requires-Dist: coverage[toml]>=4.5.4; extra == "dev"
Requires-Dist: ruff>=0.1.5; extra == "dev"
Requires-Dist: mypy>=0.812; extra == "dev"
Requires-Dist: pre-commit>=2.17.0; extra == "dev"
Dynamic: license-file

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
![Build Status](https://github.com/felixdittrich92/docTR-Synth-Generator/workflows/builds/badge.svg)
[![codecov](https://codecov.io/gh/felixdittrich92/docTR-Synth-Generator/graph/badge.svg?token=31MDR20JGI)](https://codecov.io/gh/felixdittrich92/docTR-Synth-Generator)
[![CodeFactor](https://www.codefactor.io/repository/github/felixdittrich92/doctr-synth-generator/badge)](https://www.codefactor.io/repository/github/felixdittrich92/doctr-synth-generator)
[![Pypi](https://img.shields.io/badge/pypi-v0.3.0a0-blue.svg)](https://pypi.org/project/docTR-Synth-Generator/)

# docTR-Synth-Generator
A tool to generate synthetic OCR datasets - made for docTR

![Examples: detection pages and recognition crops](https://github.com/felixdittrich92/docTR-Synth-Generator/raw/main/docs/examples_grid.png)

## Features

- **Zero-config**: generate a dataset with nothing but an output directory - real
  words, matching fonts *and* background images are downloaded automatically.
- **Multilingual by language code**: `languages=["de", "ru", "ar", ...]` resolves
  both the words *and* the fonts for each script (~85 languages), with correct
  complex-script shaping and right-to-left layout for Arabic/Hebrew.
- **No more dropped words**: any character a local font cannot render triggers an
  on-demand download of a font that can, instead of silently skipping the word.
- **Realistic output**: supersampled anti-aliasing, background-aware ink colour
  and contrast (dark-on-light and light-on-dark), faux-bold/outlines, and
  scanner/camera-style degradations (JPEG artifacts, sensor noise, blur).
- **Controllable balancing**: explicit per-language allocation, a stratified
  train/val split, optional character-coverage guarantees, and a balance report.
- **Recognition *and* detection**: produce word/line crops for recognition, or
  full document-like pages with per-word polygons for detection - both in the
  formats docTR's training references expect.
- **Fast & memory-bounded**: font objects and decoded backgrounds are cached, with
  a configurable cache size.

## Quickstart

One call - the words, fonts and backgrounds it needs are downloaded and cached
automatically. English by default:

```python
from generator import generate_dataset

generate_dataset("output_dataset", num_images=1000)
```

Multilingual is just a list of ISO 639-1 codes; each code selects its words *and*
its script, so matching fonts are pulled in and complex scripts are shaped
correctly:

```python
generate_dataset("output_dataset", num_images=10000, languages=["en", "de", "ru", "el", "ar"])
```

Detection pages instead of recognition crops:

```python
generate_dataset("pages", num_images=5000, task="detection", languages=["en", "de"])
```

...or straight from the command line (installs as `doctr-synth`, also runnable as `python -m generator`):

```bash
doctr-synth output_dataset -n 10000 -l en de ru
doctr-synth pages -t detection -l en de --layout newspaper
```

> The first run downloads word lists, fonts and backgrounds from public mirrors
> and caches them; later runs are offline. To stay fully offline from the start,
> supply your own `wordlist_path` and `font_dir` (see below).

`generate_dataset(...)` is a thin wrapper over the `GenerationConfig` +
`SyntheticDatasetGenerator` pair you can still use directly for full control. The
config is organised into focused sub-configs (`core`, `resources`, `corpus`,
`balance`, `coverage`, `recognition`, `realism`, `detection`) - build it nested,
or use `GenerationConfig.flat(...)` to pass flat keyword names. Any of those
keywords can also be passed straight to `generate_dataset(...)`:

```python
from generator import GenerationConfig, CoreConfig, DetectionConfig

# nested - group related options together
cfg = GenerationConfig(
    core=CoreConfig(num_images=10_000, task="detection", languages=["en", "de"]),
    detection=DetectionConfig(layout="newspaper"),
)

# ...or flat, routed into the sub-configs for you
cfg = GenerationConfig.flat(num_images=10_000, task="detection", det_layout="newspaper")

# ...or just the one-liner
generate_dataset("output_dataset", num_images=10_000, languages=["en", "de"], output_jpeg=True, num_workers=8)
```

### Bring your own resources

`wordlist_path` and/or `font_dir` take precedence over the automatic downloads
(e.g. the bundled `resources/` or the `fonts_v1` release):

```python
generate_dataset(
    "output_dataset",
    num_images=1000,
    wordlist_path="resources/corpus/latin_ext_balanced_words.txt",
    font_dir="resources/font",
    bg_image_dir="resources/background_images",
)
```

## Automatic resources

On the first run anything you don't provide is fetched from public mirrors and
cached (later runs are offline). Providing your own `font_dir` / `wordlist_path`
/ `bg_image_dir` takes precedence and skips the matching download.

- **Fonts** - when no local font covers every character of a word, a matching
  open-source [Noto](https://fonts.google.com/noto) font is downloaded, verified
  and cached, so words are never silently skipped (the usual cause of biased,
  latin-only datasets). Disable with `auto_download_fonts=False`.
- **Words** - with no `wordlist_path`, real frequency-ranked words come from
  [FrequencyWords](https://github.com/hermitdave/FrequencyWords) (~85 languages)
  and are cleaned (script filter, length bounds, punctuation removal). Two realism
  helpers are on by default: `casing_variant_prob` (0.3) adds Title/UPPERCASE
  variants, and `numeric_token_ratio` (0.05) mixes in numbers, dates and prices.
- **Backgrounds** - with no `bg_image_dir`, a curated background set is downloaded
  instead of blank pages. Disable with `auto_download_backgrounds=False`, or pass
  a `background_manifest_url` for your own collection.

## Dataset balancing

For multilingual runs the language mix is explicit and controllable instead of
being dominated by whichever language has the most words:

```python
config = GenerationConfig.flat(
    output_dir="output_dataset",
    num_images=30000,
    languages=["en", "de", "ru"],
    language_balance="balanced",  # "balanced" (default) or "proportional"
    # language_weights={"en": 0.6, "de": 0.3, "ru": 0.1},  # or set explicit weights
    min_char_coverage=20,  # ensure every character appears >= N times (0 = off)
)
```

The split is *stratified*: train and val share the same language mix and exact
words do not leak from train into val. A balance report is printed before
generation (per-language train/val counts, train/val overlap, distinct/rare
characters, word-length statistics); silence it with
`print_balance_report=False`.

## Supported languages

Pass ISO 639-1 codes in `languages=[...]`. Each code resolves three things
automatically: real words from a public frequency list (when one exists), a
script-matching open-source font (downloaded on demand), and the docTR vocab
used for coverage and for the recognition vocab restriction. Complex scripts are
shaped correctly - Arabic, Hebrew and Urdu run right-to-left, and Indic, Thai and
Myanmar clusters (consonant conjuncts, matras, medials, vowel signs and viramas)
are built as valid grapheme clusters, including in synthesised coverage tokens.

| Script | Languages (ISO 639-1 code) |
| --- | --- |
| Latin | Afrikaans (`af`), Azerbaijani (`az`), Catalan (`ca`), Czech (`cs`), Danish (`da`), Dutch (`nl`), English (`en`), Estonian (`et`), Basque (`eu`), Finnish (`fi`), French (`fr`), German (`de`), Hungarian (`hu`), Icelandic (`is`), Indonesian (`id`), Irish (`ga`), Italian (`it`), Latvian (`lv`), Lithuanian (`lt`), Maltese (`mt`), Norwegian (`no`/`nb`), Polish (`pl`), Portuguese (`pt`), Romanian (`ro`), Slovak (`sk`), Slovene (`sl`), Spanish (`es`), Albanian (`sq`), Swedish (`sv`), Croatian (`hr`), Turkish (`tr`), Vietnamese (`vi`) |
| Cyrillic | Belarusian (`be`), Bulgarian (`bg`), Macedonian (`mk`), Russian (`ru`), Ukrainian (`uk`) |
| Greek | Greek (`el`) |
| Perso-Arabic | Arabic (`ar`), Persian (`fa`), Urdu (`ur`) |
| Hebrew | Hebrew (`he`) |
| Armenian | Armenian (`hy`) |
| Georgian | Georgian (`ka`) |
| Devanagari | Hindi (`hi`), Marathi (`mr`) |
| Bengali | Bengali (`bn`) |
| Gujarati | Gujarati (`gu`) |
| Tamil | Tamil (`ta`) |
| Telugu | Telugu (`te`) |
| Kannada | Kannada (`kn`) |
| Malayalam | Malayalam (`ml`) |
| Oriya | Odia (`or`) |
| Sinhala | Sinhala (`si`) |
| Thai | Thai (`th`) |
| Myanmar | Burmese (`my`) |
| CJK | Japanese (`ja`), Korean (`ko`) - corpus-driven only (no fixed small vocab, so vocab-coverage synthesis is skipped) |

A few notes:

- A handful of languages have a vocab and a font but no public frequency list
  (e.g. Burmese `my`, Odia `or`). They render correctly and their full character
  set is still guaranteed through synthesised coverage tokens - you just won't
  get a real-word corpus unless you supply your own via `wordlist_path`.
- For **recognition**, any of the 214 keys in ``VOCABS`` (e.g. `"german"`,
  `"arabic"`, `"hindi"`) can be used as ``target_vocab`` / the ``vocab`` argument,
  and several may be combined for a multilingual model
  (`["german", "urdu", "odia"]`); every generated label is then restricted to
  that exact character set so a docTR model trained on the matching vocab never
  sees an un-encodable character (see the docTR training section).

## Vocabulary coverage (recognition)

A recognition model is trained against a fixed character set (docTR's `VOCABS`).
Real frequency corpora rarely contain *every* character of that set - rare
accented capitals (`ẞ`), currency signs, some punctuation - so a model trained
only on downloaded words never sees them. With `ensure_vocab_coverage=True`
(the default), each language is mapped to its docTR vocab and extra word-like
tokens are synthesised so **every renderable vocab character appears in both the
train and val splits**:

```python
config = GenerationConfig.flat(
    output_dir="dataset",
    num_images=50000,
    languages=["de"],  # mapped to the "german" vocab automatically
    ensure_vocab_coverage=True,  # default
    vocab_coverage_min_count=3,  # each vocab char appears in >= N train samples
)
```

- `target_vocab` overrides the per-language mapping - pass a `VOCABS` key
  (e.g. `"german"`) or a literal string of characters to cover. It also enables
  coverage when you supply your own `wordlist_path`.
- Coverage is enforced **after** the train/val split, so a rare character can
  never land in only one split. This makes `num_images` a *floor*: a small,
  bounded number of coverage samples (proportional to the vocab size, not the
  dataset) is appended on top.
- Languages with no fixed small vocab (CJK) are skipped automatically, and very
  large scripts (thousands of CJK ideographs / Hangul syllables) are left to the
  real corpus rather than synthesised.
- Every synthesised token stays within a single script (a Hebrew character is
  only ever placed in a Hebrew token, etc.), so each renders with one font. When
  several languages are generated together, coverage is computed over the union
  of their vocabs but tokens are never mixed across scripts.
- Coverage prefers repeating **real corpus words** that contain a rare
  character, so diacritic combinations are linguistically attested. Synthesis
  is only a fallback for characters absent from the corpus (rare punctuation,
  currency, capitals in a lower-cased corpus, or marks like Hebrew niqqud that
  real text omits) - and even then a combining mark is inserted into a real
  same-script word after a base letter, never rendered alone on a dotted circle.
- A character that **no available font can render** (e.g. `฿` inside a
  Latin-script vocab) is the one case that cannot be covered - that is a font
  limitation, reported in the log, not a logic gap.

## Detection datasets

Set `task="detection"` to generate document-like **pages** with a 4-point
polygon for every word, ready for
[docTR detection training](https://github.com/mindee/doctr/tree/main/references/detection):

```python
config = GenerationConfig.flat(
    task="detection",
    output_dir="detection_dataset",
    num_images=5000,  # = number of pages
    languages=["en", "de"],  # words + fonts resolved automatically
    bg_image_dir="resources/background_images",
    output_jpeg=True,
)
SyntheticDatasetGenerator(config).generate_dataset()
```

Each split is written as `images/` plus a `labels.json` in the exact docTR
format (absolute pixel coordinates):

```json
{
  "00000.jpg": {
    "img_dimensions": [1462, 1056],
    "img_hash": "<sha256 of the image>",
    "polygons": [[[x1, y1], [x2, y2], [x3, y3], [x4, y4]], ...]
  }
}
```

It reuses the same fonts, ink styling, contrast, backgrounds and degradations as
the recognition path. Pages are filled top-to-bottom by the available vertical
space (word count varies naturally with font size), and words are recycled as
needed so a page always fills regardless of how many candidate words it is given.

### Real-world layouts

To better mimic real documents, the layout is chosen per page via `det_layout`:

- `"paragraph"` - multi-block running text with headings and indents.
- `"newspaper"` - a full-width masthead with a double rule and a dateline, then
  several narrow columns separated by vertical rules, each with article
  headlines, bylines and small, tightly-leaded body text (~500-1100 words on an
  A4-ish page). Tune density with `det_newspaper_columns_range` (default
  `(3, 6)`, clamped to the page width), `det_newspaper_font_size_range` (default
  `(9, 15)`) and `det_newspaper_line_spacing_range` (default `(1.05, 1.2)`).
- `"form"` - a title with a header rule, then `Label:` / value rows with either
  underlines or boxed fields, shaded section-header bars, and occasional
  checkbox rows.
- `"id_card"` - a card with a coloured issuing-authority header band (emblem +
  light title text), a photo placeholder, labelled field rows, a signature line
  and MRZ-style lines. Mirrors fully for right-to-left scripts.
- `"mixed"` (default) - a weighted blend of the above; tune via
  `det_layout_weights` (e.g. `{"paragraph": 0.4, "newspaper": 0.25, "form": 0.2,
  "id_card": 0.15}`).

Forms and ID cards always render on clean generated paper. All layouts emit the
same per-word polygons, and the optional small global page rotation
(`det_rotation_*`) rotates the polygons with the page for use with docTR's
`use_polygons=True`. Other layout knobs: `det_page_*_range`, `det_font_size_range`,
`det_max_blocks`, `det_margin_ratio`, `det_heading_prob`.

> **Backgrounds for detection:** only the words *you* place are labelled, so any
> text already printed in a background photo becomes an unlabelled false
> negative. `det_plain_background_prob` (0.4) mixes in clean generated paper;
> set it to `1.0` for all-paper pages, or point `bg_image_dir` at **text-free**
> textures (plain paper, surfaces, fabrics) only.

Non-Latin scripts work out of the box: words and fonts are resolved per language,
complex scripts are shaped correctly (Arabic joining, Indic conjuncts), and
right-to-left languages (Arabic, Hebrew, ...) are laid out right-to-left so pages
read naturally. For example `languages=["ar"]`, `["he"]`, `["zh"]` or `["hi"]`.

## Plug into docTR training (on-the-fly, in-RAM)

You can skip writing a dataset to disk entirely and feed freshly synthesised
samples straight into docTR's training scripts. `generator/doctr_dataset.py`
provides PyTorch `Dataset` wrappers that generate one sample per
`__getitem__`, matching docTR's dataset contract - `(image_tensor, target)` per
sample plus a static `collate_fn` - so they drop into the existing `DataLoader`
in
[`references/detection/train.py`](https://github.com/mindee/doctr/blob/main/references/detection/train.py)
and
[`references/recognition/train.py`](https://github.com/mindee/doctr/blob/main/references/recognition/train.py).

Targets are identical to docTR's own datasets, so the model transforms and loss
treat them the same: recognition yields the label string; detection yields
`{CLASS_NAME: geoms}` with absolute-pixel polygons `(N, 4, 2)` when
`use_polygons=True` else straight boxes `(N, 4)` as `[xmin, ymin, xmax, ymax]`.

**Detection** - in `references/detection/train.py`, replace the
`DetectionDataset(...)` construction (keep the `DataLoader` lines):

```python
from generator.components import GenerationConfig
from generator.doctr_dataset import build_detection_datasets, synth_worker_init_fn

cfg = GenerationConfig.flat(
    task="detection",
    languages=["en", "de"],
    num_images=50_000,  # POOL size (word variety + vocab coverage)
    auto_download_backgrounds=True,
)
train_set, val_set = build_detection_datasets(
    cfg,
    train_samples=args.epochs and 20_000,  # virtual epoch length (len(dataset))
    val_samples=2_000,
    use_polygons=args.rotation,  # straight boxes unless --rotation
    sample_transforms=batch_transforms,  # the script's existing transforms
)
```

**Recognition** - in `references/recognition/train.py`, replace the
`RecognitionDataset(...)` construction:

```python
from generator.components import GenerationConfig
from generator.doctr_dataset import build_recognition_datasets, synth_worker_init_fn

cfg = GenerationConfig.flat(task="recognition", num_images=100_000)
train_set, val_set = build_recognition_datasets(
    cfg,
    train_samples=50_000,
    val_samples=5_000,
    vocab=args.vocab,  # e.g. ["german", "urdu", "odia"] - the model's vocab
    img_transforms=img_transforms,  # the script's existing resize/aug
)
```

Pass `vocab` the **same vocab you train the model on** - a `VOCABS` key, a
literal charset, or a list of keys whose union is the model's vocab (e.g.
`["german", "urdu", "odia"]`). Every generated label is then guaranteed to
contain only characters in that vocab, so docTR's label encoder never hits an
out-of-vocab character (which would otherwise crash training on the first batch).
Corpus words outside the vocab are dropped, character coverage is guaranteed
*within* the vocab, and the corpus languages are derived from the vocab keys
automatically (`german` -> `de`); a key with no corpus (e.g. `odia`) still has
its characters covered via synthesis. Set `config.languages` explicitly to pull
different corpora. The same restriction applies to the offline generator via
`GenerationConfig.flat(target_vocab=[...])`.

The `DataLoader` lines stay as they are - just keep
`collate_fn=train_set.collate_fn` and add `worker_init_fn=synth_worker_init_fn`
so every worker gets an independent RNG stream:

```python
train_loader = DataLoader(
    train_set,
    batch_size=args.batch_size,
    shuffle=True,
    drop_last=True,
    num_workers=args.workers,
    pin_memory=torch.cuda.is_available(),
    collate_fn=train_set.collate_fn,
    worker_init_fn=synth_worker_init_fn,
)
```

Notes:

- **Pool size vs epoch length.** `config.num_images` sizes the word *pool*
  (variety and per-split vocab coverage); `train_samples` / `val_samples` set
  the virtual epoch length (`len(dataset)`). Samples are generated fresh, so the
  epoch length is just how many iterations you want per epoch.
- **Seeding.** The train set draws a fresh random sample on every access (new
  data every epoch - the whole point of on-the-fly); the val set is a
  reproducible fixed virtual set (seeded per index) so metrics stay comparable.
- **Coverage carries over.** The recognition pools come from the same balancing
  and per-split character-coverage pipeline as the offline generator, so
  sampling from them covers the target vocab.
- **One-time setup.** Corpora, fonts and backgrounds are downloaded/resolved
  once when the datasets are built (in the parent process), not per worker.
- Requires PyTorch in your training environment (`pip install python-doctr`).
  Importing the rest of this package never requires torch. For lower-level
  control you can use `SyntheticDetectionDataset` / `SyntheticRecognitionDataset`
  directly instead of the `build_*` factories.

## Realism

Rendered crops are meant to match real captured documents rather than clean
synthetic glyphs. The pipeline applies, all configurable:

- Supersampled rendering with high-quality downsampling for photographic
  anti-aliasing (`supersample`).
- Background-aware ink: dark-on-light **and** light-on-dark text, a controllable
  (often deliberately low) contrast range, neutral or coloured ink, variable
  opacity, faux-bold and outlines.
- Glyph-space augmentations before compositing (rotation, perspective, ink
  erosion) and image-space degradations after (Gaussian sensor noise, JPEG
  compression artifacts, blur, brightness/contrast jitter) - matching how a real
  capture degrades the whole frame.
- Optional JPEG output (`output_jpeg=True`) to match real document captures.

## Performance & memory

Font objects and decoded background images are cached, giving a large throughput
improvement over re-loading them per sample. Memory stays bounded and tunable:

- `bg_cache_size` (16): number of decoded backgrounds held in memory per worker.
  Lower it on memory-constrained machines or with many workers; raise it for more
  background variety. `bg_max_dimension` (2000) downscales very large backgrounds
  on load so the cache stays light regardless of source resolution.
- Caches are per worker process, so peak memory scales roughly with
  `num_workers`.

## Configuration reference

The config is organised into focused sub-configs - `core`, `resources`,
`corpus`, `balance`, `coverage`, `recognition`, `realism` and `detection` -
each a small dataclass you can construct on its own. Build `GenerationConfig`
nested (`GenerationConfig(detection=DetectionConfig(layout="form"))`), or pass
flat keyword names via `GenerationConfig.flat(...)` / `generate_dataset(...)`,
which route each keyword into the right sub-config.

Most runs need only a handful of options - the ones you are most likely to set
(as flat keywords):

| Option | Sub-config | Default | What it does |
| --- | --- | --- | --- |
| `output_dir` | core | - | where the dataset is written (`train/`, `val/`) |
| `num_images` | core | `1000` | total samples (split by `val_percent`) |
| `task` | core | `"recognition"` | `"recognition"` crops or `"detection"` pages |
| `languages` | core | `["en"]` | ISO 639-1 codes; resolves words, fonts and shaping |
| `val_percent` | core | `0.2` | validation fraction |
| `num_workers` | core | `4` | parallel worker processes |
| `output_jpeg` | core | `False` | write JPEG instead of PNG |
| `target_vocab` | coverage | `None` | recognition: restrict labels to a `VOCABS` key / list (the `vocab=` arg) |
| `det_layout` | detection | `"mixed"` | detection: `mixed`/`paragraph`/`newspaper`/`form`/`id_card` |
| `language_balance` | balance | `"balanced"` | `"balanced"` or `"proportional"` allocation across languages |
| `min_char_coverage` | balance | `0` | ensure every character appears >= N times (0 = off) |
| `wordlist_path` / `font_dir` / `bg_image_dir` | resources | `None` | bring your own resources (skips the matching download) |

For the complete set of options (realism, augmentation and detection-layout
knobs), see the sub-config dataclasses in `generator/components/config.py`.

## Resources

- **fonts_v1**: A collection of fonts used for text rendering can be downloaded from [Fonts_v1](https://github.com/felixdittrich92/docTR-Synth-Generator/releases/download/v0.0.1/fonts_v1.zip).
- **background_images_v1**: A collection of background images used for text rendering can be downloaded from [Background_Images_v1](https://github.com/felixdittrich92/docTR-Synth-Generator/releases/download/v0.0.1/background_images_v1.zip).

## Citation

If you wish to cite please refer to the base project citation, feel free to use this [BibTeX](http://www.bibtex.org/) references:

```bibtex
@misc{docTR-Synth-Generator,
    title={docTR-Synth-Generator: A tool to generate synthetic OCR text datasets - made for docTR},
    author={{Dittrich, Felix}},
    year={2026},
    publisher = {GitHub},
    howpublished = {\url{https://github.com/felixdittrich92/docTR-Synth-Generator}}
}
```

The automatic word lists are derived from the
[FrequencyWords](https://github.com/hermitdave/FrequencyWords) project
(OpenSubtitles-based) and fonts from [Google Fonts / Noto](https://fonts.google.com/noto);
please respect their respective licenses when redistributing generated datasets.

## Development & tests

The test suite is fully offline - it builds a tiny in-memory font with
`fontTools` and monkeypatches the network downloads, so no fonts or corpora are
fetched while testing. Run it with:

```bash
make test      # pytest + coverage
make quality   # ruff + mypy
make style     # auto-format and fix
```

## Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create.

Any contributions you make are **greatly appreciated**.

1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Add your Changes
4. Run the tests and quality checks (`make test` and `make style` and `make quality`)
5. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
6. Push to the Branch (`git push origin feature/AmazingFeature`)

## License

Distributed under the Apache 2.0 License. See [`LICENSE`](https://github.com/felixdittrich92/OnnxTR?tab=Apache-2.0-1-ov-file#readme) for more information.
