Metadata-Version: 2.4
Name: doctr-synth-generator
Version: 0.2.0
Summary: A synthetic data generator for training OCR models
Author-email: Felix Dittrich <felixdittrich92@gmail.com>
Maintainer: Felix Dittrich
License:                                  Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding those notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for reasonable and customary use in describing the
              origin of the Work and reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Additional Liability. While redistributing
              the Work or Derivative Works thereof, You may choose to offer,
              and charge a fee for, acceptance of support, warranty, indemnity,
              or other liability obligations and/or rights consistent with this
              License. However, in accepting such obligations, You may act only
              on Your own behalf and on Your sole responsibility, not on behalf
              of any other Contributor, and only if You agree to indemnify,
              defend, and hold each Contributor harmless for any liability
              incurred by, or claims asserted against, such Contributor by reason
              of your accepting any such warranty or additional liability.
        
           END OF TERMS AND CONDITIONS
        
           APPENDIX: How to apply the Apache License to your work.
        
              To apply the Apache License to your work, attach the following
              boilerplate notice, with the fields enclosed by brackets "[]"
              replaced with your own identifying information. (Don't include
              the brackets!)  The text should be enclosed in the appropriate
              comment syntax for the file format. We also recommend that a
              file or class name and description of purpose be included on the
              same "printed page" as the copyright notice for easier
              identification within third-party archives.
        
           Copyright [yyyy] [name of copyright owner]
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
        
Project-URL: repository, https://github.com/felixdittrich92/docTR-Synth-Generator
Project-URL: tracker, https://github.com/felixdittrich92/docTR-Synth-Generator/issues
Project-URL: changelog, https://github.com/felixdittrich92/docTR-Synth-Generator/releases
Keywords: OCR,deep learning,computer vision,text recognition,synthetic data,data augmentation,image processing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: <4,>=3.10.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fonttools<5.0.0,>=4.50.0
Requires-Dist: numpy<3.0.0,>=2.0.0
Requires-Dist: Pillow>=11.0.0
Provides-Extra: testing
Requires-Dist: pytest>=5.3.2; extra == "testing"
Requires-Dist: coverage[toml]>=4.5.4; extra == "testing"
Provides-Extra: quality
Requires-Dist: ruff>=0.1.5; extra == "quality"
Requires-Dist: mypy>=0.812; extra == "quality"
Requires-Dist: pre-commit>=2.17.0; extra == "quality"
Provides-Extra: dev
Requires-Dist: fonttools<5.0.0,>=4.50.0; extra == "dev"
Requires-Dist: numpy<3.0.0,>=2.0.0; extra == "dev"
Requires-Dist: Pillow>=11.0.0; extra == "dev"
Requires-Dist: pytest>=5.3.2; extra == "dev"
Requires-Dist: coverage[toml]>=4.5.4; extra == "dev"
Requires-Dist: ruff>=0.1.5; extra == "dev"
Requires-Dist: mypy>=0.812; extra == "dev"
Requires-Dist: pre-commit>=2.17.0; extra == "dev"
Dynamic: license-file

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
![Build Status](https://github.com/felixdittrich92/docTR-Synth-Generator/workflows/builds/badge.svg)
[![codecov](https://codecov.io/gh/felixdittrich92/docTR-Synth-Generator/graph/badge.svg?token=31MDR20JGI)](https://codecov.io/gh/felixdittrich92/docTR-Synth-Generator)
[![CodeFactor](https://www.codefactor.io/repository/github/felixdittrich92/doctr-synth-generator/badge)](https://www.codefactor.io/repository/github/felixdittrich92/doctr-synth-generator)
[![Pypi](https://img.shields.io/badge/pypi-v0.1.0-blue.svg)](https://pypi.org/project/docTR-Synth-Generator/)

# docTR-Synth-Generator
A tool to generate synthetic OCR datasets - made for docTR

## Features

- **Zero-config**: generate a dataset with nothing but an output directory - real
  words, matching fonts *and* background images are downloaded automatically.
- **Multilingual by language code**: `languages=["de", "ru", "ar", ...]` resolves
  both the words *and* the fonts for each script (~85 languages), with correct
  complex-script shaping and right-to-left layout for Arabic/Hebrew.
- **No more dropped words**: any character a local font cannot render triggers an
  on-demand download of a font that can, instead of silently skipping the word.
- **Realistic output**: supersampled anti-aliasing, background-aware ink colour
  and contrast (dark-on-light and light-on-dark), faux-bold/outlines, and
  scanner/camera-style degradations (JPEG artifacts, sensor noise, blur).
- **Controllable balancing**: explicit per-language allocation, a stratified
  train/val split, optional character-coverage guarantees, and a balance report.
- **Recognition *and* detection**: produce word/line crops for recognition, or
  full document-like pages with per-word polygons for detection - both in the
  formats docTR's training references expect.
- **Fast & memory-bounded**: font objects and decoded backgrounds are cached, with
  a configurable cache size.

## Quickstart (zero configuration)

You no longer need to provide a wordlist or a font directory. With nothing but an
output directory and a count, the generator downloads real words for the
requested language(s) and automatically fetches matching open-source fonts:

```python
from generator import GenerationConfig, SyntheticDatasetGenerator

config = GenerationConfig(output_dir="output_dataset", num_images=1000)  # English by default
SyntheticDatasetGenerator(config).generate_dataset()
```

Multilingual is a one-liner - a language code selects both its words *and* its
script, so the correct fonts are pulled in for you:

```python
config = GenerationConfig(
    output_dir="output_dataset",
    num_images=10000,
    languages=["en", "de", "ru", "el", "ar"],  # words + fonts resolved automatically
    bg_image_dir="resources/background_images",  # optional; blank backgrounds otherwise
)
SyntheticDatasetGenerator(config).generate_dataset()
```

> The first run downloads word lists and fonts from public mirrors and caches
> them (`corpus_cache_dir` / `font_cache_dir`). Subsequent runs are offline. To
> run fully offline from the start, supply your own `wordlist_path` and
> `font_dir`.

## Bring your own resources (classic usage)

Supplying a `wordlist_path` and/or `font_dir` still works and takes precedence
over the automatic downloads:

```python
config = GenerationConfig(
    wordlist_path="resources/corpus/latin_ext_balanced_words.txt",
    font_dir="resources/font",  # e.g. the extracted fonts_v1 release
    bg_image_dir="resources/background_images",  # bundled with the repo
    output_dir="output_dataset",
    num_images=1000,
    val_percent=0.2,
    num_workers=6,
    # If a word contains characters none of your local fonts cover, download a
    # matching font instead of dropping the word (default: True):
    auto_download_fonts=True,
)
SyntheticDatasetGenerator(config).generate_dataset()
```

## Automatic fonts

When no local font covers every character of a word, a matching open-source font
(from the [Noto](https://fonts.google.com/noto) family, which spans the whole
Unicode range) is downloaded, verified for coverage and cached. This prevents
words from being silently skipped - the main cause of biased, latin-only
datasets. Disable with `auto_download_fonts=False`.

## Automatic words

When no `wordlist_path` is given, real frequency-ranked words for `languages`
are downloaded (from the open
[FrequencyWords](https://github.com/hermitdave/FrequencyWords) project, ~85
languages) and cleaned (script filtering, length bounds, punctuation removal).
Two realism helpers are applied by default and can be tuned or disabled:

- `casing_variant_prob` (0.3): adds Title/UPPERCASE variants so the model sees
  capital letters (frequency lists are almost all lowercase).
- `numeric_token_ratio` (0.05): mixes in realistic numbers, dates, prices and
  codes - the kind of content real documents are full of.

## Automatic backgrounds

When no `bg_image_dir` is given, a curated set of background images is downloaded
and cached automatically (instead of producing blank backgrounds). Supplying your
own `bg_image_dir` takes precedence and skips the download entirely - exactly like
fonts and word lists. Disable with `auto_download_backgrounds=False`, point
`background_cache_dir` somewhere persistent, or pass a `background_manifest_url`
(a newline-separated list of filenames/URLs) to use a different collection.

## Dataset balancing

For multilingual runs the language mix is explicit and controllable instead of
being dominated by whichever language has the most words:

```python
config = GenerationConfig(
    output_dir="output_dataset",
    num_images=30000,
    languages=["en", "de", "ru"],
    language_balance="balanced",  # "balanced" (default) or "proportional"
    # language_weights={"en": 0.6, "de": 0.3, "ru": 0.1},  # or set explicit weights
    min_char_coverage=20,  # ensure every character appears >= N times (0 = off)
)
```

The split is *stratified*: train and val share the same language mix and exact
words do not leak from train into val. A balance report is printed before
generation (per-language train/val counts, train/val overlap, distinct/rare
characters, word-length statistics); silence it with
`print_balance_report=False`.

## Vocabulary coverage (recognition)

A recognition model is trained against a fixed character set (docTR's `VOCABS`).
Real frequency corpora rarely contain *every* character of that set - rare
accented capitals (`ẞ`), currency signs, some punctuation - so a model trained
only on downloaded words never sees them. With `ensure_vocab_coverage=True`
(the default), each language is mapped to its docTR vocab and extra word-like
tokens are synthesised so **every renderable vocab character appears in both the
train and val splits**:

```python
config = GenerationConfig(
    output_dir="dataset",
    num_images=50000,
    languages=["de"],  # mapped to the "german" vocab automatically
    ensure_vocab_coverage=True,  # default
    vocab_coverage_min_count=3,  # each vocab char appears in >= N train samples
)
```

- `target_vocab` overrides the per-language mapping - pass a `VOCABS` key
  (e.g. `"german"`) or a literal string of characters to cover. It also enables
  coverage when you supply your own `wordlist_path`.
- Coverage is enforced **after** the train/val split, so a rare character can
  never land in only one split. This makes `num_images` a *floor*: a small,
  bounded number of coverage samples (proportional to the vocab size, not the
  dataset) is appended on top.
- Languages with no fixed small vocab (CJK) are skipped automatically, and very
  large scripts (thousands of CJK ideographs / Hangul syllables) are left to the
  real corpus rather than synthesised.
- Every synthesised token stays within a single script (a Hebrew character is
  only ever placed in a Hebrew token, etc.), so each renders with one font. When
  several languages are generated together, coverage is computed over the union
  of their vocabs but tokens are never mixed across scripts.
- Coverage prefers repeating **real corpus words** that contain a rare
  character, so diacritic combinations are linguistically attested. Synthesis
  is only a fallback for characters absent from the corpus (rare punctuation,
  currency, capitals in a lower-cased corpus, or marks like Hebrew niqqud that
  real text omits) - and even then a combining mark is inserted into a real
  same-script word after a base letter, never rendered alone on a dotted circle.
- A character that **no available font can render** (e.g. `฿` inside a
  Latin-script vocab) is the one case that cannot be covered - that is a font
  limitation, reported in the log, not a logic gap.

## Detection datasets

Set `task="detection"` to generate document-like **pages** with a 4-point
polygon for every word, ready for
[docTR detection training](https://github.com/mindee/doctr/tree/main/references/detection):

```python
config = GenerationConfig(
    task="detection",
    output_dir="detection_dataset",
    num_images=5000,  # = number of pages
    languages=["en", "de"],  # words + fonts resolved automatically
    bg_image_dir="resources/background_images",
    output_jpeg=True,
)
SyntheticDatasetGenerator(config).generate_dataset()
```

Each split is written as `images/` plus a `labels.json` in the exact docTR
format (absolute pixel coordinates):

```json
{
  "00000.jpg": {
    "img_dimensions": [1462, 1056],
    "img_hash": "<sha256 of the image>",
    "polygons": [[[x1, y1], [x2, y2], [x3, y3], [x4, y4]], ...]
  }
}
```

It reuses the same fonts, ink styling, contrast, backgrounds and degradations as
the recognition path. Pages are filled top-to-bottom by the available vertical
space (word count varies naturally with font size), and words are recycled as
needed so a page always fills regardless of how many candidate words it is given.

### Real-world layouts

To better mimic real documents, the layout is chosen per page via `det_layout`:

- `"paragraph"` - multi-block running text with headings and indents.
- `"newspaper"` - a full-width masthead with a double rule and a dateline, then
  several narrow columns separated by vertical rules, each with article
  headlines, bylines and small, tightly-leaded body text (~500-1100 words on an
  A4-ish page). Tune density with `det_newspaper_columns_range` (default
  `(3, 6)`, clamped to the page width), `det_newspaper_font_size_range` (default
  `(9, 15)`) and `det_newspaper_line_spacing_range` (default `(1.05, 1.2)`).
- `"form"` - a title with a header rule, then `Label:` / value rows with either
  underlines or boxed fields, shaded section-header bars, and occasional
  checkbox rows.
- `"id_card"` - a card with a coloured issuing-authority header band (emblem +
  light title text), a photo placeholder, labelled field rows, a signature line
  and MRZ-style lines. Mirrors fully for right-to-left scripts.
- `"mixed"` (default) - a weighted blend of the above; tune via
  `det_layout_weights` (e.g. `{"paragraph": 0.4, "newspaper": 0.25, "form": 0.2,
  "id_card": 0.15}`).

Forms and ID cards always render on clean generated paper. All layouts emit the
same per-word polygons, and the optional small global page rotation
(`det_rotation_*`) rotates the polygons with the page for use with docTR's
`use_polygons=True`. Other layout knobs: `det_page_*_range`, `det_font_size_range`,
`det_max_blocks`, `det_margin_ratio`, `det_heading_prob`.

> **Backgrounds for detection:** only the words *you* place are labelled, so any
> text already printed in a background photo becomes an unlabelled false
> negative. `det_plain_background_prob` (0.4) mixes in clean generated paper;
> set it to `1.0` for all-paper pages, or point `bg_image_dir` at **text-free**
> textures (plain paper, surfaces, fabrics) only.

Non-Latin scripts work out of the box: words and fonts are resolved per language,
complex scripts are shaped correctly (Arabic joining, Indic conjuncts), and
right-to-left languages (Arabic, Hebrew, ...) are laid out right-to-left so pages
read naturally. For example `languages=["ar"]`, `["he"]`, `["zh"]` or `["hi"]`.

## Plug into docTR training (on-the-fly, in-RAM)

You can skip writing a dataset to disk entirely and feed freshly synthesised
samples straight into docTR's training scripts. `generator/doctr_dataset.py`
provides PyTorch `Dataset` wrappers that generate one sample per
`__getitem__`, matching docTR's dataset contract - `(image_tensor, target)` per
sample plus a static `collate_fn` - so they drop into the existing `DataLoader`
in
[`references/detection/train.py`](https://github.com/mindee/doctr/blob/main/references/detection/train.py)
and
[`references/recognition/train.py`](https://github.com/mindee/doctr/blob/main/references/recognition/train.py).

Targets are identical to docTR's own datasets, so the model transforms and loss
treat them the same: recognition yields the label string; detection yields
`{CLASS_NAME: geoms}` with absolute-pixel polygons `(N, 4, 2)` when
`use_polygons=True` else straight boxes `(N, 4)` as `[xmin, ymin, xmax, ymax]`.

**Detection** - in `references/detection/train.py`, replace the
`DetectionDataset(...)` construction (keep the `DataLoader` lines):

```python
from generator.components import GenerationConfig
from generator.doctr_dataset import build_detection_datasets, synth_worker_init_fn

cfg = GenerationConfig(
    task="detection",
    languages=["en", "de"],
    num_images=50_000,  # POOL size (word variety + vocab coverage)
    auto_download_backgrounds=True,
)
train_set, val_set = build_detection_datasets(
    cfg,
    train_samples=args.epochs and 20_000,  # virtual epoch length (len(dataset))
    val_samples=2_000,
    use_polygons=args.rotation,  # straight boxes unless --rotation
    sample_transforms=batch_transforms,  # the script's existing transforms
)
```

**Recognition** - in `references/recognition/train.py`, replace the
`RecognitionDataset(...)` construction:

```python
from generator.components import GenerationConfig
from generator.doctr_dataset import build_recognition_datasets, synth_worker_init_fn

cfg = GenerationConfig(task="recognition", languages=["en", "de"], num_images=100_000)
train_set, val_set = build_recognition_datasets(
    cfg,
    train_samples=50_000,
    val_samples=5_000,
    img_transforms=img_transforms,  # the script's existing resize/aug
)
```

The `DataLoader` lines stay as they are - just keep
`collate_fn=train_set.collate_fn` and add `worker_init_fn=synth_worker_init_fn`
so every worker gets an independent RNG stream:

```python
train_loader = DataLoader(
    train_set,
    batch_size=args.batch_size,
    shuffle=True,
    drop_last=True,
    num_workers=args.workers,
    pin_memory=torch.cuda.is_available(),
    collate_fn=train_set.collate_fn,
    worker_init_fn=synth_worker_init_fn,
)
```

Notes:

- **Pool size vs epoch length.** `config.num_images` sizes the word *pool*
  (variety and per-split vocab coverage); `train_samples` / `val_samples` set
  the virtual epoch length (`len(dataset)`). Samples are generated fresh, so the
  epoch length is just how many iterations you want per epoch.
- **Seeding.** The train set draws a fresh random sample on every access (new
  data every epoch - the whole point of on-the-fly); the val set is a
  reproducible fixed virtual set (seeded per index) so metrics stay comparable.
- **Coverage carries over.** The recognition pools come from the same balancing
  and per-split character-coverage pipeline as the offline generator, so
  sampling from them covers the target vocab.
- **One-time setup.** Corpora, fonts and backgrounds are downloaded/resolved
  once when the datasets are built (in the parent process), not per worker.
- Requires PyTorch in your training environment (`pip install python-doctr`).
  Importing the rest of this package never requires torch. For lower-level
  control you can use `SyntheticDetectionDataset` / `SyntheticRecognitionDataset`
  directly instead of the `build_*` factories.

## Realism

Rendered crops are meant to match real captured documents rather than clean
synthetic glyphs. The pipeline applies, all configurable:

- Supersampled rendering with high-quality downsampling for photographic
  anti-aliasing (`supersample`).
- Background-aware ink: dark-on-light **and** light-on-dark text, a controllable
  (often deliberately low) contrast range, neutral or coloured ink, variable
  opacity, faux-bold and outlines.
- Glyph-space augmentations before compositing (rotation, perspective, ink
  erosion) and image-space degradations after (Gaussian sensor noise, JPEG
  compression artifacts, blur, brightness/contrast jitter) - matching how a real
  capture degrades the whole frame.
- Optional JPEG output (`output_jpeg=True`) to match real document captures.

## Performance & memory

Font objects and decoded background images are cached, giving a large throughput
improvement over re-loading them per sample. Memory stays bounded and tunable:

- `bg_cache_size` (16): number of decoded backgrounds held in memory per worker.
  Lower it on memory-constrained machines or with many workers; raise it for more
  background variety. `bg_max_dimension` (2000) downscales very large backgrounds
  on load so the cache stays light regardless of source resolution.
- Caches are per worker process, so peak memory scales roughly with
  `num_workers`.

## Configuration reference

All behaviour is controlled through `GenerationConfig`; see the dataclass
docstring in `generator/components/config.py` for every field and its default.

## Resources

- **fonts_v1**: A collection of fonts used for text rendering can be downloaded from [Fonts_v1](https://github.com/felixdittrich92/docTR-Synth-Generator/releases/download/v0.0.1/fonts_v1.zip).
- **background_images_v1**: A collection of background images used for text rendering can be downloaded from [Background_Images_v1](https://github.com/felixdittrich92/docTR-Synth-Generator/releases/download/v0.0.1/background_images_v1.zip).

## Citation

If you wish to cite please refer to the base project citation, feel free to use this [BibTeX](http://www.bibtex.org/) references:

```bibtex
@misc{docTR-Synth-Generator,
    title={docTR-Synth-Generator: A tool to generate synthetic OCR text datasets - made for docTR},
    author={{Dittrich, Felix}},
    year={2026},
    publisher = {GitHub},
    howpublished = {\url{https://github.com/felixdittrich92/docTR-Synth-Generator}}
}
```

The automatic word lists are derived from the
[FrequencyWords](https://github.com/hermitdave/FrequencyWords) project
(OpenSubtitles-based) and fonts from [Google Fonts / Noto](https://fonts.google.com/noto);
please respect their respective licenses when redistributing generated datasets.

## Development & tests

The test suite is fully offline - it builds a tiny in-memory font with
`fontTools` and monkeypatches the network downloads, so no fonts or corpora are
fetched while testing. Run it with:

```bash
make test      # pytest + coverage
make quality   # ruff + mypy
make style     # auto-format and fix
```

## Contributing

Contributions are what make the open-source community such an amazing place to learn, inspire, and create.

Any contributions you make are **greatly appreciated**.

1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Add your Changes
4. Run the tests and quality checks (`make test` and `make style` and `make quality`)
5. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
6. Push to the Branch (`git push origin feature/AmazingFeature`)

## License

Distributed under the Apache 2.0 License. See [`LICENSE`](https://github.com/felixdittrich92/OnnxTR?tab=Apache-2.0-1-ov-file#readme) for more information.
