Metadata-Version: 2.4
Name: blys
Version: 0.1.0
Summary: Utilities for building ML applications from the Google Fonts dataset
Author-email: Simon Cozens <simon@simon-cozens.org>
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: gftools
Requires-Dist: numpy
Requires-Dist: glyphsets
Requires-Dist: torch>=1.11.0
Requires-Dist: Pillow
Requires-Dist: fonttools
Requires-Dist: uharfbuzz
Requires-Dist: scikit-learn
Requires-Dist: torchmetrics
Requires-Dist: tensorboard
Requires-Dist: freetype-py
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Dynamic: license-file

# blys

`blys` is a utility library for building ML models that learn from font data.

It provides:

- structured access to Google Fonts metadata and font files
- robust glyph rasterization by GID/codepoint
- reusable PyTorch dataset and dataloader building blocks
- a compact, extensible training loop with checkpointing and TensorBoard logging

The package is designed to be imported by task-specific projects rather than prescribing a single model architecture.

## Installation

From PyPI:

```bash
pip install blys
```

## Data Source Expectations

Most dataset utilities expect a local clone of the Google Fonts repository with at least:

- `ofl/*/*.ttf`
- `tags/all/families.csv`

Example:

```bash
git clone https://github.com/google/fonts.git /data/google-fonts
```

## Quick Start

### 1) Load Google Fonts and inspect metadata

```python
from blys.googlefonts import GoogleFonts

gf = GoogleFonts("/data/google-fonts")
print(f"Loaded {len(gf.fonts)} fonts")

font = gf.fonts[0]
print(font.family)
print(font.classification())
print(font.description_with_tags_and_display())
```

### 2) Render glyphs

```python
from blys.render import render_gid
from blys.googlefonts import GoogleFonts
import uharfbuzz as hb

gf = GoogleFonts("/data/google-fonts")
font = gf.fonts[0]

# Find GID for 'A'
gid = hb.Font(font.hb_face).get_nominal_glyph(ord("A"))

# CHW float image in [0, 1], shape (3, 128, 128)
image = render_gid(font.path, gid=gid, size=128)
print(image.shape, image.min(), image.max())
```

### 3) Build a task-specific DatasetMaker

`DatasetMaker` handles train/test font splits and DataLoader construction. You provide `collate_fn`.

```python
from blys.dataset import DatasetMaker
import torch


class GlyphDatasetMaker(DatasetMaker):
    def collate_fn(self, batch):
        chars = torch.tensor([item["char"] for item in batch], dtype=torch.long)
        images = torch.stack(
            [
                torch.tensor(item["font"].render_char(item["char"], size=128), dtype=torch.float32)
                for item in batch
            ]
        )
        return {
            "char": chars,
            "image": images,
            "description": [item["font"].description_with_tags_and_display() for item in batch],
        }


maker = GlyphDatasetMaker(
    repo_url="/data/google-fonts",
    batch_size=16,
)
train_loader = maker.train_loader()
batch = next(iter(train_loader))
print(batch["image"].shape)
```

## Core Modules

### `blys.googlefonts`

- `GoogleFonts`: loads/filter fonts from a Google Fonts checkout
- `GoogleFont`: one font with metadata/tag/description helpers
- `StandaloneFont`: local-font fallback implementing the same interface
- `find_google_font_by_basename`: match a font by filename
- `compute_display_score`: derive display/text style centile from tags

### `blys.font`

- `Font`: abstract base with shared operations:
  - rendering by codepoint (`render_char`) and gid (`render_gid`)
  - codepoint queries
  - variable-axis sampling (`sample_axis_positions`)
  - empty-glyph checks (`has_non_empty_codepoint`, `has_non_empty_gid`)

### `blys.render`

- `render_gid`: deterministic glyph rasterization by GID with optional variable-axis coordinates
- `is_blank_rendering`: utility to detect all-white/all-black outputs
- a small CLI entry point for local rendering/debugging

### `blys.dataset`

- constants for commonly used character sets:
  - `LATIN_CORE`
  - `LATIN_KERNEL`
- `DatasetMaker`: split and loader orchestration
- `Dataset`: `(font, char)` samples filtered by available codepoints
- `AllGidsDataset`: `(font, gid)` samples over non-empty glyphs
- `ClassBalancedBatchSampler`: class-balanced index batching by font classification

### `blys.utils`

- `TrainingLoop`: minimal training harness with:
  - reproducibility setup
  - git cleanliness/commit tracking
  - TensorBoard logging
  - best-checkpoint saving
- helpers for device selection and CLI codepoint parsing

## Example Training Loop Usage

```python
import torch
from blys.utils import TrainingLoop, SaveLoadModel
from blys.dataset import DatasetMaker


class MyModel(SaveLoadModel):
    def __init__(self):
        super().__init__()
        self.net = torch.nn.Linear(10, 1)

    def forward(self, x):
        return self.net(x)


class MyLoop(TrainingLoop):
    def post_init(self, train_args):
        self.model = MyModel().to(self.device)
        maker = DatasetMaker(
            repo_url=train_args.dataset_path,
            batch_size=train_args.batch_size,
            target_codepoints={ord("A"), ord("B"), ord("C")},
        )
        self.train_loader = maker.train_loader()
        self.test_loader = maker.test_loader()
        self.optimizer = torch.optim.Adam(self.model.parameters(), lr=1e-3)
        self.num_epochs = 1
        self.target_steps = train_args.target_steps
        self.validation_direction = "higher"

    def train_step(self, batch):
        # Replace with your real tensorization/model logic
        dummy_x = torch.randn(4, 10, device=self.device)
        dummy_y = torch.randn(4, 1, device=self.device)
        pred = self.model(dummy_x)
        loss = torch.nn.functional.mse_loss(pred, dummy_y)
        return loss, {"loss": loss}
```

## Testing

Run test suite:

```bash
pip install -e .[test]
pytest -q
```

Some tests require a real Google Fonts checkout path and expect the `GOOGLE_FONTS_REPO` environment variable to be set; others run against `tests/dummy_repo`.

## License

This project is available under the Apache 2.0 License. See the [LICENSE](LICENSE) file for more details.
