Metadata-Version: 2.4
Name: libthx
Version: 0.1.5
Summary: Architecture experimentation and training infrastructure.
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click<=8.2.1
Requires-Dist: flax>=0.12.2
Requires-Dist: jsonlines>=4.0.0
Requires-Dist: loguru>=0.7.3
Requires-Dist: numpy>=2.4.1
Requires-Dist: omegaconf>=2.3.0
Requires-Dist: orbax>=0.1.9
Requires-Dist: pydantic>=2.12.5
Requires-Dist: python-dotenv>=1.2.1
Requires-Dist: rich>=13.5.2
Requires-Dist: seaborn>=0.13.2
Requires-Dist: tiktoken>=0.12.0
Requires-Dist: torchax>=0.0.11
Requires-Dist: wandb>=0.24.1
Requires-Dist: datasets>=4.5.0
Provides-Extra: fever
Requires-Dist: wikipedia>=1.4.0; extra == "fever"
Provides-Extra: huggingface
Requires-Dist: tokenizers>=0.22.2; extra == "huggingface"
Requires-Dist: transformers>=5.1.0; extra == "huggingface"
Provides-Extra: cuda13
Requires-Dist: jax[cuda13]>=0.4.23; extra == "cuda13"
Requires-Dist: torch>=2.9.1; extra == "cuda13"
Requires-Dist: torchax>=0.0.11; extra == "cuda13"
Requires-Dist: tokenizers>=0.22.2; extra == "cuda13"
Requires-Dist: transformers>=5.1.0; extra == "cuda13"
Provides-Extra: cuda12
Requires-Dist: jax[cuda12]>=0.4.23; extra == "cuda12"
Requires-Dist: torch>=2.9.1; extra == "cuda12"
Requires-Dist: torchax>=0.0.11; extra == "cuda12"
Requires-Dist: tokenizers>=0.22.2; extra == "cuda12"
Requires-Dist: transformers>=5.1.0; extra == "cuda12"
Provides-Extra: tpu
Requires-Dist: jax[tpu]>=0.4.23; extra == "tpu"
Requires-Dist: torch>=2.9.1; extra == "tpu"
Requires-Dist: torchax>=0.0.11; extra == "tpu"
Requires-Dist: tokenizers>=0.22.2; extra == "tpu"
Requires-Dist: transformers>=5.1.0; extra == "tpu"
Provides-Extra: cpu
Requires-Dist: jax>=0.4.23; extra == "cpu"
Requires-Dist: torch>=2.9.1; extra == "cpu"
Requires-Dist: torchax>=0.0.11; extra == "cpu"
Requires-Dist: tokenizers>=0.22.2; extra == "cpu"
Requires-Dist: transformers>=5.1.0; extra == "cpu"
Provides-Extra: web
Requires-Dist: fastapi>=0.100.0; extra == "web"
Requires-Dist: uvicorn[standard]>=0.23.0; extra == "web"
Requires-Dist: jinja2>=3.1.0; extra == "web"
Requires-Dist: aiofiles>=23.0.0; extra == "web"
Requires-Dist: sse-starlette>=1.6.0; extra == "web"
Requires-Dist: pyyaml>=6.0.0; extra == "web"
Requires-Dist: python-multipart>=0.0.22; extra == "web"
Requires-Dist: bcrypt>=5.0.0; extra == "web"
Requires-Dist: itsdangerous>=2.2.0; extra == "web"
Requires-Dist: watchdog>=4.0.0; extra == "web"
Provides-Extra: dev
Requires-Dist: coverage>=7.9.1; extra == "dev"
Requires-Dist: coveralls>=4.0.1; extra == "dev"
Requires-Dist: pytest>=8.4.1; extra == "dev"
Requires-Dist: pytest-cov>=6.2.1; extra == "dev"
Requires-Dist: ruff>=0.12.1; extra == "dev"
Requires-Dist: pre-commit>=4.2.0; extra == "dev"
Requires-Dist: mypy>=1.16.1; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.6.1; extra == "docs"
Requires-Dist: mkdocs-material>=9.6.20; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.30.0; extra == "docs"
Requires-Dist: pymdown-extensions>=10.15; extra == "docs"
Requires-Dist: mkdocs-gen-files>=0.5.0; extra == "docs"
Requires-Dist: mkdocs-literate-nav>=0.6.0; extra == "docs"
Requires-Dist: mkdocs-section-index>=0.3.0; extra == "docs"
Dynamic: license-file

# theseus
Have you ever wanted to train a language model from scratch but hate writing boilerplate? Previously the solution to this is to work at a frontier lab with Research Engineers:tm:.

Now the solution is to make Jack:tm: (and also a cast of frontier coding models) do your research engineering. It will probably break a lot but what the heck at least I tried.

## Download

It depends on who gave you computors to make warm:

- cuda13: `uv sync --group all --group cuda13`
- cuda12: `uv sync --group all --group cuda12`
- you love Google: `uv sync --group all --group tpu`
- you bought your own computors: `uv sync --group all --group cpu`


## Quick Start
Use the CLI.

```bash
# List available jobs
theseus jobs

# Generate a config for data tokenization
theseus configure data/tokenize_variable_dataset tokenize.yaml \
    data.name=fineweb data.max_samples=1000000

# Run the tokenization locally
theseus run tokenize-fineweb tokenize.yaml ./output

# Generate a config for pretraining
theseus configure gpt/train/pretrain train.yaml \
    --chip h100 -n 8

# Run training locally
theseus run my-gpt-run train.yaml ./output
```

### Quick Start, but You Have Infra

Set up `~/.theseus.yaml` (see `examples/dispatch.yaml`), then submit jobs to remote clusters:

```bash
theseus submit my-run train.yaml --chip h100 -n 8
```

## Quickish Start

For programmatic configuration and rapid prototyping:

```python
from theseus.quick import quick
from theseus.registry import JOBS

with quick("gpt/train/pretrain", "/path/to/output", "my-run") as j:
    j.config.training.per_device_batch_size = 16
    j.config.logging.checkpoint_interval = 4096
    j()  # run locally

# Or save config for later submission:
with quick("gpt/train/pretrain", "/path/to/output", "my-run") as j:
    j.config.training.per_device_batch_size = 16
    j.save("config.yaml", chip="h100", n_chips=8)
```

## Not Quick Start at All 

When you (or Claude) manage to find some time to chill you can actually extend this package. The package is organized based around `theseus.job.BasicJob`s. They can be extended with checkpointing and recovery tools. 

The main entrypoint to start hacking:

1. take a look at how to compose a model together in `theseus.model.models.base`
2. bodge together anything you want to change and make a new model in the models folder (be sure to add it to `theseus.model.models.__init__`)
3. write an experiment, which is a `RestoreableJob`. A very basic one can just inherit the normal trainer, and then that's about it. see `theseus.experiments.gpt` to get started (be sure to add it to `theseus.experiments.__init__`)

```python
# theseus/experiments/my_model.py
from theseus.training.base import BaseTrainer, BaseTrainerConfig
from theseus.model.models import MyModel

class PretrainMyModel(BaseTrainer[BaseTrainerConfig, MyModel]):
    MODEL = MyModel
    CONFIG = BaseTrainerConfig

    @classmethod
    def schedule(cls):
        return "wsd"
```

## JuiceFS Integration
When you are on many remote computors but bursty you may go "aw schucks I need to copy like 50TB of pretraining data around that's so lame!" 

Don't worry, we gotchu. If you use the `submit` API, we have a way to ship your root directory around by using a thing called [JuiceFS](https://juicefs.com/en/), which is a distributed filesystem.

In your `~/.theseus.yaml`, add the `mount` field to your cluster config:

```yaml
clusters:
  hpc:
    root: /mnt/juicefs/theseus
    work: /scratch/theseus
    mount: redis://:password@redis.example.com:6379/0
    cache_size: 100G
    cache_dir: /scratch/juicefs-cache
```

## (an incomplete list of) Features

- **CLI & Programmatic API**: Configure and run jobs via `theseus` CLI or the `quick()` Python API
- **Remote Dispatch**: Submit jobs to SLURM clusters or plain SSH hosts via `~/.theseus.yaml`
- **Checkpointing & Recovery**: Jobs are `RestoreableJob`s with built-in checkpoint/restore support
- **Data Pipelines**: Tokenize datasets (blockwise or streaming) with `data/tokenize_*` jobs
- **JuiceFS Integration**: Distributed filesystem support for sharing data across clusters
- **Multi-backend**: CUDA 11/12/13, TPU, and CPU via `uv sync --group`
- **Extensible**: Add models in `theseus.model.models`, experiments in `theseus.experiments`, and datasets in `theseus.data.datasets`
- **Dataclass Configs**: Type-safe configuration via dataclasses with OmegaConf, easy configuration with `theseus.config.field` dataclass extension, and Hydra-style cheeky cli overrides (`model.hidden_size=1024`)

---

<p align="center">
  <img src="https://www.jemoka.com/images/Logo_Transparent.png" width="32">
</p>
