Metadata-Version: 2.4
Name: toy-gpt-train-400-context-3
Version: 0.9.1
Summary: Toy GPT next-token prediction using a 3-token context window.
Author-email: Denise Case <dcase@nwmissouri.edu>
Project-URL: Documentation, https://toy-gpt.github.io/train-400-context-3/
Project-URL: Issues, https://github.com/toy-gpt/train-400-context-3/issues
Project-URL: Repository, https://github.com/toy-gpt/train-400-context-3
Keywords: toy,gpt,training,context window,language model
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Education
Classifier: Topic :: Software Development :: Libraries
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Typing :: Typed
Requires-Python: >=3.14
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: datafun-toolkit>=0.9.5
Provides-Extra: dev
Requires-Dist: bandit>=1.9.2; extra == "dev"
Requires-Dist: deptry>=0.10.1; extra == "dev"
Requires-Dist: packaging>=25.0; extra == "dev"
Requires-Dist: pre-commit>=4.5.1; extra == "dev"
Requires-Dist: pytest>=9.0.2; extra == "dev"
Requires-Dist: pytest-cov>=7.0.0; extra == "dev"
Requires-Dist: pyright>=1.1.408; extra == "dev"
Requires-Dist: ruff>=0.14.13; extra == "dev"
Requires-Dist: twine>=6.2.0; extra == "dev"
Requires-Dist: validate-pyproject>=0.24.1; extra == "dev"
Provides-Extra: docs
Requires-Dist: mike>=2.1.3; extra == "docs"
Requires-Dist: mkdocs>=1.6.1; extra == "docs"
Requires-Dist: mkdocs-material>=9.7.1; extra == "docs"
Requires-Dist: mkdocs-static-i18n>=1.3.0; extra == "docs"
Requires-Dist: mkdocstrings[python]>=1.0.0; extra == "docs"
Dynamic: license-file

# Toy-GPT: train-400-context-3

[![PyPI version](https://img.shields.io/pypi/v/toy-gpt-train-400-context-3)](https://pypi.org/project/toy-gpt-train-400-context-3/)
[![Latest Release](https://img.shields.io/github/v/release/toy-gpt/train-400-context-3)](https://github.com/toy-gpt/train-400-context-3/releases)
[![Docs](https://img.shields.io/badge/docs-live-blue)](https://toy-gpt.github.io/train-400-context-3/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/license/MIT)
[![CI](https://github.com/toy-gpt/train-400-context-3/actions/workflows/ci-python-mkdocs.yml/badge.svg?branch=main)](https://github.com/toy-gpt/train-400-context-3/actions/workflows/ci-python-mkdocs.yml)
[![Deploy-Docs](https://github.com/toy-gpt/train-400-context-3/actions/workflows/deploy-mkdocs.yml/badge.svg?branch=main)](https://github.com/toy-gpt/train-400-context-3/actions/workflows/deploy-mkdocs.yml)
[![Check Links](https://github.com/toy-gpt/train-400-context-3/actions/workflows/links.yml/badge.svg)](https://github.com/toy-gpt/train-400-context-3/actions/workflows/links.yml)
[![Dependabot](https://img.shields.io/badge/Dependabot-enabled-brightgreen.svg)](https://github.com/toy-gpt/train-400-context-3/security)

> Demonstrates, at very small scale, how a language model is trained.

This repository is part of a series of toy training repositories plus a companion client repository:

- **Training repositories** produce pretrained artifacts (vocabulary, weights, metadata).
- The **client repository** loads those artifacts and provides an interactive prompt.

## Contents

- a small, declared text corpus
- a tokenizer and vocabulary builder
- a simple next-token prediction model
- a repeatable training loop
- committed, inspectable artifacts for downstream use

## Scope

This is:

- an intentionally inspectable training pipeline
- a next-token predictor trained on an explicit corpus

This is not:

- a production system
- a full Transformer implementation
- a chat interface
- a claim of semantic understanding

## Outputs

This repository produces and commits pretrained artifacts under `artifacts/`.

Training logs and evidence are written under `outputs/`
(for example, `outputs/train_log.csv`).

## Quick start

See `SETUP.md` for full setup and workflow instructions.

Run the full training script:

```shell
uv run python src/toy_gpt_train/d_train.py
```

Run individually:

- a/b/c are demos (can be run alone if desired)
- d_train produces artifacts
- e_infer consumes artifacts

```shell
uv run python src/toy_gpt_train/a_tokenizer.py
uv run python src/toy_gpt_train/b_vocab.py
uv run python src/toy_gpt_train/c_model.py
uv run python src/toy_gpt_train/d_train.py
uv run python src/toy_gpt_train/e_infer.py
```

## Provenance and Purpose

The primary corpus used for training is declared in `SE_MANIFEST.toml`.

This repository commits pretrained artifacts so the client can run
without retraining.

## Annotations

[ANNOTATIONS.md](./ANNOTATIONS.md) - REQ/WHY/OBS annotations used

## Citation

[CITATION.cff](./CITATION.cff)

## License

[MIT](./LICENSE)

## SE Manifest

[SE_MANIFEST.toml](./SE_MANIFEST.toml) - project intent, scope, and role
