Minimal mental flow

Raw text
   |
   v
UTF-8 encode to bytes
   |
   v
Start with tokens = individual bytes
   |
   v
Count adjacent token pairs
   |
   v
Pick most frequent pair
   |
   v
Assign new token id
   |
   v
Replace that pair everywhere
   |
   v
Repeat N merges
   |
   v
Done: vocab + merge rules


What to build first

Do this in this order:

Phase 1

A function that turns Python string into UTF-8 bytes.

Phase 2

A representation of tokens as integers.

Phase 3

A function that counts adjacent pairs in a token sequence.

Phase 4

A function that replaces one chosen pair with a new token id.

Phase 5

A training loop that repeats merges.

Phase 6

An encoder:

text -> bytes -> apply merges -> token ids

Phase 7

A decoder:

token ids -> expand merges back -> bytes -> text

==================================================
That is the correct foundation.

Best learning target

Your first tokenizer should be able to:

train on a small text corpus

produce merge rules

encode new text

decode back exactly

If decode is not lossless, something is off.