Metadata-Version: 2.1
Name: engram-generator
Version: 0.1.0
Summary: Procedural synthetic dataset generator for training reasoning AI — 2,022 generators across 100+ scientific domains
Author: deepnet.one
License: MIT
Project-URL: Homepage, https://www.engram.one
Project-URL: Repository, https://github.com/alexge233/engram_generator
Keywords: machine-learning,curriculum,reasoning,logic,synthetic-data,infinite-dataset
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: sympy >=1.12
Requires-Dist: numpy >=1.24
Requires-Dist: rich >=13.0
Requires-Dist: pylatexenc >=2.10
Requires-Dist: unicodeit >=0.7
Provides-Extra: atoms
Requires-Dist: requests >=2.28 ; extra == 'atoms'
Requires-Dist: beautifulsoup4 >=4.12 ; extra == 'atoms'
Provides-Extra: dev
Requires-Dist: pytest >=8.0 ; extra == 'dev'
Requires-Dist: pytest-cov ; extra == 'dev'
Requires-Dist: ruff ; extra == 'dev'

```
  _____ _   _  ____ ____      _    __  __
 | ____| \ | |/ ___|  _ \    / \  |  \/  |
 |  _| |  \| | |  _| |_) |  / _ \ | |\/| |
 | |___| |\  | |_| |  _ <  / ___ \| |  | |
 |_____|_| \_|\____|_| \_\/_/   \_\_|  |_|
          G E N E R A T O R
```

**2,022 generators. 100+ scientific domains. 10^81 unique problems.**

A procedural dataset that encodes the breadth of human scientific knowledge -- mathematics, physics, chemistry, biology, computer science, engineering, quantum theory, earth sciences, economics, logic, and more -- as step-by-step reasoning problems. The goal is not to build a benchmark. The goal is to teach machines how humans reason, discover, and invent.

Every problem is generated on-the-fly. Every answer is correct by construction. There is no dataset file -- just code that writes an endless exam across every discipline humans have formalised.

> This repository contains AI-generated code, reviewed and directed by a human author.

```
pip install engram-generator
```

## The Problem

Models trained on static datasets learn to pattern-match, not to reason. Train a model on 10,000 addition problems and it learns a lookup table, not addition. Change the digit count and it breaks. That's memorisation pretending to be intelligence.

Human reasoning didn't develop by memorising answers. It developed by solving problems across domains -- by recognising that the same recursive structure appears in Fibonacci sequences, merge sort, and mathematical induction. That a conservation law works the same way in thermodynamics, circuit analysis, and chemical equilibria. That a proof by contradiction in logic uses the same mental move as a reducibility argument in computability theory.

Engram Generator encodes this cross-domain structure:

- **10^81 unique problems** -- more than atoms in the observable universe
- **Step-by-step solutions** -- show your work or fail
- **26 reasoning strategies** with balanced exposure -- no single trick works
- **100+ scientific domains** -- the breadth of formalised human knowledge
- **Adaptive difficulty** -- the curriculum escalates as the model improves
- **Must Level Up** -- advanced tasks are locked behind mastery of prerequisites
- **Provably correct** -- every answer is generated by the same algorithm that generated the problem

## The Arc

A model trained on this curriculum climbs from counting to self-awareness:

```
Tier 0   "2 + 3 = 5"
  |
Tier 2   "d/dx(3x^2 + 2x) = 6x + 2"
  |
Tier 5   "curl F = (dFz/dy - dFy/dz, ...)"
  |
Tier 7   "This proof has an error in step 3. Here is the correction."
  |
Tier 8   "These two problems share an isomorphic structure."
  |
Tier 9   "To solve this class of problems, I would design the following algorithm."
  |
Tier 10  "My architecture struggles with length generalisation.
          Here is a proposed modification."
```

From following procedures to creating them. From solving problems to understanding what makes problems solvable.

## What's Inside

| Domain | Generators | Highlights |
|---|---|---|
| **Mathematics** | 730+ | Arithmetic through category theory, PDEs, algebraic geometry, measure theory, homological algebra |
| **Physics** | 200+ | Classical mechanics to quantum field theory, plasma physics, particle physics, general relativity |
| **Computer Science** | 230+ | Algorithms, cryptography, compilers, distributed systems, ML theory, formal verification |
| **Chemistry** | 80+ | General, organic, physical, spectroscopy, polymer science, electrochemistry |
| **Biology & Health** | 90+ | Genetics, biochemistry, epidemiology, neuroscience, systems biology, pharmacology |
| **Engineering** | 100+ | Signal processing, control theory, semiconductors, photonics, aerospace, structural |
| **Quantum** | 50+ | Formalism, information theory, field theory, error correction |
| **Earth & Space** | 30+ | Astronomy, geology, oceanography, geophysics, climate science |
| **Social & Cognitive** | 50+ | Economics, game theory, linguistics, causal inference, cognitive science |
| **Logic & Foundations** | 50+ | Formal logic, model theory, computability, proof theory, set theory |
| **Other** | 100+ | Music theory, financial maths, medical imaging, persistent homology, wavelet theory |

## Levelling System

You don't get to attempt RSA encryption until you can do modular arithmetic. You don't get to critique a proof until you can write one. The skill tree enforces this:

| Tier | Tasks | What You Unlock | Examples |
|---|---|---|---|
| 0 | 20 | **Fundamentals** | Addition, subtraction, sorting, boolean logic |
| 1 | 36 | **Building blocks** | Multiplication, Fibonacci, Caesar cipher |
| 2 | 47 | **Algebra & graphs** | Derivatives, quadratics, graph reachability |
| 3 | 95 | **Real maths** | Integrals, determinants, boolean algebra |
| 4 | 313 | **Applied science** | Physics, probability, dynamic programming |
| 5 | 730 | **Expert territory** | PDEs, cryptography, quantum mechanics |
| 6 | 521 | **Graduate level** | Topology, general relativity, information theory |
| 7 | 176 | **Meta-reasoning** | Proof strategy, error detection, generalisation |
| 8 | 31 | **Creative** | Conjecture, isomorphism detection |
| 9 | 29 | **Research** | Algorithm design, impossibility proofs |
| 10 | 24 | **Self-architecture** | Scaling laws, architecture search, loss design |

## Reasoning Balance

Without balancing, formula-substitution problems (55% of generators) would dominate training. The model would learn to plug numbers into equations and call it a day.

Instead, training is balanced across **26 reasoning strategies**. Each gets equal exposure regardless of how many generators belong to it:

| Pattern | Generators | What it teaches |
|---|---|---|
| Formula substitution | 1,188 | Plug values into known equations |
| Meta-reasoning | 112 | Proof strategy, error analysis, architecture design |
| Probabilistic reasoning | 87 | Bayes, distributions, expected values, stochastic processes |
| Differential equations | 76 | ODEs, PDEs, boundary value problems, numerical methods |
| Graph traversal | 73 | BFS, DFS, Dijkstra, flow networks, connectivity |
| Simulation trace | 60 | State machines, data structures, protocol execution |
| Symbolic manipulation | 49 | Differentiation, integration, algebraic simplification |
| Construction & verification | 39 | Group axioms, homomorphisms, topological invariants |
| Geometric computation | 34 | Areas, volumes, intersections, convex hulls |
| Conservation & balance | 33 | Thermodynamic laws, Kirchhoff, chemical equilibria |
| Linear algebra | 31 | Matrix decomposition, eigenvalues, null spaces |
| Counting & enumeration | 28 | Permutations, Catalan numbers, inclusion-exclusion |
| Statistical inference | 26 | Hypothesis testing, confidence intervals, regression |
| Logical deduction | 26 | Natural deduction, resolution, sequent calculus |
| Modular arithmetic | 22 | CRT, Euler's totient, discrete logarithms |
| Transform methods | 20 | Fourier, Laplace, Z-transform, wavelets |
| Dynamic programming | 19 | Optimal substructure, memoisation, alignment |
| Optimization | 19 | Gradient descent, KKT conditions, convex methods |
| Series & convergence | 19 | Ratio test, power series, uniform convergence |
| Encoding & decoding | 15 | RSA, Huffman, Reed-Solomon, stream ciphers |
| Approximation & numerical | 14 | Newton-Raphson, quadrature, interpolation |
| Recursive decomposition | 11 | Divide-and-conquer, Tower of Hanoi, merge sort |
| Comparison & ordering | 9 | Periodic trends, mineral identification, ranking |
| Dimensional analysis | 8 | Unit conversion, significant figures, calibration |
| Greedy selection | 3 | Interval scheduling, bin packing, set cover |
| Search & backtracking | 1 | A*, constraint satisfaction |

Formula substitution has 1,188 generators. Search & backtracking has 1. But during training, both patterns get **3.8% of samples**. No free rides.

## Why Memorisation is Impossible

The entire curriculum is **1.85 MB of algorithms**. It produces **terabytes of unique instances**. That's a compression ratio of 1,250,000:1.

| Difficulty range | Unique problems | For scale... |
|---|---|---|
| d=1 only | ~10^12 | More than all Google searches ever |
| d=1-4 | ~10^41 | Grains of sand on Earth, squared |
| d=1-8 (full) | **~10^81** | **Atoms in the observable universe** |

Even the largest models can't put a dent in it:

| Model | Parameters | Can memorise | Coverage of 10^81 |
|---|---|---|---|
| GPT-2 | 124,000,000 | ~134,000 | 10^-76 |
| Llama-2 7B | 7,000,000,000 | ~7.5M | 10^-74 |
| Llama-2 70B | 70,000,000,000 | ~75M | 10^-73 |
| GPT-4 (est. ~1.8T) | 1,800,000,000,000 | ~1.9B | 10^-72 |
| Llama-3.1 405B | 405,000,000,000 | ~438M | 10^-72 |

GPT-4, estimated at 1.8 trillion parameters, could memorise roughly 2 billion samples. The dataset has 10^81. The gap is **72 orders of magnitude**. **The only winning strategy is to learn the algorithms.**

And here's the kicker: the algorithmic information (1.85 MB) fits inside even a 1M parameter model with 14x headroom. Models *can* store every algorithm. They *cannot* store even a billionth of the instances.

## Tokenizer

All mathematical notation is written in **LaTeX**. The model learns to read and write LaTeX as a native language -- fractions, integrals, matrices, Greek letters (spelled out), superscripts, subscripts, and nested expressions. This means a model trained on Engram Generator doesn't just learn to solve maths -- it learns the standard notation that humans use to communicate it.

```
\frac{d}{dx}(-x^2-2x-2) <step> -1*2x=-2x <step> -2*1=-2 <step> 0 <step> -2x-2

\begin{pmatrix} -5 & 3 \\ 2 & 2 \end{pmatrix} \times \begin{pmatrix} -1 & -2 \\ -3 & 8 \end{pmatrix}

\oint_{|z|=3} \frac{1}{z^{2}+z-6} dz <step> poles: z=-3, 2 <step> Res(f,2)=0.2 <step> 1.2566i
```

Engram Generator uses a **character-level tokenizer** -- every character maps to exactly one token. No subword merging. No BPE. No SentencePiece.

**Why?** Subword tokenizers destroy the structure that reasoning depends on:

- **Digit atomicity**: BPE merges `"123"` into a single token. The model can't see that the `3` is in the ones place and the `1` is in the hundreds place. Arithmetic becomes impossible. Character-level tokenization keeps every digit separate, so carry operations and place-value reasoning work naturally.
- **LaTeX preservation**: LaTeX uses nested braces, superscripts, and subscripts (`\frac{d}{dx}`, `x^{2}`). Subword tokenizers split these unpredictably -- `\frac` might become `\fr` + `ac`, breaking the command boundary. Character-level tokenization preserves brace matching, command names, and operator structure exactly as written.
- **Deterministic alignment**: Every character is exactly one token. No ambiguity about tokenization boundaries. The model's attention patterns can align precisely with the mathematical structure of the problem.

**The character set** (132 characters + 3 special tokens = 135 vocab):

| Category | Characters |
|---|---|
| Digits (10) | `0 1 2 3 4 5 6 7 8 9` |
| Lowercase (26) | `a b c ... z` |
| Uppercase (26) | `A B C ... Z` |
| Greek (12) | `α β γ δ ε θ λ μ π σ φ ω` |
| Arithmetic (5) | `+ - * / ^` |
| Relations (4) | `≤ ≥ ≠ ≈` |
| Grouping (6) | `( ) [ ] { }` |
| Calculus & analysis (4) | `∂ ∫ √ ∞` |
| Set theory (5) | `∈ ⊂ ∅ ∩ ∪` |
| Logic (9) | `∀ ∃ ¬ ∧ ∨ ⊢ ⊨ ↔ ⊥` |
| Punctuation (9) | `= : ; ? . , ! ' "` |
| LaTeX & structure (7) | `\ _ \| ~ < > %` |
| Other (9) | `# @ $ & ° × — → (space)` |
| **Special tokens** (3) | `<pad> <eos> <step>` |

The `<step>` token separates solution steps in the target sequence. All generator output is constrained to use only characters in this set -- any generator that produces a character outside it is a bug and is caught by the test suite.

## Samples

```
Input:  add two 5 digit numbers
Target: 13278 + 46048 <step> 8+8=16 <step> 7+4+1=12 <step> 2+0+1=3 <step> 3+6=9 <step> 1+4=5 <step> 59326
```

- **Input**: natural language task description
- **Target**: problem, solution steps, and answer separated by `<step>` tokens
- Both capped at **512 characters**

## Usage

### Generate samples

```python
from engram_generator.curriculum.registry import get_generator

gen = get_generator("addition", min_difficulty=3, max_difficulty=5)
samples = gen.generate(100)

for sample in samples[:3]:
    print(f"Input:  {sample.input_text}")
    print(f"Target: {sample.target_text}")
    print(f"Answer: {sample.answer}")
```

### Use the skill tree

```python
from engram_generator.curriculum.registry import get_all_generators
from engram_generator.curriculum.skill_tree import SkillTree

generators = get_all_generators()
tree = SkillTree(generators, retention_ratio=0.1)

# See what's unlocked
print(tree.get_unlocked_tasks())

# Level up by proving mastery
events = tree.update({"addition": 0.97, "subtraction": 0.85})
```

### Balanced training

```python
from engram_generator.curriculum.reasoning_patterns import (
    get_pattern_weights, get_pattern_summary,
)
from engram_generator.curriculum.registry import get_all_generators

gens = get_all_generators()
weights = get_pattern_weights(gens)

# Each of the 26 reasoning patterns gets equal training exposure
summary = get_pattern_summary(gens)
for pattern, count in sorted(summary.items(), key=lambda x: -x[1])[:5]:
    print(f"{pattern}: {count} generators -> 3.8% of training")
```

### Validate

```bash
engram-validate --all --samples 20
engram-validate --skill-tree
engram-validate --task addition --difficulty 5 --samples 100
```

## Testing

```bash
python -m pytest tests/ -v
```

**6,326 tests** across 16 test modules:

- **Sanity** (6,066): every generator at low difficulty, high difficulty, and determinism
- **Correctness** (75): independent mathematical verification
- **Structural** (185): no orphans, no dangling prerequisites, no backwards cross-tier deps
- **Coverage**: 99% (77,452 statements, 1,104 missed)

## Roadmap

Current: **v0.1.0** -- 2,022 generators, 100+ domains, 26 reasoning patterns

Planned:

- **Code generation** -- generators that output executable code (Python, pseudocode), verified by sandboxed execution
- **Tool calling** -- generators that produce structured tool-call sequences from task descriptions
- **Agentic reasoning** -- multi-step observation-action-reward chains for planning and tool use
- **5,000+ generators** -- deeper coverage of existing domains, plus medicine, law, philosophy, and linguistics
- **Multi-language output** -- same algorithms, different natural language task descriptions
- **Difficulty auto-scaling** -- dynamic difficulty adjustment based on model accuracy curves

## License

MIT

## Organisation

[www.engram.one](https://www.engram.one) · [www.deepnet.one](https://www.deepnet.one)
