Open source · MIT License · v0.6.1
Fine-tuning silently degrades what your model already knows. pyrecall detects exactly what changed and lets you roll back specific skills before bad weights reach production.
The Problem
You take a model that's good at a lot of things. You fine-tune it on your data — customer support tickets, legal documents, whatever. It gets noticeably better at your task. You ship it.
Six weeks later something feels off. The model handles your task fine but everything else is worse. Reasoning is shakier. Code quality dropped. It's less careful about safety cases it used to handle cleanly. Your loss curve looked fine the whole time.
This is catastrophic forgetting. It's well documented in research and almost completely ignored by tooling. There's no alarm. Nothing in your training loop catches it. pyrecall does.
How It Works
Before training, pyrecall runs your model through 64 benchmark prompts across 8 skill categories and saves the scores. That's your baseline.
After training, the same benchmarks run again. Any skill that dropped beyond your threshold gets flagged. The report prints to your terminal in color — green held, red degraded.
Every training run saves a LoRA adapter checkpoint. Roll back to any named snapshot instantly. Not the whole model — just the adapter weights that drifted.
The API
from pyrecall import Model model = Model("meta-llama/Llama-3.2-1B", strategy="lora") model.snapshot(name="before_v1") # lock in current skill scores model.learn("data.jsonl", epochs=3) # fine-tune on new data report = model.check() # what got worse? model.rollback(to="before_v1") # fix it
The CLI
What Gets Tracked
Multi-step logic and problem solving.
Code generation, debugging, explanation.
Does it do what you asked?
Breadth of factual accuracy.
Handling sensitive and edge case prompts.