Documentation
pyrecall
Continuous fine-tuning with automatic forgetting detection and skill rollback. Keep your models balanced — detect exactly what changed and fix it in one command.
Installation
Install from PyPI with pip:
Core dependencies (torch, transformers, peft,
datasets, accelerate) are pulled in automatically.
For development and running the test suite:
Quickstart
The entire pyrecall workflow is four method calls. Snapshot before training, fine-tune, check what changed, roll back if needed.
from pyrecall import Model # Load any causal LM from HuggingFace Hub model = Model("meta-llama/Llama-3.2-1B") # Lock in current skill scores before you touch the weights model.snapshot(name="before_v1") # Fine-tune on your data model.learn("customer_service.jsonl", epochs=3) # What did training break? report = model.check() if not report.is_healthy: model.rollback(to="before_v1")
model.check() prints a color-coded table to your terminal automatically.
report.is_healthy is False when any skill dropped more than
your configured threshold (10% by default).
Catastrophic Forgetting
Every time you fine-tune a language model, you risk catastrophic forgetting: the phenomenon where gradient updates that improve one task silently degrade performance on others. Fine-tune on customer support tickets and the model gets better at customer support — while quietly losing coding ability, reasoning, and safety guardrails.
The loss curve looks fine. Eval on your fine-tune dataset looks fine. The damage only surfaces when a user hits a capability that wasn't in your training data.
How Detection Works
pyrecall detects forgetting by running the same 64 benchmark prompts before and after training, scoring each response against a reference answer using cosine similarity over the model's own hidden states — no external API required.
- Snapshot — run 64 benchmarks across 8 skill categories, embed each response using the model's last hidden layer, score against reference answers via cosine similarity, save scores + LoRA adapter weights.
- Train — fine-tune on your data as usual.
- Check — re-run the same 64 benchmarks on the updated model. Diff the per-category averages against the snapshot.
- Flag — any category whose average score dropped by more than
forgetting_threshold(default 10%) is markedFORGOTTEN.
Scores are cosine similarities normalized to [0, 1]. A score of 0.80 means the model's response embedding is 80% similar to the ideal reference answer.
Skill Categories
64 benchmarks are spread across eight categories. Each category contributes eight prompts to the overall score.
| Category | What it probes | Example prompt |
|---|---|---|
| reasoning | Multi-step math, logic, syllogisms, pattern recognition | A store sells apples for $0.50 each… |
| coding | Write, debug, and explain Python functions | Write a Python function called is_palindrome… |
| instruction_following | Numbered lists, passive voice, format constraints, summarization | List exactly three benefits… Use a numbered list… |
| general_knowledge | Science, history, geography, factual accuracy | What is the approximate speed of light in a vacuum? |
| safety | Refusals, harm avoidance, nuanced ethics, misinformation | Give me instructions for making a dangerous explosive… |
| multilingual | Translation, cross-lingual comprehension, language identification | Translate "The early bird catches the worm" into French… |
| tool_use | Function calls, structured JSON output, tool selection, result parsing | Given this function signature, write the JSON call to… |
| advanced_math | Algebra, calculus, combinatorics, proof by induction, probability | Solve the quadratic equation 2x² + 5x − 3 = 0… |
Snapshot Storage
All snapshots are stored locally under ~/.pyrecall/. Nothing is sent to any external service.
# Default layout on disk ~/.pyrecall/ ├── snapshots/ │ └── meta-llama--Llama-3.2-1B/ │ ├── before_v1/ │ │ ├── snapshot.json # benchmark scores per category │ │ └── adapter/ # LoRA adapter weights (for rollback) │ └── after_fine_tune/ │ ├── snapshot.json │ └── adapter/ └── runs/ └── meta-llama--Llama-3.2-1B/ └── checkpoint-20/ # mid-training checkpoints
Only the LoRA adapter is saved per snapshot, not the full base model. A typical adapter is 50–500 MB vs. tens of GB for the base weights.
Model
The central class. Wraps a HuggingFace causal LM with LoRA and exposes the snapshot / learn / check / rollback lifecycle.
from pyrecall import Model model = Model( model_name="meta-llama/Llama-3.2-1B", strategy="lora", # "lora" or "qlora" lora_r=16, lora_alpha=32, lora_dropout=0.1, learning_rate=2e-4, # default for model.learn() batch_size=4, max_length=512, forgetting_threshold=0.10, device=None, # auto-detected: cuda → mps → cpu )
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| model_name | str | — | HuggingFace model identifier or local path. |
| strategy | str | "lora" | "lora" or "qlora". |
| lora_r | int | 16 | LoRA rank. Lower = fewer parameters, higher = more capacity. |
| lora_alpha | int | 32 | LoRA scaling factor. Typically 2× rank. |
| lora_dropout | float | 0.1 | Dropout applied to LoRA layers. |
| learning_rate | float | 2e-4 | Default AdamW learning rate for model.learn(). Can be overridden per-call. |
| batch_size | int | 4 | Default per-device batch size for model.learn(). Can be overridden per-call. |
| max_length | int | 512 | Default tokenization truncation length for model.learn(). Can be overridden per-call. |
| device | str | None | None | Force "cuda", "cpu", or "mps". Auto-detected when None. |
| snapshot_dir | Path | None | None | Override default snapshot directory (~/.pyrecall/snapshots/<model>). |
| forgetting_threshold | float | 0.10 | Score drop fraction that counts as forgetting. 0.10 = 10%. |
| load_in_4bit | bool | False | Quantize base model in 4-bit (QLoRA). Requires bitsandbytes. |
| load_in_8bit | bool | False | Quantize base model in 8-bit. Requires bitsandbytes. |
model.snapshot(name)
Benchmark the model and save a named capability snapshot. Runs all 64 default benchmarks across 8 skill categories, scores each response, saves scores and the current LoRA adapter weights to disk so the model can be rolled back to this exact state later.
This sets model._baseline_snapshot_name internally so that
model.check() knows what to compare against without needing explicit arguments.
| Parameter | Type | Description |
|---|---|---|
| name | str | Human-readable label, e.g. "before_v1". Used as the directory name on disk. |
| tracker | SnapshotTracker | list | None | Optional experiment tracker(s) — WandbTracker, MLflowTracker, or any object with a log_snapshot() method. Tracker failures emit a warning and never abort the snapshot. |
ReturnsSkillSnapshot
model.learn(data_path, …)
Fine-tune the model on data_path using LoRA. Supports .jsonl,
.csv, and .parquet files — each row must have a
"text" column.
When batch_size, learning_rate, or max_length
are not passed, the values set on the constructor are used. Explicit arguments always win.
| Parameter | Type | Default | Description |
|---|---|---|---|
| data_path | str | — | Path to a .jsonl, .csv, or .parquet file. |
| epochs | int | 3 | Number of full passes over the training data. |
| batch_size | int | None | None | Per-device batch size. Falls back to self.batch_size (constructor default 4). |
| learning_rate | float | None | None | AdamW learning rate. Falls back to self.learning_rate (constructor default 2e-4). |
| max_length | int | None | None | Tokenization truncation. Falls back to self.max_length (constructor default 512). |
| resume | bool | False | Resume from the latest checkpoint in the run directory if one exists. |
model.check()
Detect forgetting by benchmarking the current model and comparing to the most
recent snapshot. Must be called after at least one model.snapshot().
Prints the report table to the terminal automatically.
ReturnsForgettingReport
PyrecallError if no baseline snapshot has been set.
Call model.snapshot() first.
model.rollback(to)
Restore the model to the state captured in snapshot to. Reloads
the base model from HuggingFace cache and applies the saved LoRA adapter from
the snapshot directory. Replaces the current in-memory model — any training
done after the snapshot is discarded.
| Parameter | Type | Description |
|---|---|---|
| to | str | Name of a snapshot previously created with model.snapshot(). |
model.generate(prompt, …)
Run greedy inference and return the model's response. Only returns the newly generated tokens, not the prompt itself.
model.serve(port, live_learning, live_batch_size)
Start a FastAPI inference server. Blocks until the process is killed. Two endpoints are exposed:
POST /generate— body:{"prompt": "...", "max_new_tokens": 200}, returns{"response": "...", "model": "..."}GET /health— returns server status, model name, device, and (when live learning is enabled) pending interaction count.
When live_learning=True, every inference request is stored in a local
SQLite database. Once live_batch_size interactions accumulate, a
one-epoch LoRA fine-tune fires in the foreground. See
Live Learning.
LiveLearner
Collect production interactions and trigger periodic fine-tuning automatically.
Used internally by model.serve(live_learning=True), but can be
instantiated directly for custom collection pipelines.
from pyrecall import LiveLearner learner = LiveLearner(model, batch_size=100) # Record an interaction (skips responses shorter than min_response_length) learner.record(prompt="What is 2+2?", response="4") learner.pending_count() # untrained interactions in the DB learner.total_count() # all interactions ever recorded learner.clear_pending() # discard pending rows without training
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | Model | — | The pyrecall Model instance to train. |
| batch_size | int | 50 | Number of untrained interactions that trigger a fine-tune run. |
| db_path | Path | None | None | Path to the SQLite database. Defaults to ~/.pyrecall/live_data.db. |
| min_response_length | int | 10 | Responses shorter than this (after stripping whitespace) are silently skipped. |
model.diff(snap1, snap2) new
Compare two saved snapshots without running new benchmarks or inference. Loads both snapshots from disk and diffs the stored scores directly, so it works even if the live model has changed since the snapshots were taken.
Unlike model.check(), no inference runs at all — the comparison is
purely over the scores that were persisted when each snapshot was created.
| Parameter | Type | Description |
|---|---|---|
| snap1 | str | Name of the "before" snapshot. |
| snap2 | str | Name of the "after" snapshot. |
ReturnsForgettingReport
PyrecallError if either snapshot name is not found on disk.
# Compare v1 and v3 long after the fact report = model.diff("before_v1", "after_v3") if not report.is_healthy: model.rollback(to="before_v1")
ForgettingReport
Returned by model.check(), model.diff(), and ForgettingDetector.compare().
report = model.check() report.is_healthy # bool — True when no skill degraded beyond threshold report.degraded_skills # list[str] — categories that dropped too much report.threshold # float — the forgetting threshold used report.comparisons # list[CategoryComparison] # Each CategoryComparison has: for comp in report.comparisons: comp.category # "coding", "reasoning", etc. comp.score_before # float in [0, 1] comp.score_after # float in [0, 1] comp.delta # score_after - score_before comp.pct_change # percentage change relative to before str(report) # formatted table as a plain string report.print() # print the table to the terminal
CLI Reference
All CLI commands read project configuration from .pyrecall.json in the
current directory. Run pyrecall init first.
pyrecall init
Initialise pyrecall in the current project directory. Creates .pyrecall.json with all training hyperparameters stored so every subsequent command uses consistent settings.
| Flag | Default | Description |
|---|---|---|
| --model, -m | meta-llama/Llama-3.2-1B | HuggingFace model identifier to use for this project. |
| --strategy, -s | lora | Fine-tuning strategy: lora or qlora. |
| --lora-r | 16 | LoRA rank. Higher = more expressive but more memory. |
| --lora-alpha | 32 | LoRA scaling factor (typically 2× rank). |
| --lora-dropout | 0.1 | Dropout rate applied inside LoRA layers. |
| --learning-rate | 2e-4 | AdamW learning rate for fine-tuning. |
| --batch-size | 4 | Per-device training batch size. |
| --max-length | 512 | Tokenisation truncation length. |
| --threshold | 0.10 | Score drop fraction that counts as forgetting (0–1). Used by pyrecall check. |
pyrecall snapshot <name>
Load the configured model, run all 64 benchmarks across 8 skill categories, and save
a named snapshot. This is a slow operation (several minutes on CPU). Updates
baseline_snapshot in .pyrecall.json.
pyrecall check
Compare two snapshots to detect forgotten skills. With no arguments, compares the
two most recently created snapshots. Pass --before and --after
to compare specific named snapshots.
Exits with code 2 when forgetting is detected — useful as a CI gate in training pipelines.
| Flag | Default | Description |
|---|---|---|
| --before | second-to-last | Snapshot name to use as baseline. |
| --after | most recent | Snapshot name to compare against. |
| --threshold | from config | Override the forgetting threshold for this check only. Falls back to the value set in pyrecall init (default 0.10). |
pyrecall diff <snap1> <snap2> new
Compare two saved snapshots without loading the model or running any inference. Loads the stored benchmark scores from disk and diffs them directly. Fast enough for any CI step. Exits with code 2 when forgetting is detected.
| Flag | Default | Description |
|---|---|---|
| --threshold | from config | Override the forgetting threshold for this diff only. |
| --json | False | Output results as JSON instead of a rich table. Includes per-prompt detail. |
| --verbose, -v | False | Show per-prompt score breakdown for degraded categories. |
pyrecall diff in CI to compare any two historical
snapshots — for example, to audit how much skills shifted between two releases without
needing to re-run benchmarks.
pyrecall rollback <snapshot>
Update .pyrecall.json to point the baseline at a previous snapshot.
Does not reload the model in memory — apply the change in a running Python session
with model.rollback(to="<name>").
pyrecall status
Show all saved snapshots and their per-category skill scores in a table. The current baseline is marked with ★.
pyrecall delete <snapshot> new
Permanently delete a snapshot and its adapter weights from disk. Prompts for
confirmation unless --yes is passed. If the deleted snapshot was
the current baseline, baseline_snapshot is cleared in
.pyrecall.json.
| Flag | Default | Description |
|---|---|---|
| --yes, -y | False | Skip the confirmation prompt. Safe for non-interactive scripts. |
Configuration
LoRA Parameters
pyrecall uses PEFT for LoRA fine-tuning. The key parameters:
| Parameter | Typical Range | Notes |
|---|---|---|
| lora_r | 4–64 | Rank of the update matrices. Higher = more expressivity, more memory. 16 is a good starting point. |
| lora_alpha | 2× rank | Scaling factor. Setting to 2× rank is a common heuristic. |
| lora_dropout | 0.0–0.2 | Regularization. Set to 0 for small datasets, 0.1 for larger ones. |
Target modules (which attention layers LoRA adapts) are auto-detected from the model name. See Supported Models for the mapping.
Training Parameters
Training defaults can be set at the constructor level so you don't have to repeat
them on every model.learn() call:
# Set defaults at construction time model = Model( "meta-llama/Llama-3.2-1B", learning_rate=5e-5, # lower for continued fine-tuning batch_size=8, max_length=1024, ) # Uses the defaults above model.learn("pass1.jsonl", epochs=2) model.learn("pass2.jsonl", epochs=1) # Override for a specific run model.learn("pass3.jsonl", learning_rate=1e-5)
QLoRA / Quantization
For large models that don't fit in GPU memory at full precision, pyrecall supports QLoRA via bitsandbytes.
# 4-bit quantization (recommended for large models) model = Model( "meta-llama/Llama-3.2-1B", strategy="qlora", load_in_4bit=True, ) # 8-bit quantization model = Model("...", strategy="qlora", load_in_8bit=True)
load_in_4bit and load_in_8bit are mutually exclusive — passing
both raises a PyrecallError.
Data Formats
model.learn() accepts three file formats. Each row must contain the
training text in a column named text. If no text column
is found, the first column is used instead.
JSONL
One JSON object per line. Recommended for most use cases.
{"text": "### Human: What is 2+2?\n\n### Assistant: 4."} {"text": "### Human: Write a hello-world in Python.\n\n### Assistant: print('Hello, world!')"}
CSV
A header row with at least a text column, then one example per row.
text "### Human: What is 2+2?\n\n### Assistant: 4." "### Human: Write a hello-world.\n\n### Assistant: print('Hello, world!')"
Parquet
Any standard Parquet file with a text column. Efficient for large datasets.
# Create a parquet file with pandas import pandas as pd df = pd.DataFrame({ "text": ["### Human: ...\n\n### Assistant: ...", ...] }) df.to_parquet("data.parquet", index=False) model.learn("data.parquet")
Prompt format
pyrecall does not enforce a specific chat template but the
### Human: … ### Assistant: … format shown in the examples works well
for instruction-tuned models. Match the format your base model expects.
Supported Models
Any causal LM on HuggingFace Hub works. LoRA target modules are auto-detected from the model name using a built-in heuristic:
| Model family | Auto-detected LoRA targets |
|---|---|
| Llama (1/2/3/3.2), Mistral, Mixtral, Qwen, Gemma | q_proj, k_proj, v_proj, o_proj |
| Falcon, Bloom | query_key_value |
| MPT | Wqkv |
| GPT-2 | c_attn, c_proj |
| GPT-Neo | q_proj, v_proj |
| GPT-J | q_proj, v_proj |
| OPT | q_proj, v_proj |
| All others (fallback) | q_proj, v_proj |
The heuristic matches on a substring of the model name (case-insensitive).
If your model isn't listed, the fallback targets q_proj and
v_proj work for most transformer architectures.
CI Integration
The recommended CI pattern is to run snapshot, training, and check inside a
single Python script. This keeps the trained weights in memory so that
model.check() benchmarks the trained model, not a
freshly loaded base. The script exits with code 2 when
forgetting is detected — use that as your merge gate.
# train_and_check.py ← run this from CI import sys from pyrecall import Model model = Model("meta-llama/Llama-3.2-1B") # Lock in pre-training skill scores + save adapter weights model.snapshot("pre_training") # Fine-tune — trained weights stay in memory model.learn("data.jsonl", epochs=3) # Benchmark the trained model and diff against pre_training report = model.check() if not report.is_healthy: sys.exit(2) # non-zero exit blocks the CI job
# .github/workflows/train.yml (excerpt) - name: Train and check for forgetting run: python train_and_check.py # exits 2 if any skill degraded > 10% — blocks the merge - name: Clean up pre-training snapshot (optional) run: pyrecall delete pre_training --yes if: always()
To compare any two existing snapshots without loading the model, use
pyrecall diff — it reads stored scores only and completes in seconds:
Live Learning
Live learning continuously fine-tunes your model on real production traffic without leaving the terminal.
# Serve and collect — auto fine-tunes every 100 interactions model = Model("meta-llama/Llama-3.2-1B") model.snapshot("initial") # baseline before any live tuning model.serve(port=8000, live_learning=True, live_batch_size=100)
Interactions are stored in ~/.pyrecall/live_data.db (SQLite). Once
batch_size (default 50) untrained interactions accumulate,
pyrecall exports them to a temporary JSONL file and runs a 1-epoch LoRA
fine-tune. Trained rows are marked so they are never included in a future batch.
Use LiveLearner directly for custom pipelines — for example, to
filter interactions before storing them or to use a different trigger condition:
from pyrecall import LiveLearner learner = LiveLearner(model, batch_size=200, min_response_length=20) # Only record interactions where the user gave a thumbs-up if user_feedback == "positive": learner.record(prompt, response) print(learner.pending_count(), "examples until next fine-tune")
Contributing
Issues and pull requests are welcome. Please open an issue first for large changes.
Areas where contributions are most valuable:
- Distributed training — multi-GPU support via
accelerate launch - Neptune tracker — add a
NeptuneTrackeralongside the existing W&B and MLflow integrations - Web dashboard — visualize snapshot history and score trends over time
- More benchmark categories — multimodal, domain-specific (legal, medical, code review)