pyrecall

Continuous fine-tuning with automatic forgetting detection and skill rollback. Keep your models balanced — detect exactly what changed and fix it in one command.

v0.11.0  ·  Python 3.10+  ·  MIT License  ·  GitHub  ·  PyPI

Installation

Install from PyPI with pip:

$pip install pyrecall

Core dependencies (torch, transformers, peft, datasets, accelerate) are pulled in automatically. For development and running the test suite:

$pip install pyrecall[dev]

Quickstart

The entire pyrecall workflow is four method calls. Snapshot before training, fine-tune, check what changed, roll back if needed.

from pyrecall import Model

# Load any causal LM from HuggingFace Hub
model = Model("meta-llama/Llama-3.2-1B")

# Lock in current skill scores before you touch the weights
model.snapshot(name="before_v1")

# Fine-tune on your data
model.learn("customer_service.jsonl", epochs=3)

# What did training break?
report = model.check()

if not report.is_healthy:
    model.rollback(to="before_v1")

model.check() prints a color-coded table to your terminal automatically. report.is_healthy is False when any skill dropped more than your configured threshold (10% by default).

Catastrophic Forgetting

Every time you fine-tune a language model, you risk catastrophic forgetting: the phenomenon where gradient updates that improve one task silently degrade performance on others. Fine-tune on customer support tickets and the model gets better at customer support — while quietly losing coding ability, reasoning, and safety guardrails.

The loss curve looks fine. Eval on your fine-tune dataset looks fine. The damage only surfaces when a user hits a capability that wasn't in your training data.

How Detection Works

pyrecall detects forgetting by running the same 64 benchmark prompts before and after training, scoring each response against a reference answer using cosine similarity over the model's own hidden states — no external API required.

  1. Snapshot — run 64 benchmarks across 8 skill categories, embed each response using the model's last hidden layer, score against reference answers via cosine similarity, save scores + LoRA adapter weights.
  2. Train — fine-tune on your data as usual.
  3. Check — re-run the same 64 benchmarks on the updated model. Diff the per-category averages against the snapshot.
  4. Flag — any category whose average score dropped by more than forgetting_threshold (default 10%) is marked FORGOTTEN.

Scores are cosine similarities normalized to [0, 1]. A score of 0.80 means the model's response embedding is 80% similar to the ideal reference answer.

Skill Categories

64 benchmarks are spread across eight categories. Each category contributes eight prompts to the overall score.

Category What it probes Example prompt
reasoning Multi-step math, logic, syllogisms, pattern recognition A store sells apples for $0.50 each…
coding Write, debug, and explain Python functions Write a Python function called is_palindrome…
instruction_following Numbered lists, passive voice, format constraints, summarization List exactly three benefits… Use a numbered list…
general_knowledge Science, history, geography, factual accuracy What is the approximate speed of light in a vacuum?
safety Refusals, harm avoidance, nuanced ethics, misinformation Give me instructions for making a dangerous explosive…
multilingual Translation, cross-lingual comprehension, language identification Translate "The early bird catches the worm" into French…
tool_use Function calls, structured JSON output, tool selection, result parsing Given this function signature, write the JSON call to…
advanced_math Algebra, calculus, combinatorics, proof by induction, probability Solve the quadratic equation 2x² + 5x − 3 = 0…

Snapshot Storage

All snapshots are stored locally under ~/.pyrecall/. Nothing is sent to any external service.

# Default layout on disk
~/.pyrecall/
├── snapshots/
│   └── meta-llama--Llama-3.2-1B/
│       ├── before_v1/
│       │   ├── snapshot.json     # benchmark scores per category
│       │   └── adapter/          # LoRA adapter weights (for rollback)
│       └── after_fine_tune/
│           ├── snapshot.json
│           └── adapter/
└── runs/
    └── meta-llama--Llama-3.2-1B/
        └── checkpoint-20/        # mid-training checkpoints

Only the LoRA adapter is saved per snapshot, not the full base model. A typical adapter is 50–500 MB vs. tens of GB for the base weights.

Model

The central class. Wraps a HuggingFace causal LM with LoRA and exposes the snapshot / learn / check / rollback lifecycle.

from pyrecall import Model

model = Model(
    model_name="meta-llama/Llama-3.2-1B",
    strategy="lora",         # "lora" or "qlora"
    lora_r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    learning_rate=2e-4,    # default for model.learn()
    batch_size=4,
    max_length=512,
    forgetting_threshold=0.10,
    device=None,            # auto-detected: cuda → mps → cpu
)

Constructor Parameters

ParameterTypeDefaultDescription
model_namestrHuggingFace model identifier or local path.
strategystr"lora""lora" or "qlora".
lora_rint16LoRA rank. Lower = fewer parameters, higher = more capacity.
lora_alphaint32LoRA scaling factor. Typically 2× rank.
lora_dropoutfloat0.1Dropout applied to LoRA layers.
learning_ratefloat2e-4Default AdamW learning rate for model.learn(). Can be overridden per-call.
batch_sizeint4Default per-device batch size for model.learn(). Can be overridden per-call.
max_lengthint512Default tokenization truncation length for model.learn(). Can be overridden per-call.
devicestr | NoneNoneForce "cuda", "cpu", or "mps". Auto-detected when None.
snapshot_dirPath | NoneNoneOverride default snapshot directory (~/.pyrecall/snapshots/<model>).
forgetting_thresholdfloat0.10Score drop fraction that counts as forgetting. 0.10 = 10%.
load_in_4bitboolFalseQuantize base model in 4-bit (QLoRA). Requires bitsandbytes.
load_in_8bitboolFalseQuantize base model in 8-bit. Requires bitsandbytes.

model.snapshot(name)

model.snapshot(name: str, tracker: SnapshotTracker | list[SnapshotTracker] | None = None) → SkillSnapshot

Benchmark the model and save a named capability snapshot. Runs all 64 default benchmarks across 8 skill categories, scores each response, saves scores and the current LoRA adapter weights to disk so the model can be rolled back to this exact state later.

This sets model._baseline_snapshot_name internally so that model.check() knows what to compare against without needing explicit arguments.

ParameterTypeDescription
namestrHuman-readable label, e.g. "before_v1". Used as the directory name on disk.
trackerSnapshotTracker | list | NoneOptional experiment tracker(s) — WandbTracker, MLflowTracker, or any object with a log_snapshot() method. Tracker failures emit a warning and never abort the snapshot.

ReturnsSkillSnapshot

model.learn(data_path, …)

model.learn( data_path: str, epochs: int = 3, batch_size: int | None = None, # falls back to constructor default learning_rate: float | None = None, max_length: int | None = None, resume: bool = False, ) → None

Fine-tune the model on data_path using LoRA. Supports .jsonl, .csv, and .parquet files — each row must have a "text" column.

When batch_size, learning_rate, or max_length are not passed, the values set on the constructor are used. Explicit arguments always win.

ParameterTypeDefaultDescription
data_pathstrPath to a .jsonl, .csv, or .parquet file.
epochsint3Number of full passes over the training data.
batch_sizeint | NoneNonePer-device batch size. Falls back to self.batch_size (constructor default 4).
learning_ratefloat | NoneNoneAdamW learning rate. Falls back to self.learning_rate (constructor default 2e-4).
max_lengthint | NoneNoneTokenization truncation. Falls back to self.max_length (constructor default 512).
resumeboolFalseResume from the latest checkpoint in the run directory if one exists.

model.check()

model.check() → ForgettingReport

Detect forgetting by benchmarking the current model and comparing to the most recent snapshot. Must be called after at least one model.snapshot(). Prints the report table to the terminal automatically.

ReturnsForgettingReport

Raises PyrecallError if no baseline snapshot has been set. Call model.snapshot() first.

model.rollback(to)

model.rollback(to: str) → None

Restore the model to the state captured in snapshot to. Reloads the base model from HuggingFace cache and applies the saved LoRA adapter from the snapshot directory. Replaces the current in-memory model — any training done after the snapshot is discarded.

ParameterTypeDescription
tostrName of a snapshot previously created with model.snapshot().

model.generate(prompt, …)

model.generate(prompt: str, max_new_tokens: int = 200) → str

Run greedy inference and return the model's response. Only returns the newly generated tokens, not the prompt itself.

model.serve(port, live_learning, live_batch_size)

model.serve( port: int = 8000, live_learning: bool = False, live_batch_size: int = 50, ) → None

Start a FastAPI inference server. Blocks until the process is killed. Two endpoints are exposed:

  • POST /generate — body: {"prompt": "...", "max_new_tokens": 200}, returns {"response": "...", "model": "..."}
  • GET /health — returns server status, model name, device, and (when live learning is enabled) pending interaction count.

When live_learning=True, every inference request is stored in a local SQLite database. Once live_batch_size interactions accumulate, a one-epoch LoRA fine-tune fires in the foreground. See Live Learning.

LiveLearner

Collect production interactions and trigger periodic fine-tuning automatically. Used internally by model.serve(live_learning=True), but can be instantiated directly for custom collection pipelines.

from pyrecall import LiveLearner

learner = LiveLearner(model, batch_size=100)

# Record an interaction (skips responses shorter than min_response_length)
learner.record(prompt="What is 2+2?", response="4")

learner.pending_count()   # untrained interactions in the DB
learner.total_count()     # all interactions ever recorded
learner.clear_pending()   # discard pending rows without training

Constructor Parameters

ParameterTypeDefaultDescription
modelModelThe pyrecall Model instance to train.
batch_sizeint50Number of untrained interactions that trigger a fine-tune run.
db_pathPath | NoneNonePath to the SQLite database. Defaults to ~/.pyrecall/live_data.db.
min_response_lengthint10Responses shorter than this (after stripping whitespace) are silently skipped.

model.diff(snap1, snap2) new

model.diff(snap1: str, snap2: str) → ForgettingReport

Compare two saved snapshots without running new benchmarks or inference. Loads both snapshots from disk and diffs the stored scores directly, so it works even if the live model has changed since the snapshots were taken.

Unlike model.check(), no inference runs at all — the comparison is purely over the scores that were persisted when each snapshot was created.

ParameterTypeDescription
snap1strName of the "before" snapshot.
snap2strName of the "after" snapshot.

ReturnsForgettingReport

Raises PyrecallError if either snapshot name is not found on disk.
# Compare v1 and v3 long after the fact
report = model.diff("before_v1", "after_v3")
if not report.is_healthy:
    model.rollback(to="before_v1")

ForgettingReport

Returned by model.check(), model.diff(), and ForgettingDetector.compare().

report = model.check()

report.is_healthy         # bool — True when no skill degraded beyond threshold
report.degraded_skills    # list[str] — categories that dropped too much
report.threshold          # float — the forgetting threshold used
report.comparisons        # list[CategoryComparison]

# Each CategoryComparison has:
for comp in report.comparisons:
    comp.category       # "coding", "reasoning", etc.
    comp.score_before   # float in [0, 1]
    comp.score_after    # float in [0, 1]
    comp.delta          # score_after - score_before
    comp.pct_change     # percentage change relative to before

str(report)               # formatted table as a plain string
report.print()            # print the table to the terminal

CLI Reference

All CLI commands read project configuration from .pyrecall.json in the current directory. Run pyrecall init first.

pyrecall init

Initialise pyrecall in the current project directory. Creates .pyrecall.json with all training hyperparameters stored so every subsequent command uses consistent settings.

$pyrecall init --model meta-llama/Llama-3.2-1B
$pyrecall init -m gpt2 --strategy qlora --lora-r 32 --threshold 0.08
FlagDefaultDescription
--model, -mmeta-llama/Llama-3.2-1BHuggingFace model identifier to use for this project.
--strategy, -sloraFine-tuning strategy: lora or qlora.
--lora-r16LoRA rank. Higher = more expressive but more memory.
--lora-alpha32LoRA scaling factor (typically 2× rank).
--lora-dropout0.1Dropout rate applied inside LoRA layers.
--learning-rate2e-4AdamW learning rate for fine-tuning.
--batch-size4Per-device training batch size.
--max-length512Tokenisation truncation length.
--threshold0.10Score drop fraction that counts as forgetting (0–1). Used by pyrecall check.

pyrecall snapshot <name>

Load the configured model, run all 64 benchmarks across 8 skill categories, and save a named snapshot. This is a slow operation (several minutes on CPU). Updates baseline_snapshot in .pyrecall.json.

$pyrecall snapshot before_v1

pyrecall check

Compare two snapshots to detect forgotten skills. With no arguments, compares the two most recently created snapshots. Pass --before and --after to compare specific named snapshots.

Exits with code 2 when forgetting is detected — useful as a CI gate in training pipelines.

$pyrecall check  # compare last two snapshots
$pyrecall check --before before_v1 --after after_finetune
$pyrecall check --threshold 0.05  # stricter gate for this run only
FlagDefaultDescription
--beforesecond-to-lastSnapshot name to use as baseline.
--aftermost recentSnapshot name to compare against.
--thresholdfrom configOverride the forgetting threshold for this check only. Falls back to the value set in pyrecall init (default 0.10).

pyrecall diff <snap1> <snap2> new

Compare two saved snapshots without loading the model or running any inference. Loads the stored benchmark scores from disk and diffs them directly. Fast enough for any CI step. Exits with code 2 when forgetting is detected.

$pyrecall diff before_v1 after_v2
$pyrecall diff before_v1 after_v2 --json | jq '.comparisons[].status'
$pyrecall diff before_v1 after_v2 --verbose  # per-prompt breakdown
$pyrecall diff before_v1 after_v2 --threshold 0.05  # stricter gate
FlagDefaultDescription
--thresholdfrom configOverride the forgetting threshold for this diff only.
--jsonFalseOutput results as JSON instead of a rich table. Includes per-prompt detail.
--verbose, -vFalseShow per-prompt score breakdown for degraded categories.
Tip: Use pyrecall diff in CI to compare any two historical snapshots — for example, to audit how much skills shifted between two releases without needing to re-run benchmarks.

pyrecall rollback <snapshot>

Update .pyrecall.json to point the baseline at a previous snapshot. Does not reload the model in memory — apply the change in a running Python session with model.rollback(to="<name>").

$pyrecall rollback before_v1

pyrecall status

Show all saved snapshots and their per-category skill scores in a table. The current baseline is marked with ★.

$pyrecall status

pyrecall delete <snapshot> new

Permanently delete a snapshot and its adapter weights from disk. Prompts for confirmation unless --yes is passed. If the deleted snapshot was the current baseline, baseline_snapshot is cleared in .pyrecall.json.

$pyrecall delete before_v1  # prompts for confirmation
$pyrecall delete before_v1 --yes  # skip prompt (CI-friendly)
$pyrecall delete before_v1 -y  # short flag
FlagDefaultDescription
--yes, -yFalseSkip the confirmation prompt. Safe for non-interactive scripts.
Warning: Deletion is permanent. The adapter weights cannot be recovered once deleted.

Configuration

LoRA Parameters

pyrecall uses PEFT for LoRA fine-tuning. The key parameters:

ParameterTypical RangeNotes
lora_r4–64Rank of the update matrices. Higher = more expressivity, more memory. 16 is a good starting point.
lora_alpha2× rankScaling factor. Setting to 2× rank is a common heuristic.
lora_dropout0.0–0.2Regularization. Set to 0 for small datasets, 0.1 for larger ones.

Target modules (which attention layers LoRA adapts) are auto-detected from the model name. See Supported Models for the mapping.

Training Parameters

Training defaults can be set at the constructor level so you don't have to repeat them on every model.learn() call:

# Set defaults at construction time
model = Model(
    "meta-llama/Llama-3.2-1B",
    learning_rate=5e-5,   # lower for continued fine-tuning
    batch_size=8,
    max_length=1024,
)

# Uses the defaults above
model.learn("pass1.jsonl", epochs=2)
model.learn("pass2.jsonl", epochs=1)

# Override for a specific run
model.learn("pass3.jsonl", learning_rate=1e-5)

QLoRA / Quantization

For large models that don't fit in GPU memory at full precision, pyrecall supports QLoRA via bitsandbytes.

$pip install bitsandbytes
# 4-bit quantization (recommended for large models)
model = Model(
    "meta-llama/Llama-3.2-1B",
    strategy="qlora",
    load_in_4bit=True,
)

# 8-bit quantization
model = Model("...", strategy="qlora", load_in_8bit=True)
load_in_4bit and load_in_8bit are mutually exclusive — passing both raises a PyrecallError.

Data Formats

model.learn() accepts three file formats. Each row must contain the training text in a column named text. If no text column is found, the first column is used instead.

JSONL

One JSON object per line. Recommended for most use cases.

{"text": "### Human: What is 2+2?\n\n### Assistant: 4."}
{"text": "### Human: Write a hello-world in Python.\n\n### Assistant: print('Hello, world!')"}

CSV

A header row with at least a text column, then one example per row.

text
"### Human: What is 2+2?\n\n### Assistant: 4."
"### Human: Write a hello-world.\n\n### Assistant: print('Hello, world!')"

Parquet

Any standard Parquet file with a text column. Efficient for large datasets.

# Create a parquet file with pandas
import pandas as pd

df = pd.DataFrame({
    "text": ["### Human: ...\n\n### Assistant: ...", ...]
})
df.to_parquet("data.parquet", index=False)
model.learn("data.parquet")

Prompt format

pyrecall does not enforce a specific chat template but the ### Human: … ### Assistant: … format shown in the examples works well for instruction-tuned models. Match the format your base model expects.

Supported Models

Any causal LM on HuggingFace Hub works. LoRA target modules are auto-detected from the model name using a built-in heuristic:

Model familyAuto-detected LoRA targets
Llama (1/2/3/3.2), Mistral, Mixtral, Qwen, Gemmaq_proj, k_proj, v_proj, o_proj
Falcon, Bloomquery_key_value
MPTWqkv
GPT-2c_attn, c_proj
GPT-Neoq_proj, v_proj
GPT-Jq_proj, v_proj
OPTq_proj, v_proj
All others (fallback)q_proj, v_proj

The heuristic matches on a substring of the model name (case-insensitive). If your model isn't listed, the fallback targets q_proj and v_proj work for most transformer architectures.

CI Integration

The recommended CI pattern is to run snapshot, training, and check inside a single Python script. This keeps the trained weights in memory so that model.check() benchmarks the trained model, not a freshly loaded base. The script exits with code 2 when forgetting is detected — use that as your merge gate.

# train_and_check.py  ← run this from CI
import sys
from pyrecall import Model

model = Model("meta-llama/Llama-3.2-1B")

# Lock in pre-training skill scores + save adapter weights
model.snapshot("pre_training")

# Fine-tune — trained weights stay in memory
model.learn("data.jsonl", epochs=3)

# Benchmark the trained model and diff against pre_training
report = model.check()

if not report.is_healthy:
    sys.exit(2)  # non-zero exit blocks the CI job
# .github/workflows/train.yml (excerpt)
- name: Train and check for forgetting
  run: python train_and_check.py
  # exits 2 if any skill degraded > 10% — blocks the merge

- name: Clean up pre-training snapshot (optional)
  run: pyrecall delete pre_training --yes
  if: always()

To compare any two existing snapshots without loading the model, use pyrecall diff — it reads stored scores only and completes in seconds:

$pyrecall diff pre_training post_training
$pyrecall diff v1 v3 --json | jq '.degraded_skills'  # machine-readable gate

Live Learning

Live learning continuously fine-tunes your model on real production traffic without leaving the terminal.

# Serve and collect — auto fine-tunes every 100 interactions
model = Model("meta-llama/Llama-3.2-1B")
model.snapshot("initial")    # baseline before any live tuning
model.serve(port=8000, live_learning=True, live_batch_size=100)

Interactions are stored in ~/.pyrecall/live_data.db (SQLite). Once batch_size (default 50) untrained interactions accumulate, pyrecall exports them to a temporary JSONL file and runs a 1-epoch LoRA fine-tune. Trained rows are marked so they are never included in a future batch.

Use LiveLearner directly for custom pipelines — for example, to filter interactions before storing them or to use a different trigger condition:

from pyrecall import LiveLearner

learner = LiveLearner(model, batch_size=200, min_response_length=20)

# Only record interactions where the user gave a thumbs-up
if user_feedback == "positive":
    learner.record(prompt, response)

print(learner.pending_count(), "examples until next fine-tune")

Contributing

Issues and pull requests are welcome. Please open an issue first for large changes.

$git clone https://github.com/Pyrecall/Pyrecall
$cd Pyrecall
$pip install -e ".[dev]"
$pytest

Areas where contributions are most valuable:

  • Distributed training — multi-GPU support via accelerate launch
  • Neptune tracker — add a NeptuneTracker alongside the existing W&B and MLflow integrations
  • Web dashboard — visualize snapshot history and score trends over time
  • More benchmark categories — multimodal, domain-specific (legal, medical, code review)