github · paper · design doc
A CLI coding agent that learns from your corrections in real time. It has tools (read, write, edit_lines, glob, grep, bash). You approve or deny each tool call. When you deny with a correction, it takes one gradient step on LoRA parameters via SDPO and retries with updated weights. No reward model, no critic, no external teacher — the model conditioned on your correction is the teacher.
You: "fix the test" Agent: write(test.py, ...) # overwrites the file You: n → "use edit_lines; don't overwrite" → SDPO update runs immediately → agent retries with updated weights Agent: edit_lines(test.py, 14, 17, ...) You: y
Approve — executes the tool call. No gradient step. The absence of correction is the signal.
Deny with correction — the primary learning event. Your text becomes privileged context for the self-teacher. One gradient step, then retry.
Edit — you modify the tool call's arguments directly. The diff becomes implicit correction text. Same gradient step, then execute.
Intermediary feedback — free-form context ("this project uses Poetry not pip"). Accumulates in session, strengthens the teacher signal on the next denial.
One tool call per turn. This gives clean credit assignment — each correction maps to exactly one set of generated tokens, exactly one gradient step.
When you deny with correction text, the system constructs a teacher by appending the failed attempt and your correction to the conversation. A single forward pass through the same model scores the same tokens under the richer context. The per-token advantage is the logprob gap:
advantage[t] = log π_teacher(token_t) − log π_student(token_t)
Tokens where the teacher assigns higher probability get positive advantage — produce more of those. Tokens where it assigns lower get negative — suppress those. This yields O(N) bits of learning signal from a single correction, versus O(1) from a scalar reward.
The update uses importance-sampled policy gradients (not PPO clipping — clipping causes token dropping on rare but critical tokens):
is_ratio = π_current(t) / π_old(t) # clamp [0.5, 2.0] pg_loss = −mean(is_ratio · advantage · log π(t)) kl_loss = β · KL(π_θ ‖ π_ref) # β ≈ 0.04 total = pg_loss + kl_loss # backward through LoRA only
The entire step — teacher forward pass, advantages, backward, optimizer step — adds ~2-3s latency on a single GPU with an 8B model. Imperceptible when you just typed a correction.
DPO needs preference pairs — a chosen and rejected completion. In a CLI you see one tool call. Constructing pairs requires generating a second candidate or using off-policy edits. Also operates at sequence level, no per-token credit.
GRPO needs multiple samples per prompt (DeepSeek uses 64). Presenting 64 candidates to a developer is absurd UX. Also requires a verifiable reward, which most tool calls lack.
PPO needs a critic network, doubling memory. Token dropping from clipping prevents learning on rare but critical tokens.
SFT on corrections is off-policy and causes forgetting. Forward KL is mode-covering — it shifts probability toward new data at the expense of everything else.
Self-distillation is the unique intersection: dense signal (per-token), on-policy (student's own generations), no extra models (teacher = student + context), mode-seeking stability (reverse KL preserves prior capabilities).
LoRA is not just a compute optimization — it's a regularization mechanism. The low-rank constraint limits updates to ~0.1% of base model parameters, physically bounding the weight drift that causes catastrophic forgetting. The base model stays frozen and provides general coding capability. The adapter encodes project-specific patterns from your corrections. Rank 16, applied to attention projections.
Tinker returns scalar logprobs, not full distributions — we're limited to reverse KL via the logprob gap (no forward KL, no JSD interpolation). No EMA teacher (Tinker doesn't expose weight-level ops) — teacher update is instant (τ=1.0 vs paper's τ=0.05), more aggressive but fine for single-step updates. Credit assignment assumes one tool call per message.
Context distillation (Askell et al. 2021) showed a model can internalize its own prompted behavior into weights. GKD (Agarwal et al. 2023) proved on-policy distillation eliminates the distribution mismatch that cripples SFT. SDPO (Hübotter et al. 2026) showed the model conditioned on environmental feedback can serve as its own teacher, reaching GRPO accuracy at 10x speed with 7x shorter traces. SDFT (Shenfeld et al. 2026) demonstrated that self-distillation enables continual learning without catastrophic forgetting. We fuse these into a single deployable loop for interactive coding.
pip install continualcode export TINKER_API_KEY=<your-key> continualcode
Config via key=value args:
continualcode enable_training=false # inference only continualcode model_name=Qwen/Qwen3-4B-Instruct-2507 lora_rank=64 continualcode save_every=10 # checkpoint every 10 steps
train.py — SDPO core: teacher prompt construction, logprob scoring, IS-weighted update, sampler refreshtui.py — interactive CLI: approve/deny/edit flow, correction prompt, /metricstools.py — tool implementations + structured feedbackbenchmarks/auto_train.py — automated LCB training loop with multi-rollout GRPO + SDPOdemo/ — tiny project for deny → train → retry end-to-end