continualcode

github · paper · design doc

A CLI coding agent that learns from your corrections in real time. It has tools (read, write, edit_lines, glob, grep, bash). You approve or deny each tool call. When you deny with a correction, it takes one gradient step on LoRA parameters via SDPO and retries with updated weights. No reward model, no critic, no external teacher — the model conditioned on your correction is the teacher.

You: "fix the test"
Agent: write(test.py, ...)       # overwrites the file
You: n → "use edit_lines; don't overwrite"
  → SDPO update runs immediately
  → agent retries with updated weights
Agent: edit_lines(test.py, 14, 17, ...)
You: y
screenshot — terminal interaction
deny → correct → retrain → retry, in one session

Four feedback types, one training signal

Approve — executes the tool call. No gradient step. The absence of correction is the signal.
Deny with correction — the primary learning event. Your text becomes privileged context for the self-teacher. One gradient step, then retry.
Edit — you modify the tool call's arguments directly. The diff becomes implicit correction text. Same gradient step, then execute.
Intermediary feedback — free-form context ("this project uses Poetry not pip"). Accumulates in session, strengthens the teacher signal on the next denial.

One tool call per turn. This gives clean credit assignment — each correction maps to exactly one set of generated tokens, exactly one gradient step.

The self-distillation step

When you deny with correction text, the system constructs a teacher by appending the failed attempt and your correction to the conversation. A single forward pass through the same model scores the same tokens under the richer context. The per-token advantage is the logprob gap:

advantage[t] = log π_teacher(token_t) − log π_student(token_t)

Tokens where the teacher assigns higher probability get positive advantage — produce more of those. Tokens where it assigns lower get negative — suppress those. This yields O(N) bits of learning signal from a single correction, versus O(1) from a scalar reward.

The update uses importance-sampled policy gradients (not PPO clipping — clipping causes token dropping on rare but critical tokens):

is_ratio  = π_current(t) / π_old(t)       # clamp [0.5, 2.0]
pg_loss   = −mean(is_ratio · advantage · log π(t))
kl_loss   = β · KL(π_θ ‖ π_ref)           # β ≈ 0.04
total     = pg_loss + kl_loss              # backward through LoRA only

The entire step — teacher forward pass, advantages, backward, optimizer step — adds ~2-3s latency on a single GPU with an 8B model. Imperceptible when you just typed a correction.

1Studentsamples on-policy
2Correctionyour feedback text
3Self-Teachermodel + feedback
6Retryupdated policy
5LoRA Updateimportance sampling
4Per-Token ΔKL at each position
Scalar reward (RL)
O(1) — same signal for all tokens
Per-token (SDPO)
O(N) — dense signal at each position

Why not the alternatives

DPO needs preference pairs — a chosen and rejected completion. In a CLI you see one tool call. Constructing pairs requires generating a second candidate or using off-policy edits. Also operates at sequence level, no per-token credit.
GRPO needs multiple samples per prompt (DeepSeek uses 64). Presenting 64 candidates to a developer is absurd UX. Also requires a verifiable reward, which most tool calls lack.
PPO needs a critic network, doubling memory. Token dropping from clipping prevents learning on rare but critical tokens.
SFT on corrections is off-policy and causes forgetting. Forward KL is mode-covering — it shifts probability toward new data at the expense of everything else.

Self-distillation is the unique intersection: dense signal (per-token), on-policy (student's own generations), no extra models (teacher = student + context), mode-seeking stability (reverse KL preserves prior capabilities).

Why LoRA specifically

LoRA is not just a compute optimization — it's a regularization mechanism. The low-rank constraint limits updates to ~0.1% of base model parameters, physically bounding the weight drift that causes catastrophic forgetting. The base model stays frozen and provides general coding capability. The adapter encodes project-specific patterns from your corrections. Rank 16, applied to attention projections.

per-token heatmap
token-level advantages over a tool call
training curves
loss + KL after corrections

Limitations

Tinker returns scalar logprobs, not full distributions — we're limited to reverse KL via the logprob gap (no forward KL, no JSD interpolation). No EMA teacher (Tinker doesn't expose weight-level ops) — teacher update is instant (τ=1.0 vs paper's τ=0.05), more aggressive but fine for single-step updates. Credit assignment assumes one tool call per message.

Research lineage

Context distillation (Askell et al. 2021) showed a model can internalize its own prompted behavior into weights. GKD (Agarwal et al. 2023) proved on-policy distillation eliminates the distribution mismatch that cripples SFT. SDPO (Hübotter et al. 2026) showed the model conditioned on environmental feedback can serve as its own teacher, reaching GRPO accuracy at 10x speed with 7x shorter traces. SDFT (Shenfeld et al. 2026) demonstrated that self-distillation enables continual learning without catastrophic forgetting. We fuse these into a single deployable loop for interactive coding.

Install

pip install continualcode
export TINKER_API_KEY=<your-key>
continualcode

Config via key=value args:

continualcode enable_training=false          # inference only
continualcode model_name=Qwen/Qwen3-4B-Instruct-2507 lora_rank=64
continualcode save_every=10                  # checkpoint every 10 steps

Code layout

screenshot — full session
complete session with /metrics readout

References


© 2026 github paper