1. SAE:L6:F10
trained SAE latent 10
causal records
layer 6
trained-sae
strong causal 0.224
importance
0.437
association
0.765
effect
0.226
specificity
0.224
stability
1.000
strong
0.224
Causal direction: promotes criterion (0.226)
Evidence: causal intervention records
Activation summary: trained SAE latent 10. Representative high-activation contexts include unit-4: activation=3.139, criterion_score=1.000 | The room is 14; unit-6: activation=2.445, criterion_score=1.000 | The recipe calls for 250.
Causal readout: steering or ablating this feature promoted the criterion with strong causal score 0.224.
SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187
SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.
SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.
Interventions: n=1, mean directed effect=0.226, mean side effect=0.001, 95% CI=[0.226, 0.226]
Behavior score: target_token_probability_mass baseline mean=0.890, range=[0.890, 0.890], target tokens=16 (auto), sample=` feet`, ` ft`, ` meters`, ` metres`
Behavior note: Baseline score is already high; use a narrower target-token set, harder positive prompts, or a more specific behavior scorer.
unit-1: suppress baseline=0.890, intervention=0.664, directed_effect=0.226
Plan a suppression test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F10 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F10 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
unit-4: activation=3.139, criterion_score=1.000 | The room is 14
unit-6: activation=2.445, criterion_score=1.000 | The recipe calls for 250
unit-3: activation=2.418, criterion_score=1.000 | The trail climbs 600
2. SAE:L6:F30
trained SAE latent 30
causal records
layer 6
trained-sae
causal 0.016
importance
0.351
association
0.962
effect
0.016
specificity
0.016
stability
1.000
strong
0.016
Causal direction: near zero (-0.016)
Evidence: causal intervention records
Activation summary: trained SAE latent 30. Representative high-activation contexts include ordinary-prefix-4: activation=6.328, criterion_score=0.000 | A good onboarding message should; ordinary-prefix-1: activation=5.984, criterion_score=0.000 | Please write a friendly.
Causal readout: tested interventions produced a small or uncertain effect (strong causal score 0.016).
SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187
SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.
SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.
Interventions: n=1, mean directed effect=-0.016, mean side effect=0.000, 95% CI=[-0.016, -0.016]
Behavior score: target_token_probability_mass baseline mean=0.890, range=[0.890, 0.890], target tokens=16 (auto), sample=` feet`, ` ft`, ` meters`, ` metres`
Behavior note: Baseline score is already high; use a narrower target-token set, harder positive prompts, or a more specific behavior scorer.
unit-1: suppress baseline=0.890, intervention=0.906, directed_effect=-0.016
Plan a suppression test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F30 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F30 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
ordinary-prefix-4: activation=6.328, criterion_score=0.000 | A good onboarding message should
ordinary-prefix-1: activation=5.984, criterion_score=0.000 | Please write a friendly
ordinary-prefix-3: activation=4.661, criterion_score=0.000 | Summarize the feedback into
3. SAE:L6:F13
trained SAE latent 13
causal records
layer 6
trained-sae
small causal 0.023
importance
0.346
association
0.924
effect
0.023
specificity
0.023
stability
1.000
strong
0.023
Causal direction: small directional effect (-0.023)
Evidence: causal intervention records
Activation summary: trained SAE latent 13. Representative high-activation contexts include ordinary-prefix-4: activation=7.192, criterion_score=0.000 | A good onboarding message should; ordinary-prefix-1: activation=6.169, criterion_score=0.000 | Please write a friendly.
Causal readout: tested interventions produced a small or uncertain effect (strong causal score 0.023).
SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187
SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.
SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.
Interventions: n=1, mean directed effect=-0.023, mean side effect=0.000, 95% CI=[-0.023, -0.023]
Behavior score: target_token_probability_mass baseline mean=0.890, range=[0.890, 0.890], target tokens=16 (auto), sample=` feet`, ` ft`, ` meters`, ` metres`
Behavior note: Baseline score is already high; use a narrower target-token set, harder positive prompts, or a more specific behavior scorer.
unit-1: suppress baseline=0.890, intervention=0.913, directed_effect=-0.023
Plan a suppression test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F13 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F13 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
ordinary-prefix-4: activation=7.192, criterion_score=0.000 | A good onboarding message should
ordinary-prefix-1: activation=6.169, criterion_score=0.000 | Please write a friendly
ordinary-prefix-5: activation=4.528, criterion_score=0.000 | The assistant should answer