1. SAE:L6:F30
trained SAE latent 30
feature evidence
layer 6
trained-sae
causal 0.000
importance
0.944
association
0.962
effect
0.962
specificity
0.818
stability
1.000
strong
0.000
Activation association: suppresses criterion (-0.962)
Activation summary: trained SAE latent 30. Representative high-activation contexts include ordinary-prefix-4: activation=6.328, criterion_score=0.000 | A good onboarding message should; ordinary-prefix-1: activation=5.984, criterion_score=0.000 | Please write a friendly.
SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187
SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.
SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.
Plan a suppression test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F30 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F30 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
ordinary-prefix-4: activation=6.328, criterion_score=0.000 | A good onboarding message should
ordinary-prefix-1: activation=5.984, criterion_score=0.000 | Please write a friendly
ordinary-prefix-3: activation=4.661, criterion_score=0.000 | Summarize the feedback into
2. SAE:L6:F13
trained SAE latent 13
feature evidence
layer 6
trained-sae
causal 0.000
importance
0.903
association
0.924
effect
0.924
specificity
0.732
stability
1.000
strong
0.000
Activation association: suppresses criterion (-0.924)
Activation summary: trained SAE latent 13. Representative high-activation contexts include ordinary-prefix-4: activation=7.192, criterion_score=0.000 | A good onboarding message should; ordinary-prefix-1: activation=6.169, criterion_score=0.000 | Please write a friendly.
SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187
SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.
SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.
Plan a suppression test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F13 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F13 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
ordinary-prefix-4: activation=7.192, criterion_score=0.000 | A good onboarding message should
ordinary-prefix-1: activation=6.169, criterion_score=0.000 | Please write a friendly
ordinary-prefix-5: activation=4.528, criterion_score=0.000 | The assistant should answer
3. SAE:L6:F10
trained SAE latent 10
feature evidence
layer 6
trained-sae
causal 0.000
importance
0.769
association
0.765
effect
0.765
specificity
0.637
stability
1.000
strong
0.000
Activation association: promotes criterion (0.765)
Activation summary: trained SAE latent 10. Representative high-activation contexts include unit-4: activation=3.139, criterion_score=1.000 | The room is 14; unit-6: activation=2.445, criterion_score=1.000 | The recipe calls for 250.
SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187
SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.
SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.
Plan a suppression test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F10 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F10 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
unit-4: activation=3.139, criterion_score=1.000 | The room is 14
unit-6: activation=2.445, criterion_score=1.000 | The recipe calls for 250
unit-3: activation=2.418, criterion_score=1.000 | The trail climbs 600
4. SAE:L6:F16
trained SAE latent 16
feature evidence
layer 6
trained-sae
causal 0.000
importance
0.729
association
0.731
effect
0.731
specificity
0.540
stability
1.000
strong
0.000
Activation association: promotes criterion (0.731)
Activation summary: trained SAE latent 16. Representative high-activation contexts include unit-3: activation=8.210, criterion_score=1.000 | The trail climbs 600; unit-4: activation=5.021, criterion_score=1.000 | The room is 14.
SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187
SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.
SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.
Plan a suppression test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F16 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F16 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
unit-3: activation=8.210, criterion_score=1.000 | The trail climbs 600
unit-4: activation=5.021, criterion_score=1.000 | The room is 14
unit-6: activation=4.498, criterion_score=1.000 | The recipe calls for 250
5. SAE:L6:F26
trained SAE latent 26
feature evidence
layer 6
trained-sae
causal 0.000
importance
0.705
association
0.710
effect
0.710
specificity
0.483
stability
1.000
strong
0.000
Activation association: promotes criterion (0.710)
Activation summary: trained SAE latent 26. Representative high-activation contexts include unit-2: activation=9.252, criterion_score=1.000 | The package weighs 3.2; unit-6: activation=5.267, criterion_score=1.000 | The recipe calls for 250.
SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187
SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.
SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.
Plan a suppression test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F26 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F26 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
unit-2: activation=9.252, criterion_score=1.000 | The package weighs 3.2
unit-6: activation=5.267, criterion_score=1.000 | The recipe calls for 250
unit-4: activation=1.751, criterion_score=1.000 | The room is 14
6. SAE:L6:F9
trained SAE latent 9
feature evidence
layer 6
trained-sae
causal 0.000
importance
0.692
association
0.689
effect
0.689
specificity
0.505
stability
1.000
strong
0.000
Activation association: promotes criterion (0.689)
Activation summary: trained SAE latent 9. Representative high-activation contexts include unit-6: activation=8.127, criterion_score=1.000 | The recipe calls for 250; unit-3: activation=5.627, criterion_score=1.000 | The trail climbs 600.
SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187
SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.
SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.
Plan a suppression test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F9 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F9 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
unit-6: activation=8.127, criterion_score=1.000 | The recipe calls for 250
unit-3: activation=5.627, criterion_score=1.000 | The trail climbs 600
unit-4: activation=2.678, criterion_score=1.000 | The room is 14
7. SAE:L6:F0
trained SAE latent 0
feature evidence
layer 6
trained-sae
causal 0.000
importance
0.653
association
0.651
effect
0.651
specificity
0.436
stability
1.000
strong
0.000
Activation association: suppresses criterion (-0.651)
Activation summary: trained SAE latent 0. Representative high-activation contexts include ordinary-prefix-1: activation=7.703, criterion_score=0.000 | Please write a friendly; ordinary-prefix-5: activation=2.962, criterion_score=0.000 | The assistant should answer.
SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187
SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.
SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.
Plan a suppression test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F0 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F0 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
ordinary-prefix-1: activation=7.703, criterion_score=0.000 | Please write a friendly
ordinary-prefix-5: activation=2.962, criterion_score=0.000 | The assistant should answer
ordinary-prefix-4: activation=2.761, criterion_score=0.000 | A good onboarding message should
8. SAE:L6:F11
trained SAE latent 11
feature evidence
layer 6
trained-sae
causal 0.000
importance
0.647
association
0.639
effect
0.639
specificity
0.453
stability
1.000
strong
0.000
Activation association: promotes criterion (0.639)
Activation summary: trained SAE latent 11. Representative high-activation contexts include unit-6: activation=8.391, criterion_score=1.000 | The recipe calls for 250; unit-3: activation=5.123, criterion_score=1.000 | The trail climbs 600.
SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187
SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.
SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.
Plan a suppression test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F11 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
Copy
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F11 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json
Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
unit-6: activation=8.391, criterion_score=1.000 | The recipe calls for 250
unit-3: activation=5.123, criterion_score=1.000 | The trail climbs 600
unit-4: activation=1.678, criterion_score=1.000 | The room is 14