Feature Report

Criterion: the next token should be a physical measurement unit

Model distilgpt2 2026-05-20T19:53:39+00:00
Features
8
Strong Causal
0
With Interventions
0
Layers
1

Run Context

Report scope: ranked 32 kept feature(s) from 32 candidate feature(s).

Evidence summary: 8 activation rows; 32 candidate features; criterion score mean=0.500, range=[0.000, 1.000]; positive rows=4, non-positive rows=4.

Metric notes: Association is activation/criterion correlation. Effect is mean causal change from interventions. Specificity subtracts measured side effects. Strong causal score is the specificity-adjusted causal signal.

Mechanism Sketch

Causal candidates: no tested feature crossed the current strong-effect threshold.

No intervention records were attached; causal claims are untested.

No feature currently meets the strong-effect threshold; broaden prompts, test more layers, or use graph attribution.

Agent Next Actions

Plan causal tests for the top report features
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --report '<report.json>' --top-k 8 --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: inspection report JSON, scored causal prompt JSONL
Rebuild the report with intervention evidence
interp-lab inspect --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --backend records --records '<activation-records.jsonl>' --interventions '<interventions.jsonl>' --out '<causal-report-dir>' --html-out '<causal-report-dir>/report.html' Requires: activation records, intervention records
Build a graph from the causal report
interp-lab export-attribution-graph --report '<causal-report-dir>/report.json' --out '<graph.json>' --markdown-out '<graph.md>' --html-out '<graph.html>' Requires: causal report JSON
8 visible

Ranked Features

Rank Feature Evidence Importance Association Effect Specificity Strong Causal
1
SAE:L6:F30 trained SAE latent 30
feature evidence 0.944 0.962 0.962 0.818 0.000
2
SAE:L6:F13 trained SAE latent 13
feature evidence 0.903 0.924 0.924 0.732 0.000
3
SAE:L6:F10 trained SAE latent 10
feature evidence 0.769 0.765 0.765 0.637 0.000
4
SAE:L6:F16 trained SAE latent 16
feature evidence 0.729 0.731 0.731 0.540 0.000
5
SAE:L6:F26 trained SAE latent 26
feature evidence 0.705 0.710 0.710 0.483 0.000
6
SAE:L6:F9 trained SAE latent 9
feature evidence 0.692 0.689 0.689 0.505 0.000
7
SAE:L6:F0 trained SAE latent 0
feature evidence 0.653 0.651 0.651 0.436 0.000
8
SAE:L6:F11 trained SAE latent 11
feature evidence 0.647 0.639 0.639 0.453 0.000

Feature Details

1. SAE:L6:F30 trained SAE latent 30

feature evidence layer 6 trained-sae causal 0.000
importance 0.944
association 0.962
effect 0.962
specificity 0.818
stability 1.000
strong 0.000

Activation association: suppresses criterion (-0.962)

Activation summary: trained SAE latent 30. Representative high-activation contexts include ordinary-prefix-4: activation=6.328, criterion_score=0.000 | A good onboarding message should; ordinary-prefix-1: activation=5.984, criterion_score=0.000 | Please write a friendly.

SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187

SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.

SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.

Plan a suppression test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F30 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F30 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
ordinary-prefix-4: activation=6.328, criterion_score=0.000 | A good onboarding message should
ordinary-prefix-1: activation=5.984, criterion_score=0.000 | Please write a friendly
ordinary-prefix-3: activation=4.661, criterion_score=0.000 | Summarize the feedback into

2. SAE:L6:F13 trained SAE latent 13

feature evidence layer 6 trained-sae causal 0.000
importance 0.903
association 0.924
effect 0.924
specificity 0.732
stability 1.000
strong 0.000

Activation association: suppresses criterion (-0.924)

Activation summary: trained SAE latent 13. Representative high-activation contexts include ordinary-prefix-4: activation=7.192, criterion_score=0.000 | A good onboarding message should; ordinary-prefix-1: activation=6.169, criterion_score=0.000 | Please write a friendly.

SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187

SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.

SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.

Plan a suppression test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F13 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F13 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
ordinary-prefix-4: activation=7.192, criterion_score=0.000 | A good onboarding message should
ordinary-prefix-1: activation=6.169, criterion_score=0.000 | Please write a friendly
ordinary-prefix-5: activation=4.528, criterion_score=0.000 | The assistant should answer

3. SAE:L6:F10 trained SAE latent 10

feature evidence layer 6 trained-sae causal 0.000
importance 0.769
association 0.765
effect 0.765
specificity 0.637
stability 1.000
strong 0.000

Activation association: promotes criterion (0.765)

Activation summary: trained SAE latent 10. Representative high-activation contexts include unit-4: activation=3.139, criterion_score=1.000 | The room is 14; unit-6: activation=2.445, criterion_score=1.000 | The recipe calls for 250.

SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187

SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.

SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.

Plan a suppression test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F10 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F10 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
unit-4: activation=3.139, criterion_score=1.000 | The room is 14
unit-6: activation=2.445, criterion_score=1.000 | The recipe calls for 250
unit-3: activation=2.418, criterion_score=1.000 | The trail climbs 600

4. SAE:L6:F16 trained SAE latent 16

feature evidence layer 6 trained-sae causal 0.000
importance 0.729
association 0.731
effect 0.731
specificity 0.540
stability 1.000
strong 0.000

Activation association: promotes criterion (0.731)

Activation summary: trained SAE latent 16. Representative high-activation contexts include unit-3: activation=8.210, criterion_score=1.000 | The trail climbs 600; unit-4: activation=5.021, criterion_score=1.000 | The room is 14.

SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187

SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.

SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.

Plan a suppression test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F16 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F16 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
unit-3: activation=8.210, criterion_score=1.000 | The trail climbs 600
unit-4: activation=5.021, criterion_score=1.000 | The room is 14
unit-6: activation=4.498, criterion_score=1.000 | The recipe calls for 250

5. SAE:L6:F26 trained SAE latent 26

feature evidence layer 6 trained-sae causal 0.000
importance 0.705
association 0.710
effect 0.710
specificity 0.483
stability 1.000
strong 0.000

Activation association: promotes criterion (0.710)

Activation summary: trained SAE latent 26. Representative high-activation contexts include unit-2: activation=9.252, criterion_score=1.000 | The package weighs 3.2; unit-6: activation=5.267, criterion_score=1.000 | The recipe calls for 250.

SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187

SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.

SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.

Plan a suppression test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F26 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F26 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
unit-2: activation=9.252, criterion_score=1.000 | The package weighs 3.2
unit-6: activation=5.267, criterion_score=1.000 | The recipe calls for 250
unit-4: activation=1.751, criterion_score=1.000 | The room is 14

6. SAE:L6:F9 trained SAE latent 9

feature evidence layer 6 trained-sae causal 0.000
importance 0.692
association 0.689
effect 0.689
specificity 0.505
stability 1.000
strong 0.000

Activation association: promotes criterion (0.689)

Activation summary: trained SAE latent 9. Representative high-activation contexts include unit-6: activation=8.127, criterion_score=1.000 | The recipe calls for 250; unit-3: activation=5.627, criterion_score=1.000 | The trail climbs 600.

SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187

SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.

SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.

Plan a suppression test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F9 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F9 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
unit-6: activation=8.127, criterion_score=1.000 | The recipe calls for 250
unit-3: activation=5.627, criterion_score=1.000 | The trail climbs 600
unit-4: activation=2.678, criterion_score=1.000 | The room is 14

7. SAE:L6:F0 trained SAE latent 0

feature evidence layer 6 trained-sae causal 0.000
importance 0.653
association 0.651
effect 0.651
specificity 0.436
stability 1.000
strong 0.000

Activation association: suppresses criterion (-0.651)

Activation summary: trained SAE latent 0. Representative high-activation contexts include ordinary-prefix-1: activation=7.703, criterion_score=0.000 | Please write a friendly; ordinary-prefix-5: activation=2.962, criterion_score=0.000 | The assistant should answer.

SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187

SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.

SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.

Plan a suppression test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F0 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F0 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
ordinary-prefix-1: activation=7.703, criterion_score=0.000 | Please write a friendly
ordinary-prefix-5: activation=2.962, criterion_score=0.000 | The assistant should answer
ordinary-prefix-4: activation=2.761, criterion_score=0.000 | A good onboarding message should

8. SAE:L6:F11 trained SAE latent 11

feature evidence layer 6 trained-sae causal 0.000
importance 0.647
association 0.639
effect 0.639
specificity 0.453
stability 1.000
strong 0.000

Activation association: promotes criterion (0.639)

Activation summary: trained SAE latent 11. Representative high-activation contexts include unit-6: activation=8.391, criterion_score=1.000 | The recipe calls for 250; unit-3: activation=5.123, criterion_score=1.000 | The trail climbs 600.

SAE training: rows=8, latents=32, active=1.000, dead=0, val MSE=1.187

SAE training note: Training rows are fewer than latents; collect more activations or reduce latent_dim.

SAE training note: Validation reconstruction is much worse than train; broaden training prompts and keep a separate held-out eval set.

Plan a suppression test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F11 --sae '<sae.json>' --mode suppress --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
Plan an amplification test for this SAE latent
interp-lab intervene --model distilgpt2 --criterion 'the next token should be a physical measurement unit' --dataset '<causal-prompts.jsonl>' --feature SAE:L6:F11 --sae '<sae.json>' --mode amplify --target-token auto --out '<interventions.jsonl>' --plan-out '<intervention-plan.json>' --dry-run --json Requires: scored causal prompt JSONL, matching interp-lab SAE artifact
unit-6: activation=8.391, criterion_score=1.000 | The recipe calls for 250
unit-3: activation=5.123, criterion_score=1.000 | The trail climbs 600
unit-4: activation=1.678, criterion_score=1.000 | The room is 14