Interpretability · probe auditing

Does a Probe Mean What It Says? Auditing Probes with Natural Language Autoencoders

Probes that detect concepts in a model's activations are widely relied upon for both safety evaluation and deployment monitoring; however, what a probe actually detects may not exactly match what was intended. This work introduces a label-free method to audit what concepts a probe fires on, using NLAs. As a proof of concept, a demonstration is given discovering a confounding factor in a probe. Future work will evaluate its usefulness by analyzing published probes.


By Jared Baribeau · GitHub · LinkedIn

Published: May 25, 2026

Updated: May 26, 2026

Author's note: [show]


NLA audit uncovers a probe confound
Concepts tracked by refusal probe
Concepts tracked by refusal probe
with multiple-choice confound
refusal
+0.95
+0.89
illegal activity
+0.39
+0.41
safety disclaimer
+0.11
+0.11
harmful content
+0.06
+0.06
multiple-choice format(injected confound)
-0.08
+0.35
formal tone
-0.06
-0.06
question-answer structure
-0.23
-0.11
Only the confounded probe fires on “multiple-choice format” — surfaced by the audit with no concept list given in advance.

Introduction

Probes are an important tool for safety and interpretability, useful in both pre-release model evaluation and in deployment monitoring. They can be used to monitor a model's latent-space activations for the presence of specific concepts or behaviors. Probes can track potentially dangerous behaviors (such as deception, sandbagging, situational awareness) or topics (such as bio-weapons manufacturing, illegal activities).

Ideally, probes fire any time the target concept is represented in the model's activations, and not otherwise. In practice however, a probe learns the difference between the training data "positive" set and "negative" set, and verifying that the difference between labels accurately identifies and isolates the target concept can be difficult or impossible. It is common for related or unexpected (but different) topics to also be separated by the dataset labels, and therefore to be targeted by the probe.

Examples:

Currently, the only way to reveal these issues is through counterfactual datasets - which are expensive and built per concept to be checked.


Recent work on Natural Language Autoencoders (NLAs)[6] introduced a tool for natural language readouts of a model's residual-stream activations. This new "LLM mind reading" tool opens up interesting new possibilities in cross-checking probe behavior: use an NLA to compare the model's latent-space representations to the probe's output. If successful, this could be used to improve the reliability of the probes relied upon for critical, deployed safety applications.

This work demonstrates the first step: using an NLA to discover a confounding factor in a probe.

Approach: Auditing a Probe

The probe audit answers the question: what concepts really cause the probe to fire? To do this, prompts are run through the audit pipeline, and concepts that are present/absent when the probe fires are tracked.

Probe audit pipeline

  1. Prompt the target model. [A]
  2. Probe the activations: above threshold? (yes/no). [B]
  3. Verbalize the activations using an NLA. [C]
  4. Extract concept tags from the verbalization (using a concept judge, with no set taxonomy). [D]
  5. Rank the concepts most closely tracked by probe. [E]

To start, this is tested using a refusal probe, which is an easy probe to validate due to the very clear separation of "refusal" and "non-refusal" behavior.

Here are some refusal probe audit pipeline samples:

Refusal Probe Audit Pipeline Samples
1. PromptiTarget model Llama-3.3-70B-Instruct. The activation is read from resid_post at block 53 (last token). See Appendix [A].
2. Probe — refusal detected?iProbe A difference-of-means refusal direction (Arditi et al. 2024), scored by projection and thresholded. See Appendix [B].
3. NLA verbalizationiNLA A natural-language autoencoder (kitft/Llama-3.3-70B-NLA-L53-av) verbalizes the block-53 activation into text. See Appendix [C].
4. concept judgeiConcept judge Claude Sonnet (claude-sonnet-4-6) extracts ≤4 open-vocabulary concept tags from each verbalization. See Appendix [D].


Next, the samples are collected and results ranked.

5. Rank concepts by probe associationiAssociation: How much more often a concept appears when the probe fires than when it's quiet (top vs bottom third by projection). See Appendix [E].
Concepts associated with refusal probe (via NLA verbalizations)
refusal
+0.95
illegal activity
+0.39
safety disclaimer
+0.11
disclaimer
+0.08
q&a format
+0.08
harmful content
+0.06
ethical disclaimer
+0.05
misinformation
+0.03
list format
-0.06
formal tone
-0.06
trick question
-0.06
multiple-choice format
-0.08
question-answer structure
-0.23
association = frequency(concept | probe fires) − frequency(concept | probe quiet)

The probe mostly tracks "refusal" and "illegal activity". Generic “q&a format” appears whether or not the probe fires, and therefore cancels out to near zero. Concepts seen fewer than 4 times (n=200) are dropped.

Audit Method Validation

To test the approach, two refusal probes are compared: a baseline, and one trained on format-confounded data. Harmful prompts are given in multiple-choice format, and harmless prompts in regular prose. This injects an obvious confound, which should be surfaced by the audit.

  1. Train a refusal probe with an intentional confound (multiple-choice format)
  2. Audit the confounded probe, and check if the confound is discovered.

Results

The NLA audit discovers the confound. The following chart shows the difference in concepts tracked by the baseline and confounded probes. The injected confound (multiple-choice format) appears, representing a significant portion of the confounded probe's tracked concepts.


Concepts associated with refusal probe (via NLA verbalizations)
Baseline probe
Probe with multiple-choice format confound
refusal
+0.95
+0.89
illegal activity
+0.39
+0.41
safety disclaimer
+0.11
+0.11
disclaimer
+0.08
-0.06
q&a format
+0.08
+0.05
harmful content
+0.06
+0.06
ethical disclaimer
+0.05
+0.05
misinformation
+0.03
+0.05
list format
-0.06
-0.08
formal tone
-0.06
-0.06
trick question
-0.06
-0.02
multiple-choice format(injected confound)
-0.08
+0.35
question-answer structure
-0.23
-0.11
Both the baseline and confounded refusal probes track refusal and illegal activity. However, “multiple-choice format” is tracked only by the confounded probe. The audit successfully surfaced the planted confound.

This result validates the proof of concept: NLAs may indeed offer something useful for auditing and understanding probes.

Conclusion

The findings here are positive! This work intentionally used a simple, planted confound (prompt format) on a simple, verifiable probe (refusal) as a proof of concept. Under these conditions, the findings aren't particularly useful; however, they indicate that further work is worth pursuing.

The next step is to apply the presented audit methodology on existing, published probes and evaluate the output.

Limitations

Expressiveness ceiling. The audit only catches a confound the NLA both verbalizes and phrases differently from the concept; tightly entangled concepts may be indistinguishable in its words.

NLA Limitations and Confabulation. An NLA verbalization may not faithfully reflect the activation or the concepts represented inside. See the original NLA paper.

Primitive concept judge. Right now the concept extraction judge is very primitive (e.g. limited to 4 concepts per NLA readout), likely limiting the usefulness of the audits. See future work for proposed improvements.

Eval-set dependence. The association score splits a probe's activations into top/bottom third, so it needs an evaluation set spanning the probe's firing range. The concepts that surface depends on the audit set's composition. A continuous association measure (e.g. correlation of concept presence with projection) would improve this by removing the arbitrary split, though it would still be limited to concepts present in the audit set.

Layer coupling. The verbalization is run at the single, fixed layer the NLA was trained on, and the probe must read from that same layer. A new NLA could be trained on a different target layer, or potentially trained to support multiple-layer inputs.

Hedged readouts. At the assistant-turn boundary the NLA describes a distribution over likely continuations and often hedges, so a hard yes/no loses information. Further work is warranted to understand the full implications.

Baseline tics. The NLA describes generic structure ("Q&A format") almost everywhere. The frequency-contrast metric accounts for this; however, absolute mention-rates are contaminated, reducing the signal-to-noise ratio.

Cost. Each readout runs a 70B verbalizer. This is ok for offline audits, but impractical for live deployment. Training a new large parameter NLA is expensive.

Future work

Audits on published probes. Explore the usefulness of the method beyond a simple, planted confound.

Concept judge improvements. Evaluate different approaches to concept extraction and concept normalization. Iterate on the judge prompt and evaluate a larger sample of judge outputs by hand.

Challenging confounds. Stylistic/affective, topical, correlated-concept (eval-awareness vs "test-iness"; deception vs hedging).

Borderline data experimentation. Stress the refusal/over-refusal boundary (e.g. XSTest).

Probe factory. Distill the expensive NLA audit into a cheap, deployable probe trained on NLA-derived labels.

NLA improvements. Multi-layer NLAs may be less noisy and more useful.

References

  1. ↑ return to bodyDevbunova (2026). Is Evaluation Awareness Just Format Sensitivity? arXiv:2603.19426.
  2. ↑ return to bodyGoldowsky-Dill et al. (2025). Detecting Strategic Deception Using Linear Probes. arXiv:2502.03407.
  3. ↑ return to bodyLevinstein & Herrmann (2024). Still No Lie Detector for Language Models. Philosophical Studies.
  4. ↑ return to bodyMarks & Tegmark (2024). The Geometry of Truth. arXiv:2310.06824.
  5. ↑ return to bodyArditi et al. (2024). Refusal in LLMs Is Mediated by a Single Direction. arXiv:2406.11717.
  6. ↑ return to bodyFraser-Taliente et al. (2026). Natural Language Autoencoders. transformer-circuits.pub/2026/nla.

Code & data

Source on GitHub

Acknowledgements

This project was completed with the support from the BlueDot Impact rapid grant program, and as part of a BlueDot Impact Technical AI Safety Project Sprint.

Thanks for the support! 🚀


And to you 👀! Thanks for reading. Questions? Feedback? I'd love to hear from you, here.

Appendix

Experiment specifications and method details, referenced from the body by [letter].

↑ return to body

[A] Target model & activation extraction

Llama-3.3-70B-Instruct. Activations are captured as resid_post at the output of block 53 (the NLA's layer), at the last token / assistant-turn boundary — one vector per prompt. The probe and the NLA read the same activation to ensure the comparison is meaningful.

↑ return to body

[B] Probe & confound

Difference-of-means direction (Arditi et al. 2024 [5] recipe): dir = mean(harmful) − mean(harmless), scored by projection and thresholded. Harmful prompts from AdvBench / TDC / HarmBench / MaliciousInstruct; harmless from Alpaca (Arditi's precomputed splits), n_train = 128 / n_val = 32, seed 42, with Arditi's refusal-score filter.

The confounded probe is trained on harmful-in-MCQ (label 1) vs harmless-in-prose (label 0), so format is perfectly correlated with the label. The MCQ scaffold is Question: … / Choices: (A) … (B) … / Answer: vs free-form prose; the instruction content is identical across formats. The de-confounded check is a 2×2: {harmful, harmless} × {MCQ, free}.

↑ return to body

[C] NLA verbalizer

kitft/Llama-3.3-70B-NLA-L53-av (activation verbalizer), served via SGLang. Each block-53 activation is injected via input_embeds; injection scale / template are read from the checkpoint's nla_meta.yaml. Inference recipe: github.com/kitft/nla-inference. The verbalizer describes a distribution over the upcoming response at this token position and often hedges.

↑ return to body

[D] Concept judge

Claude Sonnet (claude-sonnet-4-6) via the Anthropic API; open-vocabulary extraction of ≤4 lowercase concept tags per verbalization, prompt-cached. Tags are normalized by lowercasing / stripping (no synonym-merge pass yet). Full prompt:

Extract the concepts the description says are PRESENT — what the model is representing or about to do. Output 1–4 concise tags.

· short canonical noun phrases, lowercase (e.g. "multiple-choice format", "refusal", "illegal activity").
· name SPECIFIC content/structure, not vague words like "response" or "text".
· reuse the same tag for the same idea across explanations.
· include format/structure as a concept when present.

Respond with ONLY a JSON object: {"concepts": ["tag", ...]}
↑ return to body

[E] Probe–concept association metric

The association score measures how much more often a concept shows up in the readout when the probe fires than when it's quiet.

  1. Rank every prompt by how strongly the probe fires.
  2. For one concept, check how often the NLA mentions it in the top third (probe fires most) vs the bottom third (probe quietest).
  3. Score = (how often in the top third) − (how often in the bottom third).

+1.0 = always there when firing, never when quiet → the probe keys on it.  
0 = equally common either way → unrelated.  
negative = more common when the probe stays quiet.

Hyperparameters: top / bottom third by projection (top_frac = 0.33).

Concepts seen fewer than 4 times are dropped.

↑ return to body

[F] Compute & run

Modal.com on-demand GPUs — target extraction on H100×2, NLA serving on H100×4 (tensor-parallel 4); the two 70B models run sequentially. Concept judge calls go to the Anthropic API. Frozen run part3v1, n = 200 (the 2×2, 50 prompts per cell).