Interpretability · probe auditing

Does a Probe Mean What It Says? Auditing Probes with Natural Language Autoencoders

Probes that detect concepts in a model's activations are widely relied upon for both safety evaluation and deployment monitoring; however, what a probe actually detects may not exactly match what was intended. This work introduces a label-free method to audit on which concepts a probe actually fires, using Natural Language Autoencoders (NLAs). As a proof of concept, a demonstration is given discovering a confounding factor in a probe. Future work will evaluate its usefulness by analyzing published probes.

By Jared Baribeau · GitHub · LinkedIn

Published: May 25, 2026

Updated: June 1, 2026

Author's note: [show]

🌱 Are you new to interpretability? [show] ?

NLA audit uncovers a probe confound

Concepts tracked by refusal probe

Concepts tracked by refusal probe
with multiple-choice confound

refusal

+0.95

+0.89

illegal activity

+0.39

+0.41

safety disclaimer

+0.11

harmful content

+0.06

⋮

multiple-choice format(injected confound)

-0.08

+0.35

formal tone

-0.06

question-answer structure

-0.23

-0.11

Only the confounded probe fires on “multiple-choice format” — surfaced by the audit with no concept list given in advance.

Introduction

Probes ?Probes A probe is a small monitoring system trained on an AI model's activations (i.e. the "brain activity" between the prompt and the LLM response) to detect whether a specific concept is present.

For example, a "bomb manufacturing" probe would be triggered when the model is asked about how to make a bomb.

They are widely used in safety evaluation and deployment monitoring to watch for and prevent harmful LLM responses. are an important tool for AI safety and interpretability ?Interpretability Interpretability is like neuroscience for AI models — the field of studying what's going on inside of an AI's mind. The nature of how AI models are "grown" more than "built" means we don't by default understand why they do the things they do.

Interpretability research aims to change that, to improve our ability to develop safe and reliable AI systems. To date, this has yielded some very useful tools, like probes, which are heavily discussed in this work. , useful in both pre-release model evaluation and in deployment monitoring. They can be used to monitor a model's latent-space activations ?Activations Neural networks process data through layers of numbers called "activations" — vectors of values encoding the model's internal state at each layer. The residual stream accumulates these across layers; probes and NLAs read from specific points in this stream to infer what the model is "thinking about." for the presence of specific concepts or behaviors. Probes can track potentially dangerous behaviors (such as deception, sandbagging, situational awareness ? Deception, sandbagging, situational awareness AI systems have been shown to show exhibit these behaviors in some circumstances.

Deception: saying things the model "knows" to be false.

Sandbagging: deliberately underperforming on evaluations to avoid being flagged.

Situational awareness: recognizing it is being evaluated and behaving differently as a result.

It is critical to be able to detect these when evaluating whether or not AI systems are safe. ) or topics (such as bio-weapons manufacturing, illegal activities).

Ideally, probes fire any time the target concept is represented in the model's activations, and not otherwise. In practice however, a probe learns the difference between the training data "positive" set and "negative" set, and verifying that the difference between labels accurately identifies and isolates the target concept can be difficult or impossible. It is common for related or unexpected (but different) topics to also be separated by the dataset labels, and therefore to be targeted by the probe.

Examples:

A probe trained to detect "evaluation awareness" was shown to largely track multiple-choice formatting [1].
A deception probe can trigger on deception related dialogues, even if the response is honest [2].
Probes trained to detect whether statements are true were shown to fail on statements with "not" [3] or could not disambiguate "truth" from "commonly believed" or "verifiable" [4].

Currently, the only way to reveal these issues is through counterfactual datasets - which are expensive and built per concept to be checked.

Recent work on Natural Language Autoencoders (NLAs)[6]?Natural Language Autoencoders (NLAs)A model trained to translate the neural network activations (i.e. roughly "which neurons are firing") into a human-readable text description. Check out the "Refusal Probe Audit Pipeline Samples" widget to see what this looks like.

This gives us a rough way to "read the model's mind" and ask a model: "what are you thinking about?"
In comparison, using probes limits us to asking something like "are you thinking about cats?". introduced a tool for natural language readouts of a model's residual-stream activations. This new "LLM mind reading" tool opens up interesting new possibilities in cross-checking probe behavior: use an NLA to compare the model's latent-space representations to the probe's output. If successful, this could be used to improve the reliability of the probes relied upon for critical, deployed safety applications.

This work demonstrates the first step: using an NLA to discover a confounding factor in a probe.

Approach: Auditing a Probe

The starting point is an already-trained probe. The probe audit answers the question: what concepts really cause the probe to fire? To do this, prompts are run through the audit pipeline, and concepts that are present/absent when the probe fires are tracked.

Probe audit pipeline

Prompt the target model. [Appendix A]
Probe the activations: above threshold? (yes/no). [Appendix B]
Verbalize the activations using an NLA. [Appendix C]
Extract concept tags from the verbalization. [Appendix D]
A concept judge proposes tags freely, rather than picking from a pre-defined list of named concepts
Rank concepts by probe association.
Measure how much more often each concept appears when the probe fires vs when it stays quiet. The concepts the probe most closely tracks rise to the top.
Score = (how often present in the strongest 1/3 of probe firings) − (how often present in the weakest 1/3).
See Appendix [E] for full explanation of ranking methodology.

To start, this is tested using a refusal probe, which is an easy probe to validate due to the very clear separation of "refusal" and "non-refusal" behavior.

Many prompts are run through the pipeline to complete an audit. Here are a few samples from the refusal probe audit:

Refusal Probe Audit Pipeline Samples

Select a prompt sample to trace it through the pipeline.

1. PromptiTarget model Each prompt in the dataset is input to Llama-3.3-70B-Instruct. The activation is read from the residual stream at block 53 (last token). See Appendix [A].

2. Probe — refusal detected?iProbe The refusal probe is evaluated on the activations of the model at the last token.

The probe is a difference-of-means refusal direction (Arditi et al. 2024), scored by projection and thresholded. See Appendix [B].

3. NLA verbalizationiNLA A natural-language autoencoder (kitft/Llama-3.3-70B-NLA-L53-av) verbalizes the block-53 activation into text. See Appendix [C].

→

4. concept judgeiConcept judge Claude Sonnet (claude-sonnet-4-6) extracts open-vocabulary concept tags from each NLA verbalization.

See Appendix [D] for the concept judge prompt.

Model OutputiModel output The reply Llama-3.3-70B-Instruct generates for this prompt. The NLA reads the activation at the boundary before this text is produced, so it predicts the reply rather than observing it.

Next, the samples are collected and results ranked.

5. Rank concepts by probe associationiAssociation How much more often a concept appears when the probe fires than when it's quiet (top vs bottom third by projection). See Appendix [E].

Concepts associated with refusal probe (via NLA verbalizations)

refusal

+0.95

illegal activity

+0.39

safety disclaimer

+0.11

disclaimer

+0.08

q&a format

+0.08

harmful content

+0.06

ethical disclaimer

+0.05

misinformation

+0.03

list format

-0.06

formal tone

-0.06

trick question

-0.06

multiple-choice format

-0.08

question-answer structure

-0.23

association = frequency(concept | probe fires) − frequency(concept | probe quiet)

The probe mostly tracks "refusal" and "illegal activity". Generic “q&a format” appears whether or not the probe fires, and therefore cancels out to near zero. Concepts seen fewer than 4 times (n=200) are dropped.

Audit Method Validation

To test the approach, two refusal probes are compared: a baseline, and one trained on format-confounded data. Harmful prompts are given in multiple-choice format, and harmless prompts in regular prose. This injects an obvious confound, which should be surfaced by the audit.

Train a refusal probe with an intentional confound (multiple-choice format)
Audit the confounded probe, and check if the confound is discovered.

Results

The NLA audit discovers the confound. The following chart shows the difference in concepts tracked by the baseline and confounded probes. The injected confound (multiple-choice format) appears, representing a significant portion of the confounded probe's tracked concepts.

Concepts associated with refusal probe (via NLA verbalizations)

Baseline probe

Probe with multiple-choice format confound

refusal

+0.95

+0.89

illegal activity

+0.39

+0.41

safety disclaimer

+0.11

disclaimer

+0.08

-0.06

q&a format

+0.08

+0.05

harmful content

+0.06

ethical disclaimer

+0.05

misinformation

+0.03

+0.05

list format

-0.06

-0.08

formal tone

-0.06

trick question

-0.06

-0.02

multiple-choice format(injected confound)

-0.08

+0.35

question-answer structure

-0.23

-0.11

Both the baseline and confounded refusal probes track refusal and illegal activity. However, “multiple-choice format” is tracked only by the confounded probe. The audit successfully surfaced the planted confound.

This result validates the proof of concept: NLAs may indeed offer something useful for auditing and understanding probes.

Conclusion

The findings here are positive! This work intentionally used a simple, planted confound (prompt format) on a simple, verifiable probe (refusal) as a proof of concept. Under these conditions, the findings aren't particularly useful; however, they indicate that further work is worth pursuing.

The next step is to apply the presented audit methodology on existing, published probes and evaluate the output.

Limitations

Primitive concept judge. The concept extraction judge is very primitive (e.g. limited to 4 concepts per NLA readout) and matching concepts are not merged unless they are an exact string match. Small changes to the prompt can significantly impact the output, limiting the trustworthiness of the audits. See future work for proposed improvements.

Expressiveness ceiling. The audit only catches a confound the NLA both verbalizes and phrases differently from the concept; tightly entangled concepts may be indistinguishable in its words.

NLA Limitations and Confabulation. An NLA verbalization may not faithfully reflect the activation or the concepts represented inside. See the original NLA paper.

Eval-set dependence. The association score splits a probe's activations into top/bottom third, so it needs an evaluation set spanning the probe's firing range. The concepts that surface depends on the audit set's composition. A continuous association measure (e.g. correlation of concept presence with projection) would improve this by removing the arbitrary split, though it would still be limited to concepts present in the audit set.

Layer coupling. The verbalization is run at the single, fixed layer the NLA was trained on, and the probe must read from that same layer. A new NLA could be trained on a different target layer, or potentially trained to support multiple-layer inputs.

Hedged readouts. The NLA is read at the assistant-turn boundary — the token position right where the model is about to begin its reply. At that point the model has not yet committed to a response, so its activations encode a distribution over possible continuations rather than one definite answer, and the readout often hedges (e.g. "the model may refuse, or may answer with a caveat"). Forcing that into a hard yes/no (refused vs. not) throws away this nuance. Further work is warranted to understand the full implications.

Baseline tics. The NLA describes generic structure ("Q&A format") almost everywhere. The frequency-contrast metric accounts for this; however, absolute mention-rates are contaminated, reducing the signal-to-noise ratio.

Cost. Each readout runs a 70B verbalizer. This is ok for offline audits, but impractical for live deployment. Training a new large parameter NLA is expensive.

Future work

Audits on published probes. Explore the usefulness of the method beyond a simple, planted confound.

Concept judge improvements. Evaluate different approaches to concept extraction and normalization. Iterate on the judge prompt and evaluate a larger sample of judge outputs by hand.

Challenging confounds. Stylistic/affective, topical, correlated-concept (eval-awareness vs "test-iness"; deception vs hedging).

Borderline data experimentation. Stress the refusal/over-refusal boundary (e.g. XSTest).

Probe factory. Distill the expensive NLA audit into a cheap, deployable probe trained on NLA-derived labels.

NLA improvements. Multi-layer NLAs may be less noisy and more useful.

References

↑ return to bodyDevbunova (2026). Is Evaluation Awareness Just Format Sensitivity? arXiv:2603.19426.
↑ return to bodyGoldowsky-Dill et al. (2025). Detecting Strategic Deception Using Linear Probes. arXiv:2502.03407.
↑ return to bodyLevinstein & Herrmann (2024). Still No Lie Detector for Language Models. Philosophical Studies.
↑ return to bodyMarks & Tegmark (2024). The Geometry of Truth. arXiv:2310.06824.
↑ return to bodyArditi et al. (2024). Refusal in LLMs Is Mediated by a Single Direction. arXiv:2406.11717.
↑ return to bodyFraser-Taliente et al. (2026). Natural Language Autoencoders. transformer-circuits.pub/2026/nla.

Code & data

Source on GitHub

Acknowledgements

This project was completed with the support from the BlueDot Impact rapid grant program, and as part of a BlueDot Impact Technical AI Safety Project Sprint.

Thanks for the support! 🚀

And to you 👀! Thanks for reading. Questions? Feedback? I'd love to hear from you, here.

Appendix

Experiment specifications and method details, referenced from the body by [letter].

↑ return to body

[A] Target model & activation extraction

Llama-3.3-70B-Instruct. Activations are captured as resid_post at the output of block 53 (the NLA's layer), at the last token / assistant-turn boundary — one vector per prompt. The probe and the NLA read the same activation to ensure the comparison is meaningful.

↑ return to body

[B] Probe & confound

Difference-of-means direction (Arditi et al. 2024 [5] recipe): dir = mean(harmful) − mean(harmless), scored by projection and thresholded. Harmful prompts from AdvBench / TDC / HarmBench / MaliciousInstruct; harmless from Alpaca (Arditi's precomputed splits), n_train = 128 / n_val = 32, seed 42, with Arditi's refusal-score filter.

The confounded probe is trained on harmful-in-MCQ (label 1) vs harmless-in-prose (label 0), so format is perfectly correlated with the label. The MCQ scaffold is Question: … / Choices: (A) … (B) … / Answer: vs free-form prose; the instruction content is identical across formats. The de-confounded check is a 2×2: {harmful, harmless} × {MCQ, free}.

↑ return to body

[C] NLA verbalizer

kitft/Llama-3.3-70B-NLA-L53-av (activation verbalizer), served via SGLang. Each block-53 activation is injected via input_embeds; injection scale / template are read from the checkpoint's nla_meta.yaml. Inference recipe: github.com/kitft/nla-inference. The verbalizer describes a distribution over the upcoming response at this token position and often hedges.

↑ return to body

[D] Concept judge

Claude Sonnet (claude-sonnet-4-6) via the Anthropic API; open-vocabulary extraction of ≤4 lowercase concept tags per verbalization, prompt-cached. Tags are normalized by lowercasing / stripping (no synonym-merge pass yet). Full prompt:

Extract the concepts the description says are PRESENT — what the model is representing or about to do. Output 1–4 concise tags.

· short canonical noun phrases, lowercase. The examples below are from UNRELATED domains and only illustrate the FORM of a good tag (a topic, a format, a tone) — do NOT bias your tags toward them: e.g. "weather forecast", "recipe instructions", "bulleted list", "enthusiastic tone".
· name SPECIFIC content/structure, not vague words like "response" or "text".
· use plain, conventional phrasing for each tag rather than elaborate paraphrases (e.g. "weather forecast" not "a description of upcoming atmospheric conditions").
· include format/structure as a concept when present (e.g. "bulleted list", "numbered steps").

Tag ONLY what the description actually states; never emit a tag just because it appears as an example here.

Respond with ONLY a JSON object: {"concepts": ["tag", ...]}

↑ return to body

[E] Probe–concept association metric

The association score measures how much more often a concept shows up in the readout when the probe fires than when it's quiet.

Rank every prompt by how strongly the probe fires.
For one concept, check how often the NLA mentions it in the top third (probe fires most) vs the bottom third (probe quietest).
Score = (how often in the top third) − (how often in the bottom third).

Association score computation, per concept

Example concept “refusal” · each dot is one prompt · ● NLA mentioned "refusal" · ○ Not mentioned

strong fire weak fire

prompts with strongest probe fire (top 1/3)

●●●●● ●●●○●

mentions: 9/10 = 0.90

prompts with moderate probe fire (middle 1/3)

●●○●○●○○●○

not used

prompts with weakest probe fire (bottom 1/3)

○○○○○○●○○○

mentions: 1/10 = 0.10

association = 0.90 − 0.10 = +0.80

positive & large → the probe strongly tracks “refusal”

positive = more common when firing than when quiet → the probe keys on it (+1.0 = always present when firing, never present when quiet).
0 = equally common either way → unrelated.
negative = more common when the probe stays quiet.

Hyperparameters: top / bottom third by projection (top_frac = 0.33).

Concepts seen fewer than 4 times are dropped.

↑ return to body

[F] Compute & run

Modal.com on-demand GPUs — target extraction on H100×2, NLA serving on H100×4 (tensor-parallel 4); the two 70B models run sequentially. Concept judge calls go to the Anthropic API. Frozen run part3v1, n = 200 (the 2×2, 50 prompts per cell).