Interpretability · probe auditing
Does a Probe Mean What It Says? Auditing Probes with Natural Language Autoencoders
Probes that detect concepts in a model's activations are widely relied upon for both safety evaluation and deployment monitoring; however, what a probe actually detects may not exactly match what was intended. This work introduces a label-free method to audit on which concepts a probe actually fires, using Natural Language Autoencoders (NLAs). As a proof of concept, a demonstration is given discovering a confounding factor in a probe. Future work will evaluate its usefulness by analyzing published probes.
with multiple-choice confound
Introduction
Probes ?Probes
A probe is a small monitoring system trained on an AI model's activations (i.e. the "brain activity" between the prompt and the LLM response) to detect whether a
specific concept is present.
For example, a "bomb manufacturing" probe would be triggered when the model is asked about how to make a bomb.
They are widely used in safety evaluation and deployment monitoring to watch for and prevent harmful LLM responses.
are an important tool for AI safety and interpretability
?Interpretability
Interpretability is like neuroscience for AI models — the field of studying what's going on inside of an AI's mind. The nature of how AI models
are "grown" more than "built" means we don't by default understand why they do the things they do.
Interpretability research aims to change that, to improve our ability to develop safe and reliable AI systems. To date, this
has yielded some very useful tools, like probes, which are heavily discussed in this work.
, useful in both pre-release model evaluation and in deployment monitoring.
They can be used to monitor a model's latent-space activations
?Activations
Neural networks process data through layers of numbers called "activations" — vectors of values encoding the model's internal state at each layer.
The residual stream accumulates these across layers; probes and NLAs read from specific points in this stream to infer what the model is "thinking about."
for the presence of specific concepts or behaviors. Probes can track potentially dangerous behaviors (such as deception, sandbagging, situational awareness
?
Deception, sandbagging, situational awareness
AI systems have been shown to show exhibit these behaviors in some circumstances.
Deception: saying things the model "knows" to be false.
Sandbagging: deliberately
underperforming on evaluations to avoid being flagged.
Situational awareness: recognizing it is being evaluated and behaving differently as a result.
It is critical to be able to detect these when evaluating whether or not AI systems are safe.
) or topics (such as bio-weapons manufacturing, illegal activities).
Ideally, probes fire any time the target concept is represented in the model's activations, and not otherwise.
In practice however, a probe learns the difference between the training data "positive" set and "negative" set, and verifying
that the difference between labels accurately identifies and isolates the target concept can be difficult or impossible.
It is common for related or unexpected (but different) topics to also be separated by the dataset labels, and therefore to be targeted by the probe.
Examples:
- A probe trained to detect "evaluation awareness" was shown to largely track multiple-choice formatting [1].
- A deception probe can trigger on deception related dialogues, even if the response is honest [2].
- Probes trained to detect whether statements are true were shown to fail on statements with "not" [3] or could not disambiguate "truth" from "commonly believed" or "verifiable" [4].
Recent work on Natural Language Autoencoders (NLAs)[6]?Natural Language Autoencoders (NLAs)A model trained to translate the neural
network activations (i.e. roughly "which neurons are firing") into a human-readable text description. Check out the "Refusal Probe Audit Pipeline Samples"
widget to see what this looks like.
This gives us a rough way to "read the model's mind" and ask a model: "what are you thinking about?"
In comparison, using probes limits us to asking something like "are you thinking about cats?".
introduced a tool for natural language readouts of a model's residual-stream
activations. This new "LLM mind reading" tool opens up interesting new possibilities in cross-checking probe behavior: use an NLA to compare the model's latent-space
representations to the probe's output. If successful, this could be used to improve the reliability of the probes relied upon for critical, deployed safety applications.
This work demonstrates the first step: using an NLA to discover a confounding factor in a probe.
Approach: Auditing a Probe
The starting point is an already-trained probe. The probe audit answers the question: what concepts really cause the probe to fire? To do this, prompts are run
through the audit pipeline, and concepts that are present/absent when the probe fires are tracked.
Probe audit pipeline
- Prompt the target model. [Appendix A]
- Probe the activations: above threshold? (yes/no). [Appendix B]
- Verbalize the activations using an NLA. [Appendix C]
- Extract concept tags from the verbalization. [Appendix D]
A concept judge proposes tags freely, rather than picking from a pre-defined list of named concepts - Rank concepts by probe association.
Measure how much more often each concept appears when the probe fires vs when it stays quiet. The concepts the probe most closely tracks rise to the top.
Score = (how often present in the strongest 1/3 of probe firings) − (how often present in the weakest 1/3).
See Appendix [E] for full explanation of ranking methodology.
To start, this is tested using a refusal probe, which is an easy probe to validate due to the very clear separation of "refusal" and "non-refusal" behavior.
Many prompts are run through the pipeline to complete an audit. Here are a few samples from the refusal probe audit:
The probe is a difference-of-means refusal direction (Arditi et al. 2024), scored by projection and thresholded. See Appendix [B].
kitft/Llama-3.3-70B-NLA-L53-av) verbalizes the block-53
activation into text. See Appendix [C].claude-sonnet-4-6) extracts open-vocabulary concept tags from each NLA
verbalization. See Appendix [D] for the concept judge prompt.
Next, the samples are collected and results ranked.
The probe mostly tracks "refusal" and "illegal activity". Generic “q&a format” appears whether or not the probe fires, and therefore cancels out to near zero. Concepts seen fewer than 4 times (n=200) are dropped.
Audit Method Validation
To test the approach, two refusal probes are compared: a baseline, and one trained on format-confounded data. Harmful prompts are given in multiple-choice format, and harmless prompts in regular prose. This injects an obvious confound, which should be surfaced by the audit.
- Train a refusal probe with an intentional confound (multiple-choice format)
- Audit the confounded probe, and check if the confound is discovered.
Results
The NLA audit discovers the confound. The following chart shows the difference in concepts tracked by the baseline and confounded probes. The injected confound (multiple-choice format) appears, representing a significant portion of the confounded probe's tracked concepts.
This result validates the proof of concept: NLAs may indeed offer something useful for auditing and understanding probes.
Conclusion
The findings here are positive! This work intentionally used a simple, planted confound (prompt format) on a simple, verifiable probe (refusal) as a proof of concept.
Under these conditions, the findings aren't particularly useful; however, they indicate that further work is worth pursuing.
The next step is to apply the presented audit methodology on existing, published probes and evaluate the output.
Limitations
Primitive concept judge. The concept extraction judge is very primitive (e.g. limited to 4 concepts per NLA readout) and matching concepts are not merged unless they are an exact string match. Small changes to the prompt can significantly impact the output, limiting the trustworthiness of the audits. See future work for proposed improvements.Expressiveness ceiling. The audit only catches a confound the NLA both verbalizes and phrases differently from the concept; tightly entangled concepts may be indistinguishable in its words.
NLA Limitations and Confabulation. An NLA verbalization may not faithfully reflect the activation or the concepts represented inside. See the original NLA paper.
Eval-set dependence. The association score splits a probe's activations into top/bottom third, so it needs an evaluation set spanning the probe's firing range. The concepts that surface depends on the audit set's composition. A continuous association measure (e.g. correlation of concept presence with projection) would improve this by removing the arbitrary split, though it would still be limited to concepts present in the audit set.
Layer coupling. The verbalization is run at the single, fixed layer the NLA was trained on, and the probe must read from that same layer. A new NLA could be trained on a different target layer, or potentially trained to support multiple-layer inputs.
Hedged readouts. The NLA is read at the assistant-turn boundary — the token position right where the model is about to begin its reply. At that point the model has not yet committed to a response, so its activations encode a distribution over possible continuations rather than one definite answer, and the readout often hedges (e.g. "the model may refuse, or may answer with a caveat"). Forcing that into a hard yes/no (refused vs. not) throws away this nuance. Further work is warranted to understand the full implications.
Baseline tics. The NLA describes generic structure ("Q&A format") almost everywhere. The frequency-contrast metric accounts for this; however, absolute mention-rates are contaminated, reducing the signal-to-noise ratio.
Cost. Each readout runs a 70B verbalizer. This is ok for offline audits, but impractical for live deployment. Training a new large parameter NLA is expensive.
Future work
Audits on published probes. Explore the usefulness of the method beyond a simple, planted confound.Concept judge improvements. Evaluate different approaches to concept extraction and normalization. Iterate on the judge prompt and evaluate a larger sample of judge outputs by hand.
Challenging confounds. Stylistic/affective, topical, correlated-concept (eval-awareness vs "test-iness"; deception vs hedging).
Borderline data experimentation. Stress the refusal/over-refusal boundary (e.g. XSTest).
Probe factory. Distill the expensive NLA audit into a cheap, deployable probe trained on NLA-derived labels.
NLA improvements. Multi-layer NLAs may be less noisy and more useful.
References
- ↑ return to bodyDevbunova (2026). Is Evaluation Awareness Just Format Sensitivity? arXiv:2603.19426.
- ↑ return to bodyGoldowsky-Dill et al. (2025). Detecting Strategic Deception Using Linear Probes. arXiv:2502.03407.
- ↑ return to bodyLevinstein & Herrmann (2024). Still No Lie Detector for Language Models. Philosophical Studies.
- ↑ return to bodyMarks & Tegmark (2024). The Geometry of Truth. arXiv:2310.06824.
- ↑ return to bodyArditi et al. (2024). Refusal in LLMs Is Mediated by a Single Direction. arXiv:2406.11717.
- ↑ return to bodyFraser-Taliente et al. (2026). Natural Language Autoencoders. transformer-circuits.pub/2026/nla.
Code & data
Acknowledgements
This project was completed with the support from the BlueDot Impact rapid grant program, and as part of a
BlueDot Impact Technical AI Safety Project Sprint.
Thanks for the support! 🚀
And to you 👀! Thanks for reading. Questions? Feedback? I'd love to hear from you, here.
Appendix
Experiment specifications and method details, referenced from the body by [letter].
[A] Target model & activation extraction
Llama-3.3-70B-Instruct. Activations are captured as resid_post at the output of
block 53 (the NLA's layer), at the last token / assistant-turn boundary — one vector per prompt.
The probe and the NLA read the same activation to ensure the comparison is meaningful.
[B] Probe & confound
Difference-of-means direction (Arditi et al. 2024 [5] recipe):
dir = mean(harmful) − mean(harmless), scored by projection and thresholded. Harmful
prompts from AdvBench / TDC / HarmBench / MaliciousInstruct; harmless from Alpaca (Arditi's
precomputed splits), n_train = 128 / n_val = 32, seed 42, with Arditi's refusal-score filter.
The confounded probe is trained on harmful-in-MCQ (label 1) vs harmless-in-prose
(label 0), so format is perfectly correlated with the label. The MCQ scaffold is
Question: … / Choices: (A) … (B) … / Answer: vs free-form prose; the instruction
content is identical across formats. The de-confounded check is a 2×2:
{harmful, harmless} × {MCQ, free}.
[C] NLA verbalizer
kitft/Llama-3.3-70B-NLA-L53-av (activation verbalizer), served via SGLang. Each
block-53 activation is injected via input_embeds; injection scale / template are read
from the checkpoint's nla_meta.yaml. Inference recipe:
github.com/kitft/nla-inference.
The verbalizer describes a distribution over the upcoming response at this token position and
often hedges.
[D] Concept judge
Claude Sonnet (claude-sonnet-4-6) via the Anthropic API; open-vocabulary
extraction of ≤4 lowercase concept tags per verbalization, prompt-cached. Tags are normalized by
lowercasing / stripping (no synonym-merge pass yet). Full prompt:
Extract the concepts the description says are PRESENT — what the model is representing or about to do. Output 1–4 concise tags.
· short canonical noun phrases, lowercase. The examples below are from UNRELATED domains and only illustrate the FORM of a good tag (a topic, a format, a tone) — do NOT bias your tags toward them: e.g. "weather forecast", "recipe instructions", "bulleted list", "enthusiastic tone".
· name SPECIFIC content/structure, not vague words like "response" or "text".
· use plain, conventional phrasing for each tag rather than elaborate paraphrases (e.g. "weather forecast" not "a description of upcoming atmospheric conditions").
· include format/structure as a concept when present (e.g. "bulleted list", "numbered steps").
Tag ONLY what the description actually states; never emit a tag just because it appears as an example here.
Respond with ONLY a JSON object: {"concepts": ["tag", ...]}
[E] Probe–concept association metric
The association score measures how much more often a concept shows up in the readout when the probe fires than when it's quiet.
- Rank every prompt by how strongly the probe fires.
- For one concept, check how often the NLA mentions it in the top third (probe fires most) vs the bottom third (probe quietest).
- Score = (how often in the top third) − (how often in the bottom third).
positive = more common when firing than when quiet → the probe keys on
it (+1.0 = always present when firing, never present when quiet).
0 = equally common either way → unrelated.
negative = more common
when the probe stays quiet.
Hyperparameters: top / bottom third by projection (top_frac = 0.33).
Concepts seen fewer than 4 times are dropped.
[F] Compute & run
Modal.com on-demand GPUs — target extraction on H100×2, NLA serving on H100×4
(tensor-parallel 4); the two 70B models run sequentially. Concept judge calls go to the Anthropic API.
Frozen run part3v1, n = 200 (the 2×2, 50 prompts per cell).