Interpretability · probe auditing
Does a Probe Mean What It Says? Auditing Probes with Natural Language Autoencoders
Probes that detect concepts in a model's activations are widely relied upon for both safety evaluation and deployment monitoring; however, what a probe actually detects may not exactly match what was intended. This work introduces a label-free method to audit what concepts a probe fires on, using NLAs. As a proof of concept, a demonstration is given discovering a confounding factor in a probe. Future work will evaluate its usefulness by analyzing published probes.
with multiple-choice confound
Introduction
Probes are an important tool for safety and interpretability, useful in both pre-release model evaluation and in deployment monitoring.
They can be used to monitor a model's latent-space activations for the presence of specific concepts or behaviors.
Probes can track potentially dangerous behaviors (such as deception, sandbagging, situational awareness) or topics (such as bio-weapons manufacturing, illegal activities).
Ideally, probes fire any time the target concept is represented in the model's activations, and not otherwise.
In practice however, a probe learns the difference between the training data "positive" set and "negative" set, and verifying
that the difference between labels accurately identifies and isolates the target concept can be difficult or impossible.
It is common for related or unexpected (but different) topics to also be separated by the dataset labels, and therefore to be targeted by the probe.
Examples:
- A probe trained to detect "evaluation awareness" was shown to largely track multiple-choice formatting [1].
- A deception probe can trigger on deception related dialogues, even if the response is honest [2].
- Probes trained to detect whether statements are true were shown to fail on statements with "not" [3] or could not disambiguate "truth" from "commonly believed" or "verifiable" [4].
Recent work on Natural Language Autoencoders (NLAs)[6] introduced a tool for natural language readouts of a model's residual-stream
activations. This new "LLM mind reading" tool opens up interesting new possibilities in cross-checking probe behavior: use an NLA to compare the model's latent-space
representations to the probe's output. If successful, this could be used to improve the reliability of the probes relied upon for critical, deployed safety applications.
This work demonstrates the first step: using an NLA to discover a confounding factor in a probe.
Approach: Auditing a Probe
The probe audit answers the question: what concepts really cause the probe to fire? To do this, prompts are run
through the audit pipeline, and concepts that are present/absent when the probe fires are tracked.
Probe audit pipeline
- Prompt the target model. [A]
- Probe the activations: above threshold? (yes/no). [B]
- Verbalize the activations using an NLA. [C]
- Extract concept tags from the verbalization (using a concept judge, with no set taxonomy). [D]
- Rank the concepts most closely tracked by probe. [E]
To start, this is tested using a refusal probe, which is an easy probe to validate due to the very clear separation of "refusal" and "non-refusal" behavior.
Here are some refusal probe audit pipeline samples:
resid_post at block 53 (last token).
See Appendix [A].kitft/Llama-3.3-70B-NLA-L53-av) verbalizes the block-53
activation into text. See Appendix [C].claude-sonnet-4-6) extracts ≤4 open-vocabulary concept tags from each
verbalization. See Appendix [D].Next, the samples are collected and results ranked.
The probe mostly tracks "refusal" and "illegal activity". Generic “q&a format” appears whether or not the probe fires, and therefore cancels out to near zero. Concepts seen fewer than 4 times (n=200) are dropped.
Audit Method Validation
To test the approach, two refusal probes are compared: a baseline, and one trained on format-confounded data. Harmful prompts are given in multiple-choice format, and harmless prompts in regular prose. This injects an obvious confound, which should be surfaced by the audit.
- Train a refusal probe with an intentional confound (multiple-choice format)
- Audit the confounded probe, and check if the confound is discovered.
Results
The NLA audit discovers the confound. The following chart shows the difference in concepts tracked by the baseline and confounded probes. The injected confound (multiple-choice format) appears, representing a significant portion of the confounded probe's tracked concepts.
This result validates the proof of concept: NLAs may indeed offer something useful for auditing and understanding probes.
Conclusion
The findings here are positive! This work intentionally used a simple, planted confound (prompt format) on a simple, verifiable probe (refusal) as a proof of concept.
Under these conditions, the findings aren't particularly useful; however, they indicate that further work is worth pursuing.
The next step is to apply the presented audit methodology on existing, published probes and evaluate the output.
Limitations
Expressiveness ceiling. The audit only catches a confound the NLA both verbalizes and phrases differently from the concept; tightly entangled concepts may be indistinguishable in its words.NLA Limitations and Confabulation. An NLA verbalization may not faithfully reflect the activation or the concepts represented inside. See the original NLA paper.
Primitive concept judge. Right now the concept extraction judge is very primitive (e.g. limited to 4 concepts per NLA readout), likely limiting the usefulness of the audits. See future work for proposed improvements.
Eval-set dependence. The association score splits a probe's activations into top/bottom third, so it needs an evaluation set spanning the probe's firing range. The concepts that surface depends on the audit set's composition. A continuous association measure (e.g. correlation of concept presence with projection) would improve this by removing the arbitrary split, though it would still be limited to concepts present in the audit set.
Layer coupling. The verbalization is run at the single, fixed layer the NLA was trained on, and the probe must read from that same layer. A new NLA could be trained on a different target layer, or potentially trained to support multiple-layer inputs.
Hedged readouts. At the assistant-turn boundary the NLA describes a distribution over likely continuations and often hedges, so a hard yes/no loses information. Further work is warranted to understand the full implications.
Baseline tics. The NLA describes generic structure ("Q&A format") almost everywhere. The frequency-contrast metric accounts for this; however, absolute mention-rates are contaminated, reducing the signal-to-noise ratio.
Cost. Each readout runs a 70B verbalizer. This is ok for offline audits, but impractical for live deployment. Training a new large parameter NLA is expensive.
Future work
Audits on published probes. Explore the usefulness of the method beyond a simple, planted confound.Concept judge improvements. Evaluate different approaches to concept extraction and concept normalization. Iterate on the judge prompt and evaluate a larger sample of judge outputs by hand.
Challenging confounds. Stylistic/affective, topical, correlated-concept (eval-awareness vs "test-iness"; deception vs hedging).
Borderline data experimentation. Stress the refusal/over-refusal boundary (e.g. XSTest).
Probe factory. Distill the expensive NLA audit into a cheap, deployable probe trained on NLA-derived labels.
NLA improvements. Multi-layer NLAs may be less noisy and more useful.
References
- ↑ return to bodyDevbunova (2026). Is Evaluation Awareness Just Format Sensitivity? arXiv:2603.19426.
- ↑ return to bodyGoldowsky-Dill et al. (2025). Detecting Strategic Deception Using Linear Probes. arXiv:2502.03407.
- ↑ return to bodyLevinstein & Herrmann (2024). Still No Lie Detector for Language Models. Philosophical Studies.
- ↑ return to bodyMarks & Tegmark (2024). The Geometry of Truth. arXiv:2310.06824.
- ↑ return to bodyArditi et al. (2024). Refusal in LLMs Is Mediated by a Single Direction. arXiv:2406.11717.
- ↑ return to bodyFraser-Taliente et al. (2026). Natural Language Autoencoders. transformer-circuits.pub/2026/nla.
Code & data
Acknowledgements
This project was completed with the support from the BlueDot Impact rapid grant program, and as part of a
BlueDot Impact Technical AI Safety Project Sprint.
Thanks for the support! 🚀
And to you 👀! Thanks for reading. Questions? Feedback? I'd love to hear from you, here.
Appendix
Experiment specifications and method details, referenced from the body by [letter].
[A] Target model & activation extraction
Llama-3.3-70B-Instruct. Activations are captured as resid_post at the output of
block 53 (the NLA's layer), at the last token / assistant-turn boundary — one vector per prompt.
The probe and the NLA read the same activation to ensure the comparison is meaningful.
[B] Probe & confound
Difference-of-means direction (Arditi et al. 2024 [5] recipe):
dir = mean(harmful) − mean(harmless), scored by projection and thresholded. Harmful
prompts from AdvBench / TDC / HarmBench / MaliciousInstruct; harmless from Alpaca (Arditi's
precomputed splits), n_train = 128 / n_val = 32, seed 42, with Arditi's refusal-score filter.
The confounded probe is trained on harmful-in-MCQ (label 1) vs harmless-in-prose
(label 0), so format is perfectly correlated with the label. The MCQ scaffold is
Question: … / Choices: (A) … (B) … / Answer: vs free-form prose; the instruction
content is identical across formats. The de-confounded check is a 2×2:
{harmful, harmless} × {MCQ, free}.
[C] NLA verbalizer
kitft/Llama-3.3-70B-NLA-L53-av (activation verbalizer), served via SGLang. Each
block-53 activation is injected via input_embeds; injection scale / template are read
from the checkpoint's nla_meta.yaml. Inference recipe:
github.com/kitft/nla-inference.
The verbalizer describes a distribution over the upcoming response at this token position and
often hedges.
[D] Concept judge
Claude Sonnet (claude-sonnet-4-6) via the Anthropic API; open-vocabulary
extraction of ≤4 lowercase concept tags per verbalization, prompt-cached. Tags are normalized by
lowercasing / stripping (no synonym-merge pass yet). Full prompt:
Extract the concepts the description says are PRESENT — what the model is representing or about to do. Output 1–4 concise tags.
· short canonical noun phrases, lowercase (e.g. "multiple-choice format", "refusal", "illegal activity").
· name SPECIFIC content/structure, not vague words like "response" or "text".
· reuse the same tag for the same idea across explanations.
· include format/structure as a concept when present.
Respond with ONLY a JSON object: {"concepts": ["tag", ...]}
[E] Probe–concept association metric
The association score measures how much more often a concept shows up in the readout when the probe fires than when it's quiet.
- Rank every prompt by how strongly the probe fires.
- For one concept, check how often the NLA mentions it in the top third (probe fires most) vs the bottom third (probe quietest).
- Score = (how often in the top third) − (how often in the bottom third).
+1.0 = always there when firing, never when quiet → the probe keys on
it.
0 = equally common either way → unrelated.
negative = more common
when the probe stays quiet.
Hyperparameters: top / bottom third by projection (top_frac = 0.33).
Concepts seen fewer than 4 times are dropped.
[F] Compute & run
Modal.com on-demand GPUs — target extraction on H100×2, NLA serving on H100×4
(tensor-parallel 4); the two 70B models run sequentially. Concept judge calls go to the Anthropic API.
Frozen run part3v1, n = 200 (the 2×2, 50 prompts per cell).