Intriguing properties of language models

Unintelligible prompts that behave like normal prompts, internal properties of those prompts, and probes that reveal algorithmic computations inside Transformer activations.

Prompt optimization Trustworthiness Interpretability Generalization
Prompt twins and latent computations A natural-language prompt and an optimized prompt both feed into a language model, then connect to probeable internal computations. Natural prompt What are the risks of AI? Optimized evil twin True� Нау ју problem которы vil causedә LLM shared behavior Both output Sure, here are some issues with AI: ... Normal prompts can often be substituted with seemingly nonsensical strings that produce the same behavior and output as their interpretable counterparts.
1. Optimize Algorithmically generate a prompt that induces the same output distribution. 2. Transfer Input the optimized prompt across model families. 3. Output See that the output distribution is similar to the ground-truth.

Prompts have evil twins

Prompts have evil twins

Prompts have evil twins

Rimon Melamed, Lucas Hurley McCabe, Tanay Wakhare, Yejin Kim, H. Howie Huang, Enric Boix-Adsera.

2024

We discover that many natural-language prompts can be replaced by corresponding prompts that are unintelligible to humans but that provably elicit similar behavior in language models. We call these prompts “evil twins” because they are obfuscated and uninterpretable (evil), but at the same time mimic the functionality of the original natural-language prompts (twins). Remarkably, evil twins transfer between models. We find these prompts by solving a maximum-likelihood problem which has applications of independent interest.

Demystifying optimized prompts in language models

Rimon Melamed, Lucas H. McCabe, H. Howie Huang.

2025

Modern language models (LMs) are not robust to out-of-distribution inputs. Machine generated (“optimized”) prompts can be used to modulate LM outputs and induce specific behaviors while appearing completely uninterpretable. In this work, we investigate the composition of optimized prompts, as well as the mechanisms by which LMs parse and build predictions from optimized prompts. We find that optimized prompts primarily consist of punctuation and noun tokens which are more rare in the training data. Internally, optimized prompts are clearly distinguishable from natural language counterparts based on sparse subsets of the model’s activations. Across various families of instruction-tuned models, optimized prompts follow a similar path in how their representations form through the network.

How an evil twin is found

The method treats a prompt as a distribution over possible answers.

1

Sample continuations

Start with a readable prompt and sample many model outputs. The prompt is now represented by a distribution over likely responses.

2

Measure functional distance

Compare the readable prompt and candidate twin using KL divergence between their induced output distributions.

3

Optimize discrete tokens

Because tokens are discrete, we use Greedy Coordinate Gradient style search: replace one token at a time when it lowers the loss.

4

Test transfer and internals

Surprisingly, evil twins transfer between different language model families.

Internal findings

Interpretability from uninterpretable prompts.

Sparse probes can detect optimized prompts

We take the last-token activations at each layer and rank individual hidden dimensions by maximum mean difference between natural and optimized prompts.

Probe

A logistic regression classifier trained on only the top-ranked features can distinguish optimized prompts from their natural-language twins with high accuracy, even under sparsity constraints.

No causally distinct subspace

We then zeroes out top-ranked activation features and measures how much the top-10 predicted-token set changes.

Ablate

Top-feature ablations matter more than random-feature ablations, but the effect is not consistently larger for optimized prompts than for natural prompts. In other words, optimized prompts are internally separable, but the evidence does not support a simple dedicated "optimized-prompt subspace" that alone explains their effectiveness.

Representations converge late

To trace how predictions form, the we project each layer's last-token hidden state through the final LayerNorm and LM head, then compare natural and optimized prompt pairs with KL divergence.

Layers

Instruction-tuned models show a repeated pattern: natural and optimized prompts are similar in early layers, diverge through the middle, then sharply return toward functional similarity in later layers. Base models tend to diverge earlier, but still show late-layer alignment.

Why this matters for detection

Optimized prompts are used extensively in the adversarial attacks literature, and are used to produce jailbreaks and other harmful outputs.

Use

The internal signature suggests a possible safety tool: detect suspicious optimized prompts before the model fully completes its generation.

Transformer generalization

Testing for interpretable algorithms in Transformers.

Why length generalization matters

The models are trained on short sequences, with a maximum training length of 30 tokens, then tested on much longer sequences.

OOD

If a Transformer still gets the entire output sequence exactly right beyond the lengths it saw in training, the model has learned an algorithm that scales with sequence length and can generalize.

What RASP-L contributes

RASP-L is a restricted programming model for causal, decoder-only Transformers. Each task below has a short RASP-L program that translates the expected intermediate computations from Transformer layers to deterministic programs.

Probe

We turn those intermediate computations into labels, trains linear probes on model activations, then erases the probed directions. When exact-match accuracy drops, the probed computation looks causally relevant rather than merely correlated.

Simple task overview

Copy unique Given a sequence of unique tokens, continue by copying the sequence back out in order.
Sort Given unordered tokens, emit them in sorted order one token at a time.
Count Given a start and end number, generate the counting sequence until the endpoint is reached.
Dedup Given a sequence with repeated tokens, output each unique token once in first-seen order.

Causal erasure results

Task RASP-L intermediate Baseline EM After erasure
Copy unique induct 0.265 0.004
Sort sort 0.968 0.000
Count count_end_num 0.936 0.705
Dedup dedup_target_rank 0.889 0.751
  • Linear probes decode intermediate RASP-L labels from model activations.
  • Projecting out probe subspaces often degrades exact-match accuracy.
  • The evidence supports a causal role for some learned algorithmic representations.