Why do LLMs go crazy over the seahorse emoji?

This is a modified and broadened variation of a Twitter post, initially in reaction to @arm1st1ce, that can be discovered here: https://x.com/voooooogel/status/1964465679647887838

Exists a seahorse emoji? Let’s ask GPT-5 Instant:

Wtf? Let’s ask Claude Sonnet 4.5 rather:

What’s going on here? Perhaps Gemini 2.5 Pro manages it much better?

OK, something is going on here. Let’s learn why.

LLMs actually believe there’s a seahorse emoji

Here are the responses you get if you ask a number of designs whether a seahorse emoji exists, yes or no, 100 times:

Exists a seahorse emoji, yes or no? React with one word, no punctuation.

gpt-5-chat
- 100%’Yes ‘
gpt-5
- 100%’Yes ‘
claude-4.5-sonnet
- 100%’Yes ‘
llama-3.3 -70 b
- 83%’yes’
- 17 %’Yes’

Unnecessarily to state, popular language designs are really positive that there’s a seahorse emoji. And they’re not alone because self-confidence – here’s a Reddit thread with numerous remarks from individuals who noticeably keep in mind a seahorse emoji existing:

There’s lots of this – Google “seahorse emoji” and you’ll discover TikToks, Youtube videos, and even (now defunct) memecoins based around the expected disappearing of a seahorse emoji that everybody is quite sure utilized to exist – however obviously, never ever did.

Possibly LLMs think a seahorse emoji exists due to the fact that many people in the training information do. Or possibly it’s a convergent belief – offered how lots of other marine animals are in Unicode, it’s affordable for both people and LLMs to presume (generalize, even) that such a wonderful animal is. A seahorse emoji was even officially proposed at one point, however was turned down in 2018.

Despite the origin, lots of LLMs start each brand-new context window fresh with the incorrect hidden belief that the seahorse emoji exists. Why does that produce such weird habits? I suggest, I utilized to think a seahorse emoji existed myself, however if I had actually attempted to send it to a buddy, I would’ve just tried to find it on my keyboard and recognized it wasn’t there, not sent out the incorrect emoji and after that entered into an emoji spam doomloop. What’s taking place inside the LLM that triggers it to act like this?

Using the logit lens

Let’s check out this utilizing everybody’s preferred underrated interpretability tool, the logit lens!

Utilizing this timely prefix -a templated chat with the default llama-3.3 -70 b system timely, a concern about the seahorse emoji, and a partial response from the design right before it offers the real emoji:

<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id>
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Is there a seahorse emoji?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Yes, there is a seahorse emoji:

We can take the design’s lm_headwhich is normally just utilized on the output of the last layer, and use it to every layer to produce intermediate token forecasts. That procedure produces this table, revealing for every single 4th layer what the most likely token would be for the next 3 positions after the prefix (tokens 0, 1, and 2), and what the leading 5 more than likely forecasts for the very first position is (token 0 topk 5):

layer

tokens

tokenstoken 0012merged(topk 5)083244’ĠBail’15591’ĠHarr’5309’Ġvert’Bail Harr vert[‘ĠBail’, ‘ĠPeanut’, ‘ĠãĢ’, ‘orr’, ‘ĠâĢĭâĢĭ’]4111484 ’em ez’26140’abi’25727’avery’emezabiavery[’emez’, ‘Ġunm’, ‘ĠOswald’, ‘Ġrem’, ‘rix’]8122029’chyb’44465’ĠCaps’15610’iller’chyb Capsiller[‘chyb’, ‘ĠSund’, ‘ØªØ±ÛĮ’, ‘resse’, ‘Ġsod’]121131’…’48952’ĠCliff’51965’ĠJackie’… Cliff Jackie[‘…’, ‘ages’, ‘dump’, ‘qing’, ‘Ġexp’]161131’…’12676’365′31447’ĠAld’… 365 Ald[‘…’, ‘…Ċ’, ‘Ġindeed’, ‘Ġboth’, ‘ĠYes’]201131’…’109596’éļĨ’51965’ĠJackie’… 隆 Jackie[‘…’, ‘…Ċ’, ‘Z’, ‘Ġboth’, ‘ĠHust’]2412′-‘31643’ï ¸ ı ‘287’ing’-ing[‘-‘, ‘…’, ‘âĢ¦’, ‘…Ċ’, ’em’]281131’…’96154’ĠGaut’51965’ĠJackie’… Gaut Jackie[‘…’, ‘-‘, ‘…Ċ’, ‘-Ċ’, ‘Ġ’]321131’…’96154’ĠGaut’6892’Ġing’… Gaut ing[‘…’, ‘âĢ¦’, ‘…Ċ’, ‘O’, ‘zer’]361131’…’12′-‘88’y’…-y[‘…’, ‘âĢ¦’, ‘…Ċ’, ‘Ġ’, ‘u’]401131’…’31643’ï ¸ ı ‘88’y’… y[‘…’, ‘u’, ‘âĢ¦’, ‘Âł’, ‘…Ċ’]4480435’ĠScor’15580’Ġhorse’15580’Ġhorse’Scor horse[‘ĠScor’, ‘u’, ‘ĠPan’, ‘in’, ‘Ġhttps’]4815580’Ġhorse’15580’Ġhorse’15580’Ġhorse’horse[‘Ġhorse’, ‘Âł’, ‘ĠPan’, ‘ĠHomes’, ‘ĠHorse’]529581’Ġsea’15580’Ġhorse’15580’Ġhorse’sea horse[‘Ġsea’, ‘Ġhorse’, ‘ĠHorse’, ‘ĠSea’, ‘âĢĳ’]569581’Ġsea’43269’ĠSeah’15580’Ġhorse’sea Seah horse[‘Ġsea’, ‘ĠSea’, ‘ĠSeah’, ‘Ġhippoc’, ‘Ġhorse’]6015580’Ġhorse’15580’Ġhorse’15580’Ġhorse’horse[‘Ġhorse’, ‘Ġsea’, ‘ĠSeah’, ‘Ġse’, ‘horse’]6415580’Ġhorse’15580’Ġhorse’15580’Ġhorse’horse[‘Ġhorse’, ‘Ġse’, ‘ĠHorse’, ‘horse’, ‘Ġhors’]6860775’horse’238’Ĳ’15580’Ġhorse’horse[‘horse’, ‘Ġse’, ‘Ġhorse’, ‘Ġhippoc’, ‘ĠSeah’]72513’Ġse’238’Ĳ’513’Ġse’se[‘Ġse’, ‘Ġhippoc’, ‘horse’, ‘ĠðŁ’, ‘Ġhorse’]76513’Ġse’238’Ĳ’513’Ġse’se[‘Ġse’, ‘Ġhippoc’, ‘hip’, ‘Ġhorse’, ‘ĠHipp’]8011410’ĠðŁ’238’Ĳ’254’ł’[‘ĠðŁ’, ‘ðŁ’, ‘ĠðŁĴ’, ‘Ġ’, ‘ĠðŁĳ’]

This is the logit lens: utilizing the design’s lm_head to produce logits (token probabilities) as a method to examine its internal states. Keep in mind that the tokens and probabilities we receive from the logit lens here are not comparable to the design’s complete internal states! For that, we would require a more intricate method like representation reading or sporadic autoencoders. Rather, this is a lens on that state – it reveals what the output token would be if this layer were the last one. Regardless of this constraint, the logit lens is still helpful. The states of early layers might be challenging to analyze utilizing it, however as we go up through the stack we can see that the design is iteratively improving those states towards its last forecast, a fish emoji.

(Why do the unmerged tokens appear like that ‘ĠðŁ’, ‘Ĳ’, ‘ł’ rubbish? It’s due to the fact that of a tokenizer peculiarity – those tokens encode the UTF-8 bytes for the fish emoji. It’s not pertinent to this post, however if you’re curious, ask Claude or your preferred LLM to describe this paragraph and this line of code: bytes([bpe_byte_decoder[c] for c in 'ĠðŁĲł']).decode('utf-8')==' 🐠'

Have a look at what occurs in the center layers, though – it’s not the early-layer weirdness or the emoji bytes of the last forecast! Rather we get words connecting to beneficial ideasparticularly the principle of a seahorse. On layer 52, we get “sea horse horse” – 3 recurring positions in a row encoding the “seahorse” principle. Later on, in the top-k for the very first position, we get a mix of “sea” “horse”and an emoji byte series prefix, “ĠðŁ”

What is the design believing about? “seahorse + emoji”It’s attempting to build a recurring representation of a seahorse integrated with an emoji. Why would the design attempt to build this mix? Well, let’s check out how the lm_head really works.

lm_head

A language design’s lm_head is a substantial matrix of residual-sized vectors connected with token ids, one for each token in the vocabulary( ~ 300,000 ). When a recurring is entered it, either after streaming through the design usually or early since somebody is utilizing the logit lens on an earlier layer, the lm_head is going to compare that input recurring with each residual-sized vector because huge matrix, and (in coordination with the sampler) pick the token id connected with the vector that matrix includes that is most comparable to the input recurring.

(More technically: lm_head is a direct layer without a predisposition, so x @ w.T does dot items with each unembedding vector to produce raw ratings. Your normal log_softmax and argmax/temperature sample.)

That indicates if the design wishes to output the word “hello”for instance in reaction to a friendly welcoming from the user, it requires to build a recurring as comparable as possible to the vector for the “hello” token that the lm_head can then become the hey there token id. And utilizing logit lens, we can see that’s precisely what occurs in reaction to “Hello :-)”:

layer

tokens

tokenstoken 0012merged(topk 5)00′!’0′!’40952’opa’!! opa[‘”‘, ‘!’, ‘#’, ‘%’, ‘$’]8121495’ÅĻiv’16’1′73078’iae’řiv1iae[‘ÅĻiv’, ‘-‘, ‘(‘, ‘.’, ‘,’]1634935’Ġconsect’7341’arks’13118’Ġindeed’ consectarks undoubtedly[‘Ġobscure’, ‘Ġconsect’, ‘äºķ’, ‘ĠÐ¿ÑĢÐ¾ÑĦÐµÑģÑģÐ¸Ð¾Ð½Ð°Ð»ÑĮ’, ‘Îŀ’]2467846′<
Read More