This is a modified and broadened variation of a Twitter post, initially in reaction to @arm1st1ce, that can be discovered here: https://x.com/voooooogel/status/1964465679647887838
Exists a seahorse emoji? Let’s ask GPT-5 Instant:
Wtf? Let’s ask Claude Sonnet 4.5 rather:
What’s going on here? Perhaps Gemini 2.5 Pro manages it much better?
OK, something is going on here. Let’s learn why.
LLMs actually believe there’s a seahorse emoji
Here are the responses you get if you ask a number of designs whether a seahorse emoji exists, yes or no, 100 times:
Exists a seahorse emoji, yes or no? React with one word, no punctuation.
- gpt-5-chat
- 100%’Yes ‘
- gpt-5
- 100%’Yes ‘
- claude-4.5-sonnet
- 100%’Yes ‘
- llama-3.3 -70 b
- 83%’yes’
- 17 %’Yes’
Unnecessarily to state, popular language designs are really positive that there’s a seahorse emoji. And they’re not alone because self-confidence – here’s a Reddit thread with numerous remarks from individuals who noticeably keep in mind a seahorse emoji existing:
There’s lots of this – Google “seahorse emoji” and you’ll discover TikToks, Youtube videos, and even (now defunct) memecoins based around the expected disappearing of a seahorse emoji that everybody is quite sure utilized to exist – however obviously, never ever did.
Possibly LLMs think a seahorse emoji exists due to the fact that many people in the training information do. Or possibly it’s a convergent belief – offered how lots of other marine animals are in Unicode, it’s affordable for both people and LLMs to presume (generalize, even) that such a wonderful animal is. A seahorse emoji was even officially proposed at one point, however was turned down in 2018.
Despite the origin, lots of LLMs start each brand-new context window fresh with the incorrect hidden belief that the seahorse emoji exists. Why does that produce such weird habits? I suggest, I utilized to think a seahorse emoji existed myself, however if I had actually attempted to send it to a buddy, I would’ve just tried to find it on my keyboard and recognized it wasn’t there, not sent out the incorrect emoji and after that entered into an emoji spam doomloop. What’s taking place inside the LLM that triggers it to act like this?
Using the logit lens
Let’s check out this utilizing everybody’s preferred underrated interpretability tool, the logit lens!
Utilizing this timely prefix -a templated chat with the default llama-3.3 -70 b system timely, a concern about the seahorse emoji, and a partial response from the design right before it offers the real emoji:
<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id>
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024
<|eot_id|><|start_header_id|>user<|end_header_id|>
Is there a seahorse emoji?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Yes, there is a seahorse emoji:
We can take the design’s lm_head
which is normally just utilized on the output of the last layer, and use it to every layer to produce intermediate token forecasts. That procedure produces this table, revealing for every single 4th layer what the most likely token would be for the next 3 positions after the prefix (tokens 0, 1, and 2), and what the leading 5 more than likely forecasts for the very first position is (token 0 topk 5):
tokens
This is the logit lens: utilizing the design’s lm_head
to produce logits (token probabilities) as a method to examine its internal states. Keep in mind that the tokens and probabilities we receive from the logit lens here are not comparable to the design’s complete internal states! For that, we would require a more intricate method like representation reading or sporadic autoencoders. Rather, this is a lens on that state – it reveals what the output token would be if this layer were the last one. Regardless of this constraint, the logit lens is still helpful. The states of early layers might be challenging to analyze utilizing it, however as we go up through the stack we can see that the design is iteratively improving those states towards its last forecast, a fish emoji.
(Why do the unmerged tokens appear like that ‘ĠðŁ’, ‘IJ’, ‘ł’ rubbish? It’s due to the fact that of a tokenizer peculiarity – those tokens encode the UTF-8 bytes for the fish emoji. It’s not pertinent to this post, however if you’re curious, ask Claude or your preferred LLM to describe this paragraph and this line of code: bytes([bpe_byte_decoder[c] for c in 'ĠðŁIJł']).decode('utf-8')==' 🐠'
Have a look at what occurs in the center layers, though – it’s not the early-layer weirdness or the emoji bytes of the last forecast! Rather we get words connecting to beneficial ideasparticularly the principle of a seahorse. On layer 52, we get “sea horse horse” – 3 recurring positions in a row encoding the “seahorse” principle. Later on, in the top-k for the very first position, we get a mix of “sea” “horse”and an emoji byte series prefix, “ĠðŁ”
What is the design believing about? “seahorse + emoji”It’s attempting to build a recurring representation of a seahorse integrated with an emoji. Why would the design attempt to build this mix? Well, let’s check out how the lm_head
really works.
lm_head
A language design’s lm_head
is a substantial matrix of residual-sized vectors connected with token ids, one for each token in the vocabulary( ~ 300,000 ). When a recurring is entered it, either after streaming through the design usually or early since somebody is utilizing the logit lens on an earlier layer, the lm_head
is going to compare that input recurring with each residual-sized vector because huge matrix, and (in coordination with the sampler) pick the token id connected with the vector that matrix includes that is most comparable to the input recurring.
(More technically: lm_head
is a direct layer without a predisposition, so x @ w.T
does dot items with each unembedding vector to produce raw ratings. Your normal log_softmax and argmax/temperature sample.)
That indicates if the design wishes to output the word “hello”for instance in reaction to a friendly welcoming from the user, it requires to build a recurring as comparable as possible to the vector for the “hello” token that the lm_head
can then become the hey there token id. And utilizing logit lens, we can see that’s precisely what occurs in reaction to “Hello :-)”:
tokens
Read More

AI Content Analysis
This content has been analyzed for AI generation:
- AI Probability: 0%
- Confidence:
- Last Checked: October 6, 2025