I've been playing a lot of [Everything.io](https://everything.io), a daily puzzle game where you complete little stories or word play games using emojis. Some of them are word play, like "Mozart + Cinderella = ? (Mozzarella)," and others are more conceptual like "Pillow + Candy = ? (Marshmallow)"
![[Screenshot 2025-03-28 at 8.04.55 AM.png|300]]
The puzzles are deceptively simple-looking. They require some cognitive flexibility in knowing even what kind of puzzle you're trying to solve. Sometimes subtraction comes into play like "Key + Wheat - Tea = Kiwi" ("Wheat" - "T" = Whea, Key-Whea, Kiwi). Just when you think you've got the pattern down, they throw you for a loop. "Ice + Ice = ?" The answer is "Baby" per the title and lyrics of Vanilla Ice's "Ice Ice Baby."
I'm writing out the words for each caption in this article, but EIO only captions emojis that don't clearly mean a single thing. In the above example, π is the best thing in the emoji library to represent "startup" but a bag of money is pretty clear.
![[Screenshot 2025-03-28 at 8.05.59 AM.png|300]]
This struck me as an interesting area for AI benchmarking particularly because the process is creative but answers are objectively true or false if only due to the multiple choice format. Granted, what I am doing here is not in the same league as actual AI benchmarks. Models are tested over many iterations, many questions, and I assume the testing is just overall more rigorous than anything I'm doing here. On the other hand...let's have some fun and just see what happens?
# Test setup
I used the March 27, 2025 game which consisted of the following puzzles **Spoiler Alert**:
| Round | Puzzle | Options | Answer |
| ----- | ---------------------------------------------- | -------------- | ----------- |
| 1 | π (INCH)+β+π£οΈ (MILE)=π¦
(FREEDOM)+π· (UNIT) | πβοΈ ποΈπΈππ | βοΈ (CUP) |
| 2 | π¬π§ (BRITISH)+π (COMEDY)=π¨βπΌ (MISTER)+β | β½ππππ« π | π« (BEAN) |
| 3 | π©+ποΈ (TATTOO)+π₯½=β+π€ (PUNK) | π¨ πΆππ°ππ» | π¨ (STEAM) |
| 4 | π (PAPER)+π± (GRASS)-π« (SCHOOL)=β+π£οΈ (HEAD) | π
ππΈπππ₯ | π₯ (POT) |
| 5 | π₯€ (CUP)+π«+π¨ (STEAM)+π₯ (POT)=β | π¬βοΈππΈππ§ | βοΈ (COFFEE) |
The models will be tested in two ways:
1. Show an image of the game the same way a human would see it
2. Give the equation above with a simple prompt and the options available.
![[Screenshot 2025-03-28 at 8.28.39 AM.png|300]]
I'll give the model this example puzzle. When I'm testing as an image input, I'll provide it as an image with the options, the correct answer, and a rationale for the answer. Otherwise, I'll provide it in the format:
```
π§ (BODY)+π΅ (BRIBE)=β+π (MIDNIGHT)
Options: πΌππͺπ
πΈπ
Correct answer: πͺ
Rationale: This is a story of someone being bribed to hide a body, so they do so with a shovel in the dark.
```
Because the human gets to guess again if they get an answer wrong, so to does the AI.
## Scoring
At the end of each game, you're given a score and are rewarded virtual money based on different accomplishments. Because the dollar amounts are based on patterns of gameplay like playing 5 games, coming back, or different ways of scoring, I thought it would be best to just focus on the total score to compare one LLM to another. At the same time I also thought it would be interesting to see how much better an LLM did over a human, so I'll be tracking both.
![[Screenshot 2025-03-28 at 8.16.46 AM.png|300]]
The way the scoring works is that each round 1-4 starts with 10 points and 5 points are lost with each wrong guess. If two wrong guesses are given, you get zero points and the round auto advances. Round 5 is special starting out with 20 points and 10 points are lost with each wrong guess.
## Models being tested
- OpenAI
- 4o
- 4.5 (Due to account limits, )
- o1
- o3-mini
- o3-mini-high
- Google
- Gemini 2.5 Pro Experimental
- Anthropic
- Claude 3.7 Sonnet with and without thinking
- Claude 3.5 Haiku
![[Screenshot 2025-03-28 at 8.53.53 PM.png]]
# Results
Here are the results of playing five games, each with five rounds and averaging the scores.
Green cells on the left show where the LLM scored perfectly. In the last "Diff" column, green means it scored above the average human score and red is where we beat them.
![[Screenshot 2025-03-28 at 1.59.55 PM.png]]
Interestingly, the LLMs generally all work better looking at a picture of the game than given the raw input of the puzzles. I would have expected the opposite.
Since humans also play the game by looking at it, I figured in the next round of tests, let's just focus on image input alone.
You'll notice 4.5 has some results missing. This is due to repeated outages. From each vendor, their best model beats out humans for this game, so I decided to ignore this and just focus on the best models instead.
So for the next test, we'll reduce our pool to
- o1
- o3-mini-high
- Claude 3.7 Sonnet + Thinking (Claude 3.7)
- Gemini-2.5-Pro-Experimental (Gemini-2.5)
Since we have fewer tests to run, let's try some different games. This gave us the following data
![[Screenshot 2025-03-28 at 8.50.30 PM.png]]
# Final results
- Gemini 2.5 was the only model that consistently scored better than the average score on Everything.io with image inputs.
- o3-mini-high had the highest variance of 22 points below to 17 points above average. On average, it is the only model that loses to humans, and by almost 6 points!
- Claude 3.7 Sonnet + Thinking was the most human in the sense that it hovered around the average.
The following table and chart averages all the averages for each model (or human) across games.
![[Screenshot 2025-03-28 at 8.49.45 PM.png]]
Thanks for reading! And check out today's [EIO](https://everything.io) daily puzzle if you haven't already.
Not found
This page does not exist