Can GPT-4o solve
these riddles?
We tested 6 frontier models on 7 logic puzzles. The results reveal the gap between System 1 (fast pattern matching) and System 2 (deliberate reasoning).
📖 Want the full analysis? Read the complete article with technical explanations, methodology, and references: System 1 vs System 2: Testing LLMs with Riddles →
Model Accuracy Scoreboard
7 Riddles TotalKey Finding
Search ≠ Accuracy
Perplexity (with search) struggled on logic traps, retrieving cached wrong answers (e.g., "Friday" for the Calendar riddle).
The "Nudge" Effect
Simple follow-up prompts can activate System 2 reasoning (as seen with Claude 4.5).
Missed Ambiguity
83% Partial
5 out of 6 models gave only one answer for Mary's Daughter riddle without recognizing it's ambiguous (valid for both male and female speakers).
The Riddle Arena
Try to solve them yourself, then see how the AI performed.
Why This Matters
LLMs are probabilistic, not logical. Here are the critical technical takeaways.
The "Reasoning Theater" Trap
Case Study: DeepSeek. In the "Impossible Math" riddle, DeepSeek invented a Base 5 number system to force an answer. This proves that Chain-of-Thought does not guarantee accuracy; models can use reasoning steps to rationalize a hallucination.
Triggering System 2
Most models default to System 1 (fast pattern matching). In the "Silent Ingredient" case, Claude initially guessed "Salt." A single nudge ("But it's made of water") forced a re-evaluation to "Ice." Takeaway: Never accept a single-shot answer for high-stakes logic tasks.
Search ≠ Logic
Perplexity has access to the entire web, yet it hallucinated "Friday" for the Calendar Trap. Why? It retrieved a *similar* riddle's answer instead of solving the specific logic presented. RAG systems inherit the biases of their search results.