System 1 vs System 2: LLM Riddle Showdown

Experiment Results • Dec 2025

Can GPT-4o solve
these riddles?

We tested 6 frontier models on 7 logic puzzles. The results reveal the gap between System 1 (fast pattern matching) and System 2 (deliberate reasoning).

📖 Want the full analysis? Read the complete article with technical explanations, methodology, and references: System 1 vs System 2: Testing LLMs with Riddles →

Model Accuracy Scoreboard

7 Riddles Total

Key Finding

Search ≠ Accuracy

Perplexity (with search) struggled on logic traps, retrieving cached wrong answers (e.g., "Friday" for the Calendar riddle).

The "Nudge" Effect

Salt Ice

Simple follow-up prompts can activate System 2 reasoning (as seen with Claude 4.5).

Missed Ambiguity

83% Partial

5 out of 6 models gave only one answer for Mary's Daughter riddle without recognizing it's ambiguous (valid for both male and female speakers).

The Riddle Arena

Try to solve them yourself, then see how the AI performed.

Why This Matters

LLMs are probabilistic, not logical. Here are the critical technical takeaways.

The "Reasoning Theater" Trap

Case Study: DeepSeek. In the "Impossible Math" riddle, DeepSeek invented a Base 5 number system to force an answer. This proves that Chain-of-Thought does not guarantee accuracy; models can use reasoning steps to rationalize a hallucination.

Triggering System 2

Most models default to System 1 (fast pattern matching). In the "Silent Ingredient" case, Claude initially guessed "Salt." A single nudge ("But it's made of water") forced a re-evaluation to "Ice." Takeaway: Never accept a single-shot answer for high-stakes logic tasks.

Search ≠ Logic

Perplexity has access to the entire web, yet it hallucinated "Friday" for the Calendar Trap. Why? It retrieved a *similar* riddle's answer instead of solving the specific logic presented. RAG systems inherit the biases of their search results.

Can GPT-4o solve these riddles?