Alex Goldhoorn

Articles

← Back to Articles

LLM Coding Failure Patterns

Developer reviewing AI-generated code on screen

This is a running log of recurring failure patterns that I've encountered when using LLMs for coding work — across GitHub Copilot, Cursor, Claude Code, and others. These aren't bugs but tendencies that appear across tools and models. Knowing them makes reviews faster. These entries I found several times, but some may have disappeared or will disappear with newer models.

#1

Over-engineering — solves the general case, not the specific one

Observed: 2025 (Cursor), 2026 (GitHub Copilot, Claude)

Given a concrete task, an LLM will often solve the abstract version of it. It won't reuse the helper function three files away — it'll write a new one. It'll add parameters for flexibility nobody asked for. It will handle edge cases that don't exist in the codebase.

The output is often correct but not appropriate. The feedback I give most often is to make as least changes as possible, make it short and reuse functions that are already there.

code quality review burden
#2

Clumsy solutions — right answer, wrong path

Observed: 2026

Sometimes the logic works but the approach is not every efficient. A concrete example: getting the maximum date from a DataFrame by sorting the column and taking the last row, instead of calling .max(). It arrives at the right answer via the scenic route.

code quality pandas
Diagram showing a correct but unnecessarily complex code path
The "scenic route" — correct destination, needlessly long path.
#3

Token budget surprises — runs out mid-flow

Observed: 2025 (Warp)

Monthly caps hit faster than you expect on an active project. You start a session with enthusiasm and hit a wall mid-task — sometimes mid-edit, leaving the file in an inconsistent state.

It's not a model failure, but it's a workflow hazard worth planning around. Starting big tasks early in a billing cycle and keeping local snapshots helps. Keeping track of remaining tokens and having fallback options ready also helps — a different model, the web interface, or a local model via Ollama.

workflow cost
#4

Token leak / runaway output

Observed: Apr 2026 (GitHub Copilot in VS Code) · Rare

An endless stream of <s> tokens from GitHub Copilot — a special token leaking into the completion output, repeated until the session was killed. The only fix was to close and restart.

Rare, but a useful reminder that there's a probabilistic system under the hood. When a model starts producing obviously malformed output, stop and restart rather than trying to work around it.

model failure copilot rare
#5

Unit tests — confident but shallow

Observed: 2025, 2026

Unit test generation was one of the earlier uses of LLMs for coding — and it is genuinely useful. But the tests can be overly complex or too obvious: they test a clean, trivial input that will always pass, rather than the edge cases that actually break things.

Guide it explicitly: ask for edge cases, null inputs, boundary values. The default output looks like coverage but often isn't.

testing code quality

Alex Goldhoorn is a freelance Senior Data Scientist. Find more at goldhoorn.net.