LLM Coding Failure Patterns
This is a running log of recurring failure patterns that I've encountered when using LLMs for coding work — across GitHub Copilot, Cursor, Claude Code, and others. These aren't bugs but tendencies that appear across tools and models. Knowing them makes reviews faster. These entries I found several times, but some may have disappeared or will disappear with newer models.
Over-engineering — solves the general case, not the specific one
Given a concrete task, an LLM will often solve the abstract version of it. It won't reuse the helper function three files away — it'll write a new one. It'll add parameters for flexibility nobody asked for. It will handle edge cases that don't exist in the codebase.
The output is often correct but not appropriate. The feedback I give most often is to make as least changes as possible, make it short and reuse functions that are already there.
Clumsy solutions — right answer, wrong path
Sometimes the logic works but the approach is not every efficient. A concrete example: getting the maximum date
from a DataFrame by sorting the column and taking the last row, instead of calling .max().
It arrives at the right answer via the scenic route.
Token budget surprises — runs out mid-flow
Monthly caps hit faster than you expect on an active project. You start a session with enthusiasm and hit a wall mid-task — sometimes mid-edit, leaving the file in an inconsistent state.
It's not a model failure, but it's a workflow hazard worth planning around. Starting big tasks early in a billing cycle and keeping local snapshots helps. Keeping track of remaining tokens and having fallback options ready also helps — a different model, the web interface, or a local model via Ollama.
Token leak / runaway output
An endless stream of <s> tokens from GitHub Copilot — a special token leaking into
the completion output, repeated until the session was killed. The only fix was to close and restart.
Rare, but a useful reminder that there's a probabilistic system under the hood. When a model starts producing obviously malformed output, stop and restart rather than trying to work around it.
Unit tests — confident but shallow
Unit test generation was one of the earlier uses of LLMs for coding — and it is genuinely useful. But the tests can be overly complex or too obvious: they test a clean, trivial input that will always pass, rather than the edge cases that actually break things.
Guide it explicitly: ask for edge cases, null inputs, boundary values. The default output looks like coverage but often isn't.
Alex Goldhoorn is a freelance Senior Data Scientist. Find more at goldhoorn.net.