|

Agentic failures and loops: why your AI is burning tokens and how to stop it

Agentic failures and loops why your AI is burning tokens and how to stop it

Benchmarks such as Terminal Bench 2.0 or SWE-Bench Pro measure an agent’s ability to produce a correct patch in a controlled environment. However, they systematically overlook long-session stabilitydestructive file management, and operational security.

While SemiAnalysis already describes Claude Code as a major market inflection point responsible for 4% of current GitHub commits, the reality in the terminal is more nuanced. This audit, based on incident reports from February 2026, documents real-world frictions that never appear in marketing benchmarks.

For a broader analysis of the topic — covering theoretical performance, production constraints, and economic implications — see our full report: AI Coding Agents: The Reality Beyond Benchmarks.

1. GPT-5.3 Codex: agentic instability and “Ghost Execution”

Continue reading after the ad

1.1 The danger of the “CAT pattern” and massive rewrites

One of the most critical flaws in Codex lies in its file management. Instead of using structured patch tools like apply_patch, the agent often falls back on a fragile pattern: using cat commands to rewrite an entire file for a minor modification.

Operational Impact: This “CAT pattern” drastically increases the risk of corruption in large files and can inadvertently overwrite changes made simultaneously by a human. It is a massive waste of tokens (token burn) that transforms a simple fix into a high-risk operation.

1.2 Intelligence degradation and “Stuck Approvals”

In long sessions exceeding 200k tokens, Codex exhibits a progressive qualitative decline. The AI becomes repetitive and favors “band-aid” solutions, such as timeouts or retries, rather than resolving underlying race conditions.

More seriously, technical reports from Penligent highlight the Stuck Approvals phenomenon: a state desynchronization where the interface waits for human approval while the agent has already proceeded to execution, known as Ghost Execution. This behavior violates the human-in-the-loop control boundary.

1.3 The strategic paradox of security rerouting

Continue reading after the ad

Recently, users on Hacker News discovered that GPT-5.3 Codex was silently rerouting certain requests to the older 5.2 model under the guise of “cyber security”. This rerouting, often triggered by false positives, leads to a sudden drop in reasoning capabilities in the middle of a critical session.

2. Claude Code: CLI instability and system friction

Anthropic’s approach is more transparent, but its CLI is not free from early-stage flaws documented through recent GitHub issues.

2.1 The “No Content” bug of the Bash Tool

The most significant blocker for macOS users concerns the Bash Tool. In versions 2.1.14+, simple commands like ls or pwd systematically return a (No content) message. This leaves the agent unable to perceive the project structure or run tests. This malfunction, linked to subprocess management on macOS 26.2, completely breaks the promised “terminal-first” workflow.

2.2 Context loss and session instability

Continue reading after the ad

The /context command, intended to allow developers to view all files currently included in the context window, suffers from a recurring display bug where the window flashes and disappears instantly. Without this visibility, it becomes impossible to diagnose why the AI begins to hallucinate regarding files it should be able to “see”.

Furthermore, persistent authentication issues often force manual logouts after a system sleep, resulting in the loss of Model Context Protocol (MCP) server configurations and active session tokens.

3. Synthesis of documented incidents

IncidentModelSeverityStatusSource
Bash “No Content”Claude CodeCriticalPartially patchedGitHub #19663
Stuck ApprovalsCodexHighReportedPenligent
Disappearing /contextClaude CodeMediumFlaggedGitHub #18562
Rerouting 5.2CodexMediumConfirmedHacker News

Beyond the “vibe”: operationalizing reliability

The failures observed in Claude Code and GPT-5.3 Codex remind us of a fundamental technical truth: the maturity of an AI agent is not measured by its success score on an isolated problem, but by its reliability within a complex DevOps environment.

The question for CTOs and Lead Devs is no longer whether AI can code, but whether we can grant it terminal keys without continuous supervision. The current answer is a definitive no. At this stage, human-in-the-loop oversight remains indispensable for critical production environments. To mitigate these instabilities and the resulting token burn, the engineering community is shifting toward more rigid, deterministic architectures.

In our next deep dive, we will explore how the Model Context Protocol (MCP) and frameworks like smolagents are attempting to build the “safety rails” required for the next generation of agentic workflows.


Your comments enrich our articles, so don’t hesitate to share your thoughts! Sharing on social media helps us a lot. Thank you for your support!

Continue reading after the ad

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *