7 – LLM “Plan Fatigue”: Why Late Validation Steps Got Skipped (and a Fix That Worked for Me)

by

in

I’ve been building a rule-heavy review agent for public-API governance changes. Early phases (discovery, candidate generation) work reliably. Later phases—especially rigorous validation against the full rule set—frequently get ignored, skimmed, or half-done, even with explicit, repeated instructions to treat them as mandatory.

This feels like a classic symptom: the model excels at creative/open-ended early work but treats final checks as low-priority cleanup. Ramping up prompt emphasis (“this is critical”, “double-check every finding”) helps marginally but never reaches consistent 100% adherence across runs.

What finally broke the pattern:

  1. Phased execution with progressive instruction reveal
    Split the workflow into isolated stages, using chained calls (or tool gates) to deliver fresh context and hidden plans only when needed:
    • Phase 1: Read artifacts + rules → generate candidates → append to file.
    • Phase 2: Separate call reveals a “deep audit” plan (source-level validation of findings). The agent must then request the next plan.
    • Phase 3: Final call reveals a “publish-ready” plan that re-reads rules and enforces per-finding checks before formatting output.
    This resets context window pressure and prevents the model from seeing the entire long-horizon sequence upfront (a common trigger for agents to skip tail steps). Tool gating automates the handoff for unattended end-to-end runs; without tools, you could store plans in files and have the agent read them on demand—but agents often load everything immediately if it’s visible.
  2. Semantic framing: make validation sound non-negotiable
    Even with phasing, full compliance only arrived after renaming:
    • “validate” → deep_audit
    • “verify” → final_publish
    The inflated titles seem to signal higher importance to the model—similar to how humans respond to reframing tedious work as “mission-critical.” Combined with the phase resets, this pushed me to 100% rule-compliance in testing.

Result: 95%+ automation on my daily 10–30 reviews, with the core drift problem largely solved. Still refining rules and thresholds, but the fatigue is gone.

Reflections

“Plan fatigue” is my shorthand for observed patterns like long-horizon unreliability, late-step skipping, or context-driven quality drift in agent chains. If you’re seeing similar issues in validation-heavy or multi-step agents, try progressive instruction reveal + high-gravity naming for the tail steps.

Human parallel: when I’m mentally tired, I skip tedious verification too. A short break for fresh context and reframing the task as “final sign-off” helps me push through—just like it helps the LLM.

There’s a natural tension here: we value LLMs for their flexibility and emergent reasoning, yet when we prescribe every step rigidly, they deviate—just as humans push back against micromanagement. My phased reveal + semantic reframing gives the model fresh context and a sense of “ownership” over the tail steps without fully delegating planning. For strict rule enforcement, the imposed structure (delivered progressively) has proven more reliable in my runs than looser approaches.

Curious if others have hit this and what mitigations worked (phasing, reflection loops, different models, self-generated sub-plans, etc.). Happy to discuss in comments.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *