A first draft
Most people talk about “getting better at prompting” as if the goal were to discover a perfect incantation that unlocks the model’s intelligence. That framing misses the actual control surface we have.
A modern LLM call is stateless. It does not “remember” anything from last time unless you send it again. Your application can simulate state (a chat transcript, tool outputs, a memory file), but the model itself is still just: input tokens → output tokens. The “context window” is not a mystical entity; it’s whatever you put in the HTTP request.
And once you accept that, a fairly strong claim becomes hard to unsee:
For a fixed model, agentic performance is a function of context quality, because context is the only lever you can directly control at inference time.
If you want a system that behaves “smarter,” you do not primarily ask for more clever outputs. You build feedback loops that reliably produce better context windows.
This is what I mean by hill climbing context.
A context window is the full set of tokens sent to the model for a given call (system instructions, user message, tool outputs, memory, retrieved documents, summaries, everything). It is the model’s entire world for that instant.
A chat interface creates the illusion of continuity by resending a growing transcript. An “agent” generalizes this by also adding tool results, rewriting summaries, pruning, and structuring the input—while still ultimately just sending tokens.
Hill climbing is the general strategy of iteratively making a thing better, using feedback to decide the next step, until you reach a plateau (or decide “close enough”).
A simple example: if you generate a UI and repeatedly say “polish it more,” you can often push the output uphill—until improvements become lateral (different but not better) or regress.
That is hill climbing an output.
Hill climbing context is one level higher: rather than directly optimizing the output, you optimize the input substrate the model reasons over. You treat the context window itself as the object being improved.
In practice: you run a loop where each iteration does not just ask “can you do better,” but asks “what would make the next attempt obviously better,” and then you reshape the context accordingly.
A concise definition:
Hill climbing context = iteratively transforming the context window to increase expected task success under token and human-time constraints.
If you fix the model, there are only a few ways to change what comes out. At a high level, they all reduce to the quality of what goes in.
I find it useful to bucket context work into three operations (plus a 3B):
Incorrect context is poison.
This includes:
- factual errors (“the API endpoint is X” when it’s not),
- wrong constraints (“do not modify file Y” when you must),
- false assumptions (“there is a diff tool that shows the trace mismatch” when it doesn’t),
- or stale reality (“tests were passing” when they aren’t anymore).
Incorrect context is not merely unhelpful—it creates tunnel vision. The model may visibly “try” to move toward the correct answer, but it keeps snapping back toward the false anchor because that anchor is inside its world-model.
There is a subtle but important exception: it’s fine to include wrong paths if you frame them as wrong, e.g.:
We tried X. It failed because Y. Don’t try X again.
That’s still correct context (it describes reality), and it’s extremely valuable because it compresses negative evidence into a form that prevents repeated mistakes.
But dumping an entire transcript of failed reasoning into the context window is usually counterproductive. Not because the words are “false,” but because they are an uncontrolled mixture of partial hypotheses, dead ends, and low-signal narrative. The right move is: distill, don’t transcribe.
Missing context is opportunity.
Most tasks that look like “the model isn’t smart enough” are actually “the model is under-informed.” The model can’t see your repo, your runtime state, your failing tests, your environment variables, your browser UI, your production logs, your intent, or the “obvious facts” you forgot to say because they’re in your head.
This is the core of tool use. A tool call is, functionally, a way to inject missing context:
- “Run tests. What failed?”
- “Open this file. What does it contain?”
- “Search the codebase for this symbol.”
- “Fetch the actual web page and convert it to structured text.”
- “Diff outputs from system A and system B.”
This is why agents matter: not because “agents are models,” but because agents are mechanisms for adding missing context in a loop.
Irrelevant context is drag.
There’s a reason “needle in a haystack” exists as a benchmark. If you bury one useful fact in a pile of junk, you can degrade performance. The junk consumes attention, increases ambiguity, and raises the chance the model fixates on the wrong thing.
Two subtypes matter:
- Benign noise: mostly just wastes tokens; sometimes mildly harms performance.
- Adversarial noise: actively harmful residue, like reusing a context window across unrelated tasks, or dumping giant logs that trigger wrong heuristics. In practice, cross-task residue can be devastating.
A simple operational rule that captures a lot of value:
When the task changes, clear the context.
Even if everything in your context window is “true” and “relevant,” it can still be inefficiently expressed.
Compression matters because tokens are a budget. Compression allows you to fit more signal in the window, and it reduces distraction. Converting verbose HTML to clean Markdown is not cosmetic; it is context optimization.
Compression has risk: summarize too aggressively and you drop the one detail that mattered. That’s not an argument against compression; it’s a reminder that hill climbing can go downhill. Compression is a probabilistic tradeoff: you want to remove low-value tokens without dropping high-value constraints.
Now we can define an agent precisely in a way that removes most of the mystique:
An agent is a feedback loop that repeatedly improves the context window it is about to send to the model.
The model itself is not learning during your session. It isn’t becoming a better reasoner over time. What improves is the information and structure you are providing.
A minimal pseudocode sketch:
-
Start with an initial context window (C_0).
-
Ask the model to propose an action (A_i) given (C_i).
-
Execute (A_i) (often via tools).
-
Observe feedback (F_i) (test results, diffs, UI screenshots, runtime outputs).
-
Produce the next context window (C_{i+1} = T(C_i, A_i, F_i)), where (T) is a set of transformations:
- remove wrong assumptions,
- add missing facts from feedback,
- compress/structure,
- drop irrelevant residue.
This loop is the “agent.” Everything else is implementation detail.
A natural question is: if the model has high potential, why not just let it run constantly and hill climb the product on its own?
Because raw intelligence dropped into chaos is not automatically productive.
If you took an extremely capable mind and dropped it into an information-starved environment with no instruments, no feedback signals, and unclear goals, it would not reliably produce value. A model in a context-poor environment is like a brilliant engineer locked in a room with no repo access, no runtime, no tests, and a vague request: it can generate plausible output, but it can’t reliably converge.
Today, we also lack true continual learning at the model level. Each context window is effectively onboarding a brand-new intelligent being onto your system “for the first time.” You must re-establish state, constraints, and ground truth each run.
So the practical strategy becomes:
- manufacture a high-quality context window,
- get a high-quality output,
- use feedback to improve the next context window,
- repeat.
In other words: your job is not to “talk to a model.” Your job is to produce high-quality context windows at high throughput.
The true objective is banal and correct:
Ship product value to users.
But “ship value” is too high-level to guide day-to-day mechanics. So we choose a proxy that is actionable and aligned:
Produce as many high-quality context windows as possible, prioritized by downstream product value.
That proxy makes two bottlenecks explicit:
- Context quality: how correct, complete, minimal, and well-structured the input is.
- Human bottleneck: how much human time/attention it takes per iteration.
Most discussions focus on quality alone. In practice, the breakthroughs come when you increase quality and decrease human involvement—without letting quality collapse.
A very common failure mode is accidental ambiguity: you name something in a way that implies a structure that isn’t real, and the model fills in the blanks.
Example: you have a folder ending in .bak that happens to share a name with your current project. You intended it as “some old thing I moved aside,” but the model interprets it as “a previous state of the same repo.” It then takes actions consistent with that mental model, such as restoring code from git history rather than copying from the intended source.
Nothing “mystical” happened. The model had missing context, guessed, and then reasoned coherently inside the wrong world.
The fix is also not mystical: add one sentence of disambiguation. In effect, you spend a few tokens to delete an entire class of wrong branches.
Now consider a more damaging example: you’re debugging something complex (say, a virtual machine implementation) against an authoritative test suite. You believe you have a tracer that diffs your execution trace against a reference implementation and shows the mismatch.
So your entire debugging strategy is built around “inspect the diff; repair the discrepancy.”
But the diff tool is broken, or missing, or unreliable.
This is not merely “missing context.” It is incorrect context: you are asserting the existence of a feedback signal that does not exist. The model will spend time looking for it, working around it, or even trying to fix the tool instead of fixing the underlying bug. Even worse, the loop itself is now optimized toward a phantom target.
If you run this for a day, you don’t just lose time; you burn the core resource: human attention and compute, channeled into the wrong gradient.
Once the diff tool actually works, performance can flip from “stuck” to “rapid convergence,” because you’ve replaced imagination with ground truth.
Most people treat generated code as the asset and the prompt as the disposable wrapper. For agentic work, that’s often backwards.
A common pattern:
- A run “works,” but you can tell it was messy: excessive tool churn, lots of debugging, sprawling context, brittle reasoning.
- Your instinct says: “It compiles; move on.”
But if your goal is high-quality outcomes at scale, a different strategy is frequently dominant:
- Read the final report for ~10 seconds.
- Decide if the run was clean or messy.
- If messy, do not review the code deeply.
- Distill what went wrong into the prompt (fix missing disambiguations, add constraints, improve the plan).
- Re-run from scratch.
- Compare results if necessary; often the second run is structurally cleaner and less fragile.
Why is this rational?
Because the actual product of an iteration is not the code—it is the improved context window. A messy run teaches you what information was missing and what assumptions were wrong. The correct next step is to encode that learning into the context, then rerun. You are converting a chaotic debugging journey into a clean, minimal specification that the model can execute without flailing.
This is also where async vs sync matters:
- If you are synchronously watching the model run, reruns feel expensive because they occupy your attention.
- If you run agents asynchronously and only audit outputs, reruns become cheap: you pay mainly for review and orchestration, not for waiting.
In other words, you are hill climbing not just context quality, but also the human bottleneck.
There’s a style of working where you become a manual tool call:
- copy/paste errors into the model,
- run tests yourself,
- eyeball the UI,
- nudge the model with short feedback.
This can be useful. It builds intuition (“model empathy”), helps you learn what’s possible, and is often the fastest path when stakes are low.
But it is not the same as context engineering.
I’d draw the boundary like this:
- Vibe coding optimizes for flow and low cognitive load. The human remains heavily in the loop, and quality is often “good enough.”
- Context engineering optimizes for producing high-quality, verified, minimal context windows that reliably yield high-quality outputs, with a deliberate focus on bottlenecks and failure modes.
Both have a place. The confusion happens when people call all of it “prompting,” and then conclude that success is about writing perfect prompts in one shot. In practice, the best practitioners are not prompt writers; they are feedback loop engineers.
The most useful habit I’ve found is to maintain a small, structured “kernel” that survives across iterations. Instead of carrying the whole transcript, you carry the minimal state that the next run needs.
A template:
Context Kernel
- Goal: one sentence.
- Non-goals: what not to do.
- Constraints: hard constraints only.
- Current state: what exists, what’s broken, what’s already true.
- Validation: exact commands, expected outputs.
- Key files / entry points: paths, functions, modules.
- Known failures: “Tried X; failed because Y. Don’t retry X.”
- Plan: short, checkable steps.
This structure supports hill climbing because each iteration forces you to answer:
- What was wrong? Remove it.
- What was missing? Add it.
- What is now irrelevant? Drop it.
- What can be compressed? Compress it.
A complementary operational rule:
Prefer hard validation. Tests, typecheckers, linters, deterministic checks—these are high-quality context because they are grounded. When you lack hard validation, create a rubric or approximate scoring function, but recognize you’ve shifted to a softer hill.
Hill climbing only works if you know when you’re near the top.
Some plateau signals are objective:
- tests pass,
- diffs shrink,
- fewer tool calls per successful change,
- fewer retries,
- less entropy in the loop.
Others are subjective:
- UI “prettiness,”
- writing quality,
- product feel.
For subjective domains, you often need a manual review loop, at least until you can formalize what “better” means.
But even there, you can instrument proxies: time-to-first-acceptable output, number of revisions, user feedback. The important thing is to avoid mistaking motion for progress.
A model is a powerful engine, but without a feedback loop it’s closer to an idle generator than a worker. It doesn’t interact with reality unless you provide reality. It doesn’t self-correct unless you give it the signals that make correction possible.
So the skill ceiling is not “write better prompts.” It’s:
- be relentlessly context-aware,
- build feedback loops that delete wrong context, acquire missing context, and compress away noise,
- and learn to produce high-quality context windows at high throughput without becoming the bottleneck.
If you adopt one rule to start:
When you finish a task—or when the task meaningfully changes—clear your context.
Everything else builds on that.