218 lines
8.4 KiB
Markdown
218 lines
8.4 KiB
Markdown
### 1. Treat the harness as first-class product IP
|
|
|
|
The orchestrator is the product. Invest in:
|
|
|
|
- context selection
|
|
- task planning
|
|
- tool descriptions
|
|
- retries and recovery
|
|
- permission policies
|
|
- durable state and memory
|
|
- evaluation loops
|
|
|
|
### 2. Long-running agents need explicit state management
|
|
|
|
If an agent will span many turns or run in the background, it cannot rely on raw transcript accumulation. It needs:
|
|
|
|
- compact task state
|
|
- durable artifacts and handoff files
|
|
- summarized history
|
|
- selective retrieval of only relevant prior work
|
|
|
|
### 3. Safety needs multiple layers
|
|
|
|
The practical stack is not one feature. It is a combination of:
|
|
|
|
- conservative defaults
|
|
- scoped permissions
|
|
- sandboxing where possible
|
|
- action classification
|
|
- audit logs
|
|
- destructive-action testing
|
|
- prompt-injection defenses
|
|
|
|
### 4. Local agents create real endpoint risk
|
|
|
|
A coding agent with shell and filesystem access is effectively privileged software. That means release hygiene matters:
|
|
|
|
- do not ship source maps in production artifacts
|
|
- scan release bundles before publish
|
|
- use artifact signing / attestation
|
|
- minimize local plaintext retention where possible
|
|
- document what is logged, where, and why
|
|
|
|
## How to Be Effective with Context Engineering
|
|
|
|
Anthropic defines context engineering as curating and maintaining the right set of tokens and state around a model invocation, not just writing a better prompt. For an agentic CLI, the practical meaning is simpler: the system should always provide the model with enough context to take the next correct action, but not so much that it becomes distracted, expensive, or unsafe.
|
|
|
|
### A more useful working definition
|
|
|
|
For a coding agent, context is not just the system prompt. It is the full operating environment:
|
|
|
|
- the active task and constraints
|
|
- the current plan and stopping condition
|
|
- the relevant files, symbols, and diffs
|
|
- the available tools and their contracts
|
|
- the recent observations from shell commands and tests
|
|
- durable memory from earlier work
|
|
- the policy boundary around permissions and risky actions
|
|
|
|
If any of those are missing, stale, or too noisy, agent quality drops fast.
|
|
|
|
### The context stack a coding CLI should manage
|
|
|
|
Treat context as a layered stack, not a single blob:
|
|
|
|
1. **Stable policy layer**
|
|
The non-negotiables: system rules, tool permissions, repo conventions, sandbox limits, output style, and safety constraints.
|
|
|
|
2. **Task layer**
|
|
The user's request, the success condition, assumptions, and explicit non-goals. This should be short and durable.
|
|
|
|
3. **Working-state layer**
|
|
The current plan, what has already been tried, what remains blocked, and which files or services are in scope.
|
|
|
|
4. **Evidence layer**
|
|
The actual code snippets, command results, test failures, stack traces, and docs needed for the next decision.
|
|
|
|
5. **Memory layer**
|
|
Reusable facts worth carrying across turns, such as build quirks, repo-specific commands, and previous failed approaches.
|
|
|
|
Most agent failures happen when these layers are mixed together without discipline.
|
|
|
|
### Opinionated rules for agent and CLI design
|
|
|
|
#### 1. Keep the task state outside the transcript
|
|
|
|
Do not rely on the model to infer the current plan from chat history. Persist a compact state object or artifact containing:
|
|
|
|
- the objective
|
|
- current step
|
|
- files in scope
|
|
- known constraints
|
|
- open questions
|
|
- last meaningful result
|
|
|
|
The transcript is a bad database. Use it for conversation, not state recovery.
|
|
|
|
#### 2. Retrieve code narrowly and late
|
|
|
|
Do not dump entire files or directories into context by default. Retrieve only what the next step needs:
|
|
|
|
- a specific symbol
|
|
- a failing test
|
|
- a diff hunk
|
|
- a bounded file region
|
|
- a targeted doc excerpt
|
|
|
|
Broad retrieval creates distraction and raises token cost without improving decisions.
|
|
|
|
#### 3. Summarize after every expensive step
|
|
|
|
After a search pass, test run, or multi-command investigation, convert the result into a short structured summary before moving on. Good summaries should capture:
|
|
|
|
- what was learned
|
|
- what changed
|
|
- what remains uncertain
|
|
- what the next action should be
|
|
|
|
This keeps the working set fresh and prevents context drift across long sessions.
|
|
|
|
#### 4. Design tools to return decision-ready output
|
|
|
|
Tool output should help the model choose the next action, not force it to parse noise. Prefer:
|
|
|
|
- concise command output
|
|
- bounded file reads
|
|
- explicit exit codes
|
|
- normalized error messages
|
|
- machine-parseable fields where possible
|
|
|
|
If a tool returns pages of raw text, the tool is poorly designed for agent use.
|
|
|
|
#### 5. Make memory write-worthy, not chatty
|
|
|
|
Persistent memory should be rare and high-value. Store only facts that are likely to matter later, such as:
|
|
|
|
- the right test command for this repo
|
|
- a non-obvious setup requirement
|
|
- a dangerous directory or workflow to avoid
|
|
- a service dependency that causes common failures
|
|
|
|
Do not store transient observations that belong in the current task state only.
|
|
|
|
#### 6. Separate planning context from execution context
|
|
|
|
The model needs different context when deciding what to do than when editing a file or running a command. A good CLI can tighten the context window for execution:
|
|
|
|
- include only the target file and local constraints for edits
|
|
- include only the exact command intent and safety policy for shell execution
|
|
- include only the relevant failure output for debugging
|
|
|
|
This reduces accidental spillover from stale earlier reasoning.
|
|
|
|
#### 7. Build explicit stop conditions
|
|
|
|
Agents burn time when they do not know when to stop. Every substantial task should carry one of these end states:
|
|
|
|
- requested change implemented
|
|
- tests passing or best-available verification complete
|
|
- blocked on missing permission or missing information
|
|
- unsafe to continue without user confirmation
|
|
|
|
Without a stop condition, context engineering degrades into aimless looping.
|
|
|
|
### Common failure modes to design against
|
|
|
|
These are the recurring context failures in coding agents:
|
|
|
|
- **Context poisoning:** irrelevant logs, stale plans, or old diffs dominate the prompt.
|
|
- **Context starvation:** the model is asked to act without the relevant file region, command result, or policy detail.
|
|
- **Context collision:** instructions from different phases conflict, such as planning guidance leaking into final output formatting.
|
|
- **Context amnesia:** the agent forgets prior discoveries because nothing durable was written down.
|
|
- **Context bloat:** every turn carries too much history, so quality drops and latency rises.
|
|
|
|
Your CLI should have explicit mechanisms to detect and correct each of these.
|
|
|
|
### A tactical operating loop
|
|
|
|
For a coding agent, a strong default loop looks like this:
|
|
|
|
1. Restate the goal and define success.
|
|
2. Gather only the minimum code and repo context needed to choose the next step.
|
|
3. Write or update compact task state.
|
|
4. Execute one meaningful action.
|
|
5. Summarize the result into durable working state.
|
|
6. Prune stale context before the next step.
|
|
7. Stop as soon as the success condition or block condition is reached.
|
|
|
|
This is the operational core behind most reliable agent behavior.
|
|
|
|
### What the Claude Code leak suggests here
|
|
|
|
The leak matters because it reinforces that strong coding agents are mostly a context-management problem wrapped around a model:
|
|
|
|
- permission logic is context engineering
|
|
- tool orchestration is context engineering
|
|
- background execution is context engineering
|
|
- memory and handoff artifacts are context engineering
|
|
- safety boundaries are context engineering
|
|
|
|
That is the practical takeaway: do not hunt for a magic prompt. Build a system that keeps the right context available at the right time.
|
|
|
|
## Practical Takeaways
|
|
|
|
If the goal is to design a strong agentic CLI, the combined lesson is:
|
|
|
|
- Do not over-focus on prompt wording.
|
|
- Invest in context assembly, memory, tool quality, and evaluations.
|
|
- Keep the architecture simple until complexity is justified.
|
|
- Treat local execution and packaging as security-sensitive.
|
|
- Treat context as core infrastructure, not support work.
|
|
|
|
## Sources
|
|
|
|
- [Effective context engineering for AI agents | Anthropic](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
|
|
- [Building Effective AI Agents | Anthropic](https://www.anthropic.com/research/building-effective-agents)
|
|
- [Writing effective tools for AI agents | Anthropic](https://www.anthropic.com/engineering/writing-tools-for-agents)
|
|
- [Best practices for prompt engineering with the OpenAI API | OpenAI Help Center](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api)
|