AI-Assisted Development: A Practical Guide
Thoughts on incorporating AI coding agents into real engineering workflows — understanding LLMs, agents, context management, and prompt engineering.
Thoughts on incorporating AI coding agents into real engineering workflows and what I learned so far.
I'm currently introducing AI-assisted development at the company I work at. This article is my attempt to organize what I know, what I'm still figuring out, and what resources helped me the most.
Why Should You Care
Nolan Lawson wrote a great piece about the emotional side of this shift. He described it honestly: "The worst fact about these tools is that they work." He's not celebrating the new world, but he's not resisting it either.
Whether you like it or not, AI coding tools are changing how we work. Your junior colleagues are already using Cursor, Claude Code, Copilot. They write code faster. Not always better, but faster. And the tools keep improving.
The question isn't "should I use AI for coding?" anymore. The question is: how do I use it without making a mess?
The Risks Are Real
Jake Nations wrote about this in Vibe Coding Our Way to Disaster. His argument is based on Rich Hickey's ideas about simplicity vs. ease. The short version: vibe coding (just chatting with AI and letting it write whatever) is choosing ease over simplicity. It feels productive but creates tangled, complex systems.
Vibe Coding vs. Disciplined AI Coding
The key problems with naive AI coding:
- Context complexity becomes code complexity. When you have long conversations with AI, corrections and clarifications pile up. The AI starts making connections between unrelated parts of the conversation. Your code becomes a reflection of that mess.
- AI amplifies your approach. If you rush to code without understanding the problem, AI helps you build the wrong thing faster. If you think first, AI becomes a powerful implementation tool.
- Most critical bugs come from misunderstanding the problem, not from implementation errors. This was true before AI, and it's even more true now when AI can generate hundreds of lines of code from a vague prompt.
- The Stanford study found that AI tools often lead to rework. Code shipped with AI one week gets rewritten next week. In large established codebases, AI can actually make developers less productive.
This isn't a reason to avoid AI tools. It's a reason to use them with discipline.
Engineering Skills Still Matter
AI doesn't replace the need to understand your system. You still need to:
- Know how your codebase works before asking AI to change it
- Review generated code with the same rigor as human-written code
- Design systems that are simple, not just easy to generate
- Understand when the AI is wrong (and it'll be wrong sometimes)
As Dex Horthy (HumanLayer) puts it in 12-Factor Agents: the best production AI agents are "comprised of mostly just software." The LLM is a powerful component, but it's the engineering around it that makes it reliable.
The Leverage Pyramid
Where you spend your human attention matters. A mistake at the research level cascades into everything below it.
You need to be able to read the research AI produces and tell when it's wrong. You need to be able to look at a plan and spot the flaw. The human review at research and planning stages is the highest-leverage intervention in the whole process.
Understanding LLMs
Before we talk about agents, it helps to understand what an LLM actually is.
LLM Is a Stateless Function
An LLM is a function. You give it text, it gives you text back. That's it.
There's no memory between calls. There's no hidden state. Every time you send a message, the model sees the entire conversation from scratch. What feels like a "conversation" is actually your client re-sending the full history every single time.
Call 1: f("What is 2+2?") → "4"
Call 2: f("What is 2+2?" + "4" + "Now multiply by 3") → "12"
Call 3: f("What is 2+2?" + "4" + "Now multiply by 3" + "12"
+ "What was the original number?") → "4"
The model didn't "remember" that the original number was 4. It saw the full conversation in the input and found the answer there. If you removed the earlier messages, it would have no idea.
This has practical consequences:
- Context is everything. The model only knows what you put in the input. If you don't include it, it doesn't exist.
- Longer conversations degrade. Every message adds tokens. At some point the input is so large that the model loses focus on what matters.
- You pay for every token, every time. The full conversation is re-sent on each call. A 50-message conversation means message 1 has been sent 50 times.
What the Model Actually Sees
When you type a message in ChatGPT or Claude, it looks like a simple chat. Behind the scenes, the API call looks more like this:
{
"messages": [
{ "role": "system", "content": "You are a helpful..." },
{ "role": "user", "content": "What is 2+2?" },
{ "role": "assistant", "content": "4" },
{ "role": "user", "content": "Now multiply by 3" }
]
}
// → model reads ALL of this, generates the next response
The model doesn't have a session. It doesn't "know" it already answered the first question. It receives the entire list of messages and produces the next one. The chat interface is an illusion maintained by the client.
Temperature: Controlled Randomness
You may have heard that LLMs are "non-deterministic." This is half true. The randomness is a design choice, not a flaw.
At each step, the model predicts the probability of every possible next token. Temperature controls how it picks from those probabilities:
Prompt: "The capital of France is"
Token probabilities:
"Paris" → 92%
"Lyon" → 3%
"a" → 2%
"the" → 1%
...
Temperature = 0: Always picks "Paris" (highest probability)
Temperature = 0.7: Usually picks "Paris", sometimes surprises
Temperature = 1.0: More random, might pick "Lyon" or "a"
For coding tasks, lower temperature is almost always better. You want predictable, correct output, not creative variation. Most coding agents run at low temperature by default.
Tokens, Not Characters
LLMs don't read characters or words. They read tokens. A token is roughly 3-4 characters in English, but it varies.
"Hello, world!" → ["Hello", ",", " world", "!"] = 4 tokens
"def fibonacci(n):" → ["def", " fibon", "acci", "(n", "):"] = 5 tokens
"東京" → ["東", "京"] = 2 tokens
This matters because:
- Context windows are measured in tokens. When Claude says 200k context, that's 200k tokens, not characters. Roughly 150k words, or about 500 pages of text.
- You pay per token. Both input and output. Reading a 5000-line file costs more than reading a 100-line file.
- Code is token-expensive. Variable names, syntax, and whitespace all consume tokens. A 200-line function might cost more tokens than a 200-word paragraph.
Why This Matters for Agents
Everything in the rest of this article builds on these basics. If you remember one thing from this section: the LLM doesn't know anything you didn't tell it. Everything else follows from that.
The next sections cover how agents loop around this stateless function and why managing context is the most important skill you can develop.
How Agents Actually Work
I did an internal presentation at my company about how to write good agents, based on 12-Factor Agents by Dex Horthy. I didn't take all 12 factors because many of them are about building agent frameworks, which isn't what most of us do day-to-day. We use agents, we don't build runtimes for them. Claude Code and Copilot control the runtime; we can partially control the tools and fully control the prompts.
The Agent Loop
At its core, every agent is just this:
The problem: after many iterations, the context window fills up. The agent starts looping on the same broken approach. It forgets what it tried. Even as models support longer context, focused prompts always work better.
Four Components You Control
| Component | What it is | What you control |
|---|---|---|
| Prompt | Instructions for the LLM | Fully. You write it. |
| Context | Accumulated history of steps and results | Partially. You shape what goes in. |
| Tools | Actions the agent can take (read files, run commands, etc.) | Partially. You pick which tools are available. |
| Loop | Keep going until done | Partially. You define when to pause/stop. |
Five Factors That Matter for Prompt Engineering
From the original 12 factors, these five are most relevant when you write prompts for coding assistants:
Factor 1: Natural Language Becomes Tool Calls
Your words become structured tool calls. You type "find auth code," and the LLM decides: call Grep with pattern "auth", then Glob on **/auth/**. Specific tool, specific parameters, structured as JSON. Then deterministic code picks that up and runs it.
Factor 2: Own Your Prompts
Don't outsource your prompt engineering to a black box. Some frameworks hide the actual prompt behind abstractions like "role," "goal," and "personality." Fine for prototyping. Hard to tune for production. You want to see and modify the exact tokens going to the model.
With Claude Code you can't control the internal system prompt. But you can own CLAUDE.md, custom commands in .claude/commands/, agent definitions in .claude/agents/. That's your prompt surface area. Write it, version it, test it.
"Our library gives you the best output!" ... "SHOW ME THE PROMPT."
Factor 3: Own Your Context Window
Everything is context engineering. As covered above, LLMs are stateless functions. The only thing that affects output quality is the quality of the input. The original factor goes further: you don't have to stick with standard message-based formats. You can structure context however you want, pack more signal into fewer tokens. Andrej Karpathy popularized the term "context engineering" for this. Fill the context window with the right information, not just more information.
Factor 7: Contact Humans with Tool Calls
Build checkpoints into prompts so the agent knows when to stop and ask.
# From implement_plan.md:
"Phase [N] Complete - Ready for Verification.
Automated checks passed:
- [x] Tests pass
- [x] Lint clean
Please perform manual verification:
- [ ] Feature works in UI
- [ ] No regressions
Let me know when complete so I can proceed to Phase [N+1]."
Factor 10: Small, Focused Agents
Instead of one big agent, create small agents that each do one specific thing.
Practical Tips
These patterns come from real prompt engineering experience. They're not in the 12 Factors.
Tip 1: Negative Instructions
Tell the agent what NOT to do. This prevents drift.
# Bad: only positive instructions
"Analyze the codebase and describe what you find."
# Good: positive + negative instructions
"Analyze the codebase and describe what you find.
DO NOT suggest improvements.
DO NOT perform root cause analysis.
DO NOT critique the implementation.
ONLY describe what exists, how it works, and how components interact."
Without negative instructions, the agent starts "helping": suggesting improvements, critiquing code, going off on tangents. With them, it stays focused. So be explicit about what you don't want.
Tip 2: Output Templates
Define exact format for consistent, parseable results.
# In codebase-analyzer.md:
## Analysis: [Component Name]
### Overview
[2-3 sentence summary]
### Entry Points
- `file.ts:45` - description of what's there
### Core Implementation
#### 1. [Step name] (`file.ts:15-32`)
- What it does
- How it connects to the next step
### Data Flow
1. Request arrives at `api/routes.ts:45`
2. Routed to `handlers/webhook.ts:12`
3. Validated at `handlers/webhook.ts:15-32`
Without a template, every response looks different. But with a template, results are predictable and can be parsed by other agents.
Tip 3: Tool Selection Controls Capability
Limit tools to limit what the agent CAN do. This is a physical constraint, not just instructions.
| Agent | Tools | What it CAN do | What it CANNOT do |
|---|---|---|---|
| codebase-locator | Grep, Glob, LS | Find files | Read file contents |
| codebase-analyzer | Read, Grep, Glob, LS | Read and analyze | Run commands, edit files |
| web-researcher | WebSearch, WebFetch, Read | Search the web | Modify local files |
If the agent doesn't have the Edit tool, it physically can't edit files. Not "please don't" but "literally impossible."
Tip 4: Read Before Spawn
The orchestrator must understand context before delegating to sub-agents.
Tip 5: No Open Questions
Stop and ask instead of guessing. Five seconds to clarify saves hours of rework.
# From create_plan.md:
"If you encounter open questions during planning, STOP.
Research or ask for clarification immediately.
Do NOT write the plan with unresolved questions."
# From implement_plan.md:
"When things don't match the plan:
Issue in Phase [N]:
Expected: [what the plan says]
Found: [actual situation]
Why this matters: [explanation]
How should I proceed?"
Context Management Is Everything
Dex Horthy's Advanced Context Engineering for Coding Agents explains why context management is the most important skill for working with AI coding tools. The key insight: the context window is the only lever you have to affect the quality of the output.
What Eats Up Your Context
Frequent Intentional Compaction
Design your entire workflow around context management. Keep utilization in the 40-60% range. Split work into roughly three phases (sometimes you skip research and go straight to planning, sometimes you do multiple research passes before you're ready):
Each phase starts with a fresh context window. The output of one phase becomes a compact input for the next. This is the core idea: instead of one long messy conversation, you have three focused sessions.
Sub-Agents for Context Control
Sub-agents aren't about role-playing. They're about using a fresh context window for searching and summarizing, so the main agent stays clean.
The .claude Prompts
The prompts I reference throughout this article are from the humanlayer/humanlayer/.claude repository. You can look at the originals to understand the full picture. They are a good example of "prompts as code" that you can version control, test, and share.
| File | What it does | |
|---|---|---|
| Agents | codebase-analyzer.md | Reads and explains code |
| codebase-locator.md | Finds files (no Read tool!) | |
| codebase-pattern-finder.md | Finds code patterns | |
| web-search-researcher.md | Searches the web | |
| Commands | commit.md | Simple: analyze changes, commit |
| create_plan.md | Workflow: research, plan, iterate | |
| describe_pr.md | Simple: generate PR description | |
| implement_plan.md | Workflow: execute plan phase by phase | |
| iterate_plan.md | Workflow: update existing plans | |
| research_codebase.md | Orchestrator: spawn agents, synthesize |
Three Types of Prompts
Not all prompts are the same. Here's how they differ (see commit.md, implement_plan.md, research_codebase.md):
| Type | Who does the work | Sub-agents | Human interaction |
|---|---|---|---|
| Simple | Agent directly | None | Confirm then execute |
| Workflow | Agent, phase by phase | Optional | Gates between phases |
| Orchestrator | Sub-agents | Core mechanism | Minimal (review synthesis) |
Rule of thumb: Start simple. Add workflow when you need human checkpoints between phases. Add orchestrator when you need parallel research.
Writing Good Prompts for Existing Agents
You can't just tell the agent "use ticket NUMBER-123 and research." That's too vague. The agent won't know what to look for, what's important, or when to stop.
Bad vs. Good Prompts
The Pattern
Every good prompt to an existing agent follows this structure:
The prompts in .claude/commands/ already have good structure built in (negative instructions, output templates, step-by-step strategies, human checkpoints). Your job is to give them specific context to work with, not vague directions.
Anatomy of a Well-Written Agent Prompt
Here's what makes the prompts in .claude/agents/ effective. Using codebase-analyzer.md as an example:
---
name: codebase-analyzer
tools: Read, Grep, Glob, LS # Limited tools = limited scope
model: sonnet # Cheaper model for focused tasks
---
# Role (one sentence)
"You are a specialist at understanding HOW code works."
# Negative instructions (prevent drift)
"DO NOT suggest improvements"
"DO NOT critique the implementation"
"ONLY describe what exists"
# Step-by-step strategy (how to do the job)
Step 1: Read Entry Points
Step 2: Follow the Code Path
Step 3: Document Key Logic
# Output template (consistent format)
## Analysis: [Name]
### Overview
### Entry Points
- `file:line` - description
### Core Implementation
### Data Flow
# Closing reminder
"REMEMBER: You are a documentarian, not a critic."
This structure works because each part prevents a specific failure mode:
- Limited tools prevent the agent from doing things outside its scope
- Negative instructions prevent it from drifting into "helpful" suggestions
- Step-by-step strategy prevents random, inconsistent analysis
- Output template prevents unparseable responses
- Closing reminder reinforces the constraints (LLMs pay attention to the end of prompts)
What AI Makes Possible
Some things are just hard to do without AI tools:
- Navigating unfamiliar codebases. Dex Horthy shipped a solo bug fix to BAML, a 300k LOC Rust codebase he'd never touched. Then he and Vaibhav paired for 7 hours and shipped 35k LOC (cancellation support + WASM compilation). The BAML team estimated each of those at 3-5 days for a senior engineer.
- Parallel research. You can spawn multiple focused agents to investigate different parts of the codebase at the same time. One finds files, another analyzes code, another checks the database schema. The orchestrator synthesizes everything.
- Once you have a good plan, the implementation phase is straightforward. The agent follows the spec, and the code style matches your existing codebase because the agent read it first. Consistent code generation without the usual drift.
- Onboarding. An intern at HumanLayer shipped 2 PRs on his first day and 10 on his 8th day. Research prompts let new team members get up to speed fast.
- And there's mental alignment. Instead of reading 2000 lines of code in a PR, you read 200 lines of a well-written implementation plan. You know what's being built and why.
These are real benefits. They don't make you 10x faster at everything. But they make some previously painful tasks easier.
Credits and References
This work is heavily based on and inspired by other people's work. I want to give proper credit.
12-Factor Agents by Dex Horthy (HumanLayer, YC24). The foundation for understanding how to build reliable AI agents. This article adapts 5 of the 12 factors for prompt engineering use. The original content is licensed under CC BY-SA 4.0.
Advanced Context Engineering for Coding Agents by Dex Horthy. The article about frequent intentional compaction and the research/plan/implement workflow.
We Mourn Our Craft by Nolan Lawson. An honest and emotional piece about accepting the AI shift in software development.
Vibe Coding Our Way to Disaster by Jake Nations. About the risks of unstructured AI coding, based on Rich Hickey's ideas about simplicity vs. ease.
Context Engineering — term popularized by Andrej Karpathy for the art of providing all the context needed for a task to be plausibly solvable by an LLM.
The .claude prompts referenced in this article are from humanlayer/humanlayer/.claude, created by Dex Horthy for use with Claude Code inside the CodeLayer IDE.
Specs Are the New Code by Sean Grove. The idea that specifications will become the real source code.
Stanford Study on AI's Impact on Developer Productivity — research showing that AI tools sometimes reduce productivity in established codebases.
Content licensed under CC BY-SA 4.0. Code examples licensed under Apache 2.0.