Back to blog

AI-Assisted Development: A Practical Guide

Thoughts on incorporating AI coding agents into real engineering workflows — understanding LLMs, agents, context management, and prompt engineering.

·1 min read
aiengineeringagentsprompt-engineering

Thoughts on incorporating AI coding agents into real engineering workflows and what I learned so far.

I'm currently introducing AI-assisted development at the company I work at. This article is my attempt to organize what I know, what I'm still figuring out, and what resources helped me the most.

Why Should You Care

Nolan Lawson wrote a great piece about the emotional side of this shift. He described it honestly: "The worst fact about these tools is that they work." He's not celebrating the new world, but he's not resisting it either.

Whether you like it or not, AI coding tools are changing how we work. Your junior colleagues are already using Cursor, Claude Code, Copilot. They write code faster. Not always better, but faster. And the tools keep improving.

The question isn't "should I use AI for coding?" anymore. The question is: how do I use it without making a mess?

The Risks Are Real

Jake Nations wrote about this in Vibe Coding Our Way to Disaster. His argument is based on Rich Hickey's ideas about simplicity vs. ease. The short version: vibe coding (just chatting with AI and letting it write whatever) is choosing ease over simplicity. It feels productive but creates tangled, complex systems.

Vibe Coding vs. Disciplined AI Coding

Vibe Coding (ease) You: "make a login page" AI writes 200 lines You: "it doesn't work, fix it" AI rewrites 150 lines You: "now add validation" AI patches on top of patches You: "why is everything broken?" ... Tangled mess of corrections Disciplined AI Coding (simplicity) You: research auth flow in codebase AI maps existing patterns You: review research, plan approach AI creates implementation plan You: review plan, approve AI implements following the plan You: review code, run tests Ship with confidence Clean code that fits the codebase

The key problems with naive AI coding:

  • Context complexity becomes code complexity. When you have long conversations with AI, corrections and clarifications pile up. The AI starts making connections between unrelated parts of the conversation. Your code becomes a reflection of that mess.
  • AI amplifies your approach. If you rush to code without understanding the problem, AI helps you build the wrong thing faster. If you think first, AI becomes a powerful implementation tool.
  • Most critical bugs come from misunderstanding the problem, not from implementation errors. This was true before AI, and it's even more true now when AI can generate hundreds of lines of code from a vague prompt.
  • The Stanford study found that AI tools often lead to rework. Code shipped with AI one week gets rewritten next week. In large established codebases, AI can actually make developers less productive.

This isn't a reason to avoid AI tools. It's a reason to use them with discipline.

Engineering Skills Still Matter

AI doesn't replace the need to understand your system. You still need to:

  • Know how your codebase works before asking AI to change it
  • Review generated code with the same rigor as human-written code
  • Design systems that are simple, not just easy to generate
  • Understand when the AI is wrong (and it'll be wrong sometimes)

As Dex Horthy (HumanLayer) puts it in 12-Factor Agents: the best production AI agents are "comprised of mostly just software." The LLM is a powerful component, but it's the engineering around it that makes it reliable.

The Leverage Pyramid

Where you spend your human attention matters. A mistake at the research level cascades into everything below it.

Research Plan Implementation 1 bad line of research = thousands of bad lines of code 1 bad line of a plan = hundreds of bad lines of code 1 bad line of code = 1 bad line of code Human effort goes here: review research and plans, not just code

You need to be able to read the research AI produces and tell when it's wrong. You need to be able to look at a plan and spot the flaw. The human review at research and planning stages is the highest-leverage intervention in the whole process.

Understanding LLMs

Before we talk about agents, it helps to understand what an LLM actually is.

LLM Is a Stateless Function

An LLM is a function. You give it text, it gives you text back. That's it.

f(input_text)output_textf(\text{input\_text}) \rightarrow \text{output\_text}

There's no memory between calls. There's no hidden state. Every time you send a message, the model sees the entire conversation from scratch. What feels like a "conversation" is actually your client re-sending the full history every single time.

Call 1:  f("What is 2+2?")                                         → "4"
Call 2:  f("What is 2+2?" + "4" + "Now multiply by 3")             → "12"
Call 3:  f("What is 2+2?" + "4" + "Now multiply by 3" + "12"
           + "What was the original number?")                       → "4"

The model didn't "remember" that the original number was 4. It saw the full conversation in the input and found the answer there. If you removed the earlier messages, it would have no idea.

This has practical consequences:

  • Context is everything. The model only knows what you put in the input. If you don't include it, it doesn't exist.
  • Longer conversations degrade. Every message adds tokens. At some point the input is so large that the model loses focus on what matters.
  • You pay for every token, every time. The full conversation is re-sent on each call. A 50-message conversation means message 1 has been sent 50 times.

What the Model Actually Sees

When you type a message in ChatGPT or Claude, it looks like a simple chat. Behind the scenes, the API call looks more like this:

{
  "messages": [
    { "role": "system",    "content": "You are a helpful..." },
    { "role": "user",      "content": "What is 2+2?" },
    { "role": "assistant", "content": "4" },
    { "role": "user",      "content": "Now multiply by 3" }
  ]
}
// → model reads ALL of this, generates the next response

The model doesn't have a session. It doesn't "know" it already answered the first question. It receives the entire list of messages and produces the next one. The chat interface is an illusion maintained by the client.

Temperature: Controlled Randomness

You may have heard that LLMs are "non-deterministic." This is half true. The randomness is a design choice, not a flaw.

At each step, the model predicts the probability of every possible next token. Temperature controls how it picks from those probabilities:

Prompt: "The capital of France is"

Token probabilities:
  "Paris"    → 92%
  "Lyon"     → 3%
  "a"        → 2%
  "the"      → 1%
  ...

Temperature = 0:    Always picks "Paris" (highest probability)
Temperature = 0.7:  Usually picks "Paris", sometimes surprises
Temperature = 1.0:  More random, might pick "Lyon" or "a"

For coding tasks, lower temperature is almost always better. You want predictable, correct output, not creative variation. Most coding agents run at low temperature by default.

Tokens, Not Characters

LLMs don't read characters or words. They read tokens. A token is roughly 3-4 characters in English, but it varies.

"Hello, world!"       → ["Hello", ",", " world", "!"]           = 4 tokens
"def fibonacci(n):"   → ["def", " fibon", "acci", "(n", "):"]   = 5 tokens
"東京"                 → ["東", "京"]                              = 2 tokens

This matters because:

  • Context windows are measured in tokens. When Claude says 200k context, that's 200k tokens, not characters. Roughly 150k words, or about 500 pages of text.
  • You pay per token. Both input and output. Reading a 5000-line file costs more than reading a 100-line file.
  • Code is token-expensive. Variable names, syntax, and whitespace all consume tokens. A 200-line function might cost more tokens than a 200-word paragraph.

Why This Matters for Agents

Everything in the rest of this article builds on these basics. If you remember one thing from this section: the LLM doesn't know anything you didn't tell it. Everything else follows from that.

The next sections cover how agents loop around this stateless function and why managing context is the most important skill you can develop.

How Agents Actually Work

I did an internal presentation at my company about how to write good agents, based on 12-Factor Agents by Dex Horthy. I didn't take all 12 factors because many of them are about building agent frameworks, which isn't what most of us do day-to-day. We use agents, we don't build runtimes for them. Claude Code and Copilot control the runtime; we can partially control the tools and fully control the prompts.

The Agent Loop

At its core, every agent is just this:

Context window LLM decide next action Tool Call execute action result feeds back into context Repeat until "done"

The problem: after many iterations, the context window fills up. The agent starts looping on the same broken approach. It forgets what it tried. Even as models support longer context, focused prompts always work better.

Four Components You Control

ComponentWhat it isWhat you control
PromptInstructions for the LLMFully. You write it.
ContextAccumulated history of steps and resultsPartially. You shape what goes in.
ToolsActions the agent can take (read files, run commands, etc.)Partially. You pick which tools are available.
LoopKeep going until donePartially. You define when to pause/stop.

Five Factors That Matter for Prompt Engineering

From the original 12 factors, these five are most relevant when you write prompts for coding assistants:

Factor 1: Natural Language Becomes Tool Calls

Your words become structured tool calls. You type "find auth code," and the LLM decides: call Grep with pattern "auth", then Glob on **/auth/**. Specific tool, specific parameters, structured as JSON. Then deterministic code picks that up and runs it.

What you type What the agent actually does /commit git status git diff git add <files> git commit -m "..." "find auth code" Grep: "auth" Glob: **/auth/** LS: src/services/auth/ "explain the login flow" Read: src/auth/login.ts Read: src/auth/middleware.ts Trace calls between files

Factor 2: Own Your Prompts

Don't outsource your prompt engineering to a black box. Some frameworks hide the actual prompt behind abstractions like "role," "goal," and "personality." Fine for prototyping. Hard to tune for production. You want to see and modify the exact tokens going to the model.

With Claude Code you can't control the internal system prompt. But you can own CLAUDE.md, custom commands in .claude/commands/, agent definitions in .claude/agents/. That's your prompt surface area. Write it, version it, test it.

"Our library gives you the best output!" ... "SHOW ME THE PROMPT."

Factor 3: Own Your Context Window

Everything is context engineering. As covered above, LLMs are stateless functions. The only thing that affects output quality is the quality of the input. The original factor goes further: you don't have to stick with standard message-based formats. You can structure context however you want, pack more signal into fewer tokens. Andrej Karpathy popularized the term "context engineering" for this. Fill the context window with the right information, not just more information.

Context window (200k tokens) [system prompt] [documents] [conversation] [tools] 40% Good 60% OK 80% Danger 95% Lost More noise = worse output. Focused context = better output.

Factor 7: Contact Humans with Tool Calls

Build checkpoints into prompts so the agent knows when to stop and ask.

# From implement_plan.md:

"Phase [N] Complete - Ready for Verification.
 Automated checks passed:
 - [x] Tests pass
 - [x] Lint clean

 Please perform manual verification:
 - [ ] Feature works in UI
 - [ ] No regressions

 Let me know when complete so I can proceed to Phase [N+1]."

Factor 10: Small, Focused Agents

Instead of one big agent, create small agents that each do one specific thing.

BAD: One Universal Agent Universal Researcher Tools: ALL OF THEM - Find files - Analyze code - Query database - Understand patterns - Synthesize findings 50+ steps Huge context Gets lost GOOD: Focused Micro Agents codebase-locator Tools: Grep, Glob, LS Job: find files codebase-analyzer Tools: Read, Grep, Glob, LS Job: explain code web-researcher Tools: WebSearch, WebFetch, Read Job: find docs Each: 5-10 steps, stays focused

Practical Tips

These patterns come from real prompt engineering experience. They're not in the 12 Factors.

Tip 1: Negative Instructions

Tell the agent what NOT to do. This prevents drift.

# Bad: only positive instructions
"Analyze the codebase and describe what you find."

# Good: positive + negative instructions
"Analyze the codebase and describe what you find.
 DO NOT suggest improvements.
 DO NOT perform root cause analysis.
 DO NOT critique the implementation.
 ONLY describe what exists, how it works, and how components interact."

Without negative instructions, the agent starts "helping": suggesting improvements, critiquing code, going off on tangents. With them, it stays focused. So be explicit about what you don't want.

Tip 2: Output Templates

Define exact format for consistent, parseable results.

# In codebase-analyzer.md:

## Analysis: [Component Name]

### Overview
[2-3 sentence summary]

### Entry Points
- `file.ts:45` - description of what's there

### Core Implementation
#### 1. [Step name] (`file.ts:15-32`)
- What it does
- How it connects to the next step

### Data Flow
1. Request arrives at `api/routes.ts:45`
2. Routed to `handlers/webhook.ts:12`
3. Validated at `handlers/webhook.ts:15-32`

Without a template, every response looks different. But with a template, results are predictable and can be parsed by other agents.

Tip 3: Tool Selection Controls Capability

Limit tools to limit what the agent CAN do. This is a physical constraint, not just instructions.

AgentToolsWhat it CAN doWhat it CANNOT do
codebase-locatorGrep, Glob, LSFind filesRead file contents
codebase-analyzerRead, Grep, Glob, LSRead and analyzeRun commands, edit files
web-researcherWebSearch, WebFetch, ReadSearch the webModify local files

If the agent doesn't have the Edit tool, it physically can't edit files. Not "please don't" but "literally impossible."

Tip 4: Read Before Spawn

The orchestrator must understand context before delegating to sub-agents.

WRONG Spawn immediately User asks question Agent 1 Agent 2 Agent 3 Agents get vague tasks Results are unfocused Wasted context, poor results RIGHT Read first, then delegate User asks question READ mentioned files first Plan sub-tasks based on what you read Specific 1 Specific 2 Specific 3 Precise tasks, focused results

Tip 5: No Open Questions

Stop and ask instead of guessing. Five seconds to clarify saves hours of rework.

# From create_plan.md:
"If you encounter open questions during planning, STOP.
 Research or ask for clarification immediately.
 Do NOT write the plan with unresolved questions."

# From implement_plan.md:
"When things don't match the plan:
 Issue in Phase [N]:
   Expected: [what the plan says]
   Found: [actual situation]
   Why this matters: [explanation]

 How should I proceed?"

Context Management Is Everything

Dex Horthy's Advanced Context Engineering for Coding Agents explains why context management is the most important skill for working with AI coding tools. The key insight: the context window is the only lever you have to affect the quality of the output.

What Eats Up Your Context

Context window filling up system prompt ~5% user msg ~2% grep results ~15% searching for files file contents ~25% reading code edit attempts ~15% trial and error test output + error logs ~25% debugging 87% full Agent is lost. It forgot the original goal 40 messages ago.

Frequent Intentional Compaction

Design your entire workflow around context management. Keep utilization in the 40-60% range. Split work into roughly three phases (sometimes you skip research and go straight to planning, sometimes you do multiple research passes before you're ready):

Phase 1: Research (fresh context) Input: - ticket / issue - codebase Agent searches, reads, maps the codebase Output: research.md Phase 2: Plan (fresh context) Input: - research.md - ticket / issue Agent creates step-by-step implementation plan Output: plan.md Phase 3: Implement (fresh context) Input: - plan.md - codebase Agent follows plan phase by phase Output: working code Human Review Is the research correct? Highest leverage! Human Review Is the plan sound? High leverage! Human Review Does the code work? Standard review Each phase starts with a fresh context window. Output of one phase becomes compact input for the next.

Each phase starts with a fresh context window. The output of one phase becomes a compact input for the next. This is the core idea: instead of one long messy conversation, you have three focused sessions.

Sub-Agents for Context Control

Sub-agents aren't about role-playing. They're about using a fresh context window for searching and summarizing, so the main agent stays clean.

Main Agent (orchestrator) Context: 35% used system prompt + user question + compact summaries Sub-agent 1 (locator) 15 grep calls, 10 glob calls 80% used Returns: 15 lines (file locations) Sub-agent 2 (analyzer) reads 8 files, traces 3 flows 70% used Returns: 40 lines (code analysis) Sub-agent 3 (researcher) 5 web searches, 3 page fetches 60% used Returns: 20 lines (documentation) All that noise stays in sub-agent context. Main agent only sees the compact summaries.

The .claude Prompts

The prompts I reference throughout this article are from the humanlayer/humanlayer/.claude repository. You can look at the originals to understand the full picture. They are a good example of "prompts as code" that you can version control, test, and share.

FileWhat it does
Agentscodebase-analyzer.mdReads and explains code
codebase-locator.mdFinds files (no Read tool!)
codebase-pattern-finder.mdFinds code patterns
web-search-researcher.mdSearches the web
Commandscommit.mdSimple: analyze changes, commit
create_plan.mdWorkflow: research, plan, iterate
describe_pr.mdSimple: generate PR description
implement_plan.mdWorkflow: execute plan phase by phase
iterate_plan.mdWorkflow: update existing plans
research_codebase.mdOrchestrator: spawn agents, synthesize

Three Types of Prompts

Not all prompts are the same. Here's how they differ (see commit.md, implement_plan.md, research_codebase.md):

Simple Prompt commit.md User: /commit git status + diff analyze changes present plan Human: "looks good" execute commits One task. Linear. One human checkpoint. Workflow Prompt implement_plan.md User: /implement plan.md read plan execute Phase 1 Human verifies Phase 1 execute Phase 2 Human verifies Phase 2 Sequential. Multiple gates. Persistent state (plan file). Orchestrator Prompt research_codebase.md User: /research "auth?" READ mentioned files Spawn in parallel: locator - finds files analyzer - explains code researcher - finds docs WAIT for all agents Synthesize results Delegates. Parallel. Synthesis.
TypeWho does the workSub-agentsHuman interaction
SimpleAgent directlyNoneConfirm then execute
WorkflowAgent, phase by phaseOptionalGates between phases
OrchestratorSub-agentsCore mechanismMinimal (review synthesis)

Rule of thumb: Start simple. Add workflow when you need human checkpoints between phases. Add orchestrator when you need parallel research.

Writing Good Prompts for Existing Agents

You can't just tell the agent "use ticket NUMBER-123 and research." That's too vague. The agent won't know what to look for, what's important, or when to stop.

Bad vs. Good Prompts

BAD GOOD "Research ticket ENG-1234" "Research the payment processing flow. Focus on Stripe webhook handling. I need to understand how payment status gets updated in the database. Relevant code: src/services/payments/" "Fix the bug" "/create_plan eng_1234.md Think about the migration strategy. We cannot have downtime. Look at how we handled PR #456." "Implement the feature" "/implement plan.md Start with Phase 1 only. Run tests after each change. If something doesn't match the plan, stop."

The Pattern

Every good prompt to an existing agent follows this structure:

1. SCOPE: What exactly to work on "Research the payment processing flow in our codebase" 2. FOCUS: Where to look "Relevant code is probably in src/services/payments/" 3. CONTEXT: What matters and why "We need to understand this because we are migrating to Stripe v3" 4. BOUNDARIES: When to stop or ask "If you find more than 3 services involved, stop and tell me"

The prompts in .claude/commands/ already have good structure built in (negative instructions, output templates, step-by-step strategies, human checkpoints). Your job is to give them specific context to work with, not vague directions.

Anatomy of a Well-Written Agent Prompt

Here's what makes the prompts in .claude/agents/ effective. Using codebase-analyzer.md as an example:

---
name: codebase-analyzer
tools: Read, Grep, Glob, LS            # Limited tools = limited scope
model: sonnet                           # Cheaper model for focused tasks
---
# Role (one sentence)
"You are a specialist at understanding HOW code works."

# Negative instructions (prevent drift)
"DO NOT suggest improvements"
"DO NOT critique the implementation"
"ONLY describe what exists"

# Step-by-step strategy (how to do the job)
Step 1: Read Entry Points
Step 2: Follow the Code Path
Step 3: Document Key Logic

# Output template (consistent format)
## Analysis: [Name]
### Overview
### Entry Points
  - `file:line` - description
### Core Implementation
### Data Flow

# Closing reminder
"REMEMBER: You are a documentarian, not a critic."

This structure works because each part prevents a specific failure mode:

  • Limited tools prevent the agent from doing things outside its scope
  • Negative instructions prevent it from drifting into "helpful" suggestions
  • Step-by-step strategy prevents random, inconsistent analysis
  • Output template prevents unparseable responses
  • Closing reminder reinforces the constraints (LLMs pay attention to the end of prompts)

What AI Makes Possible

Some things are just hard to do without AI tools:

  • Navigating unfamiliar codebases. Dex Horthy shipped a solo bug fix to BAML, a 300k LOC Rust codebase he'd never touched. Then he and Vaibhav paired for 7 hours and shipped 35k LOC (cancellation support + WASM compilation). The BAML team estimated each of those at 3-5 days for a senior engineer.
  • Parallel research. You can spawn multiple focused agents to investigate different parts of the codebase at the same time. One finds files, another analyzes code, another checks the database schema. The orchestrator synthesizes everything.
  • Once you have a good plan, the implementation phase is straightforward. The agent follows the spec, and the code style matches your existing codebase because the agent read it first. Consistent code generation without the usual drift.
  • Onboarding. An intern at HumanLayer shipped 2 PRs on his first day and 10 on his 8th day. Research prompts let new team members get up to speed fast.
  • And there's mental alignment. Instead of reading 2000 lines of code in a PR, you read 200 lines of a well-written implementation plan. You know what's being built and why.

These are real benefits. They don't make you 10x faster at everything. But they make some previously painful tasks easier.

Credits and References

This work is heavily based on and inspired by other people's work. I want to give proper credit.

12-Factor Agents by Dex Horthy (HumanLayer, YC24). The foundation for understanding how to build reliable AI agents. This article adapts 5 of the 12 factors for prompt engineering use. The original content is licensed under CC BY-SA 4.0.

Advanced Context Engineering for Coding Agents by Dex Horthy. The article about frequent intentional compaction and the research/plan/implement workflow.

We Mourn Our Craft by Nolan Lawson. An honest and emotional piece about accepting the AI shift in software development.

Vibe Coding Our Way to Disaster by Jake Nations. About the risks of unstructured AI coding, based on Rich Hickey's ideas about simplicity vs. ease.

Context Engineering — term popularized by Andrej Karpathy for the art of providing all the context needed for a task to be plausibly solvable by an LLM.

The .claude prompts referenced in this article are from humanlayer/humanlayer/.claude, created by Dex Horthy for use with Claude Code inside the CodeLayer IDE.

Specs Are the New Code by Sean Grove. The idea that specifications will become the real source code.

Stanford Study on AI's Impact on Developer Productivity — research showing that AI tools sometimes reduce productivity in established codebases.


Content licensed under CC BY-SA 4.0. Code examples licensed under Apache 2.0.