How AI Agents Work
Before diving into tools and workflows, it helps to understand the three building blocks that make all of this possible. Every tool on this page—Claude Code, Codex, Ralph Loop, agent teams—is built from these same primitives.
🧠 The LLM
A large language model (LLM) takes text in and produces text out. That's it. It has no memory between calls, no access to your filesystem, and no ability to run code. On its own, it's a very sophisticated autocomplete. Everything else is built on top of this.
🔧 Tool Calls
To make an LLM useful, you give it tools—structured actions it can request. Instead of just outputting text, the model can output: "I want to read src/app.ts" or "run npm test". The system executes that action and feeds the result back. This is called function calling or tool use.
🔄 The Agent Loop
An agent is an LLM running in a loop: think → tool call → observe result → think → tool call → ... until the task is done. This is what separates an agent from a chatbot. A chatbot answers once. An agent keeps working—reading files, running tests, fixing errors—until it reaches the goal or gets stuck.
The Agent Loop Visualized
Reason
LLM reads the conversation so far and decides what to do next.
Tool Call
LLM requests an action: read file, write file, run command, search code.
Observe
The result of the action is fed back into the conversation as context.
Repeat
Back to Think. The loop continues until the task is complete or the agent asks for help.
RAG: Retrieval-Augmented Generation
LLMs only know what was in their training data. RAG is the pattern of retrieving relevant information first, then generating a response grounded in that information. When Claude Code searches your codebase, reads your files, or looks up documentation—that's RAG. It's why the agent can work with code it has never seen before: it reads your code first, then reasons about it.
Every tool on this page is a different way of orchestrating the same agent loop. Claude Code is one agent looping. Agent teams are multiple agents looping in parallel. Ralph Loop is a bash script that restarts the agent loop when it finishes, pointed at the next task. The differences are in orchestration, not in kind. Once you understand think → act → observe, you understand all of them.
Foundation: CLAUDE.md & AGENTS.md
Both Claude Code and Codex rely on instruction files that act as persistent memory. These files are the single most important thing to get right—they compound knowledge across every session and every team member.
📋 CLAUDE.md
Claude Code uses a layered memory hierarchy: managed policy → user memory (~/.claude/CLAUDE.md) → project memory (./CLAUDE.md or .claude/CLAUDE.md) → modular rules (.claude/rules/*.md) → local memory (CLAUDE.local.md, gitignored) → auto memory. Use @path imports to pull in external files without bloating the root file. The entire team should contribute—every time Claude makes a mistake, add a rule so it never happens again.
📜 AGENTS.md
AGENTS.md is an open standard under the Linux Foundation's Agentic AI Foundation, meaning it works across tools—not just Codex. Codex reads it using a layered discovery system: global (~/.codex/AGENTS.md), then project root to current directory. Files concatenate root-to-current, with closer files overriding earlier guidance. Supports AGENTS.override.md for temporary changes.
Key Differences
| Aspect | Claude Code | Codex |
|---|---|---|
| Instruction file | CLAUDE.md | AGENTS.md |
| Scope | Managed policy → user → project → .claude/rules/ → local | Global + layered per-directory |
| Override mechanism | CLAUDE.local.md + .claude/rules/*.md + @path imports | AGENTS.override.md files |
| Personal/local config | CLAUDE.local.md (gitignored) | AGENTS.override.md |
| Size limit | ~2.5k tokens recommended | 32 KiB default (configurable) |
| Team sharing | Checked into git | Checked into git |
Both files serve the same purpose: preventing repeated mistakes and encoding team knowledge. Maintain both if you use both tools. Keep them concise—treat them like code, not documentation. Every rule should earn its place.
Claude Code Mastery
Claude Code is an agentic coding tool that runs in your terminal. It can read files, execute commands, write code, and create pull requests. The key to using it well is understanding its core workflow patterns.
The Inner Loop
Plan Mode
Shift+Tab twice. Iterate on the approach before any code is written.
Implement
Switch to auto-accept. Claude executes the plan, typically in one shot.
Verify
Run tests, lint, build. Give Claude a feedback loop to self-correct.
Ship
Use /commit-push-pr or similar slash command to create the PR.
Core Concepts
⚡ Skills
Reusable prompts that automate repeated workflows—/commit-push-pr, /verify-app, /code-simplifier. Defined in .claude/skills/<name>/SKILL.md (legacy .claude/commands/ still works). Inline bash pre-computes context to avoid wasted model calls.
🤖 Subagents
Lightweight child agents that run focused tasks within your session. Use them for code simplification, build validation, architecture review. Results return to your main context. Lower token cost than agent teams.
🔒 Permissions
Use /permissions to pre-allow safe commands instead of --dangerously-skip-permissions. Share via .claude/settings.json in git so the whole team has consistent behavior.
🔌 MCP Servers
Connect Claude to external tools (Slack, databases, Sentry) via Model Context Protocol. Config lives in .mcp.json, checked into git for team consistency.
Hooks: Automated Guardrails
🛡 Pre/Post Tool Hooks
Shell commands triggered on PreToolUse and PostToolUse events. Auto-format after edits, validate parameters, enforce rules on every tool call. Hooks communicate via stdout/stderr and exit codes only—they can't trigger slash commands or tool calls directly.
🛑 Quality Gates
Use Stop and UserPromptSubmit hooks to block PR creation if tests fail, enforce no secrets in diffs, run linters before commits.
👥 Team Hooks
TeammateIdle and TaskCompleted hooks prevent agent team members from going idle or marking tasks done without passing checks. Exit code 2 blocks the action and feeds the error message (stderr) back to Claude as feedback. The programmatic equivalent of Mitchell's "engineer the harness."
Parallel Sessions with Git Worktrees
Run multiple Claude Code instances simultaneously, each in its own git worktree to avoid file conflicts. This is the foundation of high-throughput development.
git worktree add -b feat/auth ../project-auth
git worktree add -b feat/dashboard ../project-dashboard
git worktree add -b fix/perf ../project-perf
# Launch Claude Code in each (separate tmux panes)
tmux new-session -s auth -c ../project-auth
tmux new-window -t auth -c ../project-dashboard
tmux new-window -t auth -c ../project-perf
Codex as Reviewer & Parallel Worker
OpenAI Codex CLI runs tasks locally in sandboxed containers with the full repository pre-loaded. This execution model makes it ideal for code review, quality assurance, and parallel task queues.
📦 Task Queue
Queue multiple tasks that run independently in sandboxed environments. Each one has the repo pre-loaded, can run tests, and presents a PR when done. This is Codex's killer feature for throughput.
🔍 Review Mode
Point Codex at a PR and ask it to review for security, performance, or correctness. Its sandboxed environment means it can actually run the code to verify claims, not just read it.
⚙ GitHub Action
Codex provides openai/codex-action@v1 to automatically review PRs and post feedback. This makes the cross-review pipeline fully automated—every PR gets agent-powered review without manual triggering.
AGENTS.md Configuration
Always run tests before creating a PR.
Follow conventional commits for messages.
Never modify files outside the src/ directory.
# project/AGENTS.md - Project-specific
This is a TypeScript project using Next.js 15.
Run `pnpm test` for tests, `pnpm lint` for linting.
Database migrations go in prisma/migrations/.
The Cross-Review Workflow
The most powerful pattern: use Claude Code for implementation and Codex for independent review. This creates a two-model adversarial check that catches errors neither tool would find alone.
Claude Code
Implements the feature, runs tests, creates a PR.
Codex Reviews
Reviews the PR in a sandbox. Runs tests, checks quality, leaves feedback.
Claude Fixes
Addresses review feedback, updates the PR.
Human Review
Final approval with high confidence from dual-model QA.
Why This Works
- Different model architectures catch different classes of bugs. What one model overlooks, the other often finds.
- Sandboxed execution in Codex means the reviewer can actually run the code, not just read diffs.
- Automated feedback loop—Claude Code can pick up Codex's review comments and fix them without human intervention for routine issues.
- Human review becomes final verification rather than first-pass bug hunting, dramatically improving efficiency.
Set up a GitHub Action that triggers Codex review on every PR created by Claude Code. This makes the cross-review pipeline fully automated.
Claude Code Agent Teams
Agent teams coordinate multiple Claude Code instances working together. One session is the team lead, spawning teammates that work independently with their own context windows, communicating via a shared task list and mailbox.
Agent teams are experimental. Enable via CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 in your environment or settings.
"Agent teams are most effective for tasks where parallel exploration adds real value."
— Claude Code DocumentationBest Use Cases
🔎 Research & Review
Multiple teammates investigate different aspects simultaneously, then share and challenge each other's findings.
✨ New Features
Teammates each own a separate module or layer without stepping on each other's work.
🐛 Debugging
Test competing hypotheses in parallel. Teammates actively try to disprove each other's theories.
📚 Cross-Layer
Frontend, backend, and test changes each owned by a different teammate, coordinating via the shared task list.
Agent Teams vs. Subagents
| Aspect | Subagents | Agent Teams |
|---|---|---|
| Communication | Report back to main agent only | Message each other directly |
| Context | Results summarized back | Fully independent context windows |
| Token cost | Lower | Higher (scales with team size) |
| Best for | Focused tasks, quick results | Complex work needing collaboration |
| Display mode | Within main session | tmux split panes or in-process |
Delegate Mode
🎯 Delegate Mode
Lead focuses purely on coordination—spawning teammates, sending messages, managing the task list—without implementing tasks itself. This prevents the lead from burning context on implementation details. Toggle with Shift+Tab.
Since you use tmux, set "teammateMode": "tmux" in your settings. Each teammate gets its own pane. Use tmux -CC in iTerm2 for the best experience, or standard tmux on Linux.
Why Specs Beat Prompts
Agents are autonomous executors that need a clear target. A prompt is a one-shot instruction that gets lost in context; a spec is a verifiable contract between you and the agent. It tells the agent what "done" looks like, gives it criteria to check its own work, and enables the outer loop (Ralph) to run unattended overnight. Writing the spec first is the single highest-leverage thing you can do before starting any agent-driven work.
✅ Testable Requirements
Every line in the spec should be verifiable—include test commands, expected behaviors, edge cases. If the agent can't check its own work, it can't self-correct.
📐 Right-Sized Stories
Break features into stories that fit in one context window. One story per iteration keeps the agent focused and prevents context exhaustion. Too large = wasted tokens. Too small = overhead.
🎯 What, Not How
Describe acceptance criteria, not implementation steps. Let the agent choose the approach. Over-specifying implementation constrains the agent and often produces worse results.
The spec IS the prompt. A well-written PRD replaces long, fragile prompt chains. Write the spec, point the agent at it, and let the loop run. Formats (PRD.md, prd.json) are covered in the Ralph Loop section.
Ralph Loop & PRD-Driven Development
Ralph Loop is a community-built bash script that runs Claude Code repeatedly until all requirements in a PRD are complete. It is not an Anthropic product—it's an open-source tool you run in your own terminal. Named after the "Ralph Wiggum" technique, it's designed for unattended operation—perfect for overnight feature development.
What It Really Is
Strip away the name and Ralph Loop is just a bash while loop. It runs Claude Code, waits for it to finish, checks if there's more work in the PRD, and starts Claude Code again. That's the entire trick.
while ! all_stories_pass; do
claude "Read PRD, implement next story, run tests"
done
The problem it solves: Claude Code's agent loop (think → act → observe) runs inside a single session with a finite context window. For large features, one session isn't enough—the context fills up, or the model finishes one story and stops. Ralph Loop solves this by giving the agent loop an outer loop. Each session gets a fresh context window, picks up where the last one left off by reading the PRD, and works on the next incomplete item. It turns a single-session tool into an overnight assembly line.
How It Works
Write PRD
Define requirements in PRD.md or prd.json with checkable items.
Start Loop
The bash script begins autonomous iteration.
Iterate
Claude reads PRD, implements next item, runs tests, marks done.
Complete
Loop exits when all items pass. PR ready for review.
PRD Formats
📝 PRD.md (Simple)
Markdown with checkboxes. Best for small features. Claude checks off items as it completes them. Each major item gets a commit.
- [ ] Add profile page route
- [ ] Create profile form component
- [ ] Add avatar upload
- [ ] Write tests
📊 prd.json (Structured)
JSON with stories. Best for large features. One story per iteration keeps context windows clean. Each story gets its own commit.
"stories": [
{ "title": "...",
"passes": false }
]
}
The Overnight Workflow
- Before bed: Write a detailed PRD with clear, testable requirements. Include verification commands.
- Start Ralph Loop: Run the ralph script in a tmux session so it survives terminal disconnect.
- Morning review: Check the git log. Each completed story is a separate commit. Review the PR, run final QA.
- If blocked: Ralph creates a BLOCKED.md documenting what it couldn't resolve. Pick up from there.
"End-of-day agents handling research, exploration, and triage during low-energy periods."
— Mitchell Hashimoto, on using AI agents during off-hoursBoris Cherny's Workflow
Boris Cherny created Claude Code and uses it to ship ~100 PRs per week. His setup is surprisingly vanilla—no exotic hacks, just disciplined application of fundamentals at scale.
💻 Parallel Sessions
5 in terminal (numbered, with OS notifications), 5-10 on claude.ai, plus mobile sessions started in the morning. Each uses its own git checkout to avoid conflicts. Expects 10-20% abandonment rate.
🎯 Opus for Everything
Uses Opus with thinking exclusively. "Even though it's bigger & slower than Sonnet, since you have to steer it less, it is almost always faster in the end."
🗺 Plan First
Starts most sessions in Plan Mode (Shift+Tab twice). Iterates on the plan, then switches to auto-accept. "A good plan is really important!"
✅ Verify Everything
"The most important thing: give Claude a way to verify its work. This feedback loop 2-3x the quality of the final result."
Key Practices
- CLAUDE.md as institutional memory: The team updates it multiple times weekly. During code review, Boris tags @.claude to add learnings. Their CLAUDE.md is ~2.5k tokens.
- Skills for every repeated workflow:
/commit-push-pris used dozens of times daily. Skills live in.claude/skills/<name>/SKILL.md(legacy.claude/commands/still works) with inline bash for context. - Strategic permissions: Use
/permissionsto pre-allow safe commands. Share settings in.claude/settings.jsonvia git. - MCP for external tools: Slack, BigQuery, Sentry connected via MCP servers. Config in
.mcp.json, checked into git. - Dedicated subagents: code-simplifier, verify-app, build-validator, code-architect—each focused on one job.
Mitchell Hashimoto's Framework
The co-founder of HashiCorp describes a deliberate six-step evolution from chatbot usage to continuous agent operation. The key insight: you must push through the inefficiency phase to reach transformation.
The Six Steps
- Abandon chatbots for agents. Conversational interfaces have limited utility. Agents that read files, execute programs, and loop are transformative.
- Reproduce work agentic-ally. Do work twice intentionally—once manually, once with the agent—to build expertise and calibrate trust.
- End-of-day agents. Use low-energy periods for agent-driven research, exploration, and triage. Check results the next morning.
- Delegate high-confidence tasks. Keep engaging deep work for yourself. Delegate well-defined, verifiable tasks to agents.
- Engineer the harness. Create instruction files (AGENTS.md or equivalent) and programmed tools. Document mistakes systematically to prevent recurrence.
- Continuous agent operation. Always have at least one agent running in the background. Target 10-20% of workday with agents active. While you focus on one task, an agent researches the next, reviews previous work, or explores alternatives. Disable notifications to control context-switching.
“Break down sessions into separate clear, actionable tasks. Don’t try to ‘draw the owl’ in one mega session.”
— Mitchell HashimotoPractical Takeaways
🎯 Task Isolation
One task per session. Don't overload context windows with unrelated work. Clear, focused prompts produce better results.
🧪 Verification
Give agents a way to check their own work. Tests, linters, type checkers—anything that provides automated feedback.
📓 Document Mistakes
When an agent makes a mistake, add it to CLAUDE.md/AGENTS.md. This compounds over time into increasingly reliable behavior.
♾ Always Have an Agent Running
Mitchell's principle: you should always have at least one agent working in the background. While you focus on one task, an agent researches the next, reviews previous work, or explores alternatives. Dead time is wasted compute.
Peter Steinberger's Workflow
Peter Steinberger ships solo at extreme velocity. His workflow is Codex-first and minimal-ceremony—conversations replace plan mode, commits go straight to main, and issue trackers didn't stick. He juggles 3–8 projects simultaneously, uses image-based prompting for UI iteration, and built a custom Oracle tool for when agents get stuck.
🧠 Codex-First
Uses GPT 5.2-codex with “high” reasoning. It silently reads files for 10–15 min before writing. 4x slower than Opus per task but needs fewer iterations, so net faster. Better for large features and refactors; Opus still preferred for smaller edits and his AI agent.
⌨ CLI Everything
Builds CLIs first, UIs second—agents can call CLIs directly and verify output. “Almost all MCPs really should be CLIs.” Custom tools for food delivery, tweets, email, cameras, home automation, music, bed temperature—all agent-accessible.
🖼 Image Prompting
“At least 50% of my prompts contain a screenshot.” Just a few words plus a screenshot is enough. Annotation improves results but isn’t required. His prompts have gotten much shorter over time—brief, image-supplemented requests beat lengthy descriptions.
🔮 Oracle
Custom CLI that gives agents access to GPT 5 Pro when they’re stuck. A “massive unlock” for research—agents can search across ~50 websites in one run. Used multiple times daily at first, now a few times per week as GPT 5.2 handles more independently.
📋 Task Queueing
Uses Codex’s built-in queue to pipeline tasks—“as I get a new idea, I add it to the pipeline.” He’s “usually the bottleneck,” not the model. Avoids multi-agent orchestration in favor of a simple, iterative approach.
Key Practices
- Just talk to it: “Don’t waste your time on stuff like RAG, subagents, Agents 2.0 or other things that are mostly just charade.” Instead: talk to the model, play with it, develop intuition. The more you work with agents, the better your results. Short prompts (1–2 sentences + an image) beat elaborate prompt engineering.
- Conversations replace planning: Calls plan mode “a hack that was necessary for older generations of models.” Instead, starts a conversation—asks a question, lets the model search the web, explore code, and create a plan together. When satisfied, writes “build” to trigger implementation. Commits to main directly for linear git history. Only works solo—“if you work in a bigger team that workflow obv won't fly.”
- Docs folder pattern: Each project maintains
docs/*.mdfiles. The model chooses filenames. Optimized for agent navigation, not human browsing. Markdown-first structure. - Cross-project reference: Tells the agent “look at
../vibetunneland do the same”—reuse patterns directly via filesystem access instead of re-explaining. - Skills for domain automation: Domain registration, DNS management, Tailscale network commands, remote Mac terminal access. Reduces context explanation on repeated tasks.
- Multi-machine setup: MacBook Pro + Mac Studio via Jump Desktop. Long-running tasks continue on Studio. Git sync for cross-machine edits. Travel flexibility with background processing.
- Opus for creative work: Despite being Codex-first for coding, uses Claude Opus for his Clawdis AI agent and general tasks. “My AI agent wouldn’t be half as fun running on GPT 5.” Picks the best tool per job, not one tool for everything.
- Ad-hoc refactoring: No more dedicated refactoring days. “Whenever prompts start taking too long or I see sth ugly flying by in the code stream, I’ll deal with it right away.” The agent’s output stream becomes the trigger for cleanup.
- Compaction as review: Token compaction (context summarization) is reframed as a feature. “Often acts like a review, and the model will find bugs when it looks at code again.” Tasks run across many compactions successfully.
- No orchestration layers, no RAG: Rejects agent orchestrators (Conductor, Terragon, Sculptor): “thin wrappers around Anthropic’s SDK + work tree management. There’s no moat.” Also dismisses RAG—modern models search well enough without a separate vector index. Prefers separate terminal windows for visibility and control over subagent abstractions.
- Blast radius thinking: Before every change, estimates how many files it will touch and how long it will take. Multiple small, atomic changes beat one massive refactor—easier to recover from errors and maintain clean git history.
“The amount of software I can create is now mostly limited by inference time and hard thinking.”
— Peter Steinberger“Most software does not require hard thinking. Most apps shove data from one form to another, maybe store it somewhere, and then show it to the user.”
— Peter SteinbergerCodex Config
model = "gpt-5.2-codex"
model_reasoning_effort = "high"
tool_output_token_limit = 25000
model_auto_compact_token_limit = 233000
[features]
unified_exec = true
skills = true
web_search_request = true
Boris plans first, uses Opus exclusively, and runs parallel sessions with worktrees. Steinberger skips planning, uses Codex primarily, and commits to main linearly. Both ship at extreme velocity—the takeaway is that the tools matter less than having a consistent, disciplined workflow that plays to each tool's strengths.
The Unified Workflow
Here is the optimized workflow combining all tools and techniques for maximum efficiency and quality.
Daytime: Active Development
Design
Use Plan Mode to define the approach. Use agent teams for research if the problem is complex.
Implement
Claude Code in parallel worktrees via tmux. One feature per session.
Verify
Run /verify. Give Claude the feedback loop. Fix issues in the same session.
PR + Review
Claude creates PR. Codex auto-reviews. Fix feedback. Human approves.
While working through the Plan → Build → Test → Ship cycle, always keep a background agent running on the next task, reviewing the last PR, or researching an upcoming feature. The goal: zero idle compute.
Overnight: Autonomous Development
Write Spec
Create PRD.md or prd.json with clear, testable requirements and verification steps.
Start Loop
Ralph script in tmux. Claude iterates through stories autonomously.
Morning QA
Review git log, check completed stories. Codex reviews the overnight PR.
Polish
Address any review feedback. Merge and deploy.
The Complete Stack
| Tool | Role | When |
|---|---|---|
| Claude Code | Primary implementation, planning, shipping | Active development hours |
| Codex | Code review, parallel task queues, QA | After every PR, parallel tasks |
| Ralph Loop | Autonomous iteration through PRD stories | Overnight, long-running features |
| Agent Teams | Parallel research, multi-module features | Complex tasks needing coordination |
| tmux + Worktrees | Session management, parallel isolation | Always (infrastructure layer) |
| CLAUDE.md / AGENTS.md | Institutional memory, mistake prevention | Always (knowledge layer) |
Verification is everything. Every workflow above depends on giving the AI a way to check its own work. Tests, linters, type checkers, build scripts—these aren't just quality tools, they're the feedback loop that makes autonomous development possible. Invest in your test suite and CI pipeline before anything else.
Sources
This analysis is based on official documentation, creator insights, and practitioner experience.