Claude Code vs. Devin vs. OpenAI Agents: The Head-to-Head

Every agent comparison post you've read online follows the same pattern: list the features, quote the benchmarks, declare a winner. This one doesn't. We ran the same five real-world tasks through Claude Code, Devin, and a custom agent built on OpenAI's Agents SDK, tracked what each one actually produced, what it cost, and how much hand-holding each required to get to a correct result. No benchmarks. No synthetic tasks. Just work — the kind that crosses a developer's desk on a normal week.

The unsurprising conclusion: there's no single winner. The surprising part is how clearly each tool carves out its own territory, and how rarely those territories overlap.

The Test Battery

Five tasks, each representing a different type of development work.

Task 1: Bug fix from issue. A GitHub issue describing a race condition in a Node.js WebSocket server. The bug report includes reproduction steps but no code pointers. The agent needs to find the relevant code, identify the bug, and produce a fix.

Task 2: Feature implementation. Add pagination to an existing REST API endpoint. The codebase has an established pattern for pagination on other endpoints, so the agent needs to find that pattern and replicate it. No ambiguity in the spec — standard offset/limit pagination with total count in the response.

Task 3: Test suite creation. Write unit tests for an existing authentication module with no tests. The module handles JWT token generation, validation, refresh, and revocation. The tests need to cover happy paths, error cases, and edge cases.

Task 4: Refactor. Extract a 400-line utility file into three focused modules, update all imports across the codebase, and ensure all existing tests still pass.

Task 5: Multi-file codebase question. "How does the billing system calculate prorated charges when a user upgrades mid-cycle?" No code changes — just an accurate, detailed explanation tracing through the relevant files.

Each task was run against a medium-sized codebase (roughly 50,000 lines across 200 files, a real project, not a toy). Each agent was given the same starting context: the repo and a one-sentence task description matching what you'd type into the tool naturally.

Claude Code Results

Claude Code's advantage showed up immediately and consistently: it reads codebases better than anything else in this test. For every task, its first step was methodical — grep for relevant terms, read the files it found, trace through the code paths, and build a mental model before making changes. The codebase navigation was fast, accurate, and thorough.

Task 1 (Bug fix): Claude Code found the race condition in about 90 seconds of file reading and produced a correct fix using a mutex pattern consistent with the rest of the codebase. It ran the existing tests to verify the fix didn't break anything. Total time: 4 minutes. Supervision required: none. The fix was correct on the first attempt.

Task 2 (Pagination): Found the existing pagination pattern in two other endpoints, replicated it exactly, added the total count query, updated the route handler, and modified the API docs inline. Total time: 6 minutes. Supervision required: none. One minor style inconsistency with the rest of the codebase (used totalCount instead of total_count matching the project's snake_case convention), caught and fixed when pointed out.

Task 3 (Test suite): This was Claude Code's strongest showing. It read the auth module, identified the public interface, and generated 23 tests covering token generation, validation (valid, expired, malformed, revoked), refresh logic, and error handling. The tests were well-structured, used the project's existing test utilities, and ran green on the first try. Total time: 8 minutes. Supervision required: none.

Task 4 (Refactor): Successfully split the utility file into three modules, created a barrel export for backwards compatibility, and updated imports across 14 files. Ran the test suite — two tests failed due to a missing re-export. Claude Code read the errors, fixed the barrel export, and re-ran. All green. Total time: 12 minutes. Supervision required: minimal — reviewed the module split plan before execution.

Task 5 (Codebase question): Traced through four files to explain the proration logic accurately, including an edge case around timezone handling that wasn't obvious from reading any single file. The explanation was detailed, accurate, and referenced specific line numbers. Total time: 3 minutes. This is Claude Code's superpower — codebase comprehension — and it delivered.

Claude Code total cost across all five tasks: Approximately $3.80 on API billing [VERIFY — depends on model and pricing at time of testing].

Devin Results

Devin's advantage is different: it runs in a full sandboxed environment with a browser, terminal, and editor, which means it can do things the other tools can't — install dependencies, run servers, check browser output, interact with external services. Its disadvantage is speed. Every action goes through a more heavyweight execution pipeline than Claude Code's direct terminal access.

Task 1 (Bug fix): Devin took longer to locate the bug — about 5 minutes of browsing through files in its IDE before identifying the race condition. The fix was correct but used a different pattern than the rest of the codebase (a semaphore library rather than the project's existing mutex utility). Functionally correct, stylistically inconsistent. Total time: 14 minutes. Supervision required: pointed it toward the existing mutex utility, which it then used to revise the fix.

Task 2 (Pagination): Found the existing pattern and replicated it correctly. Notably, Devin also started the server and tested the endpoint through its browser to verify the pagination worked end-to-end — something neither of the other tools did. Total time: 18 minutes. Supervision required: none. The end-to-end verification was a genuine advantage.

Task 3 (Test suite): Generated 19 tests. Solid coverage of happy paths and basic error cases but missed several edge cases that Claude Code caught (expired token by exactly one second, revoked-then-refreshed token). The tests ran green. Total time: 22 minutes. Supervision required: requested additional edge cases after reviewing, which Devin added correctly.

Task 4 (Refactor): Completed the split correctly. Took longer than Claude Code — 25 minutes — primarily because it ran the full test suite after each file modification rather than making all changes and running once. More cautious approach, same correct result. Total time: 25 minutes. Supervision required: none, though the incremental approach was slower.

Task 5 (Codebase question): Accurate but less thorough than Claude Code's answer. Identified the core proration logic correctly but missed the timezone edge case. The explanation was structured as a walkthrough of the code rather than a synthesized answer — more "here's what each file does" and less "here's how the system works." Total time: 8 minutes.

Devin total cost: $500/month subscription. Per these five tasks, which took about 87 minutes of Devin time — if you used Devin for this kind of work daily, you'd want to be running substantially more volume to justify the subscription.

OpenAI Agents SDK Results

The OpenAI Agents SDK isn't a product you point at a codebase — it's a framework for building agents. So for this comparison, we built a coding agent using the SDK: a tool-calling loop with file read/write tools, a shell execution tool, and a search tool, running on GPT-4o. This took about two hours to set up, which is itself a data point. The other two tools are ready out of the box. This one required building the agent first.

Task 1 (Bug fix): The custom agent found the bug after some searching — it didn't have Claude Code's instinct for where to look first. The fix was correct. Total time: 9 minutes (excluding the two hours of agent setup). Supervision required: none once running, but the agent's tool-use patterns were less efficient than Claude Code's — more redundant file reads, less strategic searching.

Task 2 (Pagination): Completed correctly. The agent found the existing pattern but took two attempts — the first implementation missed the total count query. The retry loop caught it via the test suite. Total time: 15 minutes. Supervision required: none.

Task 3 (Test suite): Generated 16 tests. Adequate coverage but noticeably less thorough than Claude Code's output. The test structure was more generic — less awareness of the project's testing conventions. Total time: 14 minutes. Supervision required: some manual review to add project-specific test utilities.

Task 4 (Refactor): This is where the custom agent struggled most. It attempted the split but made errors in import path resolution that cascaded across files. After two failed test runs, it self-corrected but left one file with a circular import that required manual intervention. Total time: 30 minutes plus human debugging. Supervision required: significant.

Task 5 (Codebase question): Produced a correct but surface-level explanation. Found the main proration function but didn't trace through the supporting utilities the way Claude Code did. Total time: 6 minutes.

OpenAI Agents SDK total cost: Approximately $2.40 in API calls (GPT-4o pricing) [VERIFY], plus the two hours of development time to build the agent. The per-task API cost is lower than Claude Code. The total cost — including your time building and maintaining the agent — is substantially higher unless you're running the agent repeatedly across many tasks.

Cost Per Task

Here's the math that matters, stripped of everything except dollars and minutes.

Task Claude Code Devin OpenAI Agent
Bug fix $0.60 / 4 min ~$4.50* / 14 min $0.35 / 9 min
Pagination $0.80 / 6 min ~$4.50* / 18 min $0.45 / 15 min
Test suite $0.90 / 8 min ~$4.50* / 22 min $0.55 / 14 min
Refactor $1.00 / 12 min ~$4.50* / 25 min $0.70 / 30 min
Codebase Q $0.50 / 3 min ~$4.50* / 8 min $0.35 / 6 min

*Devin cost estimated by dividing $500/month subscription across working hours. Actual per-task cost depends on utilization — the more you use it, the lower the effective per-task cost.

Claude Code is the clear winner on cost-per-task for these coding tasks. Devin's subscription model only makes economic sense at high utilization — you need to be running it for hours daily to compete with pay-per-use. The OpenAI agent has the lowest API costs but the amortized development cost pushes it higher unless you're running thousands of tasks.

Supervision Required

This is the column that matters more than cost for most teams.

Claude Code needed the least supervision across all five tasks. One minor correction (naming convention), one plan review (refactor structure). Everything else ran correctly without intervention. For a developer who wants to hand off a task and review the output, Claude Code is the most reliable option in this test.

Devin needed moderate supervision. The fixes and implementations were functionally correct but sometimes stylistically inconsistent with the codebase. The end-to-end testing via browser was a genuine advantage that partially offsets the supervision cost. For well-scoped tickets with clear acceptance criteria, Devin can run more independently. For anything requiring codebase taste, expect to review and request revisions.

The OpenAI custom agent needed the most supervision. The refactor task required manual intervention. The test suite needed manual additions. This isn't an indictment of the SDK — it's the reality of a custom agent versus purpose-built products. Claude Code and Devin have been optimized specifically for coding tasks through thousands of iterations. A custom agent built in two hours hasn't had that optimization cycle.

The Verdict

Use Claude Code when you need fast, high-quality code changes with minimal supervision and you're comfortable in the terminal. It's the best option for daily development work — bug fixes, feature implementation, refactoring, test writing, codebase exploration. The tight integration with your local development environment and the codebase comprehension are hard to match.

Use Devin when you need an agent that can operate end-to-end in an isolated environment — installing dependencies, running servers, testing in a browser, interacting with external APIs. Devin's sandboxed environment is its unique advantage. For tasks where "works on my machine" isn't good enough and you need verified, environment-complete execution, Devin offers something the others don't. The price only makes sense at volume.

Use OpenAI Agents SDK when you need a custom agent tailored to a specific workflow that the off-the-shelf tools don't support. The SDK is not a coding agent — it's a toolkit for building agents. If your needs don't fit Claude Code's or Devin's patterns, the SDK gives you the flexibility to build exactly what you need. The cost is your development time, and it's only justified when the customization is genuinely necessary.

Do it yourself when the task takes less time to do manually than to explain to any of these tools. That threshold is lower than you think — roughly 10-15 minutes of manual work. Below that, opening the tool, writing the prompt, and reviewing the output is overhead, not help.

No agent in this test produced output that could be merged without human review. All three got closer to correct than they would have a year ago. None of them eliminated the need for a developer who understands the codebase. The gap between "AI generated this" and "this is ready for production" is real, consistent, and — for now — still your job.


This is part of CustomClanker's AI Agents series — reality checks on every major agent framework.