Two years ago I wrote about Spec-Driven Development — writing precise specifications before generating code. That was a workflow designed to get consistent quality out of LLMs that still needed careful prompting. The workflow is still relevant in 2026, but the context has changed substantially.
This post is an honest account of how I actually build production software today: what agents handle autonomously, what still requires human judgment, and where the rough edges are.
Table of contents
Open Table of contents
What Changed Between 2024 and 2026
In 2024, LLM-assisted coding meant:
- Write a detailed prompt → get a code suggestion → review and edit
- Loop over 5-10 cycles to get a working implementation
- Manual test writing (LLMs would suggest tests but they were often wrong)
- Deployment was entirely manual
In 2026, with agentic tools like Claude Code:
- Write a spec or describe the task → agent reads the codebase, plans, implements
- Agent runs tests, fixes failures, iterates without my input
- Agent can write tests against the actual implementation and verify they pass
- Agent can trigger deployments and verify they succeeded
The most significant change isn’t the code quality (which improved modestly) — it’s the delegation level. I can hand off a well-scoped task and come back to a complete implementation rather than reviewing each code generation step.
The Workflow: From Idea to Production
Here’s the end-to-end workflow for a recent feature I built: a financial alert system that notifies users when unusual patterns are detected in their transaction data.
Step 1: Spec (15 minutes)
I write the spec in Markdown. Same format as Spec-Driven Development — context, requirements, constraints, non-requirements.
# Feature: Transaction Pattern Alert System
**Context**: We have a transaction processing API (Express.js/TypeScript, PostgreSQL,Redis). We want to send email alerts when we detect unusual patterns for an account.
**Requirements**:1. After each transaction is processed, run pattern analysis for that account2. Patterns to detect: (a) unusual merchant category, (b) amount > 3 SD from mean, (c) 5+ transactions in 1 hour, (d) weekend/late-night activity for B2B accounts3. If any pattern detected, queue an alert (don't block the transaction response)4. Alert sent via SendGrid to account owner email with: pattern description, transaction details, and a "this was me" confirmation link5. Maximum 1 alert per account per 24 hours (dedup with Redis)6. Unit tests for each pattern detector7. Integration test for the full alert flow with SendGrid mocked
**Constraints**:- Use existing Redis client (src/lib/redis.ts)- Use existing pg pool (src/lib/db.ts)- Use existing SendGrid client (src/lib/email.ts)- Alert processing must be async (add to BullMQ queue, process in worker)- TypeScript strict mode
**Non-requirements**:- Do not add push notifications- Do not add in-app notifications- Do not implement "this was me" link processing (separate ticket)Step 2: Agent Execution
I give the spec to Claude Code:
Implement the transaction pattern alert system per this spec.Start by reading the relevant existing files, then implement incrementally:1. Pattern detectors (unit-testable)2. Queue integration3. Worker4. TestsThe agent:
- Reads
src/lib/redis.ts,src/lib/db.ts,src/lib/email.tsto understand existing interfaces - Reads
src/services/transaction.tsto understand where to hook in - Reads existing tests to understand testing patterns
- Implements in the order specified
- Runs
npm testafter each step, fixes failures
Total time before returning to me with a complete implementation: 18 minutes.
Step 3: Review (30 minutes)
I review the implementation. Not line-by-line (the agent handled ~400 lines of code and 120 lines of tests) — I focus on:
- Architecture decisions: Did the agent make the right abstractions? Did it correctly understand the non-requirements?
- Security: Any SQL injection, unvalidated input, leaked credentials?
- Edge cases in pattern detection: The standard deviation calculation — is it using rolling history or lifetime? (Lifetime was wrong for onboarding accounts — I flagged this.)
- Test coverage: Are the integration tests actually testing the async path?
I found one logic error (point 3 above) and one missing test case. I described them in a follow-up message. The agent fixed both.
Step 4: PR, CI, Deploy
git add -A && git commit -m "feat: transaction pattern alert system"git push origin feature/pattern-alertsCI runs: TypeScript check, unit tests, integration tests (with mocked SendGrid), security scan. All pass. I merge.
The GitHub Actions workflow deploys to staging automatically. I verify the feature works end-to-end with a real test account. Deploy to production.
Total time for a feature that would have taken 2-3 days two years ago: 4 hours.
What the Agent Does Well
Reading existing code: The agent builds a mental model of your codebase before writing anything. It finds the existing patterns, interfaces, and conventions. The output code looks like it was written by someone who’s been on the team for a year.
Iteration: The agent doesn’t give up after one failed test. It reads the error, forms a hypothesis, fixes, and re-runs. The iteration loop that used to take me 10 minutes of back-and-forth prompt engineering now happens autonomously in 2 minutes.
Repetitive implementations: CRUD endpoints, validation functions, data transformation pipelines, test fixtures — anything formulaic where the pattern is established and the variation is data-driven. Zero marginal thinking required.
Documentation: The agent writes accurate docstrings and README sections because it can read what the code actually does, not what you intended it to do.
What the Agent Doesn’t Do Well
Architectural decisions: “Should we use an event-driven architecture or direct service calls for this?” The agent will make a choice, but it won’t necessarily be the right choice for your system’s specific scale, team structure, and operational constraints. These decisions still need human judgment.
Novel algorithm design: Implementing a known algorithm (BFS, Dijkstra, dynamic programming with a standard formulation) — excellent. Designing the algorithm itself for a novel problem — not reliable.
Debugging distributed system issues: When the bug is “sometimes requests fail under high load and the trace shows latency spikes but we can’t reproduce locally” — the agent can suggest hypotheses, but the investigation requires human intuition and production access it doesn’t have.
Cross-cutting concerns: Security architecture, observability strategy, data retention policies — these require holistic understanding of the business and risk tolerance that doesn’t fit in a context window.
Product decisions: “What should this feature do?” is not something you delegate.
The Skills That Became More Valuable
Specification writing: The ability to write a precise, unambiguous spec is now the primary leverage point. A good spec → good implementation. A vague spec → something that technically works but misses the point.
Review depth: You review more code than you write. Knowing what to look for — security issues, architectural misalignments, edge cases in business logic — became the critical skill.
System design: The agent is good at implementing within a design. The design itself requires human expertise.
Knowing when NOT to delegate: Some tasks are faster without the agent. Fixing a one-line bug. Refactoring a function you have in your head. Writing a quick script you’ll run once. The overhead of context-setting isn’t worth it.
The Rough Edges
Context window limits: Complex features that touch many files can exceed what fits in context. The agent may lose track of earlier decisions. Solution: break large tasks into independent phases with explicit state handoffs.
Hallucinated library APIs: The agent sometimes uses library methods that don’t exist (especially for newer APIs it wasn’t trained on). Solution: always run npm test after agent-generated code; type errors and runtime failures catch these quickly.
Overengineering: The agent sometimes adds abstractions that aren’t needed. “You won’t need that” is a judgment call that requires understanding the roadmap. Review for unnecessary complexity.
Security blind spots: The agent catches obvious issues (SQL injection, obvious XSS) but misses subtler ones (timing attacks, insecure deserialization in edge cases, authorization logic bugs). Security review is still human work.
The Mental Shift
The biggest change isn’t in how I write code — it’s in how I think about what I’m responsible for.
Two years ago: I was responsible for every line of code in my services.
Today: I’m responsible for the design, the specs, the review, and the system-level decisions. The implementation is usually delegated. I’m still responsible for what ships — the agent is a tool, not a co-author with separate accountability.
This requires a different discipline. You have to review more carefully than you think, especially in the early days of using an agent. The temptation is to trust the output because it looks professional and compiles. “Looks good” is not the standard; “provably correct and secure” is.
The engineers who thrive in this environment are the ones who can hold the design and the business logic clearly enough in their heads to delegate implementation confidently and review critically. That’s always been what makes a senior engineer — the agents just made it the explicit bottleneck.