Agentic Software Engineering — How I Build Production Systems in 2026

Two years ago I wrote about Spec-Driven Development — writing precise specifications before generating code. That was a workflow designed to get consistent quality out of LLMs that still needed careful prompting. The workflow is still relevant in 2026, but the context has changed substantially.

This post is an honest account of how I actually build production software today: what agents handle autonomously, what still requires human judgment, and where the rough edges are.

Open Table of contents

What Changed Between 2024 and 2026
The Workflow: From Idea to Production
What the Agent Does Well
What the Agent Doesn’t Do Well
The Skills That Became More Valuable
The Rough Edges
The Mental Shift

What Changed Between 2024 and 2026

In 2024, LLM-assisted coding meant:

Write a detailed prompt → get a code suggestion → review and edit
Loop over 5-10 cycles to get a working implementation
Manual test writing (LLMs would suggest tests but they were often wrong)
Deployment was entirely manual

In 2026, with agentic tools like Claude Code:

Write a spec or describe the task → agent reads the codebase, plans, implements
Agent runs tests, fixes failures, iterates without my input
Agent can write tests against the actual implementation and verify they pass
Agent can trigger deployments and verify they succeeded

The most significant change isn’t the code quality (which improved modestly) — it’s the delegation level. I can hand off a well-scoped task and come back to a complete implementation rather than reviewing each code generation step.

The Workflow: From Idea to Production

Here’s the end-to-end workflow for a recent feature I built: a financial alert system that notifies users when unusual patterns are detected in their transaction data.

Step 1: Spec (15 minutes)

I write the spec in Markdown. Same format as Spec-Driven Development — context, requirements, constraints, non-requirements.

# Feature: Transaction Pattern Alert System

**Context**: We have a transaction processing API (Express.js/TypeScript, PostgreSQL,
Redis). We want to send email alerts when we detect unusual patterns for an account.

**Requirements**:
1. After each transaction is processed, run pattern analysis for that account
2. Patterns to detect: (a) unusual merchant category, (b) amount > 3 SD from mean,
   (c) 5+ transactions in 1 hour, (d) weekend/late-night activity for B2B accounts
3. If any pattern detected, queue an alert (don't block the transaction response)
4. Alert sent via SendGrid to account owner email with: pattern description,
   transaction details, and a "this was me" confirmation link
5. Maximum 1 alert per account per 24 hours (dedup with Redis)
6. Unit tests for each pattern detector
7. Integration test for the full alert flow with SendGrid mocked

**Constraints**:
- Use existing Redis client (src/lib/redis.ts)
- Use existing pg pool (src/lib/db.ts)
- Use existing SendGrid client (src/lib/email.ts)
- Alert processing must be async (add to BullMQ queue, process in worker)
- TypeScript strict mode

**Non-requirements**:
- Do not add push notifications
- Do not add in-app notifications
- Do not implement "this was me" link processing (separate ticket)

Step 2: Agent Execution

I give the spec to Claude Code:

Implement the transaction pattern alert system per this spec.
Start by reading the relevant existing files, then implement incrementally:
1. Pattern detectors (unit-testable)
2. Queue integration
3. Worker
4. Tests

The agent:

Reads src/lib/redis.ts, src/lib/db.ts, src/lib/email.ts to understand existing interfaces
Reads src/services/transaction.ts to understand where to hook in
Reads existing tests to understand testing patterns
Implements in the order specified
Runs npm test after each step, fixes failures

Total time before returning to me with a complete implementation: 18 minutes.

Step 3: Review (30 minutes)

I review the implementation. Not line-by-line (the agent handled ~400 lines of code and 120 lines of tests) — I focus on:

Architecture decisions: Did the agent make the right abstractions? Did it correctly understand the non-requirements?
Security: Any SQL injection, unvalidated input, leaked credentials?
Edge cases in pattern detection: The standard deviation calculation — is it using rolling history or lifetime? (Lifetime was wrong for onboarding accounts — I flagged this.)
Test coverage: Are the integration tests actually testing the async path?

I found one logic error (point 3 above) and one missing test case. I described them in a follow-up message. The agent fixed both.

Step 4: PR, CI, Deploy

git add -A && git commit -m "feat: transaction pattern alert system"
git push origin feature/pattern-alerts

CI runs: TypeScript check, unit tests, integration tests (with mocked SendGrid), security scan. All pass. I merge.

The GitHub Actions workflow deploys to staging automatically. I verify the feature works end-to-end with a real test account. Deploy to production.

Total time for a feature that would have taken 2-3 days two years ago: 4 hours.

What the Agent Does Well

Reading existing code: The agent builds a mental model of your codebase before writing anything. It finds the existing patterns, interfaces, and conventions. The output code looks like it was written by someone who’s been on the team for a year.

Iteration: The agent doesn’t give up after one failed test. It reads the error, forms a hypothesis, fixes, and re-runs. The iteration loop that used to take me 10 minutes of back-and-forth prompt engineering now happens autonomously in 2 minutes.

Repetitive implementations: CRUD endpoints, validation functions, data transformation pipelines, test fixtures — anything formulaic where the pattern is established and the variation is data-driven. Zero marginal thinking required.

Documentation: The agent writes accurate docstrings and README sections because it can read what the code actually does, not what you intended it to do.

What the Agent Doesn’t Do Well

Architectural decisions: “Should we use an event-driven architecture or direct service calls for this?” The agent will make a choice, but it won’t necessarily be the right choice for your system’s specific scale, team structure, and operational constraints. These decisions still need human judgment.

Novel algorithm design: Implementing a known algorithm (BFS, Dijkstra, dynamic programming with a standard formulation) — excellent. Designing the algorithm itself for a novel problem — not reliable.

Debugging distributed system issues: When the bug is “sometimes requests fail under high load and the trace shows latency spikes but we can’t reproduce locally” — the agent can suggest hypotheses, but the investigation requires human intuition and production access it doesn’t have.

Cross-cutting concerns: Security architecture, observability strategy, data retention policies — these require holistic understanding of the business and risk tolerance that doesn’t fit in a context window.

Product decisions: “What should this feature do?” is not something you delegate.

The Skills That Became More Valuable

Specification writing: The ability to write a precise, unambiguous spec is now the primary leverage point. A good spec → good implementation. A vague spec → something that technically works but misses the point.

Review depth: You review more code than you write. Knowing what to look for — security issues, architectural misalignments, edge cases in business logic — became the critical skill.

System design: The agent is good at implementing within a design. The design itself requires human expertise.

Knowing when NOT to delegate: Some tasks are faster without the agent. Fixing a one-line bug. Refactoring a function you have in your head. Writing a quick script you’ll run once. The overhead of context-setting isn’t worth it.

The Rough Edges

Context window limits: Complex features that touch many files can exceed what fits in context. The agent may lose track of earlier decisions. Solution: break large tasks into independent phases with explicit state handoffs.

Hallucinated library APIs: The agent sometimes uses library methods that don’t exist (especially for newer APIs it wasn’t trained on). Solution: always run npm test after agent-generated code; type errors and runtime failures catch these quickly.

Overengineering: The agent sometimes adds abstractions that aren’t needed. “You won’t need that” is a judgment call that requires understanding the roadmap. Review for unnecessary complexity.

Security blind spots: The agent catches obvious issues (SQL injection, obvious XSS) but misses subtler ones (timing attacks, insecure deserialization in edge cases, authorization logic bugs). Security review is still human work.

The Mental Shift

The biggest change isn’t in how I write code — it’s in how I think about what I’m responsible for.

Two years ago: I was responsible for every line of code in my services.

Today: I’m responsible for the design, the specs, the review, and the system-level decisions. The implementation is usually delegated. I’m still responsible for what ships — the agent is a tool, not a co-author with separate accountability.

This requires a different discipline. You have to review more carefully than you think, especially in the early days of using an agent. The temptation is to trust the output because it looks professional and compiles. “Looks good” is not the standard; “provably correct and secure” is.

The engineers who thrive in this environment are the ones who can hold the design and the business logic clearly enough in their heads to delegate implementation confidently and review critically. That’s always been what makes a senior engineer — the agents just made it the explicit bottleneck.