LLM APIs are not like REST APIs. A REST API is deterministic — given the same inputs, you get the same outputs. An LLM is probabilistic — outputs vary, latency varies (2-30 seconds is typical), and failure modes are different from HTTP errors. Writing production code that integrates with LLMs requires adapting patterns from async I/O, streaming, and probabilistic systems.
This post covers the four patterns I use in every production LLM integration, with TypeScript code using the Anthropic Claude API.
Table of contents
Open Table of contents
Setup
npm install @anthropic-ai/sdk zodimport Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY!,});Pattern 1: Structured Outputs
The fundamental challenge: LLMs return text. Your application needs structured data. Naive approach: ask the model to “respond in JSON”. This fails ~10% of the time due to:
- Trailing commas
- Unescaped quotes in strings
- Model reasoning before the JSON block
- Markdown code fences around the JSON
The robust approach: combine prompt engineering with schema validation.
import { z } from 'zod';
// Define your expected output schemaconst TransactionAnalysisSchema = z.object({ category: z.enum(['retail', 'food', 'transport', 'utilities', 'other']), risk_score: z.number().min(0).max(1), is_suspicious: z.boolean(), reasoning: z.string(), suggested_tags: z.array(z.string()).max(5),});
type TransactionAnalysis = z.infer<typeof TransactionAnalysisSchema>;
async function analyzeTransaction( description: string, amount: number, merchant: string): Promise<TransactionAnalysis> { const response = await client.messages.create({ model: 'claude-opus-4-6', max_tokens: 1024, messages: [ { role: 'user', content: `Analyze this financial transaction and respond with ONLY a JSON object matching this exact schema:
Schema:\`\`\`json{ "category": "retail" | "food" | "transport" | "utilities" | "other", "risk_score": number between 0 and 1, "is_suspicious": boolean, "reasoning": "string explaining the analysis", "suggested_tags": ["array", "of", "tags"]}\`\`\`
Transaction:- Description: ${description}- Amount: €${amount}- Merchant: ${merchant}
Respond with only the JSON object, no additional text.`, }, ], });
const text = response.content[0].type === 'text' ? response.content[0].text : '';
// Extract JSON from response (handle markdown code fences) const jsonMatch = text.match(/```(?:json)?\s*([\s\S]*?)```/) || text.match(/(\{[\s\S]*\})/);
if (!jsonMatch) { throw new Error(`No JSON found in LLM response: ${text}`); }
const parsed = JSON.parse(jsonMatch[1]);
// Validate against schema — throws if invalid return TransactionAnalysisSchema.parse(parsed);}Structured JSON via System Prompts
Claude doesn’t have a native “JSON mode” toggle like some providers. The reliable approach is to use the system prompt to enforce JSON output (or use tool use with input_schema for guaranteed structure):
// Use system prompt to enforce JSON-only responsesconst response = await client.messages.create({ model: 'claude-opus-4-6', max_tokens: 1024, system: 'You are a financial transaction classifier. Always respond with valid JSON only, no markdown, no explanation outside the JSON structure.', messages: [ { role: 'user', content: `Classify: ${description}`, }, ],});Retry Logic for Schema Failures
async function analyzeWithRetry( description: string, amount: number, merchant: string, maxRetries = 3): Promise<TransactionAnalysis> { let lastError: Error;
for (let attempt = 1; attempt <= maxRetries; attempt++) { try { return await analyzeTransaction(description, amount, merchant); } catch (err) { lastError = err as Error;
if (err instanceof z.ZodError) { console.warn(`Attempt ${attempt}: Schema validation failed:`, err.errors); // The model returned JSON but wrong structure — retry } else if (err instanceof SyntaxError) { console.warn(`Attempt ${attempt}: Invalid JSON:`, err.message); // Model returned non-JSON — retry } else { throw err; // Non-recoverable error } } }
throw lastError!;}Pattern 2: Tool Calling (Function Calling)
Tool calling lets you give the LLM “tools” it can invoke — the model decides which tool to call with which arguments based on the user’s request. This is the core primitive for agentic systems.
// Define available toolsconst tools: Anthropic.Tool[] = [ { name: 'get_account_balance', description: 'Get the current balance and recent transactions for a bank account', input_schema: { type: 'object' as const, properties: { account_id: { type: 'string', description: 'The account identifier', }, include_transactions: { type: 'boolean', description: 'Whether to include recent transactions (default: false)', }, }, required: ['account_id'], }, }, { name: 'flag_transaction', description: 'Flag a transaction for manual review', input_schema: { type: 'object' as const, properties: { transaction_id: { type: 'string' }, reason: { type: 'string' }, severity: { type: 'string', enum: ['low', 'medium', 'high'], }, }, required: ['transaction_id', 'reason', 'severity'], }, },];
// Tool implementationasync function executeTool( name: string, input: Record<string, unknown>): Promise<string> { switch (name) { case 'get_account_balance': const balance = await db.getAccountBalance(input.account_id as string); return JSON.stringify(balance);
case 'flag_transaction': await db.flagTransaction({ id: input.transaction_id as string, reason: input.reason as string, severity: input.severity as string, }); return JSON.stringify({ success: true });
default: throw new Error(`Unknown tool: ${name}`); }}
// Agentic loop — model calls tools until it has enough info to answerasync function runAgent(userMessage: string): Promise<string> { const messages: Anthropic.MessageParam[] = [ { role: 'user', content: userMessage }, ];
while (true) { const response = await client.messages.create({ model: 'claude-opus-4-6', max_tokens: 4096, tools, messages, });
// Check if model wants to call tools if (response.stop_reason === 'tool_use') { // Add assistant's response to conversation messages.push({ role: 'assistant', content: response.content });
// Execute all tool calls and collect results const toolResults: Anthropic.ToolResultBlockParam[] = [];
for (const block of response.content) { if (block.type === 'tool_use') { console.log(`Calling tool: ${block.name}`, block.input);
try { const result = await executeTool( block.name, block.input as Record<string, unknown> );
toolResults.push({ type: 'tool_result', tool_use_id: block.id, content: result, }); } catch (err) { toolResults.push({ type: 'tool_result', tool_use_id: block.id, content: `Error: ${(err as Error).message}`, is_error: true, }); } } }
// Add tool results and continue the loop messages.push({ role: 'user', content: toolResults });
} else if (response.stop_reason === 'end_turn') { // Model is done — extract final text response const textBlock = response.content.find(b => b.type === 'text'); return textBlock?.type === 'text' ? textBlock.text : ''; } else { throw new Error(`Unexpected stop reason: ${response.stop_reason}`); } }}Pattern 3: Streaming for Responsive UIs
LLM responses take 2-30 seconds for long outputs. Streaming lets you show the response as it’s generated — dramatically better UX.
// Server: stream LLM response to HTTP clientimport { Request, Response } from 'express';
app.post('/analyze', async (req: Request, res: Response) => { res.setHeader('Content-Type', 'text/event-stream'); res.setHeader('Cache-Control', 'no-cache'); res.setHeader('Connection', 'keep-alive');
try { const stream = client.messages.stream({ model: 'claude-opus-4-6', max_tokens: 2048, messages: [ { role: 'user', content: req.body.prompt, }, ], });
for await (const event of stream) { if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') { // Send each text chunk as SSE res.write(`data: ${JSON.stringify({ text: event.delta.text })}\n\n`); } }
// Signal completion res.write('data: [DONE]\n\n'); res.end();
} catch (err) { res.write(`data: ${JSON.stringify({ error: (err as Error).message })}\n\n`); res.end(); }});// Client: consume SSE streamasync function streamAnalysis(prompt: string, onChunk: (text: string) => void) { const response = await fetch('/analyze', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ prompt }), });
const reader = response.body!.getReader(); const decoder = new TextDecoder();
while (true) { const { done, value } = await reader.read(); if (done) break;
const chunk = decoder.decode(value); const lines = chunk.split('\n');
for (const line of lines) { if (!line.startsWith('data: ')) continue; const data = line.slice(6); if (data === '[DONE]') return;
try { const { text } = JSON.parse(data); if (text) onChunk(text); } catch { // Skip malformed chunks } } }}
// UsagestreamAnalysis( 'Analyze the risk in my portfolio...', (text) => { document.getElementById('output')!.textContent += text; });Pattern 4: Cost and Rate Limit Management
LLM API costs are per-token. At scale, this matters:
// Track token usageclass LLMUsageTracker { private usage = { inputTokens: 0, outputTokens: 0, requests: 0 };
async create(params: Anthropic.MessageCreateParams): Promise<Anthropic.Message> { const response = await client.messages.create(params);
this.usage.inputTokens += response.usage.input_tokens; this.usage.outputTokens += response.usage.output_tokens; this.usage.requests++;
// Alert if costs exceed threshold const estimatedCost = this.estimateCost(); if (estimatedCost > 100) { // $100 threshold await alertOpsTeam(`LLM cost exceeded $100: ${estimatedCost.toFixed(2)}`); }
return response; }
estimateCost(): number { // Hardcoded for Opus — use a pricing map keyed by model for multi-model systems // (see the multi-agent workflows post for that pattern) const inputCost = (this.usage.inputTokens / 1_000_000) * 15.0; // $15/M tokens const outputCost = (this.usage.outputTokens / 1_000_000) * 75.0; // $75/M tokens return inputCost + outputCost; }
getStats() { return { ...this.usage, estimatedCostUSD: this.estimateCost() }; }}
// Rate limitingimport { RateLimiter } from 'limiter';
const rateLimiter = new RateLimiter({ tokensPerInterval: 50, // 50 requests interval: 'minute', // per minute (check your tier)});
async function rateLimitedCreate(params: Anthropic.MessageCreateParams) { await rateLimiter.removeTokens(1); return client.messages.create(params);}Error Handling: LLM-Specific Failures
async function robustLLMCall(params: Anthropic.MessageCreateParams): Promise<Anthropic.Message> { const maxRetries = 3; let delay = 1000; // Start with 1 second
for (let attempt = 1; attempt <= maxRetries; attempt++) { try { return await client.messages.create(params);
} catch (err) { if (err instanceof Anthropic.RateLimitError) { // Rate limit — wait and retry with exponential backoff console.warn(`Rate limited. Waiting ${delay}ms before retry ${attempt}/${maxRetries}`); await sleep(delay); delay *= 2; // Exponential backoff
} else if (err instanceof Anthropic.APIError && err.status === 529) { // Overloaded — longer wait (must check before >= 500 catch-all) await sleep(30_000);
} else if (err instanceof Anthropic.APIError && err.status >= 500) { // Server error — retry console.warn(`API server error ${err.status}. Retry ${attempt}/${maxRetries}`); await sleep(delay); delay *= 2;
} else { // Non-retryable (400, 401, 403) — throw immediately throw err; } } }
throw new Error(`LLM call failed after ${maxRetries} retries`);}
const sleep = (ms: number) => new Promise(resolve => setTimeout(resolve, ms));Caching LLM Responses
Identical prompts produce similar (not identical) responses. Caching is worth it for:
- Read-heavy content analysis (same document analyzed by many users)
- Lookup-style queries with a small space of inputs
import { createHash } from 'crypto';
const responseCache = new Map<string, { response: Anthropic.Message; cachedAt: number }>();const CACHE_TTL_MS = 60 * 60 * 1000; // 1 hour
function getCacheKey(params: Anthropic.MessageCreateParams): string { // Deterministic key from model + messages + temperature const keyData = JSON.stringify({ model: params.model, messages: params.messages, temperature: params.temperature ?? 1, }); return createHash('sha256').update(keyData).digest('hex');}
async function cachedCreate(params: Anthropic.MessageCreateParams): Promise<Anthropic.Message> { const key = getCacheKey(params); const cached = responseCache.get(key);
if (cached && (Date.now() - cached.cachedAt) < CACHE_TTL_MS) { return cached.response; }
const response = await client.messages.create(params); responseCache.set(key, { response, cachedAt: Date.now() }); return response;}These four patterns — structured outputs with validation, tool calling for agentic behavior, streaming for UX, and robust error/cost handling — cover 90% of what production LLM integrations need. The rest is domain-specific prompt engineering, which is a separate discipline.
The most important shift in mindset: treat LLM calls like expensive I/O, not function calls. Design for latency (stream), design for failure (retry + fallback), design for cost (cache + track), and always validate outputs against a schema before trusting them.