Skip to content

LLM API Integration Patterns — Structured Outputs, Function Calling, Streaming

Posted on:July 14, 2025 at 10:00 AM

LLM APIs are not like REST APIs. A REST API is deterministic — given the same inputs, you get the same outputs. An LLM is probabilistic — outputs vary, latency varies (2-30 seconds is typical), and failure modes are different from HTTP errors. Writing production code that integrates with LLMs requires adapting patterns from async I/O, streaming, and probabilistic systems.

This post covers the four patterns I use in every production LLM integration, with TypeScript code using the Anthropic Claude API.

Table of contents

Open Table of contents

Setup

Terminal window
npm install @anthropic-ai/sdk zod
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY!,
});

Pattern 1: Structured Outputs

The fundamental challenge: LLMs return text. Your application needs structured data. Naive approach: ask the model to “respond in JSON”. This fails ~10% of the time due to:

The robust approach: combine prompt engineering with schema validation.

import { z } from 'zod';
// Define your expected output schema
const TransactionAnalysisSchema = z.object({
category: z.enum(['retail', 'food', 'transport', 'utilities', 'other']),
risk_score: z.number().min(0).max(1),
is_suspicious: z.boolean(),
reasoning: z.string(),
suggested_tags: z.array(z.string()).max(5),
});
type TransactionAnalysis = z.infer<typeof TransactionAnalysisSchema>;
async function analyzeTransaction(
description: string,
amount: number,
merchant: string
): Promise<TransactionAnalysis> {
const response = await client.messages.create({
model: 'claude-opus-4-6',
max_tokens: 1024,
messages: [
{
role: 'user',
content: `Analyze this financial transaction and respond with ONLY a JSON object matching this exact schema:
Schema:
\`\`\`json
{
"category": "retail" | "food" | "transport" | "utilities" | "other",
"risk_score": number between 0 and 1,
"is_suspicious": boolean,
"reasoning": "string explaining the analysis",
"suggested_tags": ["array", "of", "tags"]
}
\`\`\`
Transaction:
- Description: ${description}
- Amount: €${amount}
- Merchant: ${merchant}
Respond with only the JSON object, no additional text.`,
},
],
});
const text = response.content[0].type === 'text' ? response.content[0].text : '';
// Extract JSON from response (handle markdown code fences)
const jsonMatch = text.match(/```(?:json)?\s*([\s\S]*?)```/) ||
text.match(/(\{[\s\S]*\})/);
if (!jsonMatch) {
throw new Error(`No JSON found in LLM response: ${text}`);
}
const parsed = JSON.parse(jsonMatch[1]);
// Validate against schema — throws if invalid
return TransactionAnalysisSchema.parse(parsed);
}

Structured JSON via System Prompts

Claude doesn’t have a native “JSON mode” toggle like some providers. The reliable approach is to use the system prompt to enforce JSON output (or use tool use with input_schema for guaranteed structure):

// Use system prompt to enforce JSON-only responses
const response = await client.messages.create({
model: 'claude-opus-4-6',
max_tokens: 1024,
system: 'You are a financial transaction classifier. Always respond with valid JSON only, no markdown, no explanation outside the JSON structure.',
messages: [
{
role: 'user',
content: `Classify: ${description}`,
},
],
});

Retry Logic for Schema Failures

async function analyzeWithRetry(
description: string,
amount: number,
merchant: string,
maxRetries = 3
): Promise<TransactionAnalysis> {
let lastError: Error;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await analyzeTransaction(description, amount, merchant);
} catch (err) {
lastError = err as Error;
if (err instanceof z.ZodError) {
console.warn(`Attempt ${attempt}: Schema validation failed:`, err.errors);
// The model returned JSON but wrong structure — retry
} else if (err instanceof SyntaxError) {
console.warn(`Attempt ${attempt}: Invalid JSON:`, err.message);
// Model returned non-JSON — retry
} else {
throw err; // Non-recoverable error
}
}
}
throw lastError!;
}

Pattern 2: Tool Calling (Function Calling)

Tool calling lets you give the LLM “tools” it can invoke — the model decides which tool to call with which arguments based on the user’s request. This is the core primitive for agentic systems.

// Define available tools
const tools: Anthropic.Tool[] = [
{
name: 'get_account_balance',
description: 'Get the current balance and recent transactions for a bank account',
input_schema: {
type: 'object' as const,
properties: {
account_id: {
type: 'string',
description: 'The account identifier',
},
include_transactions: {
type: 'boolean',
description: 'Whether to include recent transactions (default: false)',
},
},
required: ['account_id'],
},
},
{
name: 'flag_transaction',
description: 'Flag a transaction for manual review',
input_schema: {
type: 'object' as const,
properties: {
transaction_id: { type: 'string' },
reason: { type: 'string' },
severity: {
type: 'string',
enum: ['low', 'medium', 'high'],
},
},
required: ['transaction_id', 'reason', 'severity'],
},
},
];
// Tool implementation
async function executeTool(
name: string,
input: Record<string, unknown>
): Promise<string> {
switch (name) {
case 'get_account_balance':
const balance = await db.getAccountBalance(input.account_id as string);
return JSON.stringify(balance);
case 'flag_transaction':
await db.flagTransaction({
id: input.transaction_id as string,
reason: input.reason as string,
severity: input.severity as string,
});
return JSON.stringify({ success: true });
default:
throw new Error(`Unknown tool: ${name}`);
}
}
// Agentic loop — model calls tools until it has enough info to answer
async function runAgent(userMessage: string): Promise<string> {
const messages: Anthropic.MessageParam[] = [
{ role: 'user', content: userMessage },
];
while (true) {
const response = await client.messages.create({
model: 'claude-opus-4-6',
max_tokens: 4096,
tools,
messages,
});
// Check if model wants to call tools
if (response.stop_reason === 'tool_use') {
// Add assistant's response to conversation
messages.push({ role: 'assistant', content: response.content });
// Execute all tool calls and collect results
const toolResults: Anthropic.ToolResultBlockParam[] = [];
for (const block of response.content) {
if (block.type === 'tool_use') {
console.log(`Calling tool: ${block.name}`, block.input);
try {
const result = await executeTool(
block.name,
block.input as Record<string, unknown>
);
toolResults.push({
type: 'tool_result',
tool_use_id: block.id,
content: result,
});
} catch (err) {
toolResults.push({
type: 'tool_result',
tool_use_id: block.id,
content: `Error: ${(err as Error).message}`,
is_error: true,
});
}
}
}
// Add tool results and continue the loop
messages.push({ role: 'user', content: toolResults });
} else if (response.stop_reason === 'end_turn') {
// Model is done — extract final text response
const textBlock = response.content.find(b => b.type === 'text');
return textBlock?.type === 'text' ? textBlock.text : '';
} else {
throw new Error(`Unexpected stop reason: ${response.stop_reason}`);
}
}
}

Pattern 3: Streaming for Responsive UIs

LLM responses take 2-30 seconds for long outputs. Streaming lets you show the response as it’s generated — dramatically better UX.

// Server: stream LLM response to HTTP client
import { Request, Response } from 'express';
app.post('/analyze', async (req: Request, res: Response) => {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
try {
const stream = client.messages.stream({
model: 'claude-opus-4-6',
max_tokens: 2048,
messages: [
{
role: 'user',
content: req.body.prompt,
},
],
});
for await (const event of stream) {
if (event.type === 'content_block_delta' &&
event.delta.type === 'text_delta') {
// Send each text chunk as SSE
res.write(`data: ${JSON.stringify({ text: event.delta.text })}\n\n`);
}
}
// Signal completion
res.write('data: [DONE]\n\n');
res.end();
} catch (err) {
res.write(`data: ${JSON.stringify({ error: (err as Error).message })}\n\n`);
res.end();
}
});
// Client: consume SSE stream
async function streamAnalysis(prompt: string, onChunk: (text: string) => void) {
const response = await fetch('/analyze', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt }),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const data = line.slice(6);
if (data === '[DONE]') return;
try {
const { text } = JSON.parse(data);
if (text) onChunk(text);
} catch {
// Skip malformed chunks
}
}
}
}
// Usage
streamAnalysis(
'Analyze the risk in my portfolio...',
(text) => {
document.getElementById('output')!.textContent += text;
}
);

Pattern 4: Cost and Rate Limit Management

LLM API costs are per-token. At scale, this matters:

// Track token usage
class LLMUsageTracker {
private usage = { inputTokens: 0, outputTokens: 0, requests: 0 };
async create(params: Anthropic.MessageCreateParams): Promise<Anthropic.Message> {
const response = await client.messages.create(params);
this.usage.inputTokens += response.usage.input_tokens;
this.usage.outputTokens += response.usage.output_tokens;
this.usage.requests++;
// Alert if costs exceed threshold
const estimatedCost = this.estimateCost();
if (estimatedCost > 100) { // $100 threshold
await alertOpsTeam(`LLM cost exceeded $100: ${estimatedCost.toFixed(2)}`);
}
return response;
}
estimateCost(): number {
// Hardcoded for Opus — use a pricing map keyed by model for multi-model systems
// (see the multi-agent workflows post for that pattern)
const inputCost = (this.usage.inputTokens / 1_000_000) * 15.0; // $15/M tokens
const outputCost = (this.usage.outputTokens / 1_000_000) * 75.0; // $75/M tokens
return inputCost + outputCost;
}
getStats() {
return { ...this.usage, estimatedCostUSD: this.estimateCost() };
}
}
// Rate limiting
import { RateLimiter } from 'limiter';
const rateLimiter = new RateLimiter({
tokensPerInterval: 50, // 50 requests
interval: 'minute', // per minute (check your tier)
});
async function rateLimitedCreate(params: Anthropic.MessageCreateParams) {
await rateLimiter.removeTokens(1);
return client.messages.create(params);
}

Error Handling: LLM-Specific Failures

async function robustLLMCall(params: Anthropic.MessageCreateParams): Promise<Anthropic.Message> {
const maxRetries = 3;
let delay = 1000; // Start with 1 second
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await client.messages.create(params);
} catch (err) {
if (err instanceof Anthropic.RateLimitError) {
// Rate limit — wait and retry with exponential backoff
console.warn(`Rate limited. Waiting ${delay}ms before retry ${attempt}/${maxRetries}`);
await sleep(delay);
delay *= 2; // Exponential backoff
} else if (err instanceof Anthropic.APIError && err.status === 529) {
// Overloaded — longer wait (must check before >= 500 catch-all)
await sleep(30_000);
} else if (err instanceof Anthropic.APIError && err.status >= 500) {
// Server error — retry
console.warn(`API server error ${err.status}. Retry ${attempt}/${maxRetries}`);
await sleep(delay);
delay *= 2;
} else {
// Non-retryable (400, 401, 403) — throw immediately
throw err;
}
}
}
throw new Error(`LLM call failed after ${maxRetries} retries`);
}
const sleep = (ms: number) => new Promise(resolve => setTimeout(resolve, ms));

Caching LLM Responses

Identical prompts produce similar (not identical) responses. Caching is worth it for:

import { createHash } from 'crypto';
const responseCache = new Map<string, { response: Anthropic.Message; cachedAt: number }>();
const CACHE_TTL_MS = 60 * 60 * 1000; // 1 hour
function getCacheKey(params: Anthropic.MessageCreateParams): string {
// Deterministic key from model + messages + temperature
const keyData = JSON.stringify({
model: params.model,
messages: params.messages,
temperature: params.temperature ?? 1,
});
return createHash('sha256').update(keyData).digest('hex');
}
async function cachedCreate(params: Anthropic.MessageCreateParams): Promise<Anthropic.Message> {
const key = getCacheKey(params);
const cached = responseCache.get(key);
if (cached && (Date.now() - cached.cachedAt) < CACHE_TTL_MS) {
return cached.response;
}
const response = await client.messages.create(params);
responseCache.set(key, { response, cachedAt: Date.now() });
return response;
}

These four patterns — structured outputs with validation, tool calling for agentic behavior, streaming for UX, and robust error/cost handling — cover 90% of what production LLM integrations need. The rest is domain-specific prompt engineering, which is a separate discipline.

The most important shift in mindset: treat LLM calls like expensive I/O, not function calls. Design for latency (stream), design for failure (retry + fallback), design for cost (cache + track), and always validate outputs against a schema before trusting them.