LLM API Integration Patterns — Structured Outputs, Function Calling, Streaming

LLM APIs are not like REST APIs. A REST API is deterministic — given the same inputs, you get the same outputs. An LLM is probabilistic — outputs vary, latency varies (2-30 seconds is typical), and failure modes are different from HTTP errors. Writing production code that integrates with LLMs requires adapting patterns from async I/O, streaming, and probabilistic systems.

This post covers the four patterns I use in every production LLM integration, with TypeScript code using the Anthropic Claude API.

Open Table of contents

Setup
Pattern 1: Structured Outputs
- Structured JSON via System Prompts
- Retry Logic for Schema Failures
Pattern 2: Tool Calling (Function Calling)
Pattern 3: Streaming for Responsive UIs
Pattern 4: Cost and Rate Limit Management
Error Handling: LLM-Specific Failures
Caching LLM Responses

Setup

npm install @anthropic-ai/sdk zod

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY!,
});

Pattern 1: Structured Outputs

The fundamental challenge: LLMs return text. Your application needs structured data. Naive approach: ask the model to “respond in JSON”. This fails ~10% of the time due to:

Trailing commas
Unescaped quotes in strings
Model reasoning before the JSON block
Markdown code fences around the JSON

The robust approach: combine prompt engineering with schema validation.

import { z } from 'zod';

// Define your expected output schema
const TransactionAnalysisSchema = z.object({
  category: z.enum(['retail', 'food', 'transport', 'utilities', 'other']),
  risk_score: z.number().min(0).max(1),
  is_suspicious: z.boolean(),
  reasoning: z.string(),
  suggested_tags: z.array(z.string()).max(5),
});

type TransactionAnalysis = z.infer<typeof TransactionAnalysisSchema>;

async function analyzeTransaction(
  description: string,
  amount: number,
  merchant: string
): Promise<TransactionAnalysis> {
  const response = await client.messages.create({
    model: 'claude-opus-4-6',
    max_tokens: 1024,
    messages: [
      {
        role: 'user',
        content: `Analyze this financial transaction and respond with ONLY a JSON object matching this exact schema:

Schema:
\`\`\`json
{
  "category": "retail" | "food" | "transport" | "utilities" | "other",
  "risk_score": number between 0 and 1,
  "is_suspicious": boolean,
  "reasoning": "string explaining the analysis",
  "suggested_tags": ["array", "of", "tags"]
}
\`\`\`

Transaction:
- Description: ${description}
- Amount: €${amount}
- Merchant: ${merchant}

Respond with only the JSON object, no additional text.`,
      },
    ],
  });

  const text = response.content[0].type === 'text' ? response.content[0].text : '';

  // Extract JSON from response (handle markdown code fences)
  const jsonMatch = text.match(/```(?:json)?\s*([\s\S]*?)```/) ||
                    text.match(/(\{[\s\S]*\})/);

  if (!jsonMatch) {
    throw new Error(`No JSON found in LLM response: ${text}`);
  }

  const parsed = JSON.parse(jsonMatch[1]);

  // Validate against schema — throws if invalid
  return TransactionAnalysisSchema.parse(parsed);
}

Structured JSON via System Prompts

Claude doesn’t have a native “JSON mode” toggle like some providers. The reliable approach is to use the system prompt to enforce JSON output (or use tool use with input_schema for guaranteed structure):

// Use system prompt to enforce JSON-only responses
const response = await client.messages.create({
  model: 'claude-opus-4-6',
  max_tokens: 1024,
  system: 'You are a financial transaction classifier. Always respond with valid JSON only, no markdown, no explanation outside the JSON structure.',
  messages: [
    {
      role: 'user',
      content: `Classify: ${description}`,
    },
  ],
});

Retry Logic for Schema Failures

async function analyzeWithRetry(
  description: string,
  amount: number,
  merchant: string,
  maxRetries = 3
): Promise<TransactionAnalysis> {
  let lastError: Error;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await analyzeTransaction(description, amount, merchant);
    } catch (err) {
      lastError = err as Error;

      if (err instanceof z.ZodError) {
        console.warn(`Attempt ${attempt}: Schema validation failed:`, err.errors);
        // The model returned JSON but wrong structure — retry
      } else if (err instanceof SyntaxError) {
        console.warn(`Attempt ${attempt}: Invalid JSON:`, err.message);
        // Model returned non-JSON — retry
      } else {
        throw err; // Non-recoverable error
      }
    }
  }

  throw lastError!;
}

Pattern 2: Tool Calling (Function Calling)

Tool calling lets you give the LLM “tools” it can invoke — the model decides which tool to call with which arguments based on the user’s request. This is the core primitive for agentic systems.

// Define available tools
const tools: Anthropic.Tool[] = [
  {
    name: 'get_account_balance',
    description: 'Get the current balance and recent transactions for a bank account',
    input_schema: {
      type: 'object' as const,
      properties: {
        account_id: {
          type: 'string',
          description: 'The account identifier',
        },
        include_transactions: {
          type: 'boolean',
          description: 'Whether to include recent transactions (default: false)',
        },
      },
      required: ['account_id'],
    },
  },
  {
    name: 'flag_transaction',
    description: 'Flag a transaction for manual review',
    input_schema: {
      type: 'object' as const,
      properties: {
        transaction_id: { type: 'string' },
        reason: { type: 'string' },
        severity: {
          type: 'string',
          enum: ['low', 'medium', 'high'],
        },
      },
      required: ['transaction_id', 'reason', 'severity'],
    },
  },
];

// Tool implementation
async function executeTool(
  name: string,
  input: Record<string, unknown>
): Promise<string> {
  switch (name) {
    case 'get_account_balance':
      const balance = await db.getAccountBalance(input.account_id as string);
      return JSON.stringify(balance);

    case 'flag_transaction':
      await db.flagTransaction({
        id: input.transaction_id as string,
        reason: input.reason as string,
        severity: input.severity as string,
      });
      return JSON.stringify({ success: true });

    default:
      throw new Error(`Unknown tool: ${name}`);
  }
}

// Agentic loop — model calls tools until it has enough info to answer
async function runAgent(userMessage: string): Promise<string> {
  const messages: Anthropic.MessageParam[] = [
    { role: 'user', content: userMessage },
  ];

  while (true) {
    const response = await client.messages.create({
      model: 'claude-opus-4-6',
      max_tokens: 4096,
      tools,
      messages,
    });

    // Check if model wants to call tools
    if (response.stop_reason === 'tool_use') {
      // Add assistant's response to conversation
      messages.push({ role: 'assistant', content: response.content });

      // Execute all tool calls and collect results
      const toolResults: Anthropic.ToolResultBlockParam[] = [];

      for (const block of response.content) {
        if (block.type === 'tool_use') {
          console.log(`Calling tool: ${block.name}`, block.input);

          try {
            const result = await executeTool(
              block.name,
              block.input as Record<string, unknown>
            );

            toolResults.push({
              type: 'tool_result',
              tool_use_id: block.id,
              content: result,
            });
          } catch (err) {
            toolResults.push({
              type: 'tool_result',
              tool_use_id: block.id,
              content: `Error: ${(err as Error).message}`,
              is_error: true,
            });
          }
        }
      }

      // Add tool results and continue the loop
      messages.push({ role: 'user', content: toolResults });

    } else if (response.stop_reason === 'end_turn') {
      // Model is done — extract final text response
      const textBlock = response.content.find(b => b.type === 'text');
      return textBlock?.type === 'text' ? textBlock.text : '';
    } else {
      throw new Error(`Unexpected stop reason: ${response.stop_reason}`);
    }
  }
}

Pattern 3: Streaming for Responsive UIs

LLM responses take 2-30 seconds for long outputs. Streaming lets you show the response as it’s generated — dramatically better UX.

// Server: stream LLM response to HTTP client
import { Request, Response } from 'express';

app.post('/analyze', async (req: Request, res: Response) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  try {
    const stream = client.messages.stream({
      model: 'claude-opus-4-6',
      max_tokens: 2048,
      messages: [
        {
          role: 'user',
          content: req.body.prompt,
        },
      ],
    });

    for await (const event of stream) {
      if (event.type === 'content_block_delta' &&
          event.delta.type === 'text_delta') {
        // Send each text chunk as SSE
        res.write(`data: ${JSON.stringify({ text: event.delta.text })}\n\n`);
      }
    }

    // Signal completion
    res.write('data: [DONE]\n\n');
    res.end();

  } catch (err) {
    res.write(`data: ${JSON.stringify({ error: (err as Error).message })}\n\n`);
    res.end();
  }
});

// Client: consume SSE stream
async function streamAnalysis(prompt: string, onChunk: (text: string) => void) {
  const response = await fetch('/analyze', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ prompt }),
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value);
    const lines = chunk.split('\n');

    for (const line of lines) {
      if (!line.startsWith('data: ')) continue;
      const data = line.slice(6);
      if (data === '[DONE]') return;

      try {
        const { text } = JSON.parse(data);
        if (text) onChunk(text);
      } catch {
        // Skip malformed chunks
      }
    }
  }
}

// Usage
streamAnalysis(
  'Analyze the risk in my portfolio...',
  (text) => {
    document.getElementById('output')!.textContent += text;
  }
);

Pattern 4: Cost and Rate Limit Management

LLM API costs are per-token. At scale, this matters:

// Track token usage
class LLMUsageTracker {
  private usage = { inputTokens: 0, outputTokens: 0, requests: 0 };

  async create(params: Anthropic.MessageCreateParams): Promise<Anthropic.Message> {
    const response = await client.messages.create(params);

    this.usage.inputTokens += response.usage.input_tokens;
    this.usage.outputTokens += response.usage.output_tokens;
    this.usage.requests++;

    // Alert if costs exceed threshold
    const estimatedCost = this.estimateCost();
    if (estimatedCost > 100) { // $100 threshold
      await alertOpsTeam(`LLM cost exceeded $100: ${estimatedCost.toFixed(2)}`);
    }

    return response;
  }

  estimateCost(): number {
    // Hardcoded for Opus — use a pricing map keyed by model for multi-model systems
    // (see the multi-agent workflows post for that pattern)
    const inputCost = (this.usage.inputTokens / 1_000_000) * 15.0;  // $15/M tokens
    const outputCost = (this.usage.outputTokens / 1_000_000) * 75.0; // $75/M tokens
    return inputCost + outputCost;
  }

  getStats() {
    return { ...this.usage, estimatedCostUSD: this.estimateCost() };
  }
}

// Rate limiting
import { RateLimiter } from 'limiter';

const rateLimiter = new RateLimiter({
  tokensPerInterval: 50,  // 50 requests
  interval: 'minute',     // per minute (check your tier)
});

async function rateLimitedCreate(params: Anthropic.MessageCreateParams) {
  await rateLimiter.removeTokens(1);
  return client.messages.create(params);
}

Error Handling: LLM-Specific Failures

async function robustLLMCall(params: Anthropic.MessageCreateParams): Promise<Anthropic.Message> {
  const maxRetries = 3;
  let delay = 1000; // Start with 1 second

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await client.messages.create(params);

    } catch (err) {
      if (err instanceof Anthropic.RateLimitError) {
        // Rate limit — wait and retry with exponential backoff
        console.warn(`Rate limited. Waiting ${delay}ms before retry ${attempt}/${maxRetries}`);
        await sleep(delay);
        delay *= 2; // Exponential backoff

      } else if (err instanceof Anthropic.APIError && err.status === 529) {
        // Overloaded — longer wait (must check before >= 500 catch-all)
        await sleep(30_000);

      } else if (err instanceof Anthropic.APIError && err.status >= 500) {
        // Server error — retry
        console.warn(`API server error ${err.status}. Retry ${attempt}/${maxRetries}`);
        await sleep(delay);
        delay *= 2;

      } else {
        // Non-retryable (400, 401, 403) — throw immediately
        throw err;
      }
    }
  }

  throw new Error(`LLM call failed after ${maxRetries} retries`);
}

const sleep = (ms: number) => new Promise(resolve => setTimeout(resolve, ms));

Caching LLM Responses

Identical prompts produce similar (not identical) responses. Caching is worth it for:

Read-heavy content analysis (same document analyzed by many users)
Lookup-style queries with a small space of inputs

import { createHash } from 'crypto';

const responseCache = new Map<string, { response: Anthropic.Message; cachedAt: number }>();
const CACHE_TTL_MS = 60 * 60 * 1000; // 1 hour

function getCacheKey(params: Anthropic.MessageCreateParams): string {
  // Deterministic key from model + messages + temperature
  const keyData = JSON.stringify({
    model: params.model,
    messages: params.messages,
    temperature: params.temperature ?? 1,
  });
  return createHash('sha256').update(keyData).digest('hex');
}

async function cachedCreate(params: Anthropic.MessageCreateParams): Promise<Anthropic.Message> {
  const key = getCacheKey(params);
  const cached = responseCache.get(key);

  if (cached && (Date.now() - cached.cachedAt) < CACHE_TTL_MS) {
    return cached.response;
  }

  const response = await client.messages.create(params);
  responseCache.set(key, { response, cachedAt: Date.now() });
  return response;
}

These four patterns — structured outputs with validation, tool calling for agentic behavior, streaming for UX, and robust error/cost handling — cover 90% of what production LLM integrations need. The rest is domain-specific prompt engineering, which is a separate discipline.

The most important shift in mindset: treat LLM calls like expensive I/O, not function calls. Design for latency (stream), design for failure (retry + fallback), design for cost (cache + track), and always validate outputs against a schema before trusting them.