Building Production-Ready Generative AI Applications
•
#ai#llm#openai#claude#langchain
Building Production-Ready Generative AI Applications
After building several generative AI applications in production, I've learned that there's a significant gap between a proof-of-concept and a robust, scalable AI system. Here's what I've learned.
The Architecture That Works
For most production AI applications, I've found this architecture to be reliable:
// app/lib/ai/chat-service.ts
import Anthropic from '@anthropic-ai/sdk';
import { createOpenAI } from '@ai-sdk/openai';
interface ChatMessage {
role: 'user' | 'assistant' | 'system';
content: string;
}
export class ChatService {
private anthropic: Anthropic;
private openai: ReturnType<typeof createOpenAI>;
constructor() {
this.anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
this.openai = createOpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
}
async chat(messages: ChatMessage[], provider: 'anthropic' | 'openai' = 'anthropic') {
if (provider === 'anthropic') {
return this.chatWithClaude(messages);
}
return this.chatWithGPT(messages);
}
private async chatWithClaude(messages: ChatMessage[]) {
const response = await this.anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 4096,
messages: messages.map(msg => ({
role: msg.role === 'system' ? 'assistant' : msg.role,
content: msg.content,
})),
});
return response.content[0].text;
}
private async chatWithGPT(messages: ChatMessage[]) {
const response = await this.openai('gpt-4o').chat({
messages,
});
return response.text;
}
}
Key Lessons Learned
1. Always Implement Fallbacks
LLM APIs can fail. Always have a fallback strategy:
async function robustChat(messages: ChatMessage[]) {
try {
// Try primary provider
return await chatService.chat(messages, 'anthropic');
} catch (error) {
console.error('Anthropic failed, falling back to OpenAI:', error);
try {
// Fallback to secondary provider
return await chatService.chat(messages, 'openai');
} catch (fallbackError) {
console.error('All providers failed:', fallbackError);
// Return graceful error message
return "I'm having trouble connecting right now. Please try again.";
}
}
}
2. Implement Proper Streaming
Users expect real-time responses. Streaming is essential:
import { streamText } from 'ai';
export async function POST(request: Request) {
const { messages } = await request.json();
const result = streamText({
model: openai('gpt-4o'),
messages,
onFinish: async ({ text, usage }) => {
// Log usage for cost tracking
await logUsage({
tokens: usage.totalTokens,
cost: calculateCost(usage),
timestamp: new Date(),
});
},
});
return result.toDataStreamResponse();
}
3. Cost Management is Critical
Track every request:
interface UsageLog {
timestamp: Date;
model: string;
inputTokens: number;
outputTokens: number;
cost: number;
userId: string;
}
function calculateCost(usage: { inputTokens: number; outputTokens: number }, model: string) {
const pricing = {
'gpt-4o': { input: 0.005, output: 0.015 }, // per 1K tokens
'claude-3-5-sonnet': { input: 0.003, output: 0.015 },
};
const rates = pricing[model];
return (
(usage.inputTokens / 1000) * rates.input +
(usage.outputTokens / 1000) * rates.output
);
}
4. Implement Rate Limiting
Protect your API budget:
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';
const ratelimit = new Ratelimit({
redis: Redis.fromEnv(),
limiter: Ratelimit.slidingWindow(10, '1 m'), // 10 requests per minute
analytics: true,
});
export async function middleware(request: Request) {
const ip = request.headers.get('x-forwarded-for') ?? 'anonymous';
const { success } = await ratelimit.limit(ip);
if (!success) {
return new Response('Rate limit exceeded', { status: 429 });
}
return NextResponse.next();
}
Prompt Engineering Best Practices
System Prompts Matter
const SYSTEM_PROMPT = `You are a helpful AI assistant for a customer support platform.
Guidelines:
- Be concise and professional
- Always verify information before stating facts
- If you don't know something, admit it
- Never make up product details or pricing
- Always prioritize user safety and privacy
Available tools:
- search_knowledge_base(query: string): Search internal documentation
- create_ticket(title: string, description: string): Create support ticket
- check_order_status(orderId: string): Look up order information`;
Use Structured Outputs
import { z } from 'zod';
const ResponseSchema = z.object({
sentiment: z.enum(['positive', 'neutral', 'negative']),
category: z.enum(['technical', 'billing', 'general']),
urgency: z.enum(['low', 'medium', 'high']),
response: z.string(),
});
const response = await chatService.chat(messages, {
responseFormat: { type: 'json_object' },
});
const parsed = ResponseSchema.parse(JSON.parse(response));
Error Handling Strategies
class AIError extends Error {
constructor(
message: string,
public readonly code: 'RATE_LIMIT' | 'INVALID_REQUEST' | 'API_ERROR',
public readonly retryable: boolean
) {
super(message);
this.name = 'AIError';
}
}
async function handleAIRequest(messages: ChatMessage[]) {
let attempts = 0;
const maxAttempts = 3;
while (attempts < maxAttempts) {
try {
return await chatService.chat(messages);
} catch (error) {
attempts++;
if (error.status === 429) {
// Rate limit - exponential backoff
await new Promise(resolve =>
setTimeout(resolve, Math.pow(2, attempts) * 1000)
);
continue;
}
if (error.status >= 500) {
// Server error - retry
if (attempts < maxAttempts) continue;
}
// Client error or max attempts reached
throw new AIError(
error.message,
error.status === 429 ? 'RATE_LIMIT' : 'API_ERROR',
error.status === 429 || error.status >= 500
);
}
}
}
Performance Optimization
Caching Responses
import { Redis } from '@upstash/redis';
const redis = Redis.fromEnv();
async function getCachedOrGenerate(prompt: string) {
const cacheKey = `ai:response:${hashPrompt(prompt)}`;
// Check cache
const cached = await redis.get(cacheKey);
if (cached) {
return cached;
}
// Generate new response
const response = await chatService.chat([
{ role: 'user', content: prompt }
]);
// Cache for 1 hour
await redis.set(cacheKey, response, { ex: 3600 });
return response;
}
Parallel Processing
async function processMultipleQueries(queries: string[]) {
// Process up to 5 queries in parallel
const results = await Promise.all(
queries.map(query =>
chatService.chat([{ role: 'user', content: query }])
)
);
return results;
}
Monitoring and Observability
Always track:
- Latency: How long does each request take?
- Token usage: Input vs output tokens
- Cost per request: Real-time cost tracking
- Error rates: By provider and error type
- User satisfaction: Thumbs up/down feedback
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('ai-service');
async function trackedChat(messages: ChatMessage[]) {
return tracer.startActiveSpan('ai.chat', async (span) => {
const startTime = Date.now();
try {
const response = await chatService.chat(messages);
span.setAttributes({
'ai.model': 'claude-3-5-sonnet',
'ai.input_tokens': calculateTokens(messages),
'ai.latency_ms': Date.now() - startTime,
});
return response;
} catch (error) {
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
Conclusion
Building production AI applications requires much more than just calling an API. Focus on:
- Reliability: Fallbacks, retries, error handling
- Performance: Streaming, caching, parallel processing
- Cost management: Rate limiting, usage tracking
- User experience: Fast responses, graceful errors
- Observability: Comprehensive monitoring
The AI landscape is evolving rapidly, but these patterns have proven robust across multiple production deployments.