I've been building more and more tools that integrate with Large Language Models lately. From automating git commits using AI to creating a voice assistant using ChatGPT, I found myself writing the same integration code over and over. Each time I needed robust error handling, retries, and proper connection management. After the third or fourth implementation, I decided to build a proper package that would handle all of this out of the box.

Core Architecture and Design Philosophy

The package is built around a few key principles that I've found essential when working with LLMs in production:

  1. Make integration dead simple
  2. Support multiple LLM providers out of the box
  3. Include production-ready features by default
  4. Provide clear cost visibility
  5. Handle failures gracefully

Here's what a basic implementation looks like:

client, err := llm.NewClient(
    os.Getenv("OPENAI_API_KEY"),
    llm.WithProvider("openai"),
    llm.WithModel("gpt-4"),
    llm.WithTimeout(30 * time.Second),
)

resp, err := client.Chat(context.Background(), &types.ChatRequest{
    Messages: []types.Message{
        {
            Role:    types.RoleUser,
            Content: "What is the capital of France?",
        },
    },
})

Simple on the surface, but there's a lot happening underneath. Let's dive into the key components that make this production-ready.

Connection Management: Beyond Basic HTTP Clients

When building services that interact with LLMs, connection management becomes crucial. Every request doesn't need a new connection - that's wasteful and can lead to resource exhaustion. The connection pooling system is built to handle this efficiently:

type PoolConfig struct {
    MaxSize       int           // Maximum number of connections
    IdleTimeout   time.Duration // How long to keep idle connections
    CleanupPeriod time.Duration // How often to clean up idle connections
}

The pool manages connections through several key mechanisms:

Connection Lifecycle Management

The pool tracks both active and idle connections, implementing a cleanup routine that runs periodically:

func (p *ConnectionPool) cleanup() {
    ticker := time.NewTicker(p.config.CleanupPeriod)
    defer ticker.Stop()

    for range ticker.C {
        p.mu.Lock()
        now := time.Now()
        // Remove idle connections that have timed out
        // Keep track of active connections
        p.mu.Unlock()
    }
}

Smart Connection Distribution

When a client requests a connection, the pool follows a specific hierarchy:

  1. Try to reuse an existing idle connection
  2. Create a new connection if under the max limit
  3. Wait for a connection to become available if at capacity

This prevents both resource wastage and connection starvation.

Robust Error Handling and Retries

LLM APIs can be unreliable. They might rate limit you, have temporary outages, or just be slow to respond. The retry system is designed to handle these cases gracefully:

type RetryConfig struct {
    MaxRetries      int
    InitialInterval time.Duration
    MaxInterval     time.Duration
    Multiplier      float64
}

The retry system implements exponential backoff with jitter to prevent thundering herd problems. Here's how it works:

  1. Initial attempt fails
  2. Wait for InitialInterval
  3. For each subsequent retry:
    • Add random jitter to prevent synchronization
    • Increase wait time by Multiplier
    • Cap at MaxInterval to prevent excessive waits

This means your application can handle various types of failures:

  • Rate limiting (429 responses)
  • Temporary service outages (5xx responses)
  • Network timeouts
  • Connection reset errors

Cost Tracking and Budget Management

One of the most requested features was cost tracking. If you're building services on top of LLMs, you need to know exactly how much each request costs. The cost tracking system provides:

Per-Request Cost Tracking

type Usage struct {
    PromptTokens     int     
    CompletionTokens int     
    TotalTokens      int     
    Cost            float64 
}

func (ct *CostTracker) TrackUsage(provider, model string, usage Usage) error {
    cost := calculateCost(provider, model, usage)
    if cost > ct.config.MaxCostPerRequest {
        return ErrCostLimitExceeded
    }
    // Track costs and usage
}

Budget Management

The system allows you to set various budget controls:

  • Per-request cost limits
  • Daily/monthly budget caps
  • Usage alerts at configurable thresholds
  • Cost breakdown by model and provider

This becomes critical when you're running at scale. I've seen services rack up surprising bills because they didn't have proper cost monitoring in place. With this system, you can:

  1. Monitor costs in real-time
  2. Set hard limits to prevent runaway spending
  3. Get alerts before hitting budget thresholds
  4. Track costs per customer or feature

Streaming Support: Real-time Responses

Modern LLM applications often need streaming support for better user experience. The package includes robust streaming support:

streamChan, err := client.StreamChat(ctx, req)
if err != nil {
    return err
}

for resp := range streamChan {
    if resp.Error != nil {
        return resp.Error
    }
    fmt.Print(resp.Message.Content)
}

The streaming implementation handles several complex cases:

  • Graceful connection termination
  • Partial message handling
  • Error propagation
  • Context cancellation

Performance Metrics and Monitoring

Understanding how your LLM integration performs is crucial. The package includes comprehensive metrics:

Request Metrics

  • Request latency
  • Token usage
  • Error rates
  • Retry counts

Connection Pool Metrics

  • Active connections
  • Idle connections
  • Wait time for connections
  • Connection errors

Cost Metrics

  • Cost per request
  • Running totals
  • Budget utilization
  • Cost per model/provider

Provider Management

The package currently supports multiple LLM providers:

OpenAI

  • GPT-3.5
  • GPT-4
  • Text completion models

Anthropic

  • Claude
  • Claude Instant

Each provider implementation handles its specific quirks while presenting a unified interface to your application.

Real-World Applications

I've used this package in several production applications:

Automated Content Generation

A system generating thousands of product descriptions daily. Key features used:

  • Connection pooling for high throughput
  • Cost tracking for billing
  • Retries for reliability

Interactive Chat Applications

Real-time chat applications requiring:

  • Streaming responses
  • Low latency
  • Error resilience

Batch Processing Systems

Large-scale document processing using:

  • Multiple providers
  • Budget management
  • Detailed usage tracking

What's Next

While the package is already being used in production, there's more to come:

Short Term

  • Enhanced cost tracking across different pricing tiers
  • Better model handling and automatic selection
  • Support for more LLM providers
  • Improved metrics and monitoring

Long Term

  • Automatic provider failover
  • Smart request routing
  • Advanced budget controls
  • Performance optimization tools

Best Practices and Tips

From my experience using this package in production, here are some recommendations:

  1. Start with conservative retry settings
  2. Monitor your token usage closely
  3. Set up budget alerts well below your actual limits
  4. Use streaming for interactive applications
  5. Implement proper error handling in your application

Conclusion

Building this package has significantly simplified my LLM integrations. Instead of rewriting the same boilerplate code for each project, I can focus on building the actual features I need. If you're working with LLMs in Go, feel free to check out the package and contribute.

Like my approach to CI/CD deployment, this is open source and available for anyone to use and improve. The more we can standardize these patterns, the better our LLM integrations will become.

The future of LLM integration is about making these powerful tools more accessible and reliable. With proper abstractions and production-ready features, we can focus on building innovative applications instead of worrying about the underlying infrastructure.