12 KiB

Raw Blame History

Kubernetes High Availability and Scaling Guide

This guide explains how to deploy Memos in a Kubernetes environment with proper session management for horizontal scaling and high availability.

Description

Till v0.25.0, Memos had limitations when deployed as multiple pods in Kubernetes:

Session Isolation: Each pod maintained its own in-memory session cache, causing authentication inconsistencies when load balancers directed users to different pods.
SSO Redirect Issues: OAuth2 authentication flows would fail when:
- User initiated login on Pod A
- OAuth provider redirected back to Pod B
- Pod B couldn't validate the session created by Pod A
Cache Inconsistency: Session updates on one pod weren't reflected on other pods until cache expiry (10+ minutes).

Solution Overview

The solution implements a distributed cache system with the following features:

Redis-backed shared cache for session synchronization across pods
Hybrid cache strategy with local cache fallback for resilience
Event-driven cache invalidation for real-time consistency
Backward compatibility - works without Redis for single-pod deployments

Architecture

Production Architecture with External Services

┌─────────────────────────────────────────────────────────────┐
│                Load Balancer (Ingress)                     │
└─────────────┬─────────────┬─────────────┬─────────────────┘
              │             │             │
         ┌────▼────┐   ┌────▼────┐   ┌────▼────┐
         │  Pod A  │   │  Pod B  │   │  Pod C  │
         │         │   │         │   │         │
         └────┬────┘   └────┬────┘   └────┬────┘
              │             │             │
              └─────────────┼─────────────┘
                            │
              ┌─────────────┼─────────────┐
              │             │             │
    ┌─────────▼─────────┐   │   ┌─────────▼─────────┐
    │  Redis Cache      │   │   │  ReadWriteMany    │
    │  (ElastiCache)    │   │   │  Storage (EFS)    │
    │  Distributed      │   │   │  Shared Files     │
    │  Sessions         │   │   │  & Attachments    │
    └───────────────────┘   │   └───────────────────┘
                            │
                   ┌────────▼────────┐
                   │  External DB    │
                   │  (RDS/Cloud SQL)│
                   │  Multi-AZ HA    │
                   └─────────────────┘

Configuration

Environment Variables

Set these environment variables for Redis integration:

# Required: Redis connection URL
MEMOS_REDIS_URL=redis://redis-service:6379

# Optional: Redis configuration
MEMOS_REDIS_POOL_SIZE=20                    # Connection pool size
MEMOS_REDIS_DIAL_TIMEOUT=5s                 # Connection timeout
MEMOS_REDIS_READ_TIMEOUT=3s                 # Read timeout  
MEMOS_REDIS_WRITE_TIMEOUT=3s                # Write timeout
MEMOS_REDIS_KEY_PREFIX=memos                # Key prefix for isolation

Fallback Behavior

Redis Available: Uses hybrid cache (Redis + local fallback)
Redis Unavailable: Falls back to local-only cache (single pod)
Redis Failure: Gracefully degrades to local cache until Redis recovers

Deployment Options

1. Development/Testing Deployment

For testing with self-hosted database:

kubectl apply -f kubernetes-example.yaml

This creates:

Self-hosted PostgreSQL with persistent storage
Redis deployment with persistence
Memos deployment with 3 replicas
ReadWriteMany shared storage
Load balancer service and ingress
HorizontalPodAutoscaler

2. Production Deployment (Recommended)

For production with managed services:

# First, set up your managed database and Redis
# Then apply the production configuration:
kubectl apply -f kubernetes-production.yaml

This provides:

External managed database (AWS RDS, Google Cloud SQL, Azure Database)
External managed Redis (ElastiCache, Google Memorystore, Azure Cache)
ReadWriteMany storage for shared file access
Pod Disruption Budget for high availability
Network policies for security
Advanced health checks and graceful shutdown
Horizontal Pod Autoscaler with intelligent scaling

3. Cloud Provider Specific Examples

AWS Deployment with RDS and ElastiCache

# 1. Create RDS PostgreSQL instance
aws rds create-db-instance \
  --db-instance-identifier memos-db \
  --db-instance-class db.t3.medium \
  --engine postgres \
  --master-username memos \
  --master-user-password YourSecurePassword \
  --allocated-storage 100 \
  --vpc-security-group-ids sg-xxxxxxxx \
  --db-subnet-group-name memos-subnet-group \
  --multi-az \
  --backup-retention-period 7

# 2. Create ElastiCache Redis cluster
aws elasticache create-replication-group \
  --replication-group-id memos-redis \
  --description "Memos Redis cluster" \
  --node-type cache.t3.medium \
  --num-cache-clusters 2 \
  --port 6379

# 3. Update secrets with actual endpoints
kubectl create secret generic memos-secrets \
  --from-literal=database-dsn="postgres://memos:password@memos-db.xxxxxx.region.rds.amazonaws.com:5432/memos?sslmode=require"

# 4. Update ConfigMap with ElastiCache endpoint
kubectl create configmap memos-config \
  --from-literal=MEMOS_REDIS_URL="redis://memos-redis.xxxxxx.cache.amazonaws.com:6379"

# 5. Deploy Memos
kubectl apply -f kubernetes-production.yaml

Google Cloud Deployment

# 1. Create Cloud SQL instance
gcloud sql instances create memos-db \
  --database-version=POSTGRES_15 \
  --tier=db-n1-standard-2 \
  --region=us-central1 \
  --availability-type=REGIONAL \
  --backup \
  --maintenance-window-day=SUN \
  --maintenance-window-hour=06

# 2. Create Memorystore Redis instance  
gcloud redis instances create memos-redis \
  --size=5 \
  --region=us-central1 \
  --redis-version=redis_7_0

# 3. Deploy with Cloud SQL Proxy (secure connection)
kubectl apply -f kubernetes-production.yaml

Azure Deployment

# 1. Create Azure Database for PostgreSQL
az postgres server create \
  --resource-group memos-rg \
  --name memos-db \
  --location eastus \
  --admin-user memos \
  --admin-password YourSecurePassword \
  --sku-name GP_Gen5_2 \
  --version 15

# 2. Create Azure Cache for Redis
az redis create \
  --resource-group memos-rg \
  --name memos-redis \
  --location eastus \
  --sku Standard \
  --vm-size C2

# 3. Deploy Memos
kubectl apply -f kubernetes-production.yaml

Monitoring and Troubleshooting

Cache Status Endpoint

Monitor cache health via the admin API:

curl -H "Authorization: Bearer <admin-token>" \
  https://your-memos-instance.com/api/v1/cache/status

Response includes:

{
  "user_cache": {
    "type": "hybrid",
    "size": 150,
    "local_size": 45,
    "redis_size": 150,
    "redis_available": true,
    "pod_id": "abc12345",
    "event_queue_size": 0
  },
  "user_setting_cache": {
    "type": "hybrid",
    "size": 89,
    "redis_available": true,
    "pod_id": "abc12345"
  }
}

Health Checks

Monitor these indicators:

Redis Connectivity: Check redis_available in cache status
Event Queue: Monitor event_queue_size for backlog
Cache Hit Rates: Compare local_size vs redis_size
Pod Distribution: Verify requests distributed across pods

Common Issues

Symptoms: Users can log in but subsequent requests fail Cause: Session created on one pod, request handled by another Solution: Verify Redis configuration and connectivity

Problem: High cache misses

Symptoms: Poor performance, frequent database queries
Cause: Redis unavailable or misconfigured Solution: Check Redis logs and connection settings

Problem: Session persistence issues

Symptoms: Users logged out unexpectedly Cause: Redis data loss or TTL issues Solution: Enable Redis persistence and verify TTL settings

Performance Considerations

External Database Requirements

PostgreSQL Sizing:

Small (< 100 users): 2 CPU, 4GB RAM, 100GB storage
Medium (100-1000 users): 4 CPU, 8GB RAM, 500GB storage
Large (1000+ users): 8+ CPU, 16GB+ RAM, 1TB+ storage

Redis Sizing:

Memory: Base 50MB + (2KB × active sessions) + (1KB × cached settings)
Small: 1GB (handles ~500K sessions)
Medium: 2-4GB (handles 1-2M sessions)
Large: 8GB+ (handles 4M+ sessions)

Connection Pool Sizing:

Database: Start with max_connections = 20 × number_of_pods
Redis: Start with pool_size = 10 × number_of_pods

Scaling Guidelines

Horizontal Pod Autoscaler:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: memos-hpa
spec:
  scaleTargetRef:
    kind: Deployment
    name: memos
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Recommended Scaling:

Small (< 100 users): 2-3 pods, managed Redis, managed DB
Medium (100-1000 users): 3-8 pods, Redis cluster, Multi-AZ DB
Large (1000+ users): 8-20 pods, Redis cluster, read replicas
Enterprise: 20+ pods, Redis cluster, DB sharding

Security Considerations

Redis Security

Network Isolation: Deploy Redis in private network
Authentication: Use Redis AUTH if exposed
Encryption: Enable TLS for Redis connections
Access Control: Restrict Redis access to Memos pods only

Example with Redis AUTH:

MEMOS_REDIS_URL=redis://:password@redis-service:6379

Session Security

Sessions remain encrypted in transit
Redis stores serialized session data
Session TTL honored across all pods
Admin-only access to cache status endpoint

Migration Guide

From Single Pod to Multi-Pod

Option 1: Gradual Migration (Recommended)

Setup External Services: Deploy managed database and Redis
Migrate Data: Export/import existing database to managed service
Update Configuration: Add Redis and external DB environment variables
Rolling Update: Update Memos deployment with new config
Scale Up: Increase replica count gradually
Verify: Check cache status and session persistence

Option 2: Blue-Green Deployment

Setup New Environment: Complete production setup in parallel
Data Migration: Sync data to new environment
DNS Cutover: Switch traffic to new environment
Cleanup: Remove old environment after verification

Rollback Strategy

If issues occur:

Scale Down: Reduce to single pod
Remove Redis Config: Environment variables
Restart: Pods will use local cache only

Best Practices

Resource Limits: Set appropriate CPU/memory limits
Health Checks: Implement readiness/liveness probes
Monitoring: Track cache metrics and Redis health
Backup: Regular Redis data backups
Testing: Verify session persistence across pod restarts
Gradual Scaling: Increase replicas incrementally

Additional Resources

Support

For issues or questions:

Check cache status endpoint first
Review Redis and pod logs
Verify environment variable configuration
Test with single pod to isolate issues

12 KiB Raw Blame History Unescape Escape