mirror of https://github.com/usememos/memos.git
381 lines
12 KiB
Markdown
381 lines
12 KiB
Markdown
# Kubernetes High Availability and Scaling Guide
|
||
|
||
This guide explains how to deploy Memos in a Kubernetes environment with proper session management for horizontal scaling and high availability.
|
||
|
||
## Description
|
||
|
||
Till v0.25.0, Memos had limitations when deployed as multiple pods in Kubernetes:
|
||
|
||
1. **Session Isolation**: Each pod maintained its own in-memory session cache, causing authentication inconsistencies when load balancers directed users to different pods.
|
||
|
||
2. **SSO Redirect Issues**: OAuth2 authentication flows would fail when:
|
||
- User initiated login on Pod A
|
||
- OAuth provider redirected back to Pod B
|
||
- Pod B couldn't validate the session created by Pod A
|
||
|
||
3. **Cache Inconsistency**: Session updates on one pod weren't reflected on other pods until cache expiry (10+ minutes).
|
||
|
||
## Solution Overview
|
||
|
||
The solution implements a **distributed cache system** with the following features:
|
||
|
||
- **Redis-backed shared cache** for session synchronization across pods
|
||
- **Hybrid cache strategy** with local cache fallback for resilience
|
||
- **Event-driven cache invalidation** for real-time consistency
|
||
- **Backward compatibility** - works without Redis for single-pod deployments
|
||
|
||
## Architecture
|
||
|
||
### Production Architecture with External Services
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────┐
|
||
│ Load Balancer (Ingress) │
|
||
└─────────────┬─────────────┬─────────────┬─────────────────┘
|
||
│ │ │
|
||
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
|
||
│ Pod A │ │ Pod B │ │ Pod C │
|
||
│ │ │ │ │ │
|
||
└────┬────┘ └────┬────┘ └────┬────┘
|
||
│ │ │
|
||
└─────────────┼─────────────┘
|
||
│
|
||
┌─────────────┼─────────────┐
|
||
│ │ │
|
||
┌─────────▼─────────┐ │ ┌─────────▼─────────┐
|
||
│ Redis Cache │ │ │ ReadWriteMany │
|
||
│ (ElastiCache) │ │ │ Storage (EFS) │
|
||
│ Distributed │ │ │ Shared Files │
|
||
│ Sessions │ │ │ & Attachments │
|
||
└───────────────────┘ │ └───────────────────┘
|
||
│
|
||
┌────────▼────────┐
|
||
│ External DB │
|
||
│ (RDS/Cloud SQL)│
|
||
│ Multi-AZ HA │
|
||
└─────────────────┘
|
||
```
|
||
|
||
## Configuration
|
||
|
||
### Environment Variables
|
||
|
||
Set these environment variables for Redis integration:
|
||
|
||
```bash
|
||
# Required: Redis connection URL
|
||
MEMOS_REDIS_URL=redis://redis-service:6379
|
||
|
||
# Optional: Redis configuration
|
||
MEMOS_REDIS_POOL_SIZE=20 # Connection pool size
|
||
MEMOS_REDIS_DIAL_TIMEOUT=5s # Connection timeout
|
||
MEMOS_REDIS_READ_TIMEOUT=3s # Read timeout
|
||
MEMOS_REDIS_WRITE_TIMEOUT=3s # Write timeout
|
||
MEMOS_REDIS_KEY_PREFIX=memos # Key prefix for isolation
|
||
```
|
||
|
||
### Fallback Behavior
|
||
|
||
- **Redis Available**: Uses hybrid cache (Redis + local fallback)
|
||
- **Redis Unavailable**: Falls back to local-only cache (single pod)
|
||
- **Redis Failure**: Gracefully degrades to local cache until Redis recovers
|
||
|
||
## Deployment Options
|
||
|
||
### 1. Development/Testing Deployment
|
||
|
||
For testing with self-hosted database:
|
||
|
||
```bash
|
||
kubectl apply -f kubernetes-example.yaml
|
||
```
|
||
|
||
This creates:
|
||
- Self-hosted PostgreSQL with persistent storage
|
||
- Redis deployment with persistence
|
||
- Memos deployment with 3 replicas
|
||
- ReadWriteMany shared storage
|
||
- Load balancer service and ingress
|
||
- HorizontalPodAutoscaler
|
||
|
||
### 2. Production Deployment (Recommended)
|
||
|
||
For production with managed services:
|
||
|
||
```bash
|
||
# First, set up your managed database and Redis
|
||
# Then apply the production configuration:
|
||
kubectl apply -f kubernetes-production.yaml
|
||
```
|
||
|
||
This provides:
|
||
- **External managed database** (AWS RDS, Google Cloud SQL, Azure Database)
|
||
- **External managed Redis** (ElastiCache, Google Memorystore, Azure Cache)
|
||
- **ReadWriteMany storage** for shared file access
|
||
- **Pod Disruption Budget** for high availability
|
||
- **Network policies** for security
|
||
- **Advanced health checks** and graceful shutdown
|
||
- **Horizontal Pod Autoscaler** with intelligent scaling
|
||
|
||
### 3. Cloud Provider Specific Examples
|
||
|
||
#### AWS Deployment with RDS and ElastiCache
|
||
|
||
```bash
|
||
# 1. Create RDS PostgreSQL instance
|
||
aws rds create-db-instance \
|
||
--db-instance-identifier memos-db \
|
||
--db-instance-class db.t3.medium \
|
||
--engine postgres \
|
||
--master-username memos \
|
||
--master-user-password YourSecurePassword \
|
||
--allocated-storage 100 \
|
||
--vpc-security-group-ids sg-xxxxxxxx \
|
||
--db-subnet-group-name memos-subnet-group \
|
||
--multi-az \
|
||
--backup-retention-period 7
|
||
|
||
# 2. Create ElastiCache Redis cluster
|
||
aws elasticache create-replication-group \
|
||
--replication-group-id memos-redis \
|
||
--description "Memos Redis cluster" \
|
||
--node-type cache.t3.medium \
|
||
--num-cache-clusters 2 \
|
||
--port 6379
|
||
|
||
# 3. Update secrets with actual endpoints
|
||
kubectl create secret generic memos-secrets \
|
||
--from-literal=database-dsn="postgres://memos:password@memos-db.xxxxxx.region.rds.amazonaws.com:5432/memos?sslmode=require"
|
||
|
||
# 4. Update ConfigMap with ElastiCache endpoint
|
||
kubectl create configmap memos-config \
|
||
--from-literal=MEMOS_REDIS_URL="redis://memos-redis.xxxxxx.cache.amazonaws.com:6379"
|
||
|
||
# 5. Deploy Memos
|
||
kubectl apply -f kubernetes-production.yaml
|
||
```
|
||
|
||
#### Google Cloud Deployment
|
||
|
||
```bash
|
||
# 1. Create Cloud SQL instance
|
||
gcloud sql instances create memos-db \
|
||
--database-version=POSTGRES_15 \
|
||
--tier=db-n1-standard-2 \
|
||
--region=us-central1 \
|
||
--availability-type=REGIONAL \
|
||
--backup \
|
||
--maintenance-window-day=SUN \
|
||
--maintenance-window-hour=06
|
||
|
||
# 2. Create Memorystore Redis instance
|
||
gcloud redis instances create memos-redis \
|
||
--size=5 \
|
||
--region=us-central1 \
|
||
--redis-version=redis_7_0
|
||
|
||
# 3. Deploy with Cloud SQL Proxy (secure connection)
|
||
kubectl apply -f kubernetes-production.yaml
|
||
```
|
||
|
||
#### Azure Deployment
|
||
|
||
```bash
|
||
# 1. Create Azure Database for PostgreSQL
|
||
az postgres server create \
|
||
--resource-group memos-rg \
|
||
--name memos-db \
|
||
--location eastus \
|
||
--admin-user memos \
|
||
--admin-password YourSecurePassword \
|
||
--sku-name GP_Gen5_2 \
|
||
--version 15
|
||
|
||
# 2. Create Azure Cache for Redis
|
||
az redis create \
|
||
--resource-group memos-rg \
|
||
--name memos-redis \
|
||
--location eastus \
|
||
--sku Standard \
|
||
--vm-size C2
|
||
|
||
# 3. Deploy Memos
|
||
kubectl apply -f kubernetes-production.yaml
|
||
```
|
||
|
||
## Monitoring and Troubleshooting
|
||
|
||
### Cache Status Endpoint
|
||
|
||
Monitor cache health via the admin API:
|
||
|
||
```bash
|
||
curl -H "Authorization: Bearer <admin-token>" \
|
||
https://your-memos-instance.com/api/v1/cache/status
|
||
```
|
||
|
||
Response includes:
|
||
```json
|
||
{
|
||
"user_cache": {
|
||
"type": "hybrid",
|
||
"size": 150,
|
||
"local_size": 45,
|
||
"redis_size": 150,
|
||
"redis_available": true,
|
||
"pod_id": "abc12345",
|
||
"event_queue_size": 0
|
||
},
|
||
"user_setting_cache": {
|
||
"type": "hybrid",
|
||
"size": 89,
|
||
"redis_available": true,
|
||
"pod_id": "abc12345"
|
||
}
|
||
}
|
||
```
|
||
|
||
### Health Checks
|
||
|
||
Monitor these indicators:
|
||
|
||
1. **Redis Connectivity**: Check `redis_available` in cache status
|
||
2. **Event Queue**: Monitor `event_queue_size` for backlog
|
||
3. **Cache Hit Rates**: Compare `local_size` vs `redis_size`
|
||
4. **Pod Distribution**: Verify requests distributed across pods
|
||
|
||
### Common Issues
|
||
|
||
#### Problem: Authentication fails after login
|
||
**Symptoms**: Users can log in but subsequent requests fail
|
||
**Cause**: Session created on one pod, request handled by another
|
||
**Solution**: Verify Redis configuration and connectivity
|
||
|
||
#### Problem: High cache misses
|
||
**Symptoms**: Poor performance, frequent database queries
|
||
**Cause**: Redis unavailable or misconfigured
|
||
**Solution**: Check Redis logs and connection settings
|
||
|
||
#### Problem: Session persistence issues
|
||
**Symptoms**: Users logged out unexpectedly
|
||
**Cause**: Redis data loss or TTL issues
|
||
**Solution**: Enable Redis persistence and verify TTL settings
|
||
|
||
## Performance Considerations
|
||
|
||
### External Database Requirements
|
||
|
||
**PostgreSQL Sizing**:
|
||
- **Small (< 100 users)**: 2 CPU, 4GB RAM, 100GB storage
|
||
- **Medium (100-1000 users)**: 4 CPU, 8GB RAM, 500GB storage
|
||
- **Large (1000+ users)**: 8+ CPU, 16GB+ RAM, 1TB+ storage
|
||
|
||
**Redis Sizing**:
|
||
- **Memory**: Base 50MB + (2KB × active sessions) + (1KB × cached settings)
|
||
- **Small**: 1GB (handles ~500K sessions)
|
||
- **Medium**: 2-4GB (handles 1-2M sessions)
|
||
- **Large**: 8GB+ (handles 4M+ sessions)
|
||
|
||
**Connection Pool Sizing**:
|
||
- Database: Start with `max_connections = 20 × number_of_pods`
|
||
- Redis: Start with `pool_size = 10 × number_of_pods`
|
||
|
||
### Scaling Guidelines
|
||
|
||
**Horizontal Pod Autoscaler**:
|
||
```yaml
|
||
apiVersion: autoscaling/v2
|
||
kind: HorizontalPodAutoscaler
|
||
metadata:
|
||
name: memos-hpa
|
||
spec:
|
||
scaleTargetRef:
|
||
kind: Deployment
|
||
name: memos
|
||
minReplicas: 2
|
||
maxReplicas: 10
|
||
metrics:
|
||
- type: Resource
|
||
resource:
|
||
name: cpu
|
||
target:
|
||
type: Utilization
|
||
averageUtilization: 70
|
||
```
|
||
|
||
**Recommended Scaling**:
|
||
- **Small (< 100 users)**: 2-3 pods, managed Redis, managed DB
|
||
- **Medium (100-1000 users)**: 3-8 pods, Redis cluster, Multi-AZ DB
|
||
- **Large (1000+ users)**: 8-20 pods, Redis cluster, read replicas
|
||
- **Enterprise**: 20+ pods, Redis cluster, DB sharding
|
||
|
||
## Security Considerations
|
||
|
||
### Redis Security
|
||
|
||
1. **Network Isolation**: Deploy Redis in private network
|
||
2. **Authentication**: Use Redis AUTH if exposed
|
||
3. **Encryption**: Enable TLS for Redis connections
|
||
4. **Access Control**: Restrict Redis access to Memos pods only
|
||
|
||
Example with Redis AUTH:
|
||
```bash
|
||
MEMOS_REDIS_URL=redis://:password@redis-service:6379
|
||
```
|
||
|
||
### Session Security
|
||
|
||
- Sessions remain encrypted in transit
|
||
- Redis stores serialized session data
|
||
- Session TTL honored across all pods
|
||
- Admin-only access to cache status endpoint
|
||
|
||
## Migration Guide
|
||
|
||
### From Single Pod to Multi-Pod
|
||
|
||
#### Option 1: Gradual Migration (Recommended)
|
||
1. **Setup External Services**: Deploy managed database and Redis
|
||
2. **Migrate Data**: Export/import existing database to managed service
|
||
3. **Update Configuration**: Add Redis and external DB environment variables
|
||
4. **Rolling Update**: Update Memos deployment with new config
|
||
5. **Scale Up**: Increase replica count gradually
|
||
6. **Verify**: Check cache status and session persistence
|
||
|
||
#### Option 2: Blue-Green Deployment
|
||
1. **Setup New Environment**: Complete production setup in parallel
|
||
2. **Data Migration**: Sync data to new environment
|
||
3. **DNS Cutover**: Switch traffic to new environment
|
||
4. **Cleanup**: Remove old environment after verification
|
||
|
||
### Rollback Strategy
|
||
|
||
If issues occur:
|
||
1. **Scale Down**: Reduce to single pod
|
||
2. **Remove Redis Config**: Environment variables
|
||
3. **Restart**: Pods will use local cache only
|
||
|
||
## Best Practices
|
||
|
||
1. **Resource Limits**: Set appropriate CPU/memory limits
|
||
2. **Health Checks**: Implement readiness/liveness probes
|
||
3. **Monitoring**: Track cache metrics and Redis health
|
||
4. **Backup**: Regular Redis data backups
|
||
5. **Testing**: Verify session persistence across pod restarts
|
||
6. **Gradual Scaling**: Increase replicas incrementally
|
||
|
||
## Additional Resources
|
||
|
||
- [Redis Kubernetes Operator](https://github.com/spotahome/redis-operator)
|
||
- [Kubernetes HPA Documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/)
|
||
- [Session Affinity vs Distributed Sessions](https://kubernetes.io/docs/concepts/services-networking/service/#session-stickiness)
|
||
|
||
## Support
|
||
|
||
For issues or questions:
|
||
1. Check cache status endpoint first
|
||
2. Review Redis and pod logs
|
||
3. Verify environment variable configuration
|
||
4. Test with single pod to isolate issues
|
||
|