Health Checks
Production-ready health check endpoints for Kubernetes and Docker orchestration.
Overview
TONL-MCP Bridge v1.0.0 includes dedicated health check endpoints designed for container orchestration platforms. These endpoints enable:
- Kubernetes liveness and readiness probes
- Docker HEALTHCHECK instructions
- Load balancer health monitoring
- Zero-downtime deployments
- Automated failover
Endpoints
Liveness Probe: /health
Indicates if the server process is running and healthy.
Purpose:
- Kubernetes uses this to restart unhealthy pods
- Docker uses this to mark containers as unhealthy
- Load balancers use this to remove unhealthy instances
Response:
{
"status": "healthy",
"uptime": 3600.5,
"timestamp": "2025-12-07T19:00:00.000Z"
}Characteristics:
- Response time: < 1ms
- No external dependencies
- Always returns 200 OK if process is running
- No authentication required
Example:
curl http://localhost:3000/health
# Expected: 200 OK
# {
# "status": "healthy",
# "uptime": 125.3,
# "timestamp": "2025-12-07T19:02:05.000Z"
# }Readiness Probe: /ready
Indicates if the server is ready to accept traffic.
Purpose:
- Kubernetes uses this to route traffic only to ready pods
- Ensures server has completed initialization
- Can be extended to check database connections (future)
Response:
{
"status": "ready",
"timestamp": "2025-12-07T19:00:00.000Z"
}Characteristics:
- Response time: < 1ms
- Returns 200 OK when ready for traffic
- Returns 503 Service Unavailable during startup (future)
- No authentication required
Example:
curl http://localhost:3000/ready
# Expected: 200 OK
# {
# "status": "ready",
# "timestamp": "2025-12-07T19:02:05.000Z"
# }Kubernetes Configuration
Deployment with Health Checks
apiVersion: apps/v1
kind: Deployment
metadata:
name: tonl-mcp-bridge
spec:
replicas: 3
selector:
matchLabels:
app: tonl-mcp-bridge
template:
metadata:
labels:
app: tonl-mcp-bridge
spec:
containers:
- name: tonl-server
image: ghcr.io/kryptomrx/tonl-mcp-bridge:latest
ports:
- name: http
containerPort: 3000
env:
- name: TONL_AUTH_TOKEN
valueFrom:
secretKeyRef:
name: tonl-secrets
key: auth-token
# Liveness probe
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"Probe Configuration Best Practices
Liveness Probe:
initialDelaySeconds: 10- Wait 10s after container startperiodSeconds: 30- Check every 30 secondstimeoutSeconds: 5- Fail if no response in 5sfailureThreshold: 3- Restart after 3 consecutive failures
Readiness Probe:
initialDelaySeconds: 5- Start checking after 5speriodSeconds: 10- Check every 10 secondstimeoutSeconds: 3- Fail if no response in 3sfailureThreshold: 3- Mark unready after 3 failures
Docker Configuration
Dockerfile with HEALTHCHECK
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY dist ./dist
# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
EXPOSE 3000
CMD ["node", "dist/mcp/index.js"]Docker Compose
version: '3.8'
services:
tonl-server:
image: ghcr.io/kryptomrx/tonl-mcp-bridge:latest
ports:
- "3000:3000"
environment:
- TONL_AUTH_TOKEN=${TONL_AUTH_TOKEN}
- NODE_ENV=production
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
restart: unless-stoppedLoad Balancer Configuration
nginx
upstream tonl_backend {
server localhost:3000 max_fails=3 fail_timeout=30s;
server localhost:3001 max_fails=3 fail_timeout=30s;
server localhost:3002 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name api.example.com;
# Health check endpoint (no logging)
location /health {
proxy_pass http://tonl_backend/health;
access_log off;
proxy_connect_timeout 3s;
proxy_read_timeout 3s;
}
# Application endpoints
location / {
proxy_pass http://tonl_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
# Health check for this backend
health_check interval=10s fails=3 passes=2 uri=/health;
}
}HAProxy
backend tonl_servers
balance roundrobin
option httpchk GET /health
http-check expect status 200
server tonl1 localhost:3000 check inter 10s fall 3 rise 2
server tonl2 localhost:3001 check inter 10s fall 3 rise 2
server tonl3 localhost:3002 check inter 10s fall 3 rise 2Monitoring & Alerting
Prometheus Alerts
groups:
- name: tonl_health
interval: 30s
rules:
- alert: TonlServerDown
expr: up{job="tonl-mcp-bridge"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "TONL server {{ $labels.instance }} is down"
description: "Instance {{ $labels.instance }} has been down for more than 1 minute"
- alert: TonlServerUnhealthy
expr: probe_success{job="tonl-health"} == 0
for: 2m
labels:
severity: warning
annotations:
summary: "TONL server {{ $labels.instance }} health check failing"
description: "Health check failing for {{ $labels.instance }}"Blackbox Exporter
modules:
http_tonl_health:
prober: http
timeout: 5s
http:
valid_status_codes: [200]
method: GET
preferred_ip_protocol: ip4Testing Health Checks
Manual Testing
# Test liveness
curl -f http://localhost:3000/health
echo $? # Should be 0
# Test readiness
curl -f http://localhost:3000/ready
echo $? # Should be 0
# Test with timeout
timeout 5s curl http://localhost:3000/health
# Load test health endpoint
ab -n 1000 -c 10 http://localhost:3000/healthAutomated Testing
# Health check in CI/CD
#!/bin/bash
MAX_RETRIES=30
RETRY_DELAY=2
for i in $(seq 1 $MAX_RETRIES); do
if curl -sf http://localhost:3000/health > /dev/null; then
echo "Server is healthy"
exit 0
fi
echo "Waiting for server... ($i/$MAX_RETRIES)"
sleep $RETRY_DELAY
done
echo "Server failed to become healthy"
exit 1Integration Tests
import { describe, it, expect, beforeAll, afterAll } from 'vitest';
import { startHttpServer } from '../src/mcp/server.js';
describe('Health Check Endpoints', () => {
let server;
let port = 3000;
beforeAll(() => {
server = startHttpServer(port);
});
afterAll(() => {
server.close();
});
it('should return 200 OK for /health', async () => {
const response = await fetch(`http://localhost:${port}/health`);
expect(response.status).toBe(200);
const data = await response.json();
expect(data.status).toBe('healthy');
expect(data).toHaveProperty('uptime');
expect(data).toHaveProperty('timestamp');
});
it('should return 200 OK for /ready', async () => {
const response = await fetch(`http://localhost:${port}/ready`);
expect(response.status).toBe(200);
const data = await response.json();
expect(data.status).toBe('ready');
expect(data).toHaveProperty('timestamp');
});
it('should respond quickly (< 100ms)', async () => {
const start = Date.now();
await fetch(`http://localhost:${port}/health`);
const duration = Date.now() - start;
expect(duration).toBeLessThan(100);
});
});Troubleshooting
Health Check Failing
Symptom: Health check returns non-200 status or times out
Possible causes:
- Server not started
- Port not accessible
- Firewall blocking requests
- Server overloaded
Solutions:
# Check if server is running
ps aux | grep tonl
# Check port binding
netstat -an | grep 3000
# Test locally
curl -v http://localhost:3000/health
# Check logs
docker logs tonl-server
kubectl logs deployment/tonl-mcp-bridgeKubernetes Pod Restart Loop
Symptom: Pods continuously restart
Possible causes:
- Liveness probe too aggressive
- Server startup time > initialDelaySeconds
- Application crash during startup
Solutions:
# Check pod events
kubectl describe pod <pod-name>
# Check logs before crash
kubectl logs <pod-name> --previous
# Increase initialDelaySeconds
# Edit deployment:
livenessProbe:
initialDelaySeconds: 30 # Increase from 10Load Balancer Not Routing Traffic
Symptom: No traffic reaching backend servers
Possible causes:
- Readiness probe failing
- All backends marked unhealthy
- Incorrect health check path
Solutions:
# Check backend status
kubectl get pods -l app=tonl-mcp-bridge
# Test health check directly
kubectl port-forward pod/<pod-name> 3000:3000
curl http://localhost:3000/ready
# Check load balancer config
kubectl describe service tonl-mcp-bridgeBest Practices
Probe Configuration
Separate Liveness and Readiness
- Liveness: Process health only
- Readiness: Ready to serve traffic
Appropriate Timeouts
- Liveness: Longer intervals (30s)
- Readiness: Shorter intervals (10s)
Startup Period
- Use
startupProbefor slow-starting apps - Or increase
initialDelaySeconds
- Use
Failure Threshold
- 3 failures is usually appropriate
- Balance between sensitivity and stability
Performance
Lightweight Checks
- No external dependencies
- No database queries
- Fast response (< 1ms)
No Authentication
- Health checks should not require auth
- Simplifies monitoring
Caching
- Consider caching health status
- Useful for expensive checks
Monitoring
Track Check Success Rate
- Alert on declining success rate
- Indicates infrastructure issues
Response Time Metrics
- Track P50, P95, P99
- Alert on degradation
Failure Patterns
- Correlate with deployments
- Identify systemic issues
See Also
- MCP Server - Server configuration
- Kubernetes Deployment - K8s setup
- Docker Deployment - Container configuration
- Production Deployment - Best practices
- Live Monitoring - Metrics dashboard