Health Checks

Production-ready health check endpoints for Kubernetes and Docker orchestration.

Overview

TONL-MCP Bridge v1.0.0 includes dedicated health check endpoints designed for container orchestration platforms. These endpoints enable:

Kubernetes liveness and readiness probes
Docker HEALTHCHECK instructions
Load balancer health monitoring
Zero-downtime deployments
Automated failover

Endpoints

Liveness Probe: `/health`

Indicates if the server process is running and healthy.

Purpose:

Kubernetes uses this to restart unhealthy pods
Docker uses this to mark containers as unhealthy
Load balancers use this to remove unhealthy instances

Response:

json

{
  "status": "healthy",
  "uptime": 3600.5,
  "timestamp": "2025-12-07T19:00:00.000Z"
}

Characteristics:

Response time: < 1ms
No external dependencies
Always returns 200 OK if process is running
No authentication required

Example:

bash

curl http://localhost:3000/health

# Expected: 200 OK
# {
#   "status": "healthy",
#   "uptime": 125.3,
#   "timestamp": "2025-12-07T19:02:05.000Z"
# }

Readiness Probe: `/ready`

Indicates if the server is ready to accept traffic.

Purpose:

Kubernetes uses this to route traffic only to ready pods
Ensures server has completed initialization
Can be extended to check database connections (future)

Response:

json

{
  "status": "ready",
  "timestamp": "2025-12-07T19:00:00.000Z"
}

Characteristics:

Response time: < 1ms
Returns 200 OK when ready for traffic
Returns 503 Service Unavailable during startup (future)
No authentication required

Example:

bash

curl http://localhost:3000/ready

# Expected: 200 OK
# {
#   "status": "ready",
#   "timestamp": "2025-12-07T19:02:05.000Z"
# }

Kubernetes Configuration

Deployment with Health Checks

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tonl-mcp-bridge
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tonl-mcp-bridge
  template:
    metadata:
      labels:
        app: tonl-mcp-bridge
    spec:
      containers:
      - name: tonl-server
        image: ghcr.io/kryptomrx/tonl-mcp-bridge:latest
        ports:
        - name: http
          containerPort: 3000
        env:
        - name: TONL_AUTH_TOKEN
          valueFrom:
            secretKeyRef:
              name: tonl-secrets
              key: auth-token
        
        # Liveness probe
        livenessProbe:
          httpGet:
            path: /health
            port: http
          initialDelaySeconds: 10
          periodSeconds: 30
          timeoutSeconds: 5
          failureThreshold: 3
        
        # Readiness probe
        readinessProbe:
          httpGet:
            path: /ready
            port: http
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
        
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Probe Configuration Best Practices

Liveness Probe:

initialDelaySeconds: 10 - Wait 10s after container start
periodSeconds: 30 - Check every 30 seconds
timeoutSeconds: 5 - Fail if no response in 5s
failureThreshold: 3 - Restart after 3 consecutive failures

Readiness Probe:

initialDelaySeconds: 5 - Start checking after 5s
periodSeconds: 10 - Check every 10 seconds
timeoutSeconds: 3 - Fail if no response in 3s
failureThreshold: 3 - Mark unready after 3 failures

Docker Configuration

Dockerfile with HEALTHCHECK

dockerfile

FROM node:18-alpine

WORKDIR /app

COPY package*.json ./
RUN npm ci --production

COPY dist ./dist

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

EXPOSE 3000

CMD ["node", "dist/mcp/index.js"]

Docker Compose

yaml

version: '3.8'
services:
  tonl-server:
    image: ghcr.io/kryptomrx/tonl-mcp-bridge:latest
    ports:
      - "3000:3000"
    environment:
      - TONL_AUTH_TOKEN=${TONL_AUTH_TOKEN}
      - NODE_ENV=production
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s
    restart: unless-stopped

Load Balancer Configuration

nginx

upstream tonl_backend {
    server localhost:3000 max_fails=3 fail_timeout=30s;
    server localhost:3001 max_fails=3 fail_timeout=30s;
    server localhost:3002 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name api.example.com;

    # Health check endpoint (no logging)
    location /health {
        proxy_pass http://tonl_backend/health;
        access_log off;
        proxy_connect_timeout 3s;
        proxy_read_timeout 3s;
    }

    # Application endpoints
    location / {
        proxy_pass http://tonl_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        
        # Health check for this backend
        health_check interval=10s fails=3 passes=2 uri=/health;
    }
}

HAProxy

haproxy

backend tonl_servers
    balance roundrobin
    option httpchk GET /health
    http-check expect status 200
    
    server tonl1 localhost:3000 check inter 10s fall 3 rise 2
    server tonl2 localhost:3001 check inter 10s fall 3 rise 2
    server tonl3 localhost:3002 check inter 10s fall 3 rise 2

Monitoring & Alerting

Prometheus Alerts

yaml

groups:
- name: tonl_health
  interval: 30s
  rules:
  - alert: TonlServerDown
    expr: up{job="tonl-mcp-bridge"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "TONL server {{ $labels.instance }} is down"
      description: "Instance {{ $labels.instance }} has been down for more than 1 minute"

  - alert: TonlServerUnhealthy
    expr: probe_success{job="tonl-health"} == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "TONL server {{ $labels.instance }} health check failing"
      description: "Health check failing for {{ $labels.instance }}"

Blackbox Exporter

yaml

modules:
  http_tonl_health:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: [200]
      method: GET
      preferred_ip_protocol: ip4

Testing Health Checks

Manual Testing

bash

# Test liveness
curl -f http://localhost:3000/health
echo $?  # Should be 0

# Test readiness
curl -f http://localhost:3000/ready
echo $?  # Should be 0

# Test with timeout
timeout 5s curl http://localhost:3000/health

# Load test health endpoint
ab -n 1000 -c 10 http://localhost:3000/health

Automated Testing

bash

# Health check in CI/CD
#!/bin/bash
MAX_RETRIES=30
RETRY_DELAY=2

for i in $(seq 1 $MAX_RETRIES); do
  if curl -sf http://localhost:3000/health > /dev/null; then
    echo "Server is healthy"
    exit 0
  fi
  echo "Waiting for server... ($i/$MAX_RETRIES)"
  sleep $RETRY_DELAY
done

echo "Server failed to become healthy"
exit 1

Integration Tests

typescript

import { describe, it, expect, beforeAll, afterAll } from 'vitest';
import { startHttpServer } from '../src/mcp/server.js';

describe('Health Check Endpoints', () => {
  let server;
  let port = 3000;

  beforeAll(() => {
    server = startHttpServer(port);
  });

  afterAll(() => {
    server.close();
  });

  it('should return 200 OK for /health', async () => {
    const response = await fetch(`http://localhost:${port}/health`);
    expect(response.status).toBe(200);
    
    const data = await response.json();
    expect(data.status).toBe('healthy');
    expect(data).toHaveProperty('uptime');
    expect(data).toHaveProperty('timestamp');
  });

  it('should return 200 OK for /ready', async () => {
    const response = await fetch(`http://localhost:${port}/ready`);
    expect(response.status).toBe(200);
    
    const data = await response.json();
    expect(data.status).toBe('ready');
    expect(data).toHaveProperty('timestamp');
  });

  it('should respond quickly (< 100ms)', async () => {
    const start = Date.now();
    await fetch(`http://localhost:${port}/health`);
    const duration = Date.now() - start;
    
    expect(duration).toBeLessThan(100);
  });
});

Troubleshooting

Health Check Failing

Symptom: Health check returns non-200 status or times out

Possible causes:

Server not started
Port not accessible
Firewall blocking requests
Server overloaded

Solutions:

bash

# Check if server is running
ps aux | grep tonl

# Check port binding
netstat -an | grep 3000

# Test locally
curl -v http://localhost:3000/health

# Check logs
docker logs tonl-server
kubectl logs deployment/tonl-mcp-bridge

Kubernetes Pod Restart Loop

Symptom: Pods continuously restart

Possible causes:

Liveness probe too aggressive
Server startup time > initialDelaySeconds
Application crash during startup

Solutions:

bash

# Check pod events
kubectl describe pod <pod-name>

# Check logs before crash
kubectl logs <pod-name> --previous

# Increase initialDelaySeconds
# Edit deployment:
livenessProbe:
  initialDelaySeconds: 30  # Increase from 10

Load Balancer Not Routing Traffic

Symptom: No traffic reaching backend servers

Possible causes:

Readiness probe failing
All backends marked unhealthy
Incorrect health check path

Solutions:

bash

# Check backend status
kubectl get pods -l app=tonl-mcp-bridge

# Test health check directly
kubectl port-forward pod/<pod-name> 3000:3000
curl http://localhost:3000/ready

# Check load balancer config
kubectl describe service tonl-mcp-bridge

Best Practices

Probe Configuration

Separate Liveness and Readiness
- Liveness: Process health only
- Readiness: Ready to serve traffic
Appropriate Timeouts
- Liveness: Longer intervals (30s)
- Readiness: Shorter intervals (10s)
Startup Period
- Use startupProbe for slow-starting apps
- Or increase initialDelaySeconds
Failure Threshold
- 3 failures is usually appropriate
- Balance between sensitivity and stability

Performance

Lightweight Checks
- No external dependencies
- No database queries
- Fast response (< 1ms)
No Authentication
- Health checks should not require auth
- Simplifies monitoring
Caching
- Consider caching health status
- Useful for expensive checks

Monitoring

Track Check Success Rate
- Alert on declining success rate
- Indicates infrastructure issues
Response Time Metrics
- Track P50, P95, P99
- Alert on degradation
Failure Patterns
- Correlate with deployments
- Identify systemic issues

Health Checks ​

Overview ​

Endpoints ​

Liveness Probe: /health ​

Readiness Probe: /ready ​

Kubernetes Configuration ​

Deployment with Health Checks ​

Probe Configuration Best Practices ​

Docker Configuration ​

Dockerfile with HEALTHCHECK ​

Docker Compose ​

Load Balancer Configuration ​

nginx ​

HAProxy ​

Monitoring & Alerting ​

Prometheus Alerts ​

Blackbox Exporter ​

Testing Health Checks ​

Manual Testing ​

Automated Testing ​

Integration Tests ​

Troubleshooting ​

Health Check Failing ​

Kubernetes Pod Restart Loop ​

Load Balancer Not Routing Traffic ​

Best Practices ​

Probe Configuration ​

Performance ​

Monitoring ​

See Also ​

Health Checks

Overview

Endpoints

Liveness Probe: `/health`

Readiness Probe: `/ready`

Kubernetes Configuration

Deployment with Health Checks

Probe Configuration Best Practices

Docker Configuration

Dockerfile with HEALTHCHECK

Docker Compose

Load Balancer Configuration

nginx

HAProxy

Monitoring & Alerting

Prometheus Alerts

Blackbox Exporter

Testing Health Checks

Manual Testing

Automated Testing

Integration Tests

Troubleshooting

Health Check Failing

Kubernetes Pod Restart Loop

Load Balancer Not Routing Traffic

Best Practices

Probe Configuration

Performance

Monitoring

See Also