Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Health Checks for Kubernetes

Helium server provides HTTP health check endpoints designed for Kubernetes liveness and readiness probes. These endpoints run on a separate internal port (default: 9090) and are enabled for all worker modes.

Overview

Health checks help Kubernetes determine:

  • Liveness: Is the container alive and should it be restarted if it becomes unresponsive?
  • Readiness: Is the container ready to handle requests?

Helium implements both probe types on a dedicated HTTP server that runs alongside each worker mode.

Endpoints

Liveness Probe: /healthz

Returns 200 OK with a JSON response if the server is running:

{
  "status": "ok"
}

This endpoint always returns success if the health check server is responding. Kubernetes uses this to determine if the container should be restarted.

Readiness Probe: /readyz

Checks connectivity to all dependencies before returning status:

Success Response (200 OK):

{
  "status": "ok",
  "database": "ok",
  "redis": "ok",
  "rabbitmq": "ok"
}

Failure Response (503 Service Unavailable):

{
  "status": "error",
  "database": "ok",
  "redis": "error",
  "rabbitmq": "ok",
  "error": "Redis error: Connection refused"
}

The readiness probe checks:

  • PostgreSQL: Executes a simple query (SELECT 1)
  • Redis: Sends a PING command
  • RabbitMQ: Validates connection pool status

All worker modes check the same three dependencies.

Configuration

Health Check Port

Set the HEALTH_CHECK_PORT environment variable to customize the port (default: 9090):

export HEALTH_CHECK_PORT=9090

This port should be:

  • Internal only: Not exposed to external traffic
  • Accessible by Kubernetes: For probe requests
  • Different from main service ports: To avoid conflicts

Worker Modes

Health checks are available in all worker modes:

Worker ModeMain PortHealth Check PortDependencies Checked
grpc500519090Database, Redis, RabbitMQ
subscribe_api80809090Database, Redis, RabbitMQ
webhook_api80819090Database, Redis, RabbitMQ
consumerN/A9090Database, Redis, RabbitMQ
mailerN/A9090Database, Redis, RabbitMQ
cron_executorN/A9090Database, Redis, RabbitMQ

Kubernetes Deployment

Example Pod Configuration

Here’s how to configure health checks in your Kubernetes deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: helium-grpc
spec:
  replicas: 3
  selector:
    matchLabels:
      app: helium-grpc
  template:
    metadata:
      labels:
        app: helium-grpc
    spec:
      containers:
      - name: helium-grpc
        image: helium-server:latest
        env:
        - name: WORK_MODE
          value: "grpc"
        - name: LISTEN_ADDR
          value: "0.0.0.0:50051"
        - name: HEALTH_CHECK_PORT
          value: "9090"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: helium-secrets
              key: database-url
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: helium-secrets
              key: redis-url
        - name: MQ_URL
          valueFrom:
            secretKeyRef:
              name: helium-secrets
              key: mq-url
        ports:
        - name: grpc
          containerPort: 50051
          protocol: TCP
        - name: health
          containerPort: 9090
          protocol: TCP
        livenessProbe:
          httpGet:
            path: /healthz
            port: health
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /readyz
            port: health
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 5
          failureThreshold: 3

Probe Configuration Guidelines

Liveness Probe:

  • initialDelaySeconds: 10-30 seconds (allow time for startup)
  • periodSeconds: 10-30 seconds (check periodically)
  • timeoutSeconds: 5 seconds
  • failureThreshold: 3 (restart after 3 consecutive failures)

Readiness Probe:

  • initialDelaySeconds: 5-10 seconds (faster than liveness)
  • periodSeconds: 5-10 seconds (check more frequently)
  • timeoutSeconds: 5 seconds
  • failureThreshold: 3 (mark unready after 3 failures)

Service Configuration

For API worker modes (grpc, subscribe_api, webhook_api), configure a Service:

apiVersion: v1
kind: Service
metadata:
  name: helium-grpc
spec:
  type: ClusterIP
  ports:
  - name: grpc
    port: 50051
    targetPort: grpc
    protocol: TCP
  selector:
    app: helium-grpc

Note: The health check port (9090) is not exposed in the Service. It’s only for Kubernetes probes.

Worker Mode Behavior

API Modes (grpc, subscribe_api, webhook_api)

For API modes, the health check server runs alongside the main API server:

  • When the main server exits, the health check server is immediately terminated
  • Process exits when either server fails
  • Ensures no “zombie” containers serving health checks without handling requests

Background Worker Modes (consumer, mailer, cron_executor)

For background workers, the health check server runs continuously:

  • Liveness probe confirms the worker process is alive
  • Readiness probe ensures dependencies are accessible
  • Worker loops indefinitely alongside health check server

Troubleshooting

Health Check Server Not Starting

Symptom: Probes fail immediately with connection errors

Solutions:

  1. Check logs for health check server errors
  2. Verify HEALTH_CHECK_PORT is not already in use
  3. Ensure the port is accessible within the pod

Readiness Probe Failing

Symptom: Pod remains in “Not Ready” state

Solutions:

  1. Check which dependency is failing in the /readyz response
  2. Verify connection strings (DATABASE_URL, REDIS_URL, MQ_URL)
  3. Ensure network policies allow pod access to dependencies
  4. Check if dependencies are healthy

Example debugging:

# Forward health check port to local machine
kubectl port-forward pod/helium-grpc-xyz 9090:9090

# Check readiness endpoint
curl http://localhost:9090/readyz

Liveness Probe Causing Restart Loop

Symptom: Pod repeatedly restarts with liveness probe failures

Solutions:

  1. Increase initialDelaySeconds (worker may need more startup time)
  2. Increase failureThreshold (allow more failures before restart)
  3. Check if worker is deadlocked or stuck (examine logs before restart)

Worker Exits But Pod Stays Running

Symptom: Container appears healthy but doesn’t process requests

This should not happen with the current implementation:

  • API workers: Health check is aborted when main server exits
  • Background workers: Return from execute_worker() causes process exit

If this occurs, file a bug report.

Security Considerations

Port Exposure

The health check port (9090) should never be exposed externally:

  • Don’t create Ingress rules for health check endpoints
  • Don’t expose the health check port in the Service definition
  • Use network policies to restrict access to Kubernetes control plane only

Sensitive Information

Health check responses contain minimal information:

  • No version numbers
  • No internal IPs or hostnames
  • No authentication tokens
  • Only dependency status (ok/error)

Error messages may contain connection details. Ensure logs are secured appropriately.

Best Practices

  1. Use separate ports: Never combine health checks with main service endpoints
  2. Set appropriate timeouts: Balance between quick detection and false positives
  3. Monitor probe metrics: Track probe success rates in your observability stack
  4. Test locally: Use port-forwarding to verify health checks before deployment
  5. Align with dependencies: If using a sidecar proxy (Istio, Linkerd), configure startup probes

Summary

Helium’s health check endpoints provide robust Kubernetes integration:

  • Liveness probe (/healthz): Detects unresponsive containers
  • Readiness probe (/readyz): Ensures dependencies are healthy
  • Separate port (default 9090): Isolated from main services
  • All worker modes: Consistent behavior across deployment types
  • Process lifecycle: Ensures clean exits, no zombie containers

Configure these probes in your Kubernetes deployments to enable automatic recovery and load balancing.