Health Checks for Kubernetes

Helium server provides HTTP health check endpoints designed for Kubernetes liveness and readiness probes. These endpoints run on a separate internal port (default: 9090) and are enabled for all worker modes.

Overview

Health checks help Kubernetes determine:

Liveness: Is the container alive and should it be restarted if it becomes unresponsive?
Readiness: Is the container ready to handle requests?

Helium implements both probe types on a dedicated HTTP server that runs alongside each worker mode.

Endpoints

Liveness Probe: `/healthz`

Returns 200 OK with a JSON response if the server is running:

{
  "status": "ok"
}

This endpoint always returns success if the health check server is responding. Kubernetes uses this to determine if the container should be restarted.

Readiness Probe: `/readyz`

Checks connectivity to all dependencies before returning status:

Success Response (200 OK):

{
  "status": "ok",
  "database": "ok",
  "redis": "ok",
  "rabbitmq": "ok"
}

Failure Response (503 Service Unavailable):

{
  "status": "error",
  "database": "ok",
  "redis": "error",
  "rabbitmq": "ok",
  "error": "Redis error: Connection refused"
}

The readiness probe checks:

PostgreSQL: Executes a simple query (SELECT 1)
Redis: Sends a PING command
RabbitMQ: Validates connection pool status

All worker modes check the same three dependencies.

Configuration

Health Check Port

Set the HEALTH_CHECK_PORT environment variable to customize the port (default: 9090):

export HEALTH_CHECK_PORT=9090

This port should be:

Internal only: Not exposed to external traffic
Accessible by Kubernetes: For probe requests
Different from main service ports: To avoid conflicts

Worker Modes

Health checks are available in all worker modes:

Worker Mode	Main Port	Health Check Port	Dependencies Checked
`grpc`	50051	9090	Database, Redis, RabbitMQ
`subscribe_api`	8080	9090	Database, Redis, RabbitMQ
`webhook_api`	8081	9090	Database, Redis, RabbitMQ
`consumer`	N/A	9090	Database, Redis, RabbitMQ
`mailer`	N/A	9090	Database, Redis, RabbitMQ
`cron_executor`	N/A	9090	Database, Redis, RabbitMQ

Kubernetes Deployment

Example Pod Configuration

Here’s how to configure health checks in your Kubernetes deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: helium-grpc
spec:
  replicas: 3
  selector:
    matchLabels:
      app: helium-grpc
  template:
    metadata:
      labels:
        app: helium-grpc
    spec:
      containers:
      - name: helium-grpc
        image: helium-server:latest
        env:
        - name: WORK_MODE
          value: "grpc"
        - name: LISTEN_ADDR
          value: "0.0.0.0:50051"
        - name: HEALTH_CHECK_PORT
          value: "9090"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: helium-secrets
              key: database-url
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: helium-secrets
              key: redis-url
        - name: MQ_URL
          valueFrom:
            secretKeyRef:
              name: helium-secrets
              key: mq-url
        ports:
        - name: grpc
          containerPort: 50051
          protocol: TCP
        - name: health
          containerPort: 9090
          protocol: TCP
        livenessProbe:
          httpGet:
            path: /healthz
            port: health
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /readyz
            port: health
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 5
          failureThreshold: 3

Probe Configuration Guidelines

Liveness Probe:

initialDelaySeconds: 10-30 seconds (allow time for startup)
periodSeconds: 10-30 seconds (check periodically)
timeoutSeconds: 5 seconds
failureThreshold: 3 (restart after 3 consecutive failures)

Readiness Probe:

initialDelaySeconds: 5-10 seconds (faster than liveness)
periodSeconds: 5-10 seconds (check more frequently)
timeoutSeconds: 5 seconds
failureThreshold: 3 (mark unready after 3 failures)

Service Configuration

For API worker modes (grpc, subscribe_api, webhook_api), configure a Service:

apiVersion: v1
kind: Service
metadata:
  name: helium-grpc
spec:
  type: ClusterIP
  ports:
  - name: grpc
    port: 50051
    targetPort: grpc
    protocol: TCP
  selector:
    app: helium-grpc

Note: The health check port (9090) is not exposed in the Service. It’s only for Kubernetes probes.

Worker Mode Behavior

API Modes (grpc, subscribe_api, webhook_api)

For API modes, the health check server runs alongside the main API server:

When the main server exits, the health check server is immediately terminated
Process exits when either server fails
Ensures no “zombie” containers serving health checks without handling requests

Background Worker Modes (consumer, mailer, cron_executor)

For background workers, the health check server runs continuously:

Liveness probe confirms the worker process is alive
Readiness probe ensures dependencies are accessible
Worker loops indefinitely alongside health check server

Troubleshooting

Health Check Server Not Starting

Symptom: Probes fail immediately with connection errors

Solutions:

Check logs for health check server errors
Verify HEALTH_CHECK_PORT is not already in use
Ensure the port is accessible within the pod

Readiness Probe Failing

Symptom: Pod remains in “Not Ready” state

Solutions:

Check which dependency is failing in the /readyz response
Verify connection strings (DATABASE_URL, REDIS_URL, MQ_URL)
Ensure network policies allow pod access to dependencies
Check if dependencies are healthy

Example debugging:

# Forward health check port to local machine
kubectl port-forward pod/helium-grpc-xyz 9090:9090

# Check readiness endpoint
curl http://localhost:9090/readyz

Liveness Probe Causing Restart Loop

Symptom: Pod repeatedly restarts with liveness probe failures

Solutions:

Increase initialDelaySeconds (worker may need more startup time)
Increase failureThreshold (allow more failures before restart)
Check if worker is deadlocked or stuck (examine logs before restart)

Worker Exits But Pod Stays Running

Symptom: Container appears healthy but doesn’t process requests

This should not happen with the current implementation:

API workers: Health check is aborted when main server exits
Background workers: Return from execute_worker() causes process exit

If this occurs, file a bug report.

Security Considerations

Port Exposure

The health check port (9090) should never be exposed externally:

Don’t create Ingress rules for health check endpoints
Don’t expose the health check port in the Service definition
Use network policies to restrict access to Kubernetes control plane only

Sensitive Information

Health check responses contain minimal information:

No version numbers
No internal IPs or hostnames
No authentication tokens
Only dependency status (ok/error)

Error messages may contain connection details. Ensure logs are secured appropriately.

Best Practices

Use separate ports: Never combine health checks with main service endpoints
Set appropriate timeouts: Balance between quick detection and false positives
Monitor probe metrics: Track probe success rates in your observability stack
Test locally: Use port-forwarding to verify health checks before deployment
Align with dependencies: If using a sidecar proxy (Istio, Linkerd), configure startup probes

Summary

Helium’s health check endpoints provide robust Kubernetes integration:

Liveness probe (/healthz): Detects unresponsive containers
Readiness probe (/readyz): Ensures dependencies are healthy
Separate port (default 9090): Isolated from main services
All worker modes: Consistent behavior across deployment types
Process lifecycle: Ensures clean exits, no zombie containers

Configure these probes in your Kubernetes deployments to enable automatic recovery and load balancing.

Keyboard shortcuts

Helium Documentation