Kubernetes is a remarkable platform. When put in a cloud provider like digital ocean you will wonder how you ever lived without it, but it you happen to find yourself in an unstable virtualized environment where hardware resources are over committed then you may need to put in some work to get your services stable.
Kubernetes itself will run like a champ, but it may kill your services every 30 seconds. In some cases you can lose CPU randomly in a heavily overcommitted environment. This will in many cases cause your healthchecks to fail especially if performing an httpGet check. If the virtualization platform doesn’t continuously schedule the VMs CPU then you can have a service that closes connections or doesn’t return a full response. If this occurs three times in a row your service will restart since Kubernetes health checks only allow 3 failures in a row by default.
So how do you stop this killing of the services? They may actually be healthy just having trouble responding to the health checks. Fortunately, the kubernetes community made nearly all the parameters configurable it just isn’t very well documented outside the api reference.
Before I describe the kubernetes settings adjustments I want to mention that it would be preferable to fix the underlying problem. Having badly overcommitted hardware is a serious problem and will lead to more problems down the road if not fixed. This is why I prefer to let the application containers run on bare metal as it is more efficient and stable, but that is a topic for another day so on to fixing this immediate problem.
The first thing I do when I have this issue is increase the failureThreshold on the livenessProbe. This allows more consecutive failures before killing the service. I like to use this because it still checks the service every 10 seconds, but requires more failures to cause a restart giving the service more of a chance to return a valid response. In most cases this will fix the problem even increasing it a small amount. I have noticed that this sometimes may not work for services that may be under heavy load or if they are being health checked by something else such as an external load balancer. In this case, you may need to increase the periodSeconds value which defaults to 10. This controls the frequency with which health checks are performed. Setting this higher will mean kubernetes does not ping the service as frequently giving it more of a chance to do real work when it doen’t have a lot of CPU time to spare. It may be necessary to adjust both of the values, but I have yet to need to. For less critical services, it may be a good idea to run the period higher to save on resource utilization. A modified example from the kubernetes docs is below. Hopefully this will help someone out with stability issues.
apiVersion: v1 kind: Pod metadata: labels: test: liveness name: liveness-http spec: containers: - args: - /server image: gcr.io/google_containers/liveness livenessProbe: httpGet: path: /healthz port: 8080 httpHeaders: - name: X-Custom-Header value: Awesome periodSeconds: 20 failureThreshold: 4 initialDelaySeconds: 15 timeoutSeconds: 1 name: liveness