Highly Available Resilient Applications in Kubernetes 1 of 3

06 Oct 2017 [ kubernetes best-practices ]

This is the first in 3 that outlines Highly Available (HA) and application resilience best practices for running a custom, or third party application hosted inside a Kubernetes (K8s) cluster. High Availability and resilience allow us to handle: infrastructure and applications failures, cloud outages where the Kubernetes cluster is still functional, rolling updates of K8s, and rolling updates of applications. One of the guiding principles of Kubernetes is HA fault tolerance, but Kubernetes provides a platform to build applications that meet HA SLAs, it does not make applications fault tolerant.

Whenever I think about architecture I follow a simple thought process: What do I need to do to design and deploy an application, so that I am not woken up at 2am because a page goes off? And if I am woken up at 2am, how can I setup a system that will failover and recover itself by the time that I log into the Kubernetes cluster.

TLDR;

These are not nice-to-haves, but must haves for designing and deploying an application hosted in Kubernetes.

Make sure your application stops gracefully when it gets a SIGTERM signal. See Gracefully handling container stop Signal Handling within Kubernetes.
Use and ENTRYPOINT with dumb-init so that signals are passed properly to your binary.
Use Pre or Post stop hooks if your binary needs more TLC to start or stop gracefully.

Covered in Part Two

Use Deployment Controller Manifests for Microservice, and use StatefulSets only if you need their features.
Jobs and DaemonSets do not provide out of the box HA, but fill some use cases.
Persistent Volumes are the way to save a make data persitent.

Covered in Part Three

Use Liveness and Ready Probes. Design your application to use and support them.
Use Affinity and Anti-Affinity Selectors if Pods need to be ditributed across nodes.

Application Lifecycle Within Kubernetes

A container, which hosts an application, can be made aware of events in its lifecycle. This information is essential for a hosted application to be alerted that it started, or notified that it is stopping. Within various scenarios, including a Pod eviction from a Kubernetes node. Another such scenario is when a Kubernetes node is drained, before destroying that node.

When an event occurs, kubelet calls into any registered container hook for that event. The hook calls are synchronous in the processing of the container. This means for a pre-start hook the container entry point, and the hook will fire asynchronously. Hooks also impact the state of a container within the Kubernetes system. For example, if a PostStart hook fails, the container will not reach “running” state.

Container Hooks

Hooks execute as either an HTTP request or an execution of a command within the container. More detailed information about Container Lifecycle Hooks is found via the provided link.

PostStart Hook

This hook fires after a container creation, and often runs at the same time as a containers entry point. Since this is an asynchronous call when the hook runs, the timing is not guaranteed.

PreStop Hook

This hook executes before a container termination, while the PID is still running. PreStop hook event is blocking and completes before the call to delete the container is sent to the Docker daemon.

Container Hook Use

To make an application more resilient application tasks may need to be completed, or some runtime executables may need some help with signal handling to stop. Often when using Java JVM, a container will not handle a shutdown gracefully.

Including the following example, PreStop to JVM based containers is often helpful. This example lives within the contain section of a Kubernetes manifest.

1
2
3
4
5
6
7
lifecycle:
  preStop:
    exec:
      command:
        - /bin/bash
          - -c
            - PID=`pidof java` && kill -SIGTERM $PID && while ps -p $PID > /dev/null; do sleep 1; done;

Other applications such as Nginx will stop when they receive an SIGTERM signal. To gracefully stop Nginx use the following preStop hook.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx
spec:
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
        lifecycle:
          preStop:
            exec:
              # use this command to gracefully shutdown
              command: ["/usr/sbin/nginx","-s","quit"]

Gracefully Handling Container Stop Signal Handling within Kubernetes

When Kuberentes shuts down a container, two different Unix Signals run: SIGTERM and SIGKILL. An example of the workflow to stop a pod and its container(s).

The Kubernetes API receives a call command to delete a container or Pod.
Default grace period of 30s starts unless otherwise configured.
Pod status is set to “Terminating.”
Kubelet starts the Pod shutdown process.
If a preStop hook exists it executes.
The processes in the Pod’s containers are sent the SIGTERM signal
If the processes are still running after the default grace period, an SIGKILL signal given to the processes.
Kubelet updates K8s API removing the Pod when kubelet finished deleting the pod and its container(s).

The application must receive the correct signals and handle those flags. Moreover, properly designed, properly behaving, and appropriately deploy applications should not get to the point where a SIGKILL signal is not needed.

Complexity of Signal Handling in Containers

There is a well known process ID 1 problem that can add complexity to handling signals within containers. Depending on the executable used from a Dockers ENTRYPOINT, that problem can cause complexities.

TLDR;

Process ID 1, or PID 1, is a special process ID that the kernel reserves for init scripts. Because init scripts are not used within containers, having an applications PID running as PID 1 can cause unexpected and obscure-looking issues.

oreover, various implementations of the UNIX shell, /bin/sh, do not pass signals to their child processes. For instance, the default implementation of shell in the alpine base container does not send interrupts to its child processes.

A simple solution is to use a binary that acts as a signal proxy and starts a child process as PID 2 inside a container.

dumb-init

Various binaries exist that assist with PID management and signal proxying within containers. One such tool that is used within Trebuchet is [dumb-init] (https://github.com/Yelp/dumb-init). Yelp open sourced this a small C-based binary to solve the two problems listed above, and more:

dumb-init starts as PID 1, and then start a container application as PID 2.
dumb-init proxies any UNIX signals, such as SIGTERM, to its child process PID 2.
dumb-init reaps any zombie process created.

One of our base Docker images dumb-init contains the ENTRYPOINT for dumb-init.

1
ENTRYPOINT ["/sbin/dumb-init"]

If your application uses the above container, add a CMD reference in applications Dockerfile.

1
2
FROM "our-repo:dumb-init:0.ourversion"
CMD ["/my-app"]

When the above container the executes its ENTRYPOINT, the CMD runs as an argument.

Next posts will cover manifest types and controlling scheduling.