Highly Available Resilient Applications in Kubernetes 2 of 3

This is the second in 3 that outlines Highly Available(HA) and application resilience best practices for running a custom or third party application hosted inside a Kubernetes (K8s) cluster.

Topics Coverred

Kubernetes Controller Manifests

In order have a container hosted on a Kubernetes cluster its API has various types defined. These types can be described a Controller Manifests. Manifests are defined in YAML and JSON documents. Specific controller types provide to create HA applications different types of manifests are recommended.

  1. Deployments - declarative packaging for Pods and ReplicaSets.
  2. StatefulSets - a Controller that provides multiple guarantees: unique identity, ordering of deployment, and storage.

Other controller types offer some features that can be used to create an application that is HA, but using these types is far more complex.

  1. Job - one or more pods that run specified number of those Pods and successfully terminate.
  2. DaemonSet - ensures a single Pod runs on a node as filtered by scheduling rules.

The following controller manifests do not follow HA patterns or have been replaced by newer manifest types.

  1. Pod - the packaging of one of more containers deployed to a single node.
  2. ReplicaSet - used by Deployment
  3. Replication Controller - replaced by deployments

All the above controller types provide the basic kubernetes functionality of restarting failed containers. Restarting a failed container is the most basic pattern that assists with HA resiliency.

Deployments

When deploying a stateless microservice, a Deployment is typically the recommended Controller. This manifest includes the capability to have multiple pods that are deployed across multiple nodes. When a Deployment is pair with a Kubernetes Service the containers within the Deployment are load balanced with an internal or external endpoint. When clients communicate with that endpoint, and an event such as a cloud AZ failure occurs, application service is not degraded.

Features include:

  1. Multiple replicas of pods within a deployment
  2. Capability for a rolling update of the deployment
  3. Ability to roll back a deployment
  4. Scale the numbers of a deployment replicas as needed

Two key things must be observed when using deployments: use more than one replica, create or an application that can function with more than one replica. Not following these principles only provides the capability for Pod restarts.

The Kubernetes API and kubectl the capability for both rolling-updates and rollbacks. Deployment’s rollout is triggered when Deployment’s pod template is updated.

Example

A very simple Deployment with an internal service.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx-deployment
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
---
kind: Service
apiVersion: v1
metadata:
  name: nginx-deployment
spec:
  selector:
    app: nginx-deployment
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9376

The following command will upgrade the above deployment to use nginx 1.9.1.

$ kubectl set image deployment/nginx-deployment nginx=nginx:1.9.1
deployment "nginx-deployment" image updated

Immediately after that command the deployment will be rolled, replacing each replica with a new pod running nginx 1.9.1. This rollout does not cause service loss.

Deployments can also be rolled backed. For example if the wrong image is set:

$ kubectl set image deployment/nginx-deployment nginx=nginx:1.91
deployment "nginx-deployment" image updated

The new pods will be stuck in crash loop, but will not impact the service level of the deployment. The deployment status is available with the following command.

$ kubectl rollout status deployments nginx-deployment
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...

In order to rollback to the previous version the following command is executed:

$ kubectl rollout undo deployment/nginx-deployment
deployment "nginx-deployment" rolled back

By default, two previous Deployment’s rollout version are kept in Kubernetes. Any version that still exist in Kuberentes can be deployed. The revision history is accessible via the API.

$ kubectl rollout history deployment/nginx-deployment

Deployments are the primary mechanism that is recommended for HA applications, include when hosting a stateless application inside of Kubernetes. At times Deployments have the sufficient capability to run stateful applications inside of Kubernetes, otherwise, the next section covers StatefulSets, developed with features required by some distributed stateful applications.

Deployment Documentation

More documentation about Deployments is availible here.

StatefulSets

Creating resilient stateless applications vs creating resilient stateful applications is very different. Deployments can be used with stateful applications but often HA distributed stateful applications have certain requirements that are not met by Deployments.

StatefulSets provide the following unique features.

  1. A stable hostname, available to clients in DNS. The number is based on the StatefulSet name and starts at zero. For example cassandra-0.
  2. An ordinal index of Pods. 0, 1, 2, 3, etc.
  3. Stable storage linked to the ordinal and hostname of the Pod.
  4. Peer discovery is available via DNS. For example, with Cassandra, the names of the peers are known before the Pods creation.
  5. Startup and Teardown ordering. Which numbered Pod is going to be created next is known, and which Pod will be destroyed upon reducing the Set size. This feature is useful for such admin tasks as draining data from a Pod when reducing the size of a cluster.

Only choose StatefulSets when one of the above requirement is needed to run an application hosted inside of Kubernetes. Many applications such as Kafka, Elastic, Zookeeper, and Cassandra require one of more of the above capabilities.

As with Deployments scaling is provided by StatefulSets, but at this time rolling update and rollbacks are not supported.

Example

The following example is for a Cassandra cluster hosted as a StatefulSet.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
apiVersion: "apps/v1beta1"
kind: StatefulSet
metadata:
  name: cassandra
spec:
  serviceName: cassandra
  replicas: 3
  template:
    metadata:
      labels:
        app: cassandra
    spec:
      containers:
      - name: cassandra
        image: quay.io/vorstella/cassandra
        imagePullPolicy: Always
        ports:
        - containerPort: 7000
          name: intra-node
        - containerPort: 7001
          name: tls-intra-node
        - containerPort: 7199
          name: jmx
        - containerPort: 9042
          name: cql
        resources:
          limits:
            cpu: "500m"
            memory: 1Gi
          requests:
           cpu: "500m"
           memory: 1Gi
        securityContext:
          capabilities:
            add:
              - IPC_LOCK
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "PID=$(pidof java) && kill $PID && while ps -p $PID > /dev/null; do sleep 1; done"]
        env:
          - name: MAX_HEAP_SIZE
            value: 512M
          - name: HEAP_NEWSIZE
            value: 100M
          - name: CASSANDRA_SEEDS
            value: "cassandra-0.cassandra.default.svc.cluster.local,cassandra-1.cassandra.default.svc.cluster.local"
          - name: CASSANDRA_CLUSTER_NAME
            value: "K8Demo"
          - name: CASSANDRA_DC
            value: "DC1-K8Demo"
          - name: CASSANDRA_RACK
            value: "Rack1-K8Demo"
          - name: CASSANDRA_AUTO_BOOTSTRAP
            value: "false"
          - name: POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        readinessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - /ready-probe.sh
          initialDelaySeconds: 15
          timeoutSeconds: 5
        # These volume mounts are persistent. They are like inline claims,
        # but not exactly because the names need to match exactly one of
        # the stateful pod volumes.
        volumeMounts:
        - name: cassandra-data
          mountPath: /cassandra_data
  # These are converted to volume claims by the controller
  # and mounted at the paths mentioned above.
  volumeClaimTemplates:
  - metadata:
      name: cassandra-data
      annotations:
        volume.beta.kubernetes.io/storage-class: fast
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi
---
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: fast
provisioner: kubernetes.io/aws-ebs
parameters:
  type: pd-ssd
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: cassandra
  name: cassandra
spec:
  clusterIP: None
  ports:
    - port: 9042
  selector:
    app: cassandra

StatefulSets Documentation

More documentation about StatefulSets is availible here.

Jobs

A Job is a controller that only has HA replay and restart patterns. This manifest can be used within Kubernetes but it is a more complex model to deploy as a resilient application. This controller does lend itself well towards queue workers in CQRS. The application containers handle failure cases itself, rather than relying on the provided Kuberenetes platform providing fault tolerance and recovery.

This controller consists of the pod(s) which run a specified number of them successfully. A simple case is to create one Job to reliably run one Pod to completion. The Job object will start a new Pod if the first Pod fails or is deleted, and restarts are controllable within a configurable duration. A Job can also run multiple Pods in parallel.

Job Termination and Cleanup

When a Job’s pods are fails, the Job will keep creating new Pods forever, by default. Retrying forever can be a useful pattern. If an external dependency of the Job’s pods is missing, then the Job will keep trying to complete.

However, in other cases where a program should not retry forever, the capability exists to set a deadline on the job. See the below example.

The spec.activeDeadlineSeconds field of the job to some seconds controls cleanup. When the deadline passes, the job will have status with reason: DeadlineExceeded. No more pods will be created, and existing pods will be deleted.

Example

An example of a simple job running a Perl command.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  template:
    metadata:
      name: pi
    activeDeadlineSeconds: 20
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]

Jobs Documentation

More documentation about Jobs is available here.

DaemonSets

Another controller that is very useful, but does not lend itself toward high availability is a DaemonSet. Deploy of HA application as DaemonSets is very complex.

This manifest ensures that a pod runs on all or some nodes.

Some typical uses of a DaemonSet are:

  • running a CNI networking provider
  • running a logs collection daemon
  • running a node monitoring daemon

A Daemonset is analogous to have a Systemd Unit hosted on every node. Just as with Jobs, to be HA, applications hosted as a DaemonSets must include functionality to handling failovers and restarts. Often replay logs and other patterns are used.

Example

The following is a Daemonset for Weave.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: weave-net
  namespace: kube-system
spec:
  template:
    metadata:
      labels:
        name: weave-net
      annotations:
        scheduler.alpha.kubernetes.io/tolerations: |
          [
            {
              "key": "dedicated",
              "operator": "Equal",
              "value": "master",
              "effect": "NoSchedule"
            }
          ]
    spec:
      hostNetwork: true
      hostPID: true
      containers:
        - name: weave
          image: weaveworks/weave-kube:latest
          imagePullPolicy: Always
          command:
            - /home/weave/launch.sh
          livenessProbe:
            initialDelaySeconds: 30
            httpGet:
              host: 127.0.0.1
              path: /status
              port: 6784
          securityContext:
            privileged: true
          volumeMounts:
            - name: weavedb
              mountPath: /weavedb
            - name: cni-bin
              mountPath: /host/opt
            - name: cni-bin2
              mountPath: /host/home
            - name: cni-conf
              mountPath: /host/etc
            - name: dbus
              mountPath: /host/var/lib/dbus
            - name: lib-modules
              mountPath: /lib/modules
          resources:
            requests:
              cpu: 10m
        - name: weave-npc
          image: weaveworks/weave-npc:latest
          imagePullPolicy: Always
          resources:
            requests:
              cpu: 10m
          securityContext:
            privileged: true
      restartPolicy: Always
      volumes:
        - name: weavedb
          emptyDir: {}
        - name: cni-bin
          hostPath:
            path: /opt
        - name: cni-bin2
          hostPath:
            path: /home
        - name: cni-conf
          hostPath:
            path: /etc
        - name: dbus
          hostPath:
            path: /var/lib/dbus
        - name: lib-modules
          hostPath:
            path: /lib/modules

DaemonSets Documentation

More documentation about DaemonSets is availible here.

Kuberenetes Storage

Storage in a container is ephemeral when the storage is not a mounted volume. Within Kubernetes some storage types are transient: emptyDir and hostPath. Both of those storage types do not maintain persistence between pod or node restarts or failures.

Kubernetes volumes, such as awsElasticBlockStore has an explicit lifetime. Volumes can outlive container restarts, and data persists between container restarts. Destroying a Pod can cause data loss. On of the features of StatefulSets is maintaining data persistence upon pod is deletion.

EBS based Storage

An awsElasticBlockStore volume mounts an EBS Volume.

There are some limitations when using an awsElasticBlockStore volume:

  • nodes need to be in the same region and availability zone as the EBS volume
  • EBS only supports a single EC2 instance mounting a volume

These limitations must be kept in mind when designing for such occurrences as AZ failures.

Persistent Volumes

The PersistentVolume subsystem provides an API that abstracts storage provision and consumption. A PersistentVolume (PV) is a piece of storage in a K8s cluster, while a PersistentVolumeClaim (PVC) is a request for storage for a controller. The third component included is a StorageClass, which represents various classes of storage.

Persistent volumes and the underlying storage must exist before a controller is associated to the storage.

Persistent Volume Claim Templates

A feature of the StatefulSet controller, VolumeClaimTemplates is a list of storage claims that pods reference. A StatefulSet controller maps network identities, persistent volumes, to claims in a way that maintains the identity of a pod. When creating a StatefulSet, with a template, in a Kubernetes cluster, the persistent volumes are automatically created. The Cassandra example included in this document utilized a volume claim template.

Documentation

More documentation about storage is availible here.

Next posts will continue on to the topics of probes and affinity rules.