High availability best practices
Building reliable and highly available distributed systems is a non-trivial endeavor. In this section, we will check some of the best practices that enable a Kubernetes-based system to function reliably and be available in the face of various failure categories. We will also pe deep and see how to go about constructing your own highly available clusters.
Note that you should roll your own highly available Kubernetes cluster only in very special cases. Tools such as Kubespray provide battle-tested ways to create highly available clusters. You should take advantage of all the work and effort that went into these tools.
Creating highly available clusters
To create a highly available Kubernetes cluster, the master components must be redundant. That means etcd must be deployed as a cluster (typically across three or five nodes) and the Kubernetes API server must be redundant. Auxiliary cluster-management services such as Heapster storage may be deployed redundantly too, if necessary. The following diagram depicts a typical reliable and highly available Kubernetes cluster in a stacked etcd topology. There are several load-balanced master nodes, each one containing whole master components as well as an etcd component:
Figure 3.1: A highly available cluster configuration
This is not the only way to configure highly available clusters. You may prefer, for example, to deploy a standalone etcd cluster to optimize the machines to their workload or if you require more redundancy for your etcd cluster than the rest of the master nodes.
The following diagram shows a Kubernetes cluster where etcd is deployed as an external cluster:
Figure 3.2: etcd used as an external cluster
Self-hosted Kubernetes clusters, where control plane components are deployed as pods and stateful sets in the cluster, are a great approach to simplify the robustness, disaster recovery, and self-healing of the control plane components by applying Kubernetes to Kubernetes.
Making your nodes reliable
Nodes will fail, or some components will fail, but many failures are transient. The basic guarantee is to make sure that the Docker daemon (or whatever the CRI implementation is) and the kubelet restart automatically in the event of a failure.
If you run CoreOS, a modern Debian-based OS (including Ubuntu >= 16.04), or any other OS that uses systemd as its init mechanism, then it's easy to deploy Docker and the kubelet as self-starting daemons:
systemctl enable docker
systemctl enable kublet
For other operating systems, the Kubernetes project selected monit for their high-availability example, but you can use any process monitor you prefer. The main requirement is to make sure that those two critical components will restart in the event of failure, without external intervention.
Protecting your cluster state
The Kubernetes cluster state is stored in etcd. The etcd cluster was designed to be super reliable and distributed across multiple nodes. It's important to take advantage of these capabilities for a reliable and highly available Kubernetes cluster.
Clustering etcd
You should have at least three nodes in your etcd cluster. If you need more reliability and redundancy, you can go for five, seven, or any other odd number of nodes. The number of nodes must be odd to have a clear majority in the event of a network split.
In order to create a cluster, the etcd nodes should be able to discover each other. There are several methods to accomplish this. I recommend using the excellent etcd operator from CoreOS:
Figure 3.3: The Kubernetes etcd operator logo
The operator takes care of many complicated aspects of etcd operation, such as the following:
- Create and destroy
- Resizing
- Failovers
- Rolling upgrades
- Backup and restore
Installing the etcd operator
The easiest way to install the etcd operator is using Helm – the Kubernetes package manager. If you don't have Helm installed yet, follow the instructions here: https://github.com/kubernetes/helm#install.
Next, save the following YAML to helm-rbac.yaml:
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: tiller
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: tiller
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: tiller
namespace: kube-system
This creates a service account for Tiller and gives it a cluster admin role:
$ k apply -f helm-rbac.yaml
serviceaccount/tiller created
clusterrolebinding.rbac.authorization.k8s.io/tiller created
Then initialize Helm with the Tiller service account:
$ helm2 init --service-account tiller
$HELM_HOME has been configured at /Users/gigi.sayfan/.helm.
Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.
Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.To prevent this, run 'helm init' with the --tiller-tls-verify flag.
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation
Don't worry about the warnings at this point. We will pe deep into Helm in Chapter 9, Packaging Applications.
Now, we can finally install the etcd operator. I use x as a short release name to make the output less verbose. You may want to use more meaningful names:
$ helm2 install stable/etcd-operator --name x
NAME: x
LAST DEPLOYED: Thu May 28 17:33:16 2020
NAMESPACE: default
STATUS: DEPLOYED
RESOURCES:
==> v1/Pod(related)
NAME READY STATUS RESTARTS AGE
x-etcd-operator-etcd-backup-operator-dffcbd97-hfsnc 0/1 Pending 0 0s
x-etcd-operator-etcd-operator-669975754b-vhhq5 0/1 Pending 0 0s
x-etcd-operator-etcd-restore-operator-6b787cc5c-6dk77 0/1 Pending 0 0s
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
etcd-restore-operator ClusterIP 10.43.182.231 <none> 19999/TCP 0s
==> v1/ServiceAccount
NAME SECRETS AGE
x-etcd-operator-etcd-backup-operator 1 0s
x-etcd-operator-etcd-operator 1 0s
x-etcd-operator-etcd-restore-operator 1 0s
==> v1beta1/ClusterRole
NAME AGE
x-etcd-operator-etcd-operator 0s
==> v1beta1/ClusterRoleBinding
NAME AGE
x-etcd-operator-etcd-backup-operator 0s
x-etcd-operator-etcd-operator 0s
x-etcd-operator-etcd-restore-operator 0s
==> v1beta2/Deployment
NAME READY UP-TO-DATE AVAILABLE AGE
x-etcd-operator-etcd-backup-operator 0/1 1 0 0s
x-etcd-operator-etcd-operator 0/1 1 0 0s
x-etcd-operator-etcd-restore-operator 0/1 1 0 0s
NOTES:
1. etcd-operator deployed.
If you would like to deploy an etcd-cluster set cluster.enabled to true in values.yaml
Check the etcd-operator logs
export POD=$(kubectl get pods -l app=x-etcd-operator-etcd-operator --namespace default --output name)
kubectl logs $POD --namespace=default
Now that the operator is installed, we can use it to create the etcd cluster.
Creating the etcd Cluster
Save the following to etcd-cluster.yaml:
apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
name: "example-etcd-cluster"
spec:
size: 3
version: "3.2.13"
To create the cluster type, use the following command:
$ k create -f etcd-cluster.yaml
etcdcluster.etcd.database.coreos.com/etcd-cluster created
Let's verify the cluster pods were created properly:
$ k get pods -o wide | grep etcd-cluster
etcd-cluster-2fs2lpz7p7 1/1 Running 0 2m53s 10.42.2.4 k3d-k3s-default-worker-1
etcd-cluster-58547r5f6x 1/1 Running 0 3m49s 10.42.1.5 k3d-k3s-default-worker-0
etcd-cluster-z7s4bfksdl 1/1 Running 0 117s 10.42.3.5 k3d-k3s-default-worker-2
As you can see, each etcd pod was scheduled to run on a different node. This is exactly what we want with a redundant datastore like etcd.
The -o wide format for kubectl's get command provides additional information for the get pods command the node for the pod is scheduled on.
Verifying the etcd cluster
Once the etcd cluster is up and running, you can access it with the etcdctl tool to check on the cluster status and health. Kubernetes lets you execute commands directly inside pods or containers via the exec command (similar to docker exec).
Here is how to check if the cluster is healthy:
$ k exec etcd-cluster-2fs2lpz7p7 -- etcdctl cluster-health
member 1691519f36d795b7 is healthy: got healthy result from http://etcd-cluster-2fs2lpz7p7.etcd-cluster.default.svc:2379
member 1b67c8cb37fca67e is healthy: got healthy result from http://etcd-cluster-58547r5f6x.etcd-cluster.default.svc:2379
member 3d4cbb73aeb3a077 is healthy: got healthy result from http://etcd-cluster-z7s4bfksdl.etcd-cluster.default.svc:2379
cluster is healthy
Here is to how to set and get key-value pairs:
$ k exec etcd-cluster-2fs2lpz7p7 -- etcdctl set test "Yeah, it works"
Yeah, it works
$ k exec etcd-cluster-2fs2lpz7p7 -- etcdctl get test
Yeah, it works
Protecting your data
Protecting the cluster state and configuration is great, but even more important is protecting your own data. If somehow the cluster state gets corrupted, you can always rebuild the cluster from scratch (although the cluster will not be available during the rebuild). But if your own data is corrupted or lost, you're in deep trouble. The same rules apply: redundancy is king. But while the Kubernetes cluster state is very dynamic, much of your data may be less dynamic.
For example, a lot of historic data is often important and can be backed up and restored. Live data might be lost, but the overall system may be restored to an earlier snapshot and suffer only temporary damage.
You should consider Velero as a solution for backing up your entire cluster, including your own data. Heptio (now part of VMware) developed Velero, which is open source and may be a life-saver for critical systems.
Check it out at https://velero.io.
Running redundant API servers
API servers are stateless, fetching all the necessary data on the fly from the etcd cluster. This means that you can easily run multiple API servers without needing to coordinate between them. Once you have multiple API servers running, you can put a load balancer in front of them to make it transparent to clients.
Running leader election with Kubernetes
Some master components, such as the scheduler and the controller manager, can't have multiple instances active at the same time. This would be chaos, as multiple schedulers would try to schedule the same pod into multiple nodes or multiple times into the same node. The correct way to have a highly scalable Kubernetes cluster is to have these components run in leader election mode. This means that multiple instances are running, but only one is active at a time and if it fails, another one is elected as leader and takes its place.
Kubernetes supports this mode via the --leader-elect flag. The scheduler and the controller manager can be deployed as pods by copying their respective manifests to /etc/kubernetes/manifests.
Here is a snippet from a scheduler manifest that shows the use of the flag:
command:
- /bin/sh
- -c
- /usr/local/bin/kube-scheduler --master=127.0.0.1:8080 --v=2 --leader-elect=true 1>>/var/log/kube-scheduler.log
2>&1
Here is a snippet from a controller manager manifest that shows the use of the flag:
- command:
- /bin/sh
- -c
- /usr/local/bin/kube-controller-manager --master=127.0.0.1:8080 --cluster-name=e2e-test-bburns
--cluster-cidr=10.245.0.0/16 --allocate-node-cidrs=true --cloud-provider=gce --service-account-private-key-file=/srv/kubernetes/server.key
--v=2 --leader-elect=true 1>>/var/log/kube-controller-manager.log 2>&1
image: gcr.io/google\_containers/kube-controller-manager:fda24638d51a48baa13c35337fcd4793
There are several other flags to control leader election. All of them have reasonable defaults:
--leader-elect-lease-duration duration Default: 15s
--leader-elect-renew-deadline duration Default: 10s
--leader-elect-resource-lock endpoints Default: "endpoints" ("configmaps" is the other option)
--leader-elect-retry-period duration Default: 2s
Note that it is not possible to have these components restarted automatically by Kubernetes like other pods because these are exactly the Kubernetes components responsible for restarting failed pods, so they can't restart themselves if they fail. There must be a ready-to-go replacement already running.
Making your staging environment highly available
High availability is not trivial to set up. If you go to the trouble of setting up high availability, it means there is a business case for a highly available system. It follows that you want to test your reliable and highly available cluster before you deploy it to production (unless you're Netflix, where you test in production). Also, any change to the cluster may, in theory, break your high availability without disrupting other cluster functions. The essential point is that, just like anything else, if you don't test it, assume it doesn't work.
We've established that you need to test reliability and high availability. The best way to do this is to create a staging environment that replicates your production environment as closely as possible. This can get expensive. There are several ways to manage the cost:
- An ad hoc highly available staging environment: Create a large highly available cluster only for the duration of high availability testing.
- Compress time: Create interesting event streams and scenarios ahead of time, feed the input, and simulate the situations in rapid succession.
- Combine high availability testing with performance and stress testing: At the end of your performance and stress tests, overload the system and see how the reliability and high availability configuration handles the load.
Testing high availability
Testing high availability takes planning and a deep understanding of your system. The goal of every test is to reveal flaws in the system's design and/or implementation, and to provide good enough coverage that, if the tests pass, you'll be confident that the system behaves as expected.
In the realm of reliability, self-healing, and high availability, it means you need to figure out ways to break the system and watch it put itself back together.
This requires several elements, as follows:
- A comprehensive list of possible failures (including reasonable combinations)
- For each possible failure, it should be clear how the system should respond
- A way to induce the failure
- A way to observe how the system reacts
None of the elements are trivial. The best approach in my experience is to do it incrementally and try to come up with a relatively small number of generic failure categories and generic responses, rather than an exhaustive, ever-changing list of low-level failures.
For example, a generic failure category is node-unresponsive; the generic response could be rebooting the node; the way to induce the failure could be stopping the virtual machine (VM) of the node (if it's a VM). The observation should be that, while the node is down, the system still functions properly based on standard acceptance tests; the node is eventually up, and the system gets back to normal. There may be many other things you want to test, such as whether the problem was logged, whether relevant alerts went out to the right people, and whether various stats and reports were updated.
But beware of over-generalizing. In the case of the generic unresponsive node failure, a key component is detecting that the node is unresponsive. If your method of detection is faulty, then your system will not react properly. Use best practices like health checks and readiness checks.
Note that, sometimes, a failure can't be resolved in a single response. For example, in our unresponsive node case, if it's a hardware failure, then rebooting will not help. In this case, a second line of response comes into play and maybe a new node is provisioned to replace the failed node. In this case, you can't be too generic and you may need to create tests for specific types of pod/role that were on the node (such as etcd, master, worker, database, and monitoring).
If you have high quality requirements, be prepared to spend much more time setting up the proper testing environments and the tests than even the production environment.
One last important point is to try to be as unintrusive as possible. That means that, ideally, your production system will not have testing features that allow shutting down parts of it or cause it to be configured to run at reduced capacity for testing. The reason is that it increases the attack surface of your system and it could be triggered by accident by mistakes in configuration. Ideally, you can control your testing environment without resorting to modifying the code or configuration that will be deployed in production. With Kubernetes, it is usually easy to inject pods and containers with custom test functionality that can interact with system components in the staging environment, but will never be deployed in production.
In this section, we looked at what it takes to actually have a reliable and highly available cluster, including etcd, the API server, the scheduler, and the controller manager. We considered best practices for protecting the cluster itself, as well as your data, and paid special attention to the issue of starting environments and testing.