Signals and the "kubectl delete" command
October 18, 2021
Some colleagues and I were recently implementing a Chaos Monkey style test against a Kubernetes deployment. The goal was to forcibly kill an application to understand how it behaved. Specifically, we were looking to see if the application engaged in some atomic I/O operations that were safe, even if they ungracefully terminated while data was being processed. To do this, we needed to make sure the process (and, by extension, the container running the process) was forcibly terminated without an opportunity to gracefully run any shutdown routines.
For those who remember Ye Olden Days when we wrote and tested applications without wrapping them in a set of namespaces and cgroups that are created by a runtime controlled by a constantly-evolving API with an increasingly complex set of interfaces, you know this is generally an easy problem to solve: Just spin up many copies of the service and write a one-liner to kill -9 $PID && sleep 1
each of those processes. Or forcibly stop the VMs that the service is running on. Or walk over to a rack of servers and unplug it. It’s not perfect, but it’ll do the job if you’re just looking to violently terminate processes and see how a system behaves.
But we’re #webscale now, so nothing can be simple.
Short version of this article: There’s no way to accurately simulate a true failure in a Kubernetes environment unless you have access to the underlying nodes. Empirical evidence indicates that kubernetes
and/or kubectl
don’t offer a way to immediately send a SIGKILL
to a pod. The documentation is very unclear, which adds to the confusion.
As an aside: If you want to play a fun game, log into that one single point of failure server that your company has at 4:30PM on a Friday and give the ole’ SysAdmin Roulette a whirl: kill -9 $(ps -ef | tail -n +2 | awk '{ print $2 }' | shuf | head -n 1)
. This is also a great way to get your company’s Very Important Person to stop chasing blockchain, machine learning, AI, or whatever other startup snakeoil they think you need, and convince them to actually fix problems in your current environment. Anyway…
Some Fundamentals
If you’re unfamiliar with signals, here’s a crash course: signals are essentially a standardized message sent to a process. Processes can generally decide how they want to handle different signals (except SIGKILL
and SIGSTOP
), but there’s some standardization. The signals that I’m interested in are SIGTERM
and SIGKILL
. Sending a SIGTERM
to a process gives it a chance to gracefully terminate. The process will usually execute some cleanup tasks and then exit. Sending a SIGKILL
to a process will immediately terminate the process, giving it no opportunity to clean up after itself.
For my team’s experiment, we ideally wanted to send a SIGKILL
to our Kubernetes pods to test how they behave in ungraceful shutdown scenarios. The kubectl delete
command is used to delete resources, such as pods. It provides a --grace-period
flag, ostensibly for allowing you to give a pod a certain amount of time to gracefully terminate (SIGTERM
) before it’s forcibly killed (SIGKILL
). If you review the help menu for kubectl delete
, you’ll find the following relevant bits:
--force=false: If true, immediately remove resources from API and bypass graceful deletion.
Note that immediate deletion of some resources may result in inconsistency or data loss and requires
confirmation.
--grace-period=-1: Period of time in seconds given to the resource to terminate gracefully.
Ignored if negative. Set to 1 for immediate shutdown. Can only be set to 0 when --force is true
(force deletion).
This isn’t really clear. Does --grace-period=1
result in immediate shutdown via a SIGKILL
, or does it give the pod a 1 second grace period? Does --grace-period=0 --force=true
send an immediate SIGKILL
, or does it just remove the resource from the Kubernetes API? It’s all entirely unclear from the docs, so I ran some experiments to find out more.
Test Setup
To figure out how this behavior works, I used the following setup:
- minikube version
1.23.2
- Kubernetes server version
1.22.2
kubectl
version1.22.2
.
$ minikube version --components
minikube version: v1.23.2
commit: 0a0ad764652082477c00d51d2475284b5d39ceed
buildctl:
buildctl github.com/moby/buildkit v0.9.0 c8bb937807d405d92be91f06ce2629e6202ac7a9
containerd:
containerd github.com/containerd/containerd v1.4.9 e25210fe30a0a703442421b0f60afac609f950a3
crictl:
crictl version v1.21.0
crio:
crio version 1.22.0
crun:
error
ctr:
ctr github.com/containerd/containerd v1.4.9
docker:
Docker version 20.10.8, build 3967b7d
dockerd:
Docker version 20.10.8, build 75249d8
podman:
podman version 2.2.1
runc:
runc version 1.0.1
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:32:41Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Testing with --grace-period=1
The documentation for kubectl
specifically states that --grace-period
should be “Set to 1 for immediate shutdown.” To me, this would indicate that a grace period of 1 results in…well, immediate shutdown, like the documentation says. In the world of *nix, this means that the process is sent a SIGKILL
and not given a chance to gracefully terminate.
Let’s put that to the test. First, I’ll start a simple busybox pod that just sleeps forever (I output timestamps on everything so that I can follow the flow):
$ date -u +%R:%S && kubectl run --image=busybox busybox sleep infinity
23:58:45
pod/busybox created
Next, I’ll connect to my minikube host (via minikube ssh
) and fire up an strace
on the process ID of the sleeping container.
$ strace --absolute-timestamps -p $(docker ps | grep 'sleep infinity' | cut -f 1 -d ' ' | xargs docker inspect | jq .[0].State.Pid)
strace: Process 44880 attached
23:58:48 restart_syscall(<... resuming interrupted nanosleep ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
Finally, I’ll just send a kubectl delete
with a grace-period
of 1
and observe the strace
output:
# kubectl delete command
$ date -u +%R:%S && kubectl delete pod busybox --grace-period=1
23:59:04
pod "busybox" deleted
# strace output from minikube
23:59:04 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
23:59:04 restart_syscall(<... resuming interrupted restart_syscall ...>) = ?
23:59:05 +++ killed by SIGKILL +++
Notice that a SIGTERM
is received, followed immediately by a SIGKILL
one second later. This is unexpected behavior: the docs indicate that --grace-period=0
results in immediate shutdown, which clearly isn’t the case. The use of a SIGTERM
gives the process a chance to gracefully exit, which is undesirable for my tests.
Testing with --grace-period=0
and --force=true
The other option provided by the documentation is to use --grace-period=0
and --force=true
. Again, the docs are unclear about what will actually happen here. They state that --grace-period
can only be set to 0 “when --force
is true (force deletion).” The docs further explain that a force deletion will “immediately remove resources from API and bypass graceful deletion.” Basically, this indicates that the resource will be removed from Kubernetes before it has received confirmation that the resource itself (e.g., a container) has actually been deleted.
Once again, the documentation is unclear about the behavior (does --grace-period-0
result in a SIGKILL
?), so I tested it out with the same experiment:
# Create the pod
$ date -u +%R:%S && kubectl run --image=busybox busybox sleep infinity
00:01:59
pod/busybox created
# Trace the pid
$ strace -p $(docker ps | grep 'sleep infinity' | cut -f 1 -d ' ' | xargs docker inspect | jq .[0].State.Pid) --absolute-timestamps
strace: Process 45803 attached
00:02:08 restart_syscall(<... resuming interrupted nanosleep ...>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
# Force delete the pod with no grace period
$ date -u +%R:%S && kubectl delete pod busybox --grace-period=0 --force=true
00:02:18
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "busybox" force deleted
# Observe the strace output
00:02:18 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
00:02:18 restart_syscall(<... resuming interrupted restart_syscall ...>) = ?
00:02:48 +++ killed by SIGKILL +++
Whoa. Thirty seconds between receiving a SIGTERM
and finally terminating via SIGKILL
? That doesn’t sound like --grace-period=0
to me. So it turns out that specifying --grace-period=0
and --force=true
might actually provide more of a grace period than you would expect.
But why?
I now know that neither --grace-period=1
nor --grace-period=0 --force=true
behave “correctly” based on the documentation. The weirdest thing about this behavior is that it’s totally unnecessary. Docker (and I imagine other runtimes, like containerd) supports sending SIGKILL
to a container:
# Kill the container
$ date -u +%R:%S && docker kill $(docker ps | grep 'sleep infinity' | cut -f 1 -d ' ')
01:25:45
8c59ac684bf2
# Observe the strace output
$ strace --absolute-timestamps -p $(docker ps | grep 'sleep infinity' | cut -f 1 -d ' ' | xargs docker inspect | jq .[0].State.Pid)
strace: Process 9122 attached
01:25:32 restart_syscall(<... resuming interrupted nanosleep ...>) = ?
01:25:45 +++ killed by SIGKILL +++
Notice that the container is immediately terminated via a SIGKILL
. This is expected behavior, and is what any person should expect for an “immediate shutdown.”
It turns out that I’m not the first one to come across this problem. An issue was opened almost 2 years ago pointing out that, at a minimum, the documentation should be corrected to accurately reflect the behavior of Kubernetes. The issue was largely ignored and then autoclosed.
Does it matter?
All of this sounds very academic: how often do administrators really care about the shutdown signals sent to their processes? And why can’t you just run the underlying docker kill
commands (or the equivalents in other runtimes)? Does it really matter that Kubernetes improperly implements “immediate shutdown” and then doesn’t explain this in the documentation?
It matters to anyone looking to test their system to ensure it behaves properly in failure scenarios. If a Kubernetes node fails due to a hardware issue, it probably isn’t going to use its dying breaths to politely send SIGTERM
s to every pod. It’s just going to fail, and you need to understand how your system will handle that failure. Without being able to actually simulate this behavior, you can’t be confident that your system will degrade in the way you expect.
It’s tempting to tell an administrator to just log into the underlying hosts and simulate a failure, either via a docker kill
or by physically terminating the machine. Aside from this being silly (Kubernetes should just implement signals properly), it’s not always possible: many organizations pay for hosted Kubernetes and have no access to the underlying nodes.
More broadly, Kubernetes is often billed by supporters as a “distributed operating system.” Process management is an integral part of an operating system, and if you can’t reason about how Kubernetes handles process termination, then it’s not much of an operating system. These are the kinds of “small things” that always end up mattering, so it’s probably just a good idea to implement them correctly from the beginning.
But really I’m just armchair quarterbacking: I’m not trying to take shots at the Kubernetes project. The goal of this article is just to spread awareness about the fact that you can’t simulate failure scenarios using only Kubernetes tooling. You need access to the underlying nodes, or your failure simulations won’t be accurate.
Previous article: Using OPA to block malicious annotations in Kubernetes
Next article: Introducing Rucksack: A place to store your one-liners