Kasper Siig, May 5, 2022
Pods, deployments, and services are just some of the concepts that you need to understand in order to work with Kubernetes. They’re the main building blocks of a working Kubernetes cluster. On top of that, you’ll likely also have to learn about ConfigMaps, ingress controllers, and other functions.
Even after you’re comfortable with these different resources and you’re deploying workloads inside a Kubernetes cluster, you’re still bound to run into errors. It’s important that you also understand what these errors mean.
In this article you’ll be introduced to the two most common errors in Kubernetes: ImagePullBackOff
and CrashLoopBackOff
. You’ll also learn how to diagnose these errors and get them resolved.
Knowing about these errors isn’t just a matter of increasing your skills as an engineer. It’s also a matter of planning and time management.
Depending on the organization you work in – whether it’s one that relies on structure and project management or one that’s focused on getting stuff done as quickly as possible – you’re bound to be doing planning to some degree so that your coworkers and management know when your project is deployed and running.
If you’ve worked with infrastructure or development in general, you know that things never go quite according to plan and that errors happen. This is just part of the industry. It becomes an issue, though, if you get stuck on trying to handle the error. If you have no idea what the error is or how to resolve it, that can cost a lot of time. Because of this, it’s important to be aware of at least the most common errors.
A platform at the scale of Kubernetes will contain many different error codes. You’ll encounter a lot of them at some point, but there are two that you’ll see more than any others. This article provides more information about these errors and what you should do when you see them.
If you want to follow along with the examples, you can find manifest files for all the examples in [this GitHub repo].
“ImagePullBackOff” is likely _the_ most common error you’ll encounter when working with Kubernetes. While there are many different causes for this error, it can be boiled down to a simple explanation: Kubernetes isn’t able to pull the image and will “back off” from trying. It’s important to note that it won’t _stop_ trying; instead, it will try to pull the image at increasing intervals.
In [the GitHub repo], you’ll find a file named `pod-typo-in-name.yaml` with the following contents:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: ngiinx:1.14.2
ports:
- containerPort: 80
```
This is a simple Nginx pod. However, on line eight there’s a typo, and since the image `ngiinx` doesn’t exist, Kubernetes won’t be able to pull it. Try running `kubectl apply -f pod-typo-in-name.yaml && kubectl get pods –watch`. Within thirty seconds or so, you’ll see something like this:
```bash
$ kubectl apply -f pod-typo-in-name.yaml && kubectl get pods --watch
pod/nginx created
NAME READY STATUS RESTARTS AGE
nginx 0/1 ContainerCreating 0 0s
nginx 0/1 ErrImagePull 0 3s
nginx 0/1 ImagePullBackOff 0 14s
```
As you can see, Kubernetes starts out trying to create the image but quickly runs into an error and reports ErrImagePull
as the status. Kubernetes will try a few more times to pull the image, but since it continues to fail, it will enter the ImagePullBackOff
status. You already know that in this example it’s because of a typo in the name of the image, but in production use cases you’ll have to debug this properly.
To do so, you need to look at the events of the pod. You can do so by executing kubectl describe nginx
, with nginx
being the name of the pod. This will give you a lot of details about the pod, but in this case it’s the Events
section at the bottom that is relevant. That should look something like this:
```
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m24s default-scheduler Successfully assigned kubernetes-errors/nginx to docker-desktop
Normal Pulling 5m56s (x4 over 7m24s) kubelet Pulling image "ngiinx:1.14.2"
Warning Failed 5m54s (x4 over 7m22s) kubelet Failed to pull image "ngiinx:1.14.2": rpc error: code = Unknown desc = Error response from daemon: pull access denied for ngiinx, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
Warning Failed 5m54s (x4 over 7m22s) kubelet Error: ErrImagePull
Warning Failed 5m30s (x6 over 7m21s) kubelet Error: ImagePullBackOff
Normal BackOff 2m14s (x20 over 7m21s) kubelet Back-off pulling image "ngiinx:1.14.2"
```
You can see exactly what is happening here when Kubernetes tries to spin up the image. It runs into an issue, then tries to pull four more times, finally going into the ImagePullBackOff
state. Unfortunately, Kubernetes isn’t clever enough to figure out exactly what’s wrong. Instead, it gives you a list of options for what could be the problem: pull access denied for ngiinx, repository does not exist or may require 'docker login'
As you can see, nowhere does it state that the problem could be because of a typo in the name. This is something to be _very_ aware of while debugging Kubernetes errors.
Kubernetes may give you suggestions on what is wrong, but you still need to think for yourself, consider all possible reasons, and perhaps interpret the suggestions. In this case, it’s true in theory that Kubernetes doesn’t have access to ngiinx
, because you can’t have access to something that doesn’t exist. This is a weird way of thinking about it, but sometimes necessary.
Among other reasons, you may encounter the ImagePullBackOff
error if the image doesn’t exist, you’re missing authentication (if you were using a private repository, for instance), or the repository is undergoing rate limiting.
Another error that’s nearly as common as ImagePullBackOff
is CrashLoopBackOff
. This error occurs when the container inside a pod crashes and it’s not possible for Kubernetes to get the application running again.
In the GitHub repo, you’ll find a file named pod-crashloopbackoff.yaml
that has been configured with the command exit 1
. As a reminder, in Linux an exit code of 0
means a success, while everything else is an error. When running kubectl apply -f pod-crashloopbackoff.yaml && kubectl get pods --watch
, you’ll get the following output after about a minute:
```bash
$ kubectl apply -f pod-crashloopbackoff.yaml && kubectl get pods --watch
pod/nginx created
NAME READY STATUS RESTARTS AGE
nginx 0/1 ContainerCreating 0 0s
nginx 0/1 RunContainerError 0 1s
nginx 0/1 RunContainerError 1 2s
nginx 0/1 CrashLoopBackOff 1 15s
nginx 0/1 RunContainerError 2 15s
nginx 0/1 CrashLoopBackOff 2 27s
nginx 0/1 RunContainerError 3 50s
```
As you can see, Kubernetes starts out normally by trying to create the container, but the container crashes. It tries again but it still crashes, so it enters the CrashLoopBackOff
state. When a pod is in this state, nothing is happening. It’s pausing from trying to get the container running. Once the determined waiting period is over it will try again, but with the container still failing it will go back into the CrashLoopBackOff
state and wait again. The wait time increases exponentially each time it goes into this state, which you’ll see if you check the age column on the right.
To diagnose this issue you once again need to look at the events of the pod by executing kubectl describe nginx
, with nginx
being the name of the pod. Doing this will give you an output resembling the previous one, but with a very useful line:
```
Warning Failed 7m25s (x5 over 8m54s) kubelet Error: failed to start container "nginx": Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "exit 1": executable file not found in $PATH: unknown
```
Here, the error isn’t even the expected cause of an exit code `1`. Rather, the command `exit` isn’t found in the path of the container. Whatever the error is for your application, this is where you’ll find the error message that your container is outputting, so it’s where you can start to debug why your application is crashing.
If or when you find other errors inside Kubernetes, how do you get those resolved? Sadly, the Kubernetes documentation doesn’t offer a comprehensive list of the different status codes you might encounter.
However, the Kubernetes community is big and incredibly helpful. Opening up Google and searching for the error code should lead you to at least one or two articles describing the error at hand.
You’ve been introduced to the two most common errors inside Kubernetes: ImagePullBackOff
and CrashLoopBackOff
. Both of these errors are based around the principle of trying to get the pod created by continuously retrying, while exponentially increasing the interval between retries. As you can see, the errors are simple in nature, but the cause can be more complex. However, if you know how to use kubectl describe
and are careful to look at the right events, both errors will become a lot easier to debug.
Reference(s):