This is a continuation of the PodSecurityPolicy is Dead, Long live...? article, which looks at how to construct the most effective policy for your Kubernetes infrastructure. Haven't read that? Check it out first.
Based on that foundation, this article looks at how versioning policies streamline the developer experience to deliver features and minimise downtime whilst meeting compliance requirements.
“Policy as code” is one of the more recent 'as-code' buzzwords to enter the discourse since ‘infrastructure-as-code’ paved the way for the *-as-code term. The fundamental principles of it sound great: everything in version control, auditable, repeatable, etc. However, in practice, it can often fall apart when it comes to the day 2 operational challenges which are exacerbated by adopting ‘GitOps’.
We'll look at a common scenario and present a working example of versioned policy running through the entire process to address the issue.
Let’s start with a likely (simplified) scenario:
- Person (a) writes a change to a deployment yaml file locally, yaml appears valid, so
- Person (a) pushes it to a branch and raises a pull request to the main/master branch requesting a review
- Person (b) looks at the diff, agrees with the change and approves it
- Person (a/b) merges the change causing the change to now be in the main/master branch
- CI/CD picks up the change and successfully applies the changed deployment yaml to the Kubernetes cluster
- The deployment controller creates a replicaSet and submits it to the Kubernetes API (which is accepted by the api server)
- The replicaSet controller creates pods and submits to the API, the API server rejects these pods since they are rejected by a PodSecurityPolicy Rule (or similar) admission controller.
- Unless you’re polling the API server for events on your deployment rollout, you won’t know it’s failed.
- Your Main/Master branch is now broken; you'll either need to figure out how to rollback changes or roll forward a fix, either by administering the cluster directly, or repeating the entire process from step 1.
That's just your 'business as usual' flow for all your devs.
What happens when you want to update the policy itself?
Your policy engine might allow you to ‘dry run’ before you ‘enforce’ a new policy rule by putting it in a ‘warning’ or ‘audit’ or ‘background’ mode where a warning response is returned in the event log when something breaks the new rule.
But that will only happen if the API server re-evaluates the resources, which usually only occurs when the pod reschedules. Again, someone needs to be monitoring the event logs and acting on them, which can introduce its own challenges in exposing those logs to your teams.
All of that activity is happening a long way from the developers that are going to do something about it.
Furthermore, communicating that policy update between the well-intentioned security team and developers is fraught with common bureaucratic concerns frequently found in organizations at scale. The security policy itself might be considered somehow sensitive as it may reveal potential weaknesses.
Consequently, reproducing that policy configuration in a local development environment may also prove impracticable. This is all made much, much worse with multiple clusters for development, staging, production and multi-tenancies with multiple teams and applications co-existing in the same cluster space all with their own varying needs.
So what can you do about all of this?
First and foremost, sharing the policy is imperative. Your organisation has to absolutely accept the advantages of exposing policy and communicating that effectively with its developers far outweigh any potential security advantage gained through obscurity.
Along with sharing it, you need to articulate the benefits of each and every one. After all, you’ve hopefully hired some smart people, and smart people will try to find workarounds when they don’t see value in the obstruction.
Explaining the policy should hopefully help you justify it to yourselves too. Rules naturally should become based less on emotional and anecdotal and instead grounded in informed threat modeling that's easier to maintain as your threat landscape changes.
The next step is collecting the policy, codifying and assuring it is kept in version control. Once it's in version control, you can adopt the same semantic versioning strategy seen elsewhere that your developers will be used to.
Quick recap semantic versioning
- Version numbers look like 1.20.30
- The number of digits between the points is up to you (1.2.3 is fine as is 1.002.000003)
- Don’t be fooled by the decimal points, they’re not real (1.20.0 is greater than 1.3.0)
- The first digit is a major breaking change where you make wholly incompatible changes (this will probably be the case with almost all your policy changes)
- The second digit is a minor change that might be adding functionality but is backwards compatible (this is less likely for policy changes)
- The third is for patches where you make backwards compatible bug fixes (likely quite rare for your policy changes)
- For more detail see the Semantic Versioning website
Great so you’ve got your policy definitions in version control, tagged with semantic versioning, the next step is consuming that within your applications so your developers can test their applications against it, locally to start with, then later in continuous integration.
Hopefully your developers will be used to this at least, they can treat your policy like they treat versioned dependencies.
Now they’re testing locally, implementing the same check-in CI should be straightforward, this will assure that peer reviews are only ever carried out on code that is known to pass your policy.
Given it’s now a dependency, you can use tools like snyk/dependabot/renovate etc to automate making pull requests to keep it updated and highlight to your developers when the policy update is not compatible with their app.
Awesome. Now for the really tricky bit...
Your runtime needs to support multiple policy versions 😱
From a risk perspective, your organisation needs to be comfortable with accepting the transitionary period for old policy versions to be retired, which comes down to communication between those settings and those consuming/subjected to the policy, forwards and backward by one version, so your runtime needs to support at least three significant versions.
Show me the code
I’ve put together a reference model of this in a dedicated GitHub organisation with a bunch of repositories. Renovate was used to make automated pull requests on policy updates, you can see examples of that.
- kyverno as the policy engine, but any policy engine that allows you to be selective with labels on the resources should work.
- Github Actions as the CI/CD but anything similar would work that integrates version control with pull/merge requests should work.
- Github for version control, but any similar git service with a pull request capability and linked tests should work.
- KiND for the Kubernetes cluster but any Kubernetes cluster should work, this just let me do all the testing quickly.
- Renovate automatically maintains the policy dependencies by raising pull requests for us.
Please, allow me to introduce you to Example Policy Org
Enter Example Policy org
This app is compliant with version 2.0.1 of the company polic
Following on from 1.0.0 we found that the lack of consistency isn’t helping, some people are setting it to ‘hr’ others ‘human resources'. So a breaking policy change has been introduced (hence the major version bump) to require the value to be from a known pre-determined list. So the
mycompany.com/department label must be one of
The policy team forgot a department! So now the
mycompany.com/department label must be one of
tech|accounts|servicedesk|hr|sales. This was a non-breaking, very inor change, so we’re going to consider it a patch update, so we only increment the last segment of the version number.
This is an example of everything coexisting on a single cluster for simplicity and keeping this free to run I stand up the cluster each time using KiND, but this could just as well be a real cluster(s).
This is a simple tool to help our developers test their apps, they can simply run
docker run --rm -ti -v $(pwd):/apps ghcr.io/example-policy-org/policy-checker when in the app and it’ll test if the app passes.
The location of the policy is intentionally hard coded, making this reusable outside of our example organisation would take some significant thoght and is out of scope for this
What I haven’t done is require the
mycompany.com/policy-version label, that’s probably part of the policy-checker and ci process’s job and also up to your cluster administrators to what they do with things that don’t have the label, you might for example exclude anything from
kube-system, otherwise require the
mycompany.com/policy-version >= 1.0.0 and update that minimum version as required. In reality it's just another rule, but seperate from the rest of the policy codebase.
Now, it's your turn
You should be able to reuse the principles of what we've covered in this article to go forth and version your organization’s policy and make the dev experience a well-informed compliant breeze.
As you can see, this is far from the finished article. To share your thoughts, and if you think there's a better answer or more to it, tweet us!