Skip to content

Instantly share code, notes, and snippets.

@bigkevmcd
Last active July 6, 2020 07:13
Show Gist options
  • Select an option

  • Save bigkevmcd/a93752eb468b0c5c744abd9b38052555 to your computer and use it in GitHub Desktop.

Select an option

Save bigkevmcd/a93752eb468b0c5c744abd9b38052555 to your computer and use it in GitHub Desktop.

Gated Deployments in Kubernetes

Getting to GitOps requires that the staff tasked with looking after business services are confident that changes that are deployed automatically meet the certain standards.

How do we build that confidence?

Initial CRD example:

apiVersion: something/v1alpha1
kind: GatedRelease
metadata:
  name: release-our-app
spec:
  source:
    git: 
      uri: https://github.com/demo/demo.git
      ref: refs/tags/v273
    contextDir: deploy/service
  deploy:
   argocd-update:
     cluster: https://cluster.host
     application: our-app-name
  gates:
  - name: wait-to-deploy
    age: 30m
  - name: staging-metrics
    prometheus-alerts:
    - errors-alert
    - health-alert
    - latency-alert
  - name: office-hours
    scheduled-time:
      start: 05:00UTC
      end: 21:00UTC
  - name: global-blocks
    blocklist:
    - all-deploys
    - our-app
  - name: endpoint-check
    url: https://testing.example.com/health
  - name: image-security
    security:
      url: https://example.com/security
      cves:
        high: 1
  - name: do-something-in-a-container
    shell:
      image: my-container/image:latest
      cmd: ["testing.sh"]
      args: ["--test"]
status:
    sha: 9585191f37f7b0fb9444f35a9bf50de191beadc2
    state: Waiting
    gates:
    - name: wait-to-deploy
      state: closed
      meta:
        notBefore: 2020-07-02T19:33:54Z
    - name: staging-metrics
      state: unknown
    - name: global-blocks
      state: unknown
    - name: endpoint-check
      state: closed
      meta:
        lastError: "503 Service Unavailable"

Basic process

Something creates a GatedRelease resource.

Reconciliation of the GatedRelease starts by dereferencing the ref for the gitReference, i.e. if it's "refs/tags/v273" get the specific SHA that this references.

Spawning a Goroutine to track the state.

After some period, the Release would "fail", and the CRD resource would be flagged as such.

Restarting would need some sort of external intervention.

Age gates

First of all, if there's an age gate, this means that checking should not begin before a specific time.

This is to allow time for metrics to be significant before checking alerts.

Then, into a timed-loop, checking the state of each of the gates.

This gate is failing, if the time since the GatedRelease is less than the age.

Prometheus Alerts

By defining alerts on specific metrics, you can define the operational parameters for your service.

We can query Prometheus's AlertManager API to ask for currently active alerts, and match on the set of names provided in the gate.

This gate is failing if any of the gates in the configuration are in the list of active alerts from AlertManager.

Scheduled Time

This is to prevent deployments outwith normal hours, where someone might be woken from their bed to respond to a page for a deployment.

This gate is failing if the current time is within the scheduled-time periods.

Blocklist

This is used to globally prevent deploys, or more specifically, to block deploys for specific components.

Some uses for this might be to turn off deployments for all components when some maintenance is being done, or perhaps over crucial business periods.

This should check something (TBD) to ask if any of the names in the configured blocklist are currently blocked.

This gate is failing if any of the names are blocked according to the (TBD) component.

URL checks

This is a generic catch-all, query the configured URL and if returns anything other than a 200 response, the check fails.

We have a proposal that would allow the user to specify the frequency at which gates are checked.

This may be applicable to certain gates only, but for example on the url gate, I'm immediately drawn to consider concepts like retryInterval, retryCount and so on.

This gate is failing if the configured URL is not returning a 200 response.

Security Gating

This is should allow configuration of something that can determine the image security, with possibly configuration to allow deployments with a specific overrides over the level of security.

When all gates are succesful

At this point, the release is "good to go", and the specific git reference can be applied using the deploy method above.

In this example, argocd-update the basic idea is that it would update the argocd-application referenced to change the version that it should be deploying.

@robpblake
Copy link

For the definition of the Git repository, it might be nice to match the way that they are defined in a BuildConfig. Small syntactic difference to what you're proposing, but keeps things consistent across all of the Openshift related CRDs. See: https://docs.openshift.com/container-platform/4.4/builds/creating-build-inputs.html#source-code_creating-build-inputs

@robpblake
Copy link

I wonder if the blocklist could refer to one or more global gate resources? So in the same way that you're defining a list of gates within the GatedRelease, there is also a stand-alone Gate resource which can check a global condition. I then reference those Gates in the blocklist

@robpblake
Copy link

Again thinking out loud, is it useful to offer the ability to execute a container image as a gate and check the exit code? Might be as simple as offering a shell? It's a potentially simpler approach over hitting an endpoint and gives a lot of flexibility in terms of defining gates

- name: do-something-in-a-container
- shell:
    image: my-container-image
    cmd: "echo foo"

@robpblake
Copy link

Would be nice to indicate in the Status sub-resource which gates are open and which are currently closed.

@robpblake
Copy link

Personal preference, but I think the deploy could live as a sibling of gitReference and gates elements. So reading down the resource definition it would be:

  1. What am I going to deploy?
  2. How am I going to deploy it?
  3. What are the conditions that will allow the deployment to proceed?

@robpblake
Copy link

Is there any value in allowing the user to specify the frequency at which gates are checked? This may be applicable to certain gates only, but for example on the url gate, I'm immediately drawn to consider concepts like retryInterval, retryCount and so on.

@robpblake
Copy link

robpblake commented Jul 3, 2020

Is there value in considering inheritance of Gates as well? The use-case I'm trying to capture here is:

  1. Company-wide there is a defined set of Gates that all projects must adhere too (for example don't deploy at quarter-end)
  2. Product teams must inherit those Gates and can then extend the list of Gates with their own

So something like:

- gates:
  - name: defacto-gates
  - extend: `a-predefined-list-of-gates`

Or is it easier to push this into tooling around the assembly of your GatedRelease resource e.g. via Kustomize?

@bigkevmcd
Copy link
Author

Or is it easier to push this into tooling around the assembly of your GatedRelease resource e.g. via Kustomize?

I suspect so, but, I definitely wouldn't rule this out, it might get complex as to "where do we get this list from?"

@bigkevmcd
Copy link
Author

Would be nice to indicate in the Status sub-resource which gates are open and which are currently closed.

Yeah, I had just put in a case to highlight this.

I also expect Status to have an overall state i.e. "Released", "Waiting", "Errored" etc.

@bigkevmcd
Copy link
Author

Again thinking out loud, is it useful to offer the ability to execute a container image as a gate and check the exit code? Might be as simple as offering a shell? It's a potentially simpler approach over hitting an endpoint and gives a lot of flexibility in terms of defining gates

- name: do-something-in-a-container
- shell:
    image: my-container-image
    cmd: "echo foo"

I like this idea, I can't think of any reason why not.

@bigkevmcd
Copy link
Author

For the definition of the Git repository, it might be nice to match the way that they are defined in a BuildConfig. Small syntactic difference to what you're proposing, but keeps things consistent across all of the Openshift related CRDs. See: https://docs.openshift.com/container-platform/4.4/builds/creating-build-inputs.html#source-code_creating-build-inputs

Done.

@bigkevmcd
Copy link
Author

Personal preference, but I think the deploy could live as a sibling of gitReference and gates elements. So reading down the resource definition it would be:

  1. What am I going to deploy?
  2. How am I going to deploy it?
  3. What are the conditions that will allow the deployment to proceed?

Yip...done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment