- Kevin McDermott kmcdermo@redhat.com
- Rob Blake rblake@redhat.com
Getting to GitOps requires that the staff tasked with looking after business services are confident that changes that are deployed automatically meet the certain standards.
How do we build that confidence?
Initial CRD example:
apiVersion: something/v1alpha1
kind: GatedRelease
metadata:
name: release-our-app
spec:
source:
git:
uri: https://github.com/demo/demo.git
ref: refs/tags/v273
contextDir: deploy/service
deploy:
argocd-update:
cluster: https://cluster.host
application: our-app-name
gates:
- name: wait-to-deploy
age: 30m
- name: staging-metrics
prometheus-alerts:
- errors-alert
- health-alert
- latency-alert
- name: office-hours
scheduled-time:
start: 05:00UTC
end: 21:00UTC
- name: global-blocks
blocklist:
- all-deploys
- our-app
- name: endpoint-check
url: https://testing.example.com/health
- name: image-security
security:
url: https://example.com/security
cves:
high: 1
- name: do-something-in-a-container
shell:
image: my-container/image:latest
cmd: ["testing.sh"]
args: ["--test"]
status:
sha: 9585191f37f7b0fb9444f35a9bf50de191beadc2
state: Waiting
gates:
- name: wait-to-deploy
state: closed
meta:
notBefore: 2020-07-02T19:33:54Z
- name: staging-metrics
state: unknown
- name: global-blocks
state: unknown
- name: endpoint-check
state: closed
meta:
lastError: "503 Service Unavailable"Something creates a GatedRelease resource.
Reconciliation of the GatedRelease starts by dereferencing the ref for the gitReference, i.e. if it's "refs/tags/v273" get the specific SHA that this references.
Spawning a Goroutine to track the state.
After some period, the Release would "fail", and the CRD resource would be flagged as such.
Restarting would need some sort of external intervention.
First of all, if there's an age gate, this means that checking should not
begin before a specific time.
This is to allow time for metrics to be significant before checking alerts.
Then, into a timed-loop, checking the state of each of the gates.
This gate is failing, if the time since the GatedRelease is less than the age.
By defining alerts on specific metrics, you can define the operational parameters for your service.
We can query Prometheus's AlertManager API to ask for currently active alerts, and match on the set of names provided in the gate.
This gate is failing if any of the gates in the configuration are in the list of active alerts from AlertManager.
This is to prevent deployments outwith normal hours, where someone might be woken from their bed to respond to a page for a deployment.
This gate is failing if the current time is within the scheduled-time periods.
This is used to globally prevent deploys, or more specifically, to block deploys for specific components.
Some uses for this might be to turn off deployments for all components when some maintenance is being done, or perhaps over crucial business periods.
This should check something (TBD) to ask if any of the names in the configured blocklist are currently blocked.
This gate is failing if any of the names are blocked according to the (TBD) component.
This is a generic catch-all, query the configured URL and if returns anything other than a 200 response, the check fails.
We have a proposal that would allow the user to specify the frequency at which gates are checked.
This may be applicable to certain gates only, but for example on the url gate, I'm immediately drawn to consider concepts like retryInterval, retryCount and so on.
This gate is failing if the configured URL is not returning a 200 response.
This is should allow configuration of something that can determine the image security, with possibly configuration to allow deployments with a specific overrides over the level of security.
At this point, the release is "good to go", and the specific git reference can be applied using the deploy method above.
In this example, argocd-update the basic idea is that it would update the
argocd-application referenced to change the version that it should be deploying.
For the definition of the Git repository, it might be nice to match the way that they are defined in a
BuildConfig. Small syntactic difference to what you're proposing, but keeps things consistent across all of the Openshift related CRDs. See: https://docs.openshift.com/container-platform/4.4/builds/creating-build-inputs.html#source-code_creating-build-inputs