Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save boniface/f6f720842c4155664561842b8e61fbb6 to your computer and use it in GitHub Desktop.

Select an option

Save boniface/f6f720842c4155664561842b8e61fbb6 to your computer and use it in GitHub Desktop.

Envoy Gateway on RKE2 + Cilium

Installation, Upgrades, and HA (No DNS/Deadlock Surprises)

Audience

Cluster operators running Envoy Gateway with:

  • NodePort exposure (no LB)
  • hostNetwork: true in the data-plane
  • Cloudflare → Origin (Envoy) TLS
  • RKE2 + Cilium

Problem history (what this runbook prevents)

Two common failure modes after upgrades/restarts:

  1. DNS breaks for hostNetwork pods hostNetwork=true + dnsPolicy=ClusterFirst → pod uses host DNS, cannot resolve *.svc.cluster.local → Envoy can’t reach xDS → proxy never Ready → NodePort gets rejected.

  2. RollingUpdate deadlock with hostPorts Proxy pods bind host ports (e.g. 19001/19003). With hostNetwork=true, you can only run one proxy per node. Default RollingUpdate (surge) tries to run old + new pods → “4 pods on 3 nodes” → one Pending forever.

This runbook makes both impossible by design.


Table of contents

  1. Prerequisites
  2. Install Envoy Gateway (control plane)
  3. Golden EnvoyProxy (HA + DNS fix + safe rollouts)
  4. Create Gateway + TLS listeners
  5. Post-install validation checklist
  6. Upgrade procedure (zero surprises)
  7. Troubleshooting playbook
  8. Monitoring canaries

Prerequisites

Cluster health

kubectl get nodes -o wide
kubectl get pods -A | head

Cilium health

cilium status

cert-manager (if used for TLS)

kubectl get pods -n cert-manager
kubectl get certificate -A

Install Envoy Gateway (control plane)

1) Create namespace

kubectl get ns gateway-system 2>/dev/null || kubectl create ns gateway-system

2) Install/upgrade CRDs (recommended)

Ensures CRDs are aligned with the chart version.

helm pull oci://docker.io/envoyproxy/gateway-helm --version v1.6.3 --untar

kubectl apply --server-side --force-conflicts -f ./gateway-helm/crds/gatewayapi-crds.yaml
kubectl apply --server-side --force-conflicts -f ./gateway-helm/crds/generated

3) Install control plane via Helm

helm upgrade --install eg oci://docker.io/envoyproxy/gateway-helm \
  --version v1.6.3 \
  -n gateway-system

Verify:

helm -n gateway-system list
kubectl get deploy -n gateway-system envoy-gateway -o wide
kubectl get pods  -n gateway-system -l control-plane=envoy-gateway -o wide

Golden EnvoyProxy (HA + DNS fix + safe rollouts)

This is the single most important object to prevent outages.

Why this works

  • hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet Ensures proxy pods use cluster DNS, not host resolvers.
  • replicas: 3 True HA on a 3-node cluster (one per node).
  • maxSurge: 0 and maxUnavailable: 1 Prevents deadlock during upgrades with hostPorts.

Apply the golden EnvoyProxy

cat <<'EOF' | kubectl apply -f -
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: envoy-proxy
  namespace: gateway-system
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyService:
        type: NodePort
        externalTrafficPolicy: Local
      envoyDeployment:
        replicas: 3
        patch:
          type: StrategicMerge
          value:
            spec:
              strategy:
                type: RollingUpdate
                rollingUpdate:
                  maxSurge: 0
                  maxUnavailable: 1
              template:
                spec:
                  hostNetwork: true
                  dnsPolicy: ClusterFirstWithHostNet
EOF

Create Gateway + TLS listeners

Example Gateway referencing the EnvoyProxy above:

cat <<'EOF' | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: main-gateway
  namespace: gateway-system
spec:
  gatewayClassName: envoy
  infrastructure:
    parametersRef:
      group: gateway.envoyproxy.io
      kind: EnvoyProxy
      name: envoy-proxy
  listeners:
  - name: https
    port: 443
    protocol: HTTPS
    tls:
      mode: Terminate
      certificateRefs:
      - kind: Secret
        name: erp-tls
      - kind: Secret
        name: links-tls
      - kind: Secret
        name: fallback-tls
    allowedRoutes:
      namespaces:
        from: All
EOF

Post-install validation checklist

1) Control plane healthy

kubectl get pods -n gateway-system -l control-plane=envoy-gateway -o wide
kubectl logs -n gateway-system deploy/envoy-gateway --since=10m

2) Data-plane Deployment template has critical fields

Replace the name if your generated deployment differs.

kubectl get deploy -n gateway-system -l app.kubernetes.io/component=proxy -o wide

kubectl get deploy -n gateway-system envoy-gateway-system-main-gateway-* \
  -o jsonpath='replicas={.spec.replicas} hostNetwork={.spec.template.spec.hostNetwork} dnsPolicy={.spec.template.spec.dnsPolicy} maxSurge={.spec.strategy.rollingUpdate.maxSurge} maxUnavailable={.spec.strategy.rollingUpdate.maxUnavailable}{"\n"}'

Expected:

  • replicas=3
  • hostNetwork=true
  • dnsPolicy=ClusterFirstWithHostNet
  • maxSurge=0
  • maxUnavailable=1

3) Proxy pods are Ready and spread across nodes

kubectl get pods -n gateway-system -l app.kubernetes.io/component=proxy -o wide

Expected:

  • 3 pods, each 2/2 Running
  • one on each node (node1/node2/node3)

4) EndpointSlice has ready endpoints

If all endpoints become ready=false, kube-proxy may reject NodePort traffic.

kubectl get endpointslice -n gateway-system \
  -l kubernetes.io/service-name=envoy-gateway-system-main-gateway-* \
  -o jsonpath='{range .items[*].endpoints[*]}{.addresses[0]}{" ready="}{.conditions.ready}{"\n"}{end}'

Expected:

  • 3 endpoints, all ready=true

5) NodePort reachability from nodes

Find NodePort:

kubectl get svc -n gateway-system -l gateway.envoyproxy.io/owning-gateway-name=main-gateway -o wide

Test:

nc -vz node1 <NODEPORT>
nc -vz node2 <NODEPORT>
nc -vz node3 <NODEPORT>

Upgrade procedure (zero surprises)

1) Upgrade CRDs

helm pull oci://docker.io/envoyproxy/gateway-helm --version v1.6.3 --untar

kubectl apply --server-side --force-conflicts -f ./gateway-helm/crds/gatewayapi-crds.yaml
kubectl apply --server-side --force-conflicts -f ./gateway-helm/crds/generated

2) Upgrade Helm release (control plane)

helm upgrade eg oci://docker.io/envoyproxy/gateway-helm \
  --version v1.6.3 \
  -n gateway-system

Verify images:

kubectl get deploy -n gateway-system envoy-gateway \
  -o=jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'

3) Re-verify EnvoyProxy is still the source of truth

kubectl get envoyproxy.gateway.envoyproxy.io -n gateway-system envoy-proxy -o yaml | sed -n '1,260p'

4) Watch data-plane roll safely

Because maxSurge=0, updates should be sequential and never deadlock.

kubectl get pods -n gateway-system -l app.kubernetes.io/component=proxy -w

5) Run the post-upgrade validation checklist

Repeat the full checklist section above.


Troubleshooting playbook

Symptom A: Cloudflare error 525 (SSL handshake failed)

Start here:

kubectl get pods -n gateway-system -l app.kubernetes.io/component=proxy -o wide
kubectl get endpointslice -n gateway-system \
  -l kubernetes.io/service-name=envoy-gateway-system-main-gateway-* \
  -o jsonpath='{range .items[*].endpoints[*]}{.addresses[0]}{" ready="}{.conditions.ready}{"\n"}{end}'

If you see 0 ready endpoints, kube-proxy may reject NodePort and Cloudflare shows 525.

Symptom B: Envoy logs show xDS “no healthy upstream” or timeouts

Check DNS from inside an Envoy proxy pod:

kubectl debug -n gateway-system -it pod/<proxy-pod> \
  --image=busybox:1.36 --target=envoy -- sh -c \
  'cat /etc/resolv.conf; nslookup envoy-gateway.gateway-system.svc.cluster.local'

If you see public nameservers and NXDOMAIN:

  • dnsPolicy is wrong (should be ClusterFirstWithHostNet)

Fix: ensure golden EnvoyProxy is applied.

Symptom C: Upgrade creates 4 pods and one stays Pending forever

This is the hostNetwork + hostPorts + surge deadlock.

Confirm:

kubectl describe deploy -n gateway-system envoy-gateway-system-main-gateway-* | egrep -i 'NewReplicaSet|OldReplicaSets'
kubectl describe pod -n gateway-system <pending-pod> | tail -n 40

Fix permanently:

  • Ensure maxSurge: 0 exists via the EnvoyProxy patch (golden config).

Emergency unblock:

  • Delete the remaining old ReplicaSet pod (the one occupying the “third node”) so the pending new pod can schedule.

Monitoring canaries

Canary 1: EndpointSlice ready endpoints

Alert if ready endpoints drop to 0 for > 60s.

kubectl get endpointslice -n gateway-system \
  -l kubernetes.io/service-name=envoy-gateway-system-main-gateway-* \
  -o jsonpath='{range .items[*].endpoints[*]}{.addresses[0]}{" ready="}{.conditions.ready}{"\n"}{end}'

Canary 2: Gateway Programmed condition

Alert if Programmed becomes False.

kubectl get gateway -n gateway-system main-gateway \
  -o jsonpath='{.status.conditions[?(@.type=="Programmed")].status}{"\n"}'

Canary 3: DNS policy drift check (hostNetwork safety)

kubectl get deploy -n gateway-system envoy-gateway-system-main-gateway-* \
  -o jsonpath='hostNetwork={.spec.template.spec.hostNetwork} dnsPolicy={.spec.template.spec.dnsPolicy}{"\n"}'

Appendix: One-line “health summary”

echo "=== control plane ==="
kubectl get pods -n gateway-system -l control-plane=envoy-gateway -o wide
echo "=== data plane ==="
kubectl get pods -n gateway-system -l app.kubernetes.io/component=proxy -o wide
echo "=== endpoints ==="
kubectl get endpointslice -n gateway-system \
  -l kubernetes.io/service-name=envoy-gateway-system-main-gateway-* \
  -o jsonpath='{range .items[*].endpoints[*]}{.addresses[0]}{" ready="}{.conditions.ready}{"\n"}{end}'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment