Cluster operators running Envoy Gateway with:
- NodePort exposure (no LB)
hostNetwork: truein the data-plane- Cloudflare → Origin (Envoy) TLS
- RKE2 + Cilium
Two common failure modes after upgrades/restarts:
-
DNS breaks for hostNetwork pods
hostNetwork=true+dnsPolicy=ClusterFirst→ pod uses host DNS, cannot resolve*.svc.cluster.local→ Envoy can’t reach xDS → proxy never Ready → NodePort gets rejected. -
RollingUpdate deadlock with hostPorts Proxy pods bind host ports (e.g. 19001/19003). With
hostNetwork=true, you can only run one proxy per node. Default RollingUpdate (surge) tries to run old + new pods → “4 pods on 3 nodes” → one Pending forever.
This runbook makes both impossible by design.
- Prerequisites
- Install Envoy Gateway (control plane)
- Golden EnvoyProxy (HA + DNS fix + safe rollouts)
- Create Gateway + TLS listeners
- Post-install validation checklist
- Upgrade procedure (zero surprises)
- Troubleshooting playbook
- Monitoring canaries
kubectl get nodes -o wide
kubectl get pods -A | headcilium statuskubectl get pods -n cert-manager
kubectl get certificate -Akubectl get ns gateway-system 2>/dev/null || kubectl create ns gateway-systemEnsures CRDs are aligned with the chart version.
helm pull oci://docker.io/envoyproxy/gateway-helm --version v1.6.3 --untar
kubectl apply --server-side --force-conflicts -f ./gateway-helm/crds/gatewayapi-crds.yaml
kubectl apply --server-side --force-conflicts -f ./gateway-helm/crds/generatedhelm upgrade --install eg oci://docker.io/envoyproxy/gateway-helm \
--version v1.6.3 \
-n gateway-systemVerify:
helm -n gateway-system list
kubectl get deploy -n gateway-system envoy-gateway -o wide
kubectl get pods -n gateway-system -l control-plane=envoy-gateway -o wideThis is the single most important object to prevent outages.
hostNetwork: true+dnsPolicy: ClusterFirstWithHostNetEnsures proxy pods use cluster DNS, not host resolvers.replicas: 3True HA on a 3-node cluster (one per node).maxSurge: 0andmaxUnavailable: 1Prevents deadlock during upgrades with hostPorts.
cat <<'EOF' | kubectl apply -f -
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
name: envoy-proxy
namespace: gateway-system
spec:
provider:
type: Kubernetes
kubernetes:
envoyService:
type: NodePort
externalTrafficPolicy: Local
envoyDeployment:
replicas: 3
patch:
type: StrategicMerge
value:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
template:
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
EOFExample Gateway referencing the EnvoyProxy above:
cat <<'EOF' | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: main-gateway
namespace: gateway-system
spec:
gatewayClassName: envoy
infrastructure:
parametersRef:
group: gateway.envoyproxy.io
kind: EnvoyProxy
name: envoy-proxy
listeners:
- name: https
port: 443
protocol: HTTPS
tls:
mode: Terminate
certificateRefs:
- kind: Secret
name: erp-tls
- kind: Secret
name: links-tls
- kind: Secret
name: fallback-tls
allowedRoutes:
namespaces:
from: All
EOFkubectl get pods -n gateway-system -l control-plane=envoy-gateway -o wide
kubectl logs -n gateway-system deploy/envoy-gateway --since=10mReplace the name if your generated deployment differs.
kubectl get deploy -n gateway-system -l app.kubernetes.io/component=proxy -o wide
kubectl get deploy -n gateway-system envoy-gateway-system-main-gateway-* \
-o jsonpath='replicas={.spec.replicas} hostNetwork={.spec.template.spec.hostNetwork} dnsPolicy={.spec.template.spec.dnsPolicy} maxSurge={.spec.strategy.rollingUpdate.maxSurge} maxUnavailable={.spec.strategy.rollingUpdate.maxUnavailable}{"\n"}'Expected:
replicas=3hostNetwork=truednsPolicy=ClusterFirstWithHostNetmaxSurge=0maxUnavailable=1
kubectl get pods -n gateway-system -l app.kubernetes.io/component=proxy -o wideExpected:
- 3 pods, each
2/2 Running - one on each node (node1/node2/node3)
If all endpoints become
ready=false, kube-proxy may reject NodePort traffic.
kubectl get endpointslice -n gateway-system \
-l kubernetes.io/service-name=envoy-gateway-system-main-gateway-* \
-o jsonpath='{range .items[*].endpoints[*]}{.addresses[0]}{" ready="}{.conditions.ready}{"\n"}{end}'Expected:
- 3 endpoints, all
ready=true
Find NodePort:
kubectl get svc -n gateway-system -l gateway.envoyproxy.io/owning-gateway-name=main-gateway -o wideTest:
nc -vz node1 <NODEPORT>
nc -vz node2 <NODEPORT>
nc -vz node3 <NODEPORT>helm pull oci://docker.io/envoyproxy/gateway-helm --version v1.6.3 --untar
kubectl apply --server-side --force-conflicts -f ./gateway-helm/crds/gatewayapi-crds.yaml
kubectl apply --server-side --force-conflicts -f ./gateway-helm/crds/generatedhelm upgrade eg oci://docker.io/envoyproxy/gateway-helm \
--version v1.6.3 \
-n gateway-systemVerify images:
kubectl get deploy -n gateway-system envoy-gateway \
-o=jsonpath='{.spec.template.spec.containers[0].image}{"\n"}'kubectl get envoyproxy.gateway.envoyproxy.io -n gateway-system envoy-proxy -o yaml | sed -n '1,260p'Because maxSurge=0, updates should be sequential and never deadlock.
kubectl get pods -n gateway-system -l app.kubernetes.io/component=proxy -wRepeat the full checklist section above.
Start here:
kubectl get pods -n gateway-system -l app.kubernetes.io/component=proxy -o wide
kubectl get endpointslice -n gateway-system \
-l kubernetes.io/service-name=envoy-gateway-system-main-gateway-* \
-o jsonpath='{range .items[*].endpoints[*]}{.addresses[0]}{" ready="}{.conditions.ready}{"\n"}{end}'If you see 0 ready endpoints, kube-proxy may reject NodePort and Cloudflare shows 525.
Check DNS from inside an Envoy proxy pod:
kubectl debug -n gateway-system -it pod/<proxy-pod> \
--image=busybox:1.36 --target=envoy -- sh -c \
'cat /etc/resolv.conf; nslookup envoy-gateway.gateway-system.svc.cluster.local'If you see public nameservers and NXDOMAIN:
- dnsPolicy is wrong (should be
ClusterFirstWithHostNet)
Fix: ensure golden EnvoyProxy is applied.
This is the hostNetwork + hostPorts + surge deadlock.
Confirm:
kubectl describe deploy -n gateway-system envoy-gateway-system-main-gateway-* | egrep -i 'NewReplicaSet|OldReplicaSets'
kubectl describe pod -n gateway-system <pending-pod> | tail -n 40Fix permanently:
- Ensure
maxSurge: 0exists via the EnvoyProxy patch (golden config).
Emergency unblock:
- Delete the remaining old ReplicaSet pod (the one occupying the “third node”) so the pending new pod can schedule.
Alert if ready endpoints drop to 0 for > 60s.
kubectl get endpointslice -n gateway-system \
-l kubernetes.io/service-name=envoy-gateway-system-main-gateway-* \
-o jsonpath='{range .items[*].endpoints[*]}{.addresses[0]}{" ready="}{.conditions.ready}{"\n"}{end}'Alert if Programmed becomes False.
kubectl get gateway -n gateway-system main-gateway \
-o jsonpath='{.status.conditions[?(@.type=="Programmed")].status}{"\n"}'kubectl get deploy -n gateway-system envoy-gateway-system-main-gateway-* \
-o jsonpath='hostNetwork={.spec.template.spec.hostNetwork} dnsPolicy={.spec.template.spec.dnsPolicy}{"\n"}'echo "=== control plane ==="
kubectl get pods -n gateway-system -l control-plane=envoy-gateway -o wide
echo "=== data plane ==="
kubectl get pods -n gateway-system -l app.kubernetes.io/component=proxy -o wide
echo "=== endpoints ==="
kubectl get endpointslice -n gateway-system \
-l kubernetes.io/service-name=envoy-gateway-system-main-gateway-* \
-o jsonpath='{range .items[*].endpoints[*]}{.addresses[0]}{" ready="}{.conditions.ready}{"\n"}{end}'