Skip to content

Instantly share code, notes, and snippets.

@mattmattox
Last active December 23, 2025 18:24
Show Gist options
  • Select an option

  • Save mattmattox/11689c8d2c6fc5fff7dbd78ed3baae69 to your computer and use it in GitHub Desktop.

Select an option

Save mattmattox/11689c8d2c6fc5fff7dbd78ed3baae69 to your computer and use it in GitHub Desktop.
flannel-watchdog

Overview

This project provides a Flannel annotation watchdog for Kubernetes nodes. It continuously checks whether the flannel.alpha.coreos.com/backend-data annotation is present on the node. If missing (which typically indicates a Flannel restart or networking issue), it can optionally restart the Flannel container. A dry-run mode allows detection without restarting.

Installation

Download the script and the systemd service file, then install and enable the watchdog.

1. Fetch the repository content

Use wget or curl to download the script and service files.

# Create a working directory
mkdir -p /opt/flannel-watchdog
cd /opt/flannel-watchdog

# Download the watchdog script
wget https://gist.githubusercontent.com/mattmattox/11689c8d2c6fc5fff7dbd78ed3baae69/raw/flannel-watchdog.sh

# Download the systemd service unit
wget https://gist.githubusercontent.com/mattmattox/11689c8d2c6fc5fff7dbd78ed3baae69/raw/flannel-watchdog.service

# Optional: Download an environment file template
cat > flannel-watchdog.env << 'EOF'
# Polling cadence (seconds)
LOOP_DELAY=30

# Recovery wait behavior
WAIT_TIMEOUT=180
WAIT_INTERVAL=5

# Dry run: 1=detection only, 0=restart containers
DRY_RUN=1

# Pattern to match flannel containers in docker ps
MATCH_PATTERN=flannel
EOF

Or with curl:

curl -o flannel-watchdog.sh \
     https://gist.githubusercontent.com/mattmattox/11689c8d2c6fc5fff7dbd78ed3baae69/raw/flannel-watchdog.sh

curl -o flannel-watchdog.service \
     https://gist.githubusercontent.com/mattmattox/11689c8d2c6fc5fff7dbd78ed3baae69/raw/flannel-watchdog.service

2. Install the script

install -m 0755 flannel-watchdog.sh /usr/local/sbin/flannel-watchdog.sh

3. Configure systemd

Copy the service unit into /etc/systemd/system/:

install -m 0644 flannel-watchdog.service /etc/systemd/system/flannel-watchdog.service

Optionally, place an environment file in /etc/sysconfig/flannel-watchdog (or other path referenced by the service) with configuration overrides. For example:

cp flannel-watchdog.env /etc/sysconfig/flannel-watchdog

Adjust DRY_RUN in the environment file to 0 when you want the watchdog to perform actual restart actions.

4. Enable and start the service

Reload systemd and enable the watchdog:

systemctl daemon-reload
systemctl enable --now flannel-watchdog.service

Check its status and logs:

systemctl status flannel-watchdog.service
journalctl -u flannel-watchdog.service -f

Configuration Options

Variable Description Default
LOOP_DELAY Seconds between annotation checks 30
WAIT_TIMEOUT Seconds to wait for flannel recovery after restart 180
WAIT_INTERVAL Polling interval during recovery wait loops 5
DRY_RUN 1 = detect only (no restart), 0 = enable restart 1
MATCH_PATTERN Pattern to match flannel containers in docker ps results flannel

Place an environment file (e.g., /etc/sysconfig/flannel-watchdog) to override these values without modifying the script.

Dry-Run Mode

By default, the watchdog runs in dry-run mode (DRY_RUN=1), which only logs detection events without restarting any container. You must set DRY_RUN=0 to allow automatic restarts.

Notes

  • It expects docker as the container runtime; adapt if you use containerd/crictl.
  • Ensure the node has appropriate kubeconfig access to query annotations.
# Polling cadence
LOOP_DELAY=30
# Recovery wait behavior
WAIT_TIMEOUT=180
WAIT_INTERVAL=5
# Dry run (1 = detect only, no restart)
DRY_RUN=1
# Container match pattern for docker ps (name/image)
MATCH_PATTERN=flannel
[Unit]
Description=Flannel annotation watchdog (restart flannel container if backend-data missing)
After=network-online.target docker.service
Wants=network-online.target
Requires=docker.service
[Service]
Type=simple
ExecStart=/usr/local/sbin/flannel-watchdog.sh
Restart=always
RestartSec=5
# Load optional configuration overrides
EnvironmentFile=-/etc/sysconfig/flannel-watchdog
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=full
ProtectHome=true
# If kubectl/docker need access to /etc and /var/run/docker.sock, keep these writable/readable
ReadWritePaths=/var/run /run
[Install]
WantedBy=multi-user.target
#!/usr/bin/env bash
set -euo pipefail
###############################################################################
# Configuration (override via environment or systemd EnvironmentFile)
###############################################################################
LOOP_DELAY="${LOOP_DELAY:-30}" # Seconds between checks
WAIT_TIMEOUT="${WAIT_TIMEOUT:-180}" # Max wait for flannel recovery after restart
WAIT_INTERVAL="${WAIT_INTERVAL:-5}" # Poll interval during recovery wait
DRY_RUN="${DRY_RUN:-0}" # 1 = detect only, do not restart
MATCH_PATTERN="${MATCH_PATTERN:-flannel}" # docker ps match (name/image), case-insensitive
ANNOT_KEY="flannel.alpha.coreos.com/backend-data"
###############################################################################
# Helpers
###############################################################################
techo() {
printf '%s %s\n' "$(date '+%Y-%m-%dT%H:%M:%S%z')" "$*"
}
is_truthy() {
case "${1,,}" in
1|true|yes|y|on) return 0 ;;
*) return 1 ;;
esac
}
###############################################################################
# Root check (service should run as root if it needs docker/kubeconfig access)
###############################################################################
if [[ "${EUID}" -ne 0 ]]; then
techo "ERROR: This script must be run as root."
exit 1
fi
###############################################################################
# Kubernetes SSL / kubeconfig detection
###############################################################################
if [ -d /opt/rke/etc/kubernetes/ssl ]; then
K8S_SSLDIR=/opt/rke/etc/kubernetes/ssl
else
K8S_SSLDIR=/etc/kubernetes/ssl
fi
KUBECONFIG="${K8S_SSLDIR}/kubecfg-kube-node.yaml"
NODE_NAME="$(hostname -s)"
###############################################################################
# Functions
###############################################################################
get_annotation_value() {
local val
val="$(
kubectl --kubeconfig "${KUBECONFIG}" get node "${NODE_NAME}" -o \
"jsonpath={.metadata.annotations['${ANNOT_KEY}']}" 2>/dev/null || true
)"
[[ "${val}" == "null" ]] && val=""
printf '%s' "${val}" | tr -d '\r\n' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//'
}
find_flannel_container_ids() {
docker ps --format '{{.ID}} {{.Names}} {{.Image}}' \
| grep -i -- "${MATCH_PATTERN}" \
| awk '{print $1}' \
|| true
}
wait_for_containers_running() {
local start now elapsed state cid
local ids=("$@")
start="$(date +%s)"
while true; do
local all_running=true
for cid in "${ids[@]}"; do
state="$(docker inspect -f '{{.State.Status}}' "${cid}" 2>/dev/null || true)"
if [[ "${state}" != "running" ]]; then
all_running=false
techo "Waiting for container ${cid} (state=${state:-unknown})..."
fi
done
if [[ "${all_running}" == "true" ]]; then
techo "Container(s) matching '${MATCH_PATTERN}' are running."
return 0
fi
now="$(date +%s)"
elapsed="$((now - start))"
if (( elapsed >= WAIT_TIMEOUT )); then
techo "ERROR: Timed out waiting for container(s) to be running."
return 1
fi
sleep "${WAIT_INTERVAL}"
done
}
wait_for_annotation_present() {
local start now elapsed val
start="$(date +%s)"
while true; do
val="$(get_annotation_value || true)"
if [[ -n "${val}" ]]; then
techo "Annotation '${ANNOT_KEY}' is populated."
return 0
fi
techo "Waiting for annotation '${ANNOT_KEY}'..."
now="$(date +%s)"
elapsed="$((now - start))"
if (( elapsed >= WAIT_TIMEOUT )); then
techo "ERROR: Timed out waiting for annotation '${ANNOT_KEY}'."
return 1
fi
sleep "${WAIT_INTERVAL}"
done
}
check_and_recover() {
local annot_val
annot_val="$(get_annotation_value || true)"
if [[ -z "${annot_val}" ]]; then
if is_truthy "${DRY_RUN}"; then
techo "DETECTED: Annotation missing/empty on node '${NODE_NAME}' (dry-run; no restart performed)."
return 0
fi
techo "DETECTED: Annotation missing/empty on node '${NODE_NAME}'. Initiating recovery."
mapfile -t ids < <(find_flannel_container_ids)
if (( ${#ids[@]} == 0 )); then
techo "WARNING: No containers matched pattern '${MATCH_PATTERN}' via 'docker ps'."
return 0
fi
for cid in "${ids[@]}"; do
techo "Restarting container ${cid}"
docker restart "${cid}" >/dev/null
done
wait_for_containers_running "${ids[@]}"
wait_for_annotation_present
techo "Recovery completed."
else
techo "OK: Annotation present on node '${NODE_NAME}'."
fi
}
###############################################################################
# Main loop
###############################################################################
techo "Starting watchdog (node=${NODE_NAME}, delay=${LOOP_DELAY}s, dry_run=${DRY_RUN}, match='${MATCH_PATTERN}')"
while true; do
check_and_recover || techo "WARNING: iteration completed with errors."
sleep "${LOOP_DELAY}"
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment