Last active
February 13, 2026 15:53
-
-
Save Billy99/91ea0bf909cfcf12529a9457107909c1 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ------------------------ | |
| I am coding a CSI Driver in Golang that will mount a PVC into a pod volumemount. | |
| Can you write a golang set of functions that will mount a PVC in a pod? | |
| ------------------------ | |
| This is an important clarification: | |
| π **A CSI driver does NOT mount a PVC into a Pod.** | |
| Kubernetes does that. | |
| Your CSI **Node plugin** only mounts a volume to a path that kubelet gives you (`target_path`). | |
| Kubelet then bind-mounts that into the Pod. | |
| So in `NodePublishVolume`, you: | |
| 1. Prepare the backing storage (if needed) | |
| 2. Mount it to `target_path` | |
| 3. Return success | |
| You never interact with Pods or PVC objects directly in the mount path. | |
| --- | |
| # π What Actually Happens | |
| Flow: | |
| 1. User creates PVC | |
| 2. PVC bound to PV | |
| 3. Pod references PVC | |
| 4. Pod scheduled to Node | |
| 5. Kubelet calls: | |
| * `NodeStageVolume` (optional) | |
| * `NodePublishVolume` | |
| 6. CSI driver mounts to `target_path` | |
| 7. Kubelet bind-mounts into container | |
| So your job is simply: | |
| ``` | |
| Mount(source) β targetPath | |
| ``` | |
| --- | |
| # β Example: Minimal CSI NodePublishVolume Implementation in Go | |
| This example mounts a filesystem volume using `mount(8)` via Kubernetes mount utils. | |
| It assumes: | |
| * You are mounting a block device like `/dev/sdb` | |
| * Or mounting a directory | |
| * Linux node | |
| --- | |
| ## Dependencies | |
| ```go | |
| import ( | |
| "context" | |
| "fmt" | |
| "os" | |
| "github.com/container-storage-interface/spec/lib/go/csi" | |
| "k8s.io/mount-utils" | |
| ) | |
| ``` | |
| --- | |
| ## Driver Struct | |
| ```go | |
| type NodeServer struct { | |
| mounter mount.Interface | |
| } | |
| ``` | |
| Initialize it: | |
| ```go | |
| func NewNodeServer() *NodeServer { | |
| return &NodeServer{ | |
| mounter: mount.New(""), | |
| } | |
| } | |
| ``` | |
| --- | |
| # πΉ NodePublishVolume | |
| ```go | |
| func (ns *NodeServer) NodePublishVolume( | |
| ctx context.Context, | |
| req *csi.NodePublishVolumeRequest, | |
| ) (*csi.NodePublishVolumeResponse, error) { | |
| targetPath := req.GetTargetPath() | |
| volumeID := req.GetVolumeId() | |
| readOnly := req.GetReadonly() | |
| if targetPath == "" { | |
| return nil, fmt.Errorf("target path not provided") | |
| } | |
| // Example: assume source device path is based on volumeID | |
| source := fmt.Sprintf("/dev/%s", volumeID) | |
| // Ensure target directory exists | |
| if err := os.MkdirAll(targetPath, 0750); err != nil { | |
| return nil, fmt.Errorf("failed to create target path: %w", err) | |
| } | |
| // Check if already mounted | |
| notMnt, err := mount.IsNotMountPoint(ns.mounter, targetPath) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to check mount point: %w", err) | |
| } | |
| if !notMnt { | |
| // Already mounted | |
| return &csi.NodePublishVolumeResponse{}, nil | |
| } | |
| options := []string{} | |
| if readOnly { | |
| options = append(options, "ro") | |
| } | |
| // Perform mount | |
| if err := ns.mounter.Mount(source, targetPath, "ext4", options); err != nil { | |
| return nil, fmt.Errorf("mount failed: %w", err) | |
| } | |
| return &csi.NodePublishVolumeResponse{}, nil | |
| } | |
| ``` | |
| --- | |
| # πΉ NodeUnpublishVolume | |
| You must implement unmount: | |
| ```go | |
| func (ns *NodeServer) NodeUnpublishVolume( | |
| ctx context.Context, | |
| req *csi.NodeUnpublishVolumeRequest, | |
| ) (*csi.NodeUnpublishVolumeResponse, error) { | |
| targetPath := req.GetTargetPath() | |
| notMnt, err := mount.IsNotMountPoint(ns.mounter, targetPath) | |
| if err != nil { | |
| return nil, err | |
| } | |
| if notMnt { | |
| return &csi.NodeUnpublishVolumeResponse{}, nil | |
| } | |
| if err := ns.mounter.Unmount(targetPath); err != nil { | |
| return nil, err | |
| } | |
| return &csi.NodeUnpublishVolumeResponse{}, nil | |
| } | |
| ``` | |
| --- | |
| # πΉ If Using NodeStageVolume (Recommended) | |
| If implementing staging: | |
| * `NodeStageVolume` β mount device to global path | |
| * `NodePublishVolume` β bind mount from staging to pod path | |
| Example bind mount: | |
| ```go | |
| err := ns.mounter.Mount( | |
| stagingPath, | |
| targetPath, | |
| "", | |
| []string{"bind"}, | |
| ) | |
| ``` | |
| --- | |
| # π¨ Important: You DO NOT: | |
| * Create Pods | |
| * Modify Pods | |
| * Mount into container namespace directly | |
| * Touch PVC objects during publish | |
| Kubelet handles all of that. | |
| --- | |
| # π§ If Your Storage Is NOT a Block Device | |
| If your CSI driver: | |
| * Mounts NFS β use `type = "nfs"` | |
| * Mounts tmpfs β use `type = "tmpfs"` | |
| * Mounts FUSE β call external binary | |
| * Mounts custom backend β prepare directory and bind mount | |
| Tell me your backend type and I can tailor the example. | |
| --- | |
| # π― Summary | |
| To βmount a PVC into a podβ in CSI: | |
| You only: | |
| ``` | |
| Mount(source storage) β targetPath | |
| ``` | |
| Kubelet handles the rest. | |
| --- | |
| If you tell me: | |
| * Block or filesystem? | |
| * Local or network? | |
| * Using staging? | |
| * Linux only? | |
| I can give you a production-grade skeleton driver structure. | |
| ------------------------ | |
| I am writing a Kubernetes operator that needs to mount the same data into multiple | |
| workload pods running on different nodes using PVCs. If my cluster does not support | |
| a StorageClass with accessmode of ReadOnlyMany, I was planning on creating a PVC per | |
| node. I don't want the workload pods to have to know the name of the PVC on each node. | |
| Can I have the workload pod volume reference my CSI driver and have the CSI driver | |
| determine the PVC name and manage the mounting of the data into the workload pod? | |
| If CSI driver is not the best option for this use case, what other pattern should be | |
| used to simplify the pod deployment? | |
| ------------------------ | |
| This is a very good architectural question β and you're thinking in the right direction. | |
| Short answer: | |
| > β No β a Pod cannot dynamically βaskβ a CSI driver to choose a PVC at mount time. | |
| > β The PVC must already be bound to a PV before the Pod starts. | |
| > π So CSI is not the right abstraction layer for solving this problem. | |
| Letβs break down why β and then Iβll show you the correct pattern. | |
| --- | |
| # π« Why CSI Cannot Choose the PVC at Mount Time | |
| Volume resolution happens in this order: | |
| 1. Pod spec references **a PVC name** | |
| 2. PVC is bound to a PV | |
| 3. Scheduler ensures topology constraints | |
| 4. Kubelet calls CSI with: | |
| * `volume_id` | |
| * `target_path` | |
| * `volume_context` | |
| By the time `NodePublishVolume` is called: | |
| * The PVC is already resolved | |
| * The PV is already chosen | |
| * The driver gets a volume ID β not a PVC name | |
| * You cannot switch to another PVC | |
| So the CSI driver has no opportunity to βchoose the PVC for that node.β | |
| --- | |
| # π― Your Actual Requirement | |
| You want: | |
| * Same logical data | |
| * Multiple nodes | |
| * Underlying storage only supports RWO | |
| * Pods should not need to know node-specific PVC names | |
| * Operator should abstract this away | |
| Thatβs a scheduling and orchestration problem β not a CSI problem. | |
| --- | |
| # π§ Correct Pattern: Operator Owns the PVC Fan-Out | |
| This is a very common design. | |
| ## Architecture Pattern | |
| Your operator: | |
| 1. Detects which nodes run workload Pods | |
| 2. Creates one PVC per node: | |
| * `mydata-node-a` | |
| * `mydata-node-b` | |
| 3. Creates a Pod (or mutates it) to reference the correct PVC for that node | |
| The workload Pod spec should be generated or mutated by your operator. | |
| Pods themselves do not need to βknowβ the naming logic. | |
| --- | |
| # β Clean Pattern #1 β Use a Mutating Admission Webhook | |
| This is the cleanest Kubernetes-native solution. | |
| Flow: | |
| 1. Workload Pod contains a logical volume: | |
| ```yaml | |
| volumes: | |
| - name: mydata | |
| persistentVolumeClaim: | |
| claimName: mydata-placeholder | |
| ``` | |
| 2. Your mutating webhook: | |
| * Detects node assignment (or waits until scheduled) | |
| * Rewrites `claimName` to: | |
| ``` | |
| mydata-${spec.nodeName} | |
| ``` | |
| This keeps workload manifests simple. | |
| --- | |
| # β Clean Pattern #2 β Operator Manages StatefulSet Instead | |
| If possible, convert workload into a StatefulSet: | |
| StatefulSets already create per-pod PVCs automatically: | |
| ```yaml | |
| volumeClaimTemplates: | |
| - metadata: | |
| name: mydata | |
| ``` | |
| Then each pod gets: | |
| ``` | |
| mydata-0 | |
| mydata-1 | |
| mydata-2 | |
| ``` | |
| No CSI changes needed. | |
| --- | |
| # β Clean Pattern #3 β Use a DaemonSet + hostPath | |
| If the data is identical and node-local: | |
| 1. Operator deploys DaemonSet | |
| 2. Each pod mounts: | |
| ``` | |
| /var/lib/mydata | |
| ``` | |
| 3. Workloads use hostPath | |
| But this only works if: | |
| * Data can be replicated to each node | |
| * You're okay with hostPath semantics | |
| --- | |
| # π« Why Using CSI to Solve This Is a Bad Idea | |
| If you try to: | |
| > βHave Pod reference my CSI driver directly and let CSI decide PVCβ | |
| You run into hard Kubernetes constraints: | |
| * Pod must reference PVC by name | |
| * Scheduler must validate topology | |
| * PVC must bind before scheduling | |
| * CSI only sees volumeID | |
| You cannot override that flow. | |
| CSI is a storage backend abstraction β not a volume routing layer. | |
| --- | |
| # π§© The Real Core Problem | |
| You are trying to emulate: | |
| ``` | |
| ReadOnlyMany behavior | |
| ``` | |
| on top of: | |
| ``` | |
| ReadWriteOnce storage | |
| ``` | |
| That is fundamentally a data replication problem. | |
| So the real question is: | |
| π How is your data synchronized between nodes? | |
| Because even if you create per-node PVCs: | |
| * They are separate volumes | |
| * They do not automatically share data | |
| You would need: | |
| * Replication | |
| * Snapshot clone | |
| * rsync | |
| * Or backend-level copy | |
| --- | |
| # π Best Production-Grade Pattern | |
| If you need identical read-only data across nodes: | |
| ## π₯ Best Option: Snapshot + Clone per Node | |
| If storage supports: | |
| 1. Create master PVC | |
| 2. Create VolumeSnapshot | |
| 3. Clone PVC per node from snapshot | |
| 4. Mount clone locally | |
| Operator manages lifecycle. | |
| No CSI changes required. | |
| --- | |
| ## π₯ Even Better: Use RWX Storage | |
| If possible, use: | |
| * NFS | |
| * CephFS | |
| * EFS | |
| * Longhorn RWX | |
| * Rook CephFS | |
| * Gluster | |
| Because this is exactly what RWX is designed for. | |
| --- | |
| # π§ If I Were Designing This | |
| I would: | |
| 1. Create a CRD: `SharedDataset` | |
| 2. Operator: | |
| * Watches dataset | |
| * Creates per-node PVC clones | |
| * Injects correct PVC into Pods | |
| 3. Use owner references for cleanup | |
| CSI remains a dumb storage layer. | |
| Operator handles orchestration. | |
| --- | |
| # π§Ύ Final Answer | |
| > Can a Pod reference CSI and let CSI determine the PVC name? | |
| No. | |
| > Is CSI the right abstraction for this? | |
| No. | |
| > What should you use instead? | |
| One of: | |
| * Mutating admission webhook (cleanest abstraction) | |
| * StatefulSet with volumeClaimTemplates | |
| * Operator-managed per-node PVC creation | |
| * Snapshot+clone strategy | |
| * Or use proper RWX storage | |
| --- | |
| If you tell me: | |
| * What storage backend you're using | |
| * Whether data changes or is immutable | |
| * Whether replication is required | |
| * Whether pods are Deployments or StatefulSets | |
| I can design the cleanest architecture for your exact case. | |
| ------------------------ | |
| If I want to follow your "Clean Pattern #1 β Use a Mutating Admission Webhook" pattern above, | |
| how do I trigger the Mutating Admission Webhook on the pod creation? Does it run against all | |
| pods, or pods with a given label? Or some other mechanism? | |
| ------------------------ | |
| Great β this is exactly the right question to ask before implementing it. | |
| A **Mutating Admission Webhook** is triggered by the Kubernetes API server during object admission (CREATE, UPDATE, etc.). You control *when* it runs using a `MutatingWebhookConfiguration`. | |
| So the answer is: | |
| > It can run against **all Pods**, or **only specific Pods**, depending on how you configure the webhook rules and selectors. | |
| Letβs break it down. | |
| --- | |
| # π§ When Does a Mutating Webhook Run? | |
| It runs: | |
| * During `CREATE`, `UPDATE`, or other configured operations | |
| * Before the object is persisted | |
| * After defaulting | |
| * Before validation | |
| For your case: | |
| You want to mutate Pods during: | |
| ``` | |
| CREATE | |
| ``` | |
| --- | |
| # π― How Do You Scope Which Pods It Runs On? | |
| There are **three levels of filtering**: | |
| --- | |
| # 1οΈβ£ Rule Filtering (Resource + Operation) | |
| In your `MutatingWebhookConfiguration`: | |
| ```yaml | |
| rules: | |
| - operations: ["CREATE"] | |
| apiGroups: [""] | |
| apiVersions: ["v1"] | |
| resources: ["pods"] | |
| ``` | |
| This means: | |
| * Only Pods | |
| * Only on CREATE | |
| * Not Deployments, not StatefulSets | |
| * Not UPDATE | |
| So far, this matches *all Pods*. | |
| --- | |
| # 2οΈβ£ Namespace Selector (Filter by Namespace Labels) | |
| You can restrict to namespaces with a specific label: | |
| ```yaml | |
| namespaceSelector: | |
| matchLabels: | |
| my-operator/enabled: "true" | |
| ``` | |
| Now it only runs for Pods in namespaces labeled: | |
| ```yaml | |
| my-operator/enabled=true | |
| ``` | |
| Very clean multi-tenant pattern. | |
| --- | |
| # 3οΈβ£ Object Selector (Filter by Pod Labels) β Most Relevant for You | |
| This is probably what you want. | |
| ```yaml | |
| objectSelector: | |
| matchLabels: | |
| my-operator/inject-pvc: "true" | |
| ``` | |
| Now the webhook runs **only on Pods with that label**. | |
| Example Pod: | |
| ```yaml | |
| metadata: | |
| labels: | |
| my-operator/inject-pvc: "true" | |
| ``` | |
| All other Pods are untouched. | |
| --- | |
| # π Recommended Pattern for Your Use Case | |
| Since youβre building an operator and want to inject node-specific PVCs: | |
| Use: | |
| * `objectSelector` to match only workload Pods | |
| * A label like: | |
| ``` | |
| dataset.mycompany.com/name: mydataset | |
| ``` | |
| Then your webhook: | |
| 1. Checks label | |
| 2. Waits for scheduling (if needed) | |
| 3. Rewrites volume claimName | |
| --- | |
| # β οΈ Important: Scheduling Problem | |
| Hereβs a key subtlety: | |
| When Pods are created: | |
| * `spec.nodeName` is NOT set yet | |
| * Scheduler sets it later | |
| If your PVC name depends on nodeName: | |
| You cannot compute it at CREATE time unless: | |
| * The Pod is pre-scheduled (e.g., DaemonSet) | |
| * Or you use a second mutation after scheduling | |
| --- | |
| # π§© Two Common Solutions | |
| ## β Option A β Use DaemonSet (Easiest) | |
| DaemonSet Pods: | |
| * Have nodeName set immediately | |
| * Scheduler pre-assigns node | |
| Webhook can safely use: | |
| ```go | |
| pod.Spec.NodeName | |
| ``` | |
| --- | |
| ## β Option B β Use Wait-For-First-Consumer PVC Binding | |
| Better architectural pattern: | |
| 1. Create PVC per node with node affinity | |
| 2. Pod just references generic PVC | |
| 3. StorageClass uses: | |
| ``` | |
| volumeBindingMode: WaitForFirstConsumer | |
| ``` | |
| 4. Scheduler binds correct PV after scheduling | |
| This removes need to mutate based on node name. | |
| Often cleaner. | |
| --- | |
| # π Minimal MutatingWebhookConfiguration Example | |
| ```yaml | |
| apiVersion: admissionregistration.k8s.io/v1 | |
| kind: MutatingWebhookConfiguration | |
| metadata: | |
| name: dataset-injector | |
| webhooks: | |
| - name: dataset.mycompany.com | |
| admissionReviewVersions: ["v1"] | |
| sideEffects: None | |
| failurePolicy: Fail | |
| rules: | |
| - operations: ["CREATE"] | |
| apiGroups: [""] | |
| apiVersions: ["v1"] | |
| resources: ["pods"] | |
| objectSelector: | |
| matchLabels: | |
| dataset.mycompany.com/enabled: "true" | |
| clientConfig: | |
| service: | |
| name: dataset-webhook-service | |
| namespace: my-operator | |
| path: /mutate-v1-pod | |
| caBundle: <base64> | |
| ``` | |
| --- | |
| # π§ How Kubebuilder Makes This Easy | |
| If you're using Kubebuilder: | |
| You can scaffold: | |
| ``` | |
| kubebuilder create webhook --group core --version v1 --kind Pod --type mutating | |
| ``` | |
| Then inside: | |
| ```go | |
| func (w *PodMutator) Handle(ctx context.Context, req admission.Request) admission.Response { | |
| pod := &corev1.Pod{} | |
| err := w.decoder.Decode(req, pod) | |
| ... | |
| } | |
| ``` | |
| Kubebuilder handles: | |
| * TLS | |
| * CA injection | |
| * Webhook configuration | |
| * Service wiring | |
| Much easier than manual. | |
| --- | |
| # π― Final Answer | |
| > Does it run against all Pods? | |
| Only if you configure it that way. | |
| You control scope using: | |
| 1. `rules` (resource + operation) | |
| 2. `namespaceSelector` | |
| 3. `objectSelector` (label-based filtering β best for you) | |
| --- | |
| # π‘ My Recommendation for Your Case | |
| Use: | |
| * `objectSelector` with label like: | |
| ``` | |
| dataset.mycompany.com/name=mydataset | |
| ``` | |
| * Restrict to CREATE | |
| * Use Kubebuilder webhook scaffolding | |
| --- | |
| If you'd like, I can now: | |
| * Show the exact Pod mutation logic to rewrite PVC name | |
| * Or design the full operator + webhook architecture for your per-node PVC system | |
| Just tell me which direction you're leaning. | |
| ------------------------ | |
| Using golang and kubebuilder, can you show me what the pod mutating webhook file would | |
| look like that filters on a pod with a given label and mutates the claimName in in the | |
| persistentVolumeClaim in the volumes of the pod spec to modify the PVC to an updated name? | |
| ------------------------ | |
| Excellent β this is exactly the kind of thing Kubebuilder is good at. | |
| Below is a **realistic Kubebuilder mutating webhook example** that: | |
| 1. Runs on Pod CREATE | |
| 2. Checks for a specific label | |
| 3. Iterates over `spec.volumes` | |
| 4. Rewrites `persistentVolumeClaim.claimName` | |
| 5. Returns a JSON patch response | |
| This assumes: | |
| * Kubebuilder v3+ | |
| * controller-runtime webhook server | |
| * You already scaffolded a mutating webhook for Pods | |
| --- | |
| # 1οΈβ£ Scaffold (for reference) | |
| If not already done: | |
| ```bash | |
| kubebuilder create webhook \ | |
| --group core \ | |
| --version v1 \ | |
| --kind Pod \ | |
| --type mutating \ | |
| --programmatic-validation=false | |
| ``` | |
| Youβll get something like: | |
| ``` | |
| internal/webhook/v1/pod_webhook.go | |
| ``` | |
| Weβll replace that with real logic. | |
| --- | |
| # 2οΈβ£ Example: `pod_mutating_webhook.go` | |
| ```go | |
| package v1 | |
| import ( | |
| "context" | |
| "fmt" | |
| "net/http" | |
| corev1 "k8s.io/api/core/v1" | |
| admissionv1 "k8s.io/api/admission/v1" | |
| ctrl "sigs.k8s.io/controller-runtime" | |
| "sigs.k8s.io/controller-runtime/pkg/webhook/admission" | |
| ) | |
| const ( | |
| TargetLabelKey = "dataset.mycompany.com/enabled" | |
| TargetLabelValue = "true" | |
| VolumeNameToMutate = "mydata" // optional filter by volume name | |
| ) | |
| type PodMutator struct { | |
| decoder *admission.Decoder | |
| } | |
| func (m *PodMutator) SetupWebhookWithManager(mgr ctrl.Manager) error { | |
| return ctrl.NewWebhookManagedBy(mgr). | |
| For(&corev1.Pod{}). | |
| WithMutator(m). | |
| Complete() | |
| } | |
| // +kubebuilder:webhook:path=/mutate-v1-pod,mutating=true,failurePolicy=Fail,sideEffects=None,groups="",resources=pods,verbs=create,versions=v1,name=mpod.kb.io,admissionReviewVersions=v1 | |
| func (m *PodMutator) Handle(ctx context.Context, req admission.Request) admission.Response { | |
| pod := &corev1.Pod{} | |
| if err := m.decoder.Decode(req, pod); err != nil { | |
| return admission.Errored(http.StatusBadRequest, err) | |
| } | |
| // π Filter by label | |
| if pod.Labels[TargetLabelKey] != TargetLabelValue { | |
| return admission.Allowed("pod does not match target label") | |
| } | |
| // Track if we mutate | |
| mutated := false | |
| // Iterate volumes | |
| for i, vol := range pod.Spec.Volumes { | |
| if vol.PersistentVolumeClaim == nil { | |
| continue | |
| } | |
| // Optional: Only mutate specific volume | |
| if vol.Name != VolumeNameToMutate { | |
| continue | |
| } | |
| originalClaim := vol.PersistentVolumeClaim.ClaimName | |
| // Example mutation logic: | |
| // Append node name OR append suffix | |
| // NOTE: nodeName is usually empty at CREATE time | |
| newClaimName := fmt.Sprintf("%s-node-specific", originalClaim) | |
| pod.Spec.Volumes[i].PersistentVolumeClaim.ClaimName = newClaimName | |
| mutated = true | |
| } | |
| if !mutated { | |
| return admission.Allowed("no pvc volumes mutated") | |
| } | |
| // Return patch response | |
| return admission.PatchResponseFromRaw(req.Object.Raw, pod) | |
| } | |
| func (m *PodMutator) InjectDecoder(d *admission.Decoder) error { | |
| m.decoder = d | |
| return nil | |
| } | |
| ``` | |
| --- | |
| # 3οΈβ£ What This Does | |
| If a Pod like this is created: | |
| ```yaml | |
| apiVersion: v1 | |
| kind: Pod | |
| metadata: | |
| name: example | |
| labels: | |
| dataset.mycompany.com/enabled: "true" | |
| spec: | |
| volumes: | |
| - name: mydata | |
| persistentVolumeClaim: | |
| claimName: dataset | |
| ``` | |
| It will be mutated to: | |
| ```yaml | |
| spec: | |
| volumes: | |
| - name: mydata | |
| persistentVolumeClaim: | |
| claimName: dataset-node-specific | |
| ``` | |
| --- | |
| # 4οΈβ£ If You Want to Use NodeName | |
| β Important: | |
| At CREATE time: | |
| ``` | |
| pod.Spec.NodeName == "" | |
| ``` | |
| Unless: | |
| * It's a DaemonSet | |
| * Or manually scheduled | |
| If you need nodeName-based mutation, you must: | |
| * Use a **post-scheduling mutation** (complex) | |
| * Or rely on PVC topology instead | |
| * Or run as DaemonSet | |
| --- | |
| # 5οΈβ£ How Filtering Really Works | |
| You should also configure `objectSelector` in the webhook configuration: | |
| ```yaml | |
| objectSelector: | |
| matchLabels: | |
| dataset.mycompany.com/enabled: "true" | |
| ``` | |
| That prevents unnecessary webhook calls. | |
| --- | |
| # 6οΈβ£ Production Improvements | |
| For production, you should: | |
| * Skip mutation if claim already rewritten | |
| * Handle multiple PVC volumes | |
| * Make mutation logic idempotent | |
| * Possibly read a CR to determine correct PVC name | |
| * Add logging | |
| Example idempotency check: | |
| ```go | |
| if strings.HasSuffix(originalClaim, "-node-specific") { | |
| continue | |
| } | |
| ``` | |
| --- | |
| # 7οΈβ£ Important Architectural Reminder | |
| Mutating PVC name at Pod CREATE works only if: | |
| * The target PVC already exists | |
| * The PVC binding rules allow it | |
| * Scheduling constraints are satisfied | |
| If you're trying to dynamically create per-node PVCs, | |
| you likely want your operator to: | |
| 1. Create PVCs | |
| 2. Ensure readiness | |
| 3. Then allow Pods | |
| --- | |
| # β Summary | |
| This webhook: | |
| * Filters by Pod label | |
| * Iterates volumes | |
| * Rewrites `claimName` | |
| * Returns patch | |
| This is the correct Kubebuilder-style implementation. | |
| --- | |
| If you'd like next, I can show: | |
| * How to mutate based on a CR lookup | |
| * How to mutate only after scheduling | |
| * Or a full operator + webhook working pattern for your per-node PVC architecture | |
| Which direction do you want to go? | |
| ------------------------ | |
| I would like to see both "How to mutate based on a CR lookup" and "How to mutate only after scheduling". | |
| ------------------------ | |
| Perfect β these are the two *advanced* patterns, and they solve different architectural problems. | |
| I'll show you: | |
| 1. β **Mutate based on a CR lookup** (clean, production-ready) | |
| 2. β οΈ **Mutate only after scheduling** (trickier, but possible) | |
| Both examples assume: | |
| * Kubebuilder | |
| * controller-runtime | |
| * Mutating admission webhook for Pods | |
| --- | |
| # β PART 1 β Mutate Based on a CR Lookup | |
| This is the **cleanest and most production-safe pattern**. | |
| ## π― Goal | |
| Pod has a label like: | |
| ```yaml | |
| labels: | |
| dataset.mycompany.com/name: mydataset | |
| ``` | |
| Your webhook: | |
| 1. Reads the label | |
| 2. Fetches `SharedDataset` CR | |
| 3. Computes the correct PVC name | |
| 4. Mutates `claimName` | |
| --- | |
| ## π§± Example CRD | |
| ```go | |
| type SharedDatasetSpec struct { | |
| BasePVCName string `json:"basePVCName"` | |
| } | |
| type SharedDataset struct { | |
| metav1.TypeMeta `json:",inline"` | |
| metav1.ObjectMeta `json:"metadata,omitempty"` | |
| Spec SharedDatasetSpec `json:"spec,omitempty"` | |
| } | |
| ``` | |
| --- | |
| ## π§ Updated Webhook Struct | |
| We now inject a Kubernetes client: | |
| ```go | |
| type PodMutator struct { | |
| Client client.Client | |
| decoder *admission.Decoder | |
| } | |
| ``` | |
| --- | |
| ## π§ Setup | |
| ```go | |
| func (m *PodMutator) SetupWebhookWithManager(mgr ctrl.Manager) error { | |
| m.Client = mgr.GetClient() | |
| return ctrl.NewWebhookManagedBy(mgr). | |
| For(&corev1.Pod{}). | |
| WithMutator(m). | |
| Complete() | |
| } | |
| ``` | |
| --- | |
| ## π Handle() with CR Lookup | |
| ```go | |
| func (m *PodMutator) Handle(ctx context.Context, req admission.Request) admission.Response { | |
| pod := &corev1.Pod{} | |
| if err := m.decoder.Decode(req, pod); err != nil { | |
| return admission.Errored(http.StatusBadRequest, err) | |
| } | |
| datasetName, ok := pod.Labels["dataset.mycompany.com/name"] | |
| if !ok { | |
| return admission.Allowed("no dataset label") | |
| } | |
| // Fetch CR | |
| dataset := &examplev1.SharedDataset{} | |
| err := m.Client.Get(ctx, client.ObjectKey{ | |
| Name: datasetName, | |
| Namespace: pod.Namespace, | |
| }, dataset) | |
| if err != nil { | |
| return admission.Errored(http.StatusBadRequest, err) | |
| } | |
| mutated := false | |
| for i, vol := range pod.Spec.Volumes { | |
| if vol.PersistentVolumeClaim == nil { | |
| continue | |
| } | |
| // Example logic: | |
| // PVC name = basePVCName + "-node-a" | |
| newPVC := dataset.Spec.BasePVCName + "-default" | |
| if pod.Spec.Volumes[i].PersistentVolumeClaim.ClaimName != newPVC { | |
| pod.Spec.Volumes[i].PersistentVolumeClaim.ClaimName = newPVC | |
| mutated = true | |
| } | |
| } | |
| if !mutated { | |
| return admission.Allowed("no mutation needed") | |
| } | |
| return admission.PatchResponseFromRaw(req.Object.Raw, pod) | |
| } | |
| ``` | |
| --- | |
| ## π Why This Pattern Is Good | |
| * Centralized dataset logic | |
| * PVC naming fully abstracted | |
| * Clean separation of concerns | |
| * No CSI hacks | |
| * Deterministic and safe | |
| --- | |
| # β οΈ PART 2 β Mutate Only After Scheduling | |
| This is harder. | |
| Because: | |
| > At CREATE time, `pod.Spec.NodeName == ""` | |
| The scheduler sets it later. | |
| And: | |
| * Admission webhooks run before persistence | |
| * They do NOT run automatically after scheduling | |
| So how do we mutate after scheduling? | |
| --- | |
| # π§ Option A (Recommended): Mutate on UPDATE When nodeName Is Set | |
| Modify your webhook rules: | |
| ```yaml | |
| rules: | |
| - operations: ["CREATE", "UPDATE"] | |
| apiGroups: [""] | |
| apiVersions: ["v1"] | |
| resources: ["pods"] | |
| ``` | |
| Then in code: | |
| ```go | |
| if pod.Spec.NodeName == "" { | |
| return admission.Allowed("pod not yet scheduled") | |
| } | |
| ``` | |
| Then compute: | |
| ```go | |
| newPVC := fmt.Sprintf("%s-%s", dataset.Spec.BasePVCName, pod.Spec.NodeName) | |
| ``` | |
| --- | |
| ## π¨ CRITICAL PROBLEM | |
| By the time NodeName is set: | |
| * Volumes are already resolved | |
| * PVC binding likely already occurred | |
| * Mutation may be rejected | |
| Kubernetes may reject spec changes to volumes after creation. | |
| So this often does **NOT work reliably**. | |
| --- | |
| # π§ Option B (Better): Use WaitForFirstConsumer | |
| Instead of mutating after scheduling: | |
| 1. Create per-node PVCs | |
| 2. Add nodeAffinity to PV | |
| 3. Use StorageClass: | |
| ```yaml | |
| volumeBindingMode: WaitForFirstConsumer | |
| ``` | |
| Then: | |
| * Pod references generic PVC | |
| * Scheduler picks node | |
| * Binder picks correct PV for that node | |
| No webhook required. | |
| This is MUCH cleaner. | |
| --- | |
| # π§ Option C (Advanced & Clean): Two-Phase Pod Pattern | |
| If you must mutate after scheduling: | |
| 1. Pod created with placeholder PVC | |
| 2. Scheduler assigns node | |
| 3. Your controller watches Pod updates | |
| 4. Controller deletes Pod | |
| 5. Recreates Pod with correct PVC | |
| But: | |
| * Disruptive | |
| * Complex | |
| * Not elegant | |
| --- | |
| # π What I Recommend For Your Architecture | |
| Since you want: | |
| * Per-node PVCs | |
| * No RWX | |
| * Clean pod spec | |
| * No PVC awareness in workload | |
| Best pattern: | |
| ### π₯ Operator + CR Lookup + PVC Precreation | |
| Flow: | |
| 1. SharedDataset CR created | |
| 2. Operator creates: | |
| * `dataset-node-a` | |
| * `dataset-node-b` | |
| 3. Webhook mutates PVC name at CREATE | |
| 4. PVC already exists | |
| 5. Scheduler enforces node affinity | |
| No post-scheduling mutation required. | |
| --- | |
| # π¬ Summary Comparison | |
| | Pattern | Safe | Production Ready | Recommended | | |
| | ----------------------- | ---- | ---------------- | ----------- | | |
| | CR lookup at CREATE | β | β | βββββ | | |
| | Mutate after scheduling | β οΈ | β | Avoid | | |
| | WaitForFirstConsumer | β | β | βββββ | | |
| | Delete & recreate pod | β | β | No | | |
| --- | |
| # π‘ My Advice | |
| Use: | |
| * CR-driven dataset abstraction | |
| * Pre-created per-node PVCs | |
| * CREATE-time mutation only | |
| * Avoid nodeName-dependent logic in webhook | |
| --- | |
| If you'd like, next I can: | |
| * Design the full SharedDataset operator architecture | |
| * Show the PVC-per-node reconciliation logic | |
| * Show how to combine topology + webhook cleanly | |
| * Or draw the complete control-plane flow step-by-step | |
| Which direction would you like to go deeper into? | |
| ------------------------ | |
| I your "Part-1 - Mutate Based on a CR Lookup", the example code adds "-default" to the PVC name, | |
| but the comment says "-node-a". Because this is a CREATE and the Node is not know at create, I | |
| don't understand how this helps to pick a node specific PVC name. What am I missing? | |
| ------------------------ | |
| Excellent β you caught the inconsistency π | |
| You are not missing anything. | |
| You are correctly identifying the core constraint. | |
| Letβs untangle this carefully. | |
| --- | |
| # π΄ The Core Truth | |
| At **Pod CREATE time**: | |
| ``` | |
| pod.Spec.NodeName == "" | |
| ``` | |
| So: | |
| > You CANNOT compute a node-specific PVC name during CREATE | |
| > unless the Pod is pre-scheduled (DaemonSet, static pod, etc.) | |
| So the earlier example that hinted at `-node-a` during CREATE was conceptually misleading in the context of Deployments. | |
| Good catch. | |
| --- | |
| # π§ So How Does The CR Lookup Pattern Actually Help? | |
| The CR lookup pattern is useful when: | |
| * The PVC name is derived from a CR | |
| * But NOT from the node | |
| Example: | |
| ``` | |
| SharedDataset: | |
| spec: | |
| basePVCName: dataset-prod | |
| ``` | |
| Then: | |
| ``` | |
| dataset-prod-readonly | |
| dataset-prod-clone | |
| dataset-prod-version2 | |
| ``` | |
| That works. | |
| But it does NOT solve node-specific naming. | |
| --- | |
| # π₯ The Real Problem Youβre Trying to Solve | |
| You want: | |
| * RWO storage | |
| * Multiple nodes | |
| * One PVC per node | |
| * Pod does not know PVC name | |
| * PVC name depends on node | |
| This creates a circular dependency: | |
| | Step | Requires | | |
| | ------------ | ---------------------- | | |
| | Schedule Pod | PVC must already exist | | |
| | Choose PVC | Need to know node | | |
| | Know node | Must schedule Pod | | |
| Thatβs the deadlock. | |
| --- | |
| # π The Correct Solution: Stop Encoding Node Name in PVC Name | |
| Instead of: | |
| ``` | |
| dataset-node-a | |
| dataset-node-b | |
| ``` | |
| You should create: | |
| ``` | |
| dataset | |
| ``` | |
| Multiple PVs behind it: | |
| * PV 1 β node-a (nodeAffinity) | |
| * PV 2 β node-b (nodeAffinity) | |
| * PV 3 β node-c (nodeAffinity) | |
| And use: | |
| ``` | |
| volumeBindingMode: WaitForFirstConsumer | |
| ``` | |
| --- | |
| # π§ How This Actually Works | |
| 1. Pod references generic PVC: | |
| ```yaml | |
| claimName: dataset | |
| ``` | |
| 2. StorageClass: | |
| ```yaml | |
| volumeBindingMode: WaitForFirstConsumer | |
| ``` | |
| 3. Scheduler: | |
| * Picks a node for Pod | |
| 4. PVC binder: | |
| * Selects PV whose nodeAffinity matches that node | |
| 5. PVC binds to correct PV | |
| 6. CSI mounts local RWO storage | |
| No webhook needed. | |
| No nodeName logic needed. | |
| No circular dependency. | |
| --- | |
| # π§± Example Architecture | |
| ## PVC (single) | |
| ```yaml | |
| apiVersion: v1 | |
| kind: PersistentVolumeClaim | |
| metadata: | |
| name: dataset | |
| spec: | |
| accessModes: | |
| - ReadWriteOnce | |
| storageClassName: local-storage | |
| ``` | |
| --- | |
| ## PV for node-a | |
| ```yaml | |
| apiVersion: v1 | |
| kind: PersistentVolume | |
| spec: | |
| storageClassName: local-storage | |
| capacity: | |
| storage: 10Gi | |
| accessModes: | |
| - ReadWriteOnce | |
| local: | |
| path: /mnt/dataset | |
| nodeAffinity: | |
| required: | |
| nodeSelectorTerms: | |
| - matchExpressions: | |
| - key: kubernetes.io/hostname | |
| operator: In | |
| values: | |
| - node-a | |
| ``` | |
| Repeat for node-b, node-c. | |
| --- | |
| # π― Why This Works | |
| Because Kubernetes does: | |
| ``` | |
| Schedule first β bind later | |
| ``` | |
| When using WaitForFirstConsumer. | |
| This breaks the circular dependency. | |
| --- | |
| # π« Why Mutating After Scheduling Is Wrong | |
| You might think: | |
| "Okay, Iβll mutate on UPDATE when nodeName is set." | |
| But by then: | |
| * Volume scheduling already considered PVC | |
| * Changing PVC name is forbidden | |
| * You risk invalidating scheduling decisions | |
| Kubernetes intentionally prevents this pattern. | |
| --- | |
| # π§ When Is CREATE-Time CR Lookup Useful? | |
| It is useful when: | |
| * Selecting dataset version | |
| * Selecting environment-specific PVC | |
| * Injecting precomputed PVC name | |
| * Multi-tenant routing | |
| But not for node-based routing. | |
| --- | |
| # π Final Clarification | |
| You are not missing anything. | |
| The reality is: | |
| > You cannot safely select a node-specific PVC name during Pod CREATE unless the Pod is pre-scheduled. | |
| The correct pattern is: | |
| β One PVC | |
| β Multiple node-affine PVs | |
| β WaitForFirstConsumer | |
| β No nodeName-based mutation | |
| --- | |
| # π If Youβd Like | |
| I can now: | |
| * Design the full operator logic that auto-creates per-node PVs | |
| * Show how to replicate data per node cleanly | |
| * Or explain how CSI topology fits into this pattern | |
| Which direction do you want to explore next? | |
| ------------------------ | |
| So this only works for a single pod, right? If the workload is a Daemonset or Replicaset, multiple | |
| pods on different nodes would map to the same PVC, where the PVC would bind to different PVs on each | |
| node, which would fail. Is that correct? | |
| ------------------------ | |
| Excellent β this is exactly the right mental model to validate. | |
| And yes β your concern is correct. | |
| Letβs walk through it precisely. | |
| --- | |
| # π΄ Key Fact: A PVC Can Bind To Only ONE PV | |
| A `PersistentVolumeClaim`: | |
| * Binds **once** | |
| * To exactly **one PersistentVolume** | |
| * For its lifetime | |
| Even with: | |
| ```yaml | |
| volumeBindingMode: WaitForFirstConsumer | |
| ``` | |
| The flow is: | |
| 1. First Pod gets scheduled | |
| 2. PVC binds to a matching PV (e.g., node-A) | |
| 3. PVC is now permanently bound to that PV | |
| 4. All future Pods using that PVC must use that same PV | |
| --- | |
| # π₯ What Happens With a ReplicaSet or DaemonSet? | |
| Letβs say: | |
| * You have one PVC: `dataset` | |
| * You have 3 PVs: | |
| * PV-A (node-A) | |
| * PV-B (node-B) | |
| * PV-C (node-C) | |
| ### Pod #1 scheduled to node-A | |
| PVC binds to PV-A. | |
| All good. | |
| --- | |
| ### Pod #2 scheduled to node-B | |
| Now: | |
| * PVC is already bound to PV-A | |
| * PV-A has nodeAffinity=node-A | |
| * Pod is on node-B | |
| Result: | |
| β Pod cannot mount | |
| β Scheduler fails | |
| β You get volume node affinity conflict | |
| --- | |
| # π§ So Your Conclusion Is Correct | |
| > This pattern works only for one Pod using the PVC at a time. | |
| Or: | |
| * Multiple Pods on the **same node** | |
| * Not multiple nodes | |
| --- | |
| # π― So What Does This Mean For Your Original Goal? | |
| You want: | |
| * Same logical dataset | |
| * Multiple nodes | |
| * RWO storage | |
| * Multiple Pods | |
| * No PVC awareness in workload | |
| That means: | |
| > You must have multiple PVCs. | |
| There is no Kubernetes mechanism where: | |
| ``` | |
| One PVC dynamically binds to different PVs per Pod | |
| ``` | |
| PVC binding is cluster-level, not per-Pod. | |
| --- | |
| # π So What Actually Works? | |
| There are only four real patterns: | |
| --- | |
| # β Pattern 1 β One PVC Per Pod (StatefulSet) | |
| This is what Kubernetes is designed for. | |
| StatefulSet automatically creates: | |
| ``` | |
| dataset-0 | |
| dataset-1 | |
| dataset-2 | |
| ``` | |
| Each Pod gets its own PVC. | |
| Works perfectly with RWO. | |
| --- | |
| # β Pattern 2 β Operator Creates One PVC Per Node | |
| For a DaemonSet: | |
| Operator creates: | |
| ``` | |
| dataset-node-a | |
| dataset-node-b | |
| dataset-node-c | |
| ``` | |
| Then: | |
| * Pod on node-A references dataset-node-a | |
| * Pod on node-B references dataset-node-b | |
| This requires: | |
| * Webhook mutation | |
| OR | |
| * Generating the DaemonSet yourself | |
| This is the correct abstraction if you insist on DaemonSet. | |
| --- | |
| # β Pattern 3 β True RWX Storage (The Cleanest) | |
| Use: | |
| * NFS | |
| * CephFS | |
| * EFS | |
| * Longhorn RWX | |
| * etc. | |
| Then: | |
| * Single PVC | |
| * Many nodes | |
| * No hacks | |
| This is what RWX is for. | |
| --- | |
| # β Pattern 4 β Data Replication Layer | |
| If storage is RWO but data is read-only: | |
| Operator: | |
| 1. Creates per-node PVC | |
| 2. Clones from snapshot | |
| 3. Keeps them in sync if needed | |
| Pods use per-node PVC. | |
| This is effectively building your own distributed dataset layer. | |
| --- | |
| # π§ Why Kubernetes Works This Way | |
| PVC binding is designed to guarantee: | |
| * Storage topology correctness | |
| * Scheduler correctness | |
| * Predictable data locality | |
| If PVCs dynamically rebound per Pod: | |
| * Data consistency guarantees would break | |
| * RWO semantics would be meaningless | |
| So Kubernetes intentionally prevents this. | |
| --- | |
| # π― Final Answer | |
| Yes. | |
| Your understanding is correct: | |
| > A single PVC cannot bind to different PVs for different Pods across nodes. | |
| So the WaitForFirstConsumer trick only works for: | |
| * Single Pod workloads | |
| * Or multi-pod same-node workloads | |
| Not ReplicaSets across nodes. | |
| Not DaemonSets across nodes. | |
| --- | |
| # π What You Should Do | |
| Given everything you've described so far: | |
| The correct pattern for you is: | |
| > Operator-managed per-node PVCs | |
| > | |
| > * Pod mutation (or generated workload spec) | |
| Not CSI logic. | |
| Not PVC rebinding tricks. | |
| --- | |
| If you'd like, I can now: | |
| * Design the full per-node PVC operator pattern | |
| * Show how to safely mutate DaemonSet pods | |
| * Or propose a different architecture depending on whether the data changes or is immutable | |
| What does your dataset look like β immutable, periodically updated, or actively written? | |
| ------------------------ | |
| ------------------------ | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment