Skip to content

Instantly share code, notes, and snippets.

@Billy99
Last active February 13, 2026 15:53
Show Gist options
  • Select an option

  • Save Billy99/91ea0bf909cfcf12529a9457107909c1 to your computer and use it in GitHub Desktop.

Select an option

Save Billy99/91ea0bf909cfcf12529a9457107909c1 to your computer and use it in GitHub Desktop.
------------------------
I am coding a CSI Driver in Golang that will mount a PVC into a pod volumemount.
Can you write a golang set of functions that will mount a PVC in a pod?
------------------------
This is an important clarification:
πŸ‘‰ **A CSI driver does NOT mount a PVC into a Pod.**
Kubernetes does that.
Your CSI **Node plugin** only mounts a volume to a path that kubelet gives you (`target_path`).
Kubelet then bind-mounts that into the Pod.
So in `NodePublishVolume`, you:
1. Prepare the backing storage (if needed)
2. Mount it to `target_path`
3. Return success
You never interact with Pods or PVC objects directly in the mount path.
---
# πŸ”Ž What Actually Happens
Flow:
1. User creates PVC
2. PVC bound to PV
3. Pod references PVC
4. Pod scheduled to Node
5. Kubelet calls:
* `NodeStageVolume` (optional)
* `NodePublishVolume`
6. CSI driver mounts to `target_path`
7. Kubelet bind-mounts into container
So your job is simply:
```
Mount(source) β†’ targetPath
```
---
# βœ… Example: Minimal CSI NodePublishVolume Implementation in Go
This example mounts a filesystem volume using `mount(8)` via Kubernetes mount utils.
It assumes:
* You are mounting a block device like `/dev/sdb`
* Or mounting a directory
* Linux node
---
## Dependencies
```go
import (
"context"
"fmt"
"os"
"github.com/container-storage-interface/spec/lib/go/csi"
"k8s.io/mount-utils"
)
```
---
## Driver Struct
```go
type NodeServer struct {
mounter mount.Interface
}
```
Initialize it:
```go
func NewNodeServer() *NodeServer {
return &NodeServer{
mounter: mount.New(""),
}
}
```
---
# πŸ”Ή NodePublishVolume
```go
func (ns *NodeServer) NodePublishVolume(
ctx context.Context,
req *csi.NodePublishVolumeRequest,
) (*csi.NodePublishVolumeResponse, error) {
targetPath := req.GetTargetPath()
volumeID := req.GetVolumeId()
readOnly := req.GetReadonly()
if targetPath == "" {
return nil, fmt.Errorf("target path not provided")
}
// Example: assume source device path is based on volumeID
source := fmt.Sprintf("/dev/%s", volumeID)
// Ensure target directory exists
if err := os.MkdirAll(targetPath, 0750); err != nil {
return nil, fmt.Errorf("failed to create target path: %w", err)
}
// Check if already mounted
notMnt, err := mount.IsNotMountPoint(ns.mounter, targetPath)
if err != nil {
return nil, fmt.Errorf("failed to check mount point: %w", err)
}
if !notMnt {
// Already mounted
return &csi.NodePublishVolumeResponse{}, nil
}
options := []string{}
if readOnly {
options = append(options, "ro")
}
// Perform mount
if err := ns.mounter.Mount(source, targetPath, "ext4", options); err != nil {
return nil, fmt.Errorf("mount failed: %w", err)
}
return &csi.NodePublishVolumeResponse{}, nil
}
```
---
# πŸ”Ή NodeUnpublishVolume
You must implement unmount:
```go
func (ns *NodeServer) NodeUnpublishVolume(
ctx context.Context,
req *csi.NodeUnpublishVolumeRequest,
) (*csi.NodeUnpublishVolumeResponse, error) {
targetPath := req.GetTargetPath()
notMnt, err := mount.IsNotMountPoint(ns.mounter, targetPath)
if err != nil {
return nil, err
}
if notMnt {
return &csi.NodeUnpublishVolumeResponse{}, nil
}
if err := ns.mounter.Unmount(targetPath); err != nil {
return nil, err
}
return &csi.NodeUnpublishVolumeResponse{}, nil
}
```
---
# πŸ”Ή If Using NodeStageVolume (Recommended)
If implementing staging:
* `NodeStageVolume` β†’ mount device to global path
* `NodePublishVolume` β†’ bind mount from staging to pod path
Example bind mount:
```go
err := ns.mounter.Mount(
stagingPath,
targetPath,
"",
[]string{"bind"},
)
```
---
# 🚨 Important: You DO NOT:
* Create Pods
* Modify Pods
* Mount into container namespace directly
* Touch PVC objects during publish
Kubelet handles all of that.
---
# 🧠 If Your Storage Is NOT a Block Device
If your CSI driver:
* Mounts NFS β†’ use `type = "nfs"`
* Mounts tmpfs β†’ use `type = "tmpfs"`
* Mounts FUSE β†’ call external binary
* Mounts custom backend β†’ prepare directory and bind mount
Tell me your backend type and I can tailor the example.
---
# 🎯 Summary
To β€œmount a PVC into a pod” in CSI:
You only:
```
Mount(source storage) β†’ targetPath
```
Kubelet handles the rest.
---
If you tell me:
* Block or filesystem?
* Local or network?
* Using staging?
* Linux only?
I can give you a production-grade skeleton driver structure.
------------------------
I am writing a Kubernetes operator that needs to mount the same data into multiple
workload pods running on different nodes using PVCs. If my cluster does not support
a StorageClass with accessmode of ReadOnlyMany, I was planning on creating a PVC per
node. I don't want the workload pods to have to know the name of the PVC on each node.
Can I have the workload pod volume reference my CSI driver and have the CSI driver
determine the PVC name and manage the mounting of the data into the workload pod?
If CSI driver is not the best option for this use case, what other pattern should be
used to simplify the pod deployment?
------------------------
This is a very good architectural question β€” and you're thinking in the right direction.
Short answer:
> ❌ No β€” a Pod cannot dynamically β€œask” a CSI driver to choose a PVC at mount time.
> βœ… The PVC must already be bound to a PV before the Pod starts.
> πŸ‘‰ So CSI is not the right abstraction layer for solving this problem.
Let’s break down why β€” and then I’ll show you the correct pattern.
---
# 🚫 Why CSI Cannot Choose the PVC at Mount Time
Volume resolution happens in this order:
1. Pod spec references **a PVC name**
2. PVC is bound to a PV
3. Scheduler ensures topology constraints
4. Kubelet calls CSI with:
* `volume_id`
* `target_path`
* `volume_context`
By the time `NodePublishVolume` is called:
* The PVC is already resolved
* The PV is already chosen
* The driver gets a volume ID β€” not a PVC name
* You cannot switch to another PVC
So the CSI driver has no opportunity to β€œchoose the PVC for that node.”
---
# 🎯 Your Actual Requirement
You want:
* Same logical data
* Multiple nodes
* Underlying storage only supports RWO
* Pods should not need to know node-specific PVC names
* Operator should abstract this away
That’s a scheduling and orchestration problem β€” not a CSI problem.
---
# 🧠 Correct Pattern: Operator Owns the PVC Fan-Out
This is a very common design.
## Architecture Pattern
Your operator:
1. Detects which nodes run workload Pods
2. Creates one PVC per node:
* `mydata-node-a`
* `mydata-node-b`
3. Creates a Pod (or mutates it) to reference the correct PVC for that node
The workload Pod spec should be generated or mutated by your operator.
Pods themselves do not need to β€œknow” the naming logic.
---
# βœ… Clean Pattern #1 β€” Use a Mutating Admission Webhook
This is the cleanest Kubernetes-native solution.
Flow:
1. Workload Pod contains a logical volume:
```yaml
volumes:
- name: mydata
persistentVolumeClaim:
claimName: mydata-placeholder
```
2. Your mutating webhook:
* Detects node assignment (or waits until scheduled)
* Rewrites `claimName` to:
```
mydata-${spec.nodeName}
```
This keeps workload manifests simple.
---
# βœ… Clean Pattern #2 β€” Operator Manages StatefulSet Instead
If possible, convert workload into a StatefulSet:
StatefulSets already create per-pod PVCs automatically:
```yaml
volumeClaimTemplates:
- metadata:
name: mydata
```
Then each pod gets:
```
mydata-0
mydata-1
mydata-2
```
No CSI changes needed.
---
# βœ… Clean Pattern #3 β€” Use a DaemonSet + hostPath
If the data is identical and node-local:
1. Operator deploys DaemonSet
2. Each pod mounts:
```
/var/lib/mydata
```
3. Workloads use hostPath
But this only works if:
* Data can be replicated to each node
* You're okay with hostPath semantics
---
# 🚫 Why Using CSI to Solve This Is a Bad Idea
If you try to:
> β€œHave Pod reference my CSI driver directly and let CSI decide PVC”
You run into hard Kubernetes constraints:
* Pod must reference PVC by name
* Scheduler must validate topology
* PVC must bind before scheduling
* CSI only sees volumeID
You cannot override that flow.
CSI is a storage backend abstraction β€” not a volume routing layer.
---
# 🧩 The Real Core Problem
You are trying to emulate:
```
ReadOnlyMany behavior
```
on top of:
```
ReadWriteOnce storage
```
That is fundamentally a data replication problem.
So the real question is:
πŸ‘‰ How is your data synchronized between nodes?
Because even if you create per-node PVCs:
* They are separate volumes
* They do not automatically share data
You would need:
* Replication
* Snapshot clone
* rsync
* Or backend-level copy
---
# πŸ— Best Production-Grade Pattern
If you need identical read-only data across nodes:
## πŸ₯‡ Best Option: Snapshot + Clone per Node
If storage supports:
1. Create master PVC
2. Create VolumeSnapshot
3. Clone PVC per node from snapshot
4. Mount clone locally
Operator manages lifecycle.
No CSI changes required.
---
## πŸ₯ˆ Even Better: Use RWX Storage
If possible, use:
* NFS
* CephFS
* EFS
* Longhorn RWX
* Rook CephFS
* Gluster
Because this is exactly what RWX is designed for.
---
# 🧠 If I Were Designing This
I would:
1. Create a CRD: `SharedDataset`
2. Operator:
* Watches dataset
* Creates per-node PVC clones
* Injects correct PVC into Pods
3. Use owner references for cleanup
CSI remains a dumb storage layer.
Operator handles orchestration.
---
# 🧾 Final Answer
> Can a Pod reference CSI and let CSI determine the PVC name?
No.
> Is CSI the right abstraction for this?
No.
> What should you use instead?
One of:
* Mutating admission webhook (cleanest abstraction)
* StatefulSet with volumeClaimTemplates
* Operator-managed per-node PVC creation
* Snapshot+clone strategy
* Or use proper RWX storage
---
If you tell me:
* What storage backend you're using
* Whether data changes or is immutable
* Whether replication is required
* Whether pods are Deployments or StatefulSets
I can design the cleanest architecture for your exact case.
------------------------
If I want to follow your "Clean Pattern #1 β€” Use a Mutating Admission Webhook" pattern above,
how do I trigger the Mutating Admission Webhook on the pod creation? Does it run against all
pods, or pods with a given label? Or some other mechanism?
------------------------
Great β€” this is exactly the right question to ask before implementing it.
A **Mutating Admission Webhook** is triggered by the Kubernetes API server during object admission (CREATE, UPDATE, etc.). You control *when* it runs using a `MutatingWebhookConfiguration`.
So the answer is:
> It can run against **all Pods**, or **only specific Pods**, depending on how you configure the webhook rules and selectors.
Let’s break it down.
---
# 🧠 When Does a Mutating Webhook Run?
It runs:
* During `CREATE`, `UPDATE`, or other configured operations
* Before the object is persisted
* After defaulting
* Before validation
For your case:
You want to mutate Pods during:
```
CREATE
```
---
# 🎯 How Do You Scope Which Pods It Runs On?
There are **three levels of filtering**:
---
# 1️⃣ Rule Filtering (Resource + Operation)
In your `MutatingWebhookConfiguration`:
```yaml
rules:
- operations: ["CREATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
```
This means:
* Only Pods
* Only on CREATE
* Not Deployments, not StatefulSets
* Not UPDATE
So far, this matches *all Pods*.
---
# 2️⃣ Namespace Selector (Filter by Namespace Labels)
You can restrict to namespaces with a specific label:
```yaml
namespaceSelector:
matchLabels:
my-operator/enabled: "true"
```
Now it only runs for Pods in namespaces labeled:
```yaml
my-operator/enabled=true
```
Very clean multi-tenant pattern.
---
# 3️⃣ Object Selector (Filter by Pod Labels) βœ… Most Relevant for You
This is probably what you want.
```yaml
objectSelector:
matchLabels:
my-operator/inject-pvc: "true"
```
Now the webhook runs **only on Pods with that label**.
Example Pod:
```yaml
metadata:
labels:
my-operator/inject-pvc: "true"
```
All other Pods are untouched.
---
# πŸ† Recommended Pattern for Your Use Case
Since you’re building an operator and want to inject node-specific PVCs:
Use:
* `objectSelector` to match only workload Pods
* A label like:
```
dataset.mycompany.com/name: mydataset
```
Then your webhook:
1. Checks label
2. Waits for scheduling (if needed)
3. Rewrites volume claimName
---
# ⚠️ Important: Scheduling Problem
Here’s a key subtlety:
When Pods are created:
* `spec.nodeName` is NOT set yet
* Scheduler sets it later
If your PVC name depends on nodeName:
You cannot compute it at CREATE time unless:
* The Pod is pre-scheduled (e.g., DaemonSet)
* Or you use a second mutation after scheduling
---
# 🧩 Two Common Solutions
## βœ… Option A β€” Use DaemonSet (Easiest)
DaemonSet Pods:
* Have nodeName set immediately
* Scheduler pre-assigns node
Webhook can safely use:
```go
pod.Spec.NodeName
```
---
## βœ… Option B β€” Use Wait-For-First-Consumer PVC Binding
Better architectural pattern:
1. Create PVC per node with node affinity
2. Pod just references generic PVC
3. StorageClass uses:
```
volumeBindingMode: WaitForFirstConsumer
```
4. Scheduler binds correct PV after scheduling
This removes need to mutate based on node name.
Often cleaner.
---
# πŸ— Minimal MutatingWebhookConfiguration Example
```yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: dataset-injector
webhooks:
- name: dataset.mycompany.com
admissionReviewVersions: ["v1"]
sideEffects: None
failurePolicy: Fail
rules:
- operations: ["CREATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
objectSelector:
matchLabels:
dataset.mycompany.com/enabled: "true"
clientConfig:
service:
name: dataset-webhook-service
namespace: my-operator
path: /mutate-v1-pod
caBundle: <base64>
```
---
# 🧠 How Kubebuilder Makes This Easy
If you're using Kubebuilder:
You can scaffold:
```
kubebuilder create webhook --group core --version v1 --kind Pod --type mutating
```
Then inside:
```go
func (w *PodMutator) Handle(ctx context.Context, req admission.Request) admission.Response {
pod := &corev1.Pod{}
err := w.decoder.Decode(req, pod)
...
}
```
Kubebuilder handles:
* TLS
* CA injection
* Webhook configuration
* Service wiring
Much easier than manual.
---
# 🎯 Final Answer
> Does it run against all Pods?
Only if you configure it that way.
You control scope using:
1. `rules` (resource + operation)
2. `namespaceSelector`
3. `objectSelector` (label-based filtering β€” best for you)
---
# πŸ’‘ My Recommendation for Your Case
Use:
* `objectSelector` with label like:
```
dataset.mycompany.com/name=mydataset
```
* Restrict to CREATE
* Use Kubebuilder webhook scaffolding
---
If you'd like, I can now:
* Show the exact Pod mutation logic to rewrite PVC name
* Or design the full operator + webhook architecture for your per-node PVC system
Just tell me which direction you're leaning.
------------------------
Using golang and kubebuilder, can you show me what the pod mutating webhook file would
look like that filters on a pod with a given label and mutates the claimName in in the
persistentVolumeClaim in the volumes of the pod spec to modify the PVC to an updated name?
------------------------
Excellent β€” this is exactly the kind of thing Kubebuilder is good at.
Below is a **realistic Kubebuilder mutating webhook example** that:
1. Runs on Pod CREATE
2. Checks for a specific label
3. Iterates over `spec.volumes`
4. Rewrites `persistentVolumeClaim.claimName`
5. Returns a JSON patch response
This assumes:
* Kubebuilder v3+
* controller-runtime webhook server
* You already scaffolded a mutating webhook for Pods
---
# 1️⃣ Scaffold (for reference)
If not already done:
```bash
kubebuilder create webhook \
--group core \
--version v1 \
--kind Pod \
--type mutating \
--programmatic-validation=false
```
You’ll get something like:
```
internal/webhook/v1/pod_webhook.go
```
We’ll replace that with real logic.
---
# 2️⃣ Example: `pod_mutating_webhook.go`
```go
package v1
import (
"context"
"fmt"
"net/http"
corev1 "k8s.io/api/core/v1"
admissionv1 "k8s.io/api/admission/v1"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/webhook/admission"
)
const (
TargetLabelKey = "dataset.mycompany.com/enabled"
TargetLabelValue = "true"
VolumeNameToMutate = "mydata" // optional filter by volume name
)
type PodMutator struct {
decoder *admission.Decoder
}
func (m *PodMutator) SetupWebhookWithManager(mgr ctrl.Manager) error {
return ctrl.NewWebhookManagedBy(mgr).
For(&corev1.Pod{}).
WithMutator(m).
Complete()
}
// +kubebuilder:webhook:path=/mutate-v1-pod,mutating=true,failurePolicy=Fail,sideEffects=None,groups="",resources=pods,verbs=create,versions=v1,name=mpod.kb.io,admissionReviewVersions=v1
func (m *PodMutator) Handle(ctx context.Context, req admission.Request) admission.Response {
pod := &corev1.Pod{}
if err := m.decoder.Decode(req, pod); err != nil {
return admission.Errored(http.StatusBadRequest, err)
}
// πŸ”Ž Filter by label
if pod.Labels[TargetLabelKey] != TargetLabelValue {
return admission.Allowed("pod does not match target label")
}
// Track if we mutate
mutated := false
// Iterate volumes
for i, vol := range pod.Spec.Volumes {
if vol.PersistentVolumeClaim == nil {
continue
}
// Optional: Only mutate specific volume
if vol.Name != VolumeNameToMutate {
continue
}
originalClaim := vol.PersistentVolumeClaim.ClaimName
// Example mutation logic:
// Append node name OR append suffix
// NOTE: nodeName is usually empty at CREATE time
newClaimName := fmt.Sprintf("%s-node-specific", originalClaim)
pod.Spec.Volumes[i].PersistentVolumeClaim.ClaimName = newClaimName
mutated = true
}
if !mutated {
return admission.Allowed("no pvc volumes mutated")
}
// Return patch response
return admission.PatchResponseFromRaw(req.Object.Raw, pod)
}
func (m *PodMutator) InjectDecoder(d *admission.Decoder) error {
m.decoder = d
return nil
}
```
---
# 3️⃣ What This Does
If a Pod like this is created:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: example
labels:
dataset.mycompany.com/enabled: "true"
spec:
volumes:
- name: mydata
persistentVolumeClaim:
claimName: dataset
```
It will be mutated to:
```yaml
spec:
volumes:
- name: mydata
persistentVolumeClaim:
claimName: dataset-node-specific
```
---
# 4️⃣ If You Want to Use NodeName
⚠ Important:
At CREATE time:
```
pod.Spec.NodeName == ""
```
Unless:
* It's a DaemonSet
* Or manually scheduled
If you need nodeName-based mutation, you must:
* Use a **post-scheduling mutation** (complex)
* Or rely on PVC topology instead
* Or run as DaemonSet
---
# 5️⃣ How Filtering Really Works
You should also configure `objectSelector` in the webhook configuration:
```yaml
objectSelector:
matchLabels:
dataset.mycompany.com/enabled: "true"
```
That prevents unnecessary webhook calls.
---
# 6️⃣ Production Improvements
For production, you should:
* Skip mutation if claim already rewritten
* Handle multiple PVC volumes
* Make mutation logic idempotent
* Possibly read a CR to determine correct PVC name
* Add logging
Example idempotency check:
```go
if strings.HasSuffix(originalClaim, "-node-specific") {
continue
}
```
---
# 7️⃣ Important Architectural Reminder
Mutating PVC name at Pod CREATE works only if:
* The target PVC already exists
* The PVC binding rules allow it
* Scheduling constraints are satisfied
If you're trying to dynamically create per-node PVCs,
you likely want your operator to:
1. Create PVCs
2. Ensure readiness
3. Then allow Pods
---
# βœ… Summary
This webhook:
* Filters by Pod label
* Iterates volumes
* Rewrites `claimName`
* Returns patch
This is the correct Kubebuilder-style implementation.
---
If you'd like next, I can show:
* How to mutate based on a CR lookup
* How to mutate only after scheduling
* Or a full operator + webhook working pattern for your per-node PVC architecture
Which direction do you want to go?
------------------------
I would like to see both "How to mutate based on a CR lookup" and "How to mutate only after scheduling".
------------------------
Perfect β€” these are the two *advanced* patterns, and they solve different architectural problems.
I'll show you:
1. βœ… **Mutate based on a CR lookup** (clean, production-ready)
2. ⚠️ **Mutate only after scheduling** (trickier, but possible)
Both examples assume:
* Kubebuilder
* controller-runtime
* Mutating admission webhook for Pods
---
# βœ… PART 1 β€” Mutate Based on a CR Lookup
This is the **cleanest and most production-safe pattern**.
## 🎯 Goal
Pod has a label like:
```yaml
labels:
dataset.mycompany.com/name: mydataset
```
Your webhook:
1. Reads the label
2. Fetches `SharedDataset` CR
3. Computes the correct PVC name
4. Mutates `claimName`
---
## 🧱 Example CRD
```go
type SharedDatasetSpec struct {
BasePVCName string `json:"basePVCName"`
}
type SharedDataset struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec SharedDatasetSpec `json:"spec,omitempty"`
}
```
---
## 🧠 Updated Webhook Struct
We now inject a Kubernetes client:
```go
type PodMutator struct {
Client client.Client
decoder *admission.Decoder
}
```
---
## πŸ”§ Setup
```go
func (m *PodMutator) SetupWebhookWithManager(mgr ctrl.Manager) error {
m.Client = mgr.GetClient()
return ctrl.NewWebhookManagedBy(mgr).
For(&corev1.Pod{}).
WithMutator(m).
Complete()
}
```
---
## πŸš€ Handle() with CR Lookup
```go
func (m *PodMutator) Handle(ctx context.Context, req admission.Request) admission.Response {
pod := &corev1.Pod{}
if err := m.decoder.Decode(req, pod); err != nil {
return admission.Errored(http.StatusBadRequest, err)
}
datasetName, ok := pod.Labels["dataset.mycompany.com/name"]
if !ok {
return admission.Allowed("no dataset label")
}
// Fetch CR
dataset := &examplev1.SharedDataset{}
err := m.Client.Get(ctx, client.ObjectKey{
Name: datasetName,
Namespace: pod.Namespace,
}, dataset)
if err != nil {
return admission.Errored(http.StatusBadRequest, err)
}
mutated := false
for i, vol := range pod.Spec.Volumes {
if vol.PersistentVolumeClaim == nil {
continue
}
// Example logic:
// PVC name = basePVCName + "-node-a"
newPVC := dataset.Spec.BasePVCName + "-default"
if pod.Spec.Volumes[i].PersistentVolumeClaim.ClaimName != newPVC {
pod.Spec.Volumes[i].PersistentVolumeClaim.ClaimName = newPVC
mutated = true
}
}
if !mutated {
return admission.Allowed("no mutation needed")
}
return admission.PatchResponseFromRaw(req.Object.Raw, pod)
}
```
---
## πŸ† Why This Pattern Is Good
* Centralized dataset logic
* PVC naming fully abstracted
* Clean separation of concerns
* No CSI hacks
* Deterministic and safe
---
# ⚠️ PART 2 β€” Mutate Only After Scheduling
This is harder.
Because:
> At CREATE time, `pod.Spec.NodeName == ""`
The scheduler sets it later.
And:
* Admission webhooks run before persistence
* They do NOT run automatically after scheduling
So how do we mutate after scheduling?
---
# 🧠 Option A (Recommended): Mutate on UPDATE When nodeName Is Set
Modify your webhook rules:
```yaml
rules:
- operations: ["CREATE", "UPDATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"]
```
Then in code:
```go
if pod.Spec.NodeName == "" {
return admission.Allowed("pod not yet scheduled")
}
```
Then compute:
```go
newPVC := fmt.Sprintf("%s-%s", dataset.Spec.BasePVCName, pod.Spec.NodeName)
```
---
## 🚨 CRITICAL PROBLEM
By the time NodeName is set:
* Volumes are already resolved
* PVC binding likely already occurred
* Mutation may be rejected
Kubernetes may reject spec changes to volumes after creation.
So this often does **NOT work reliably**.
---
# 🧠 Option B (Better): Use WaitForFirstConsumer
Instead of mutating after scheduling:
1. Create per-node PVCs
2. Add nodeAffinity to PV
3. Use StorageClass:
```yaml
volumeBindingMode: WaitForFirstConsumer
```
Then:
* Pod references generic PVC
* Scheduler picks node
* Binder picks correct PV for that node
No webhook required.
This is MUCH cleaner.
---
# 🧠 Option C (Advanced & Clean): Two-Phase Pod Pattern
If you must mutate after scheduling:
1. Pod created with placeholder PVC
2. Scheduler assigns node
3. Your controller watches Pod updates
4. Controller deletes Pod
5. Recreates Pod with correct PVC
But:
* Disruptive
* Complex
* Not elegant
---
# πŸ† What I Recommend For Your Architecture
Since you want:
* Per-node PVCs
* No RWX
* Clean pod spec
* No PVC awareness in workload
Best pattern:
### πŸ₯‡ Operator + CR Lookup + PVC Precreation
Flow:
1. SharedDataset CR created
2. Operator creates:
* `dataset-node-a`
* `dataset-node-b`
3. Webhook mutates PVC name at CREATE
4. PVC already exists
5. Scheduler enforces node affinity
No post-scheduling mutation required.
---
# πŸ”¬ Summary Comparison
| Pattern | Safe | Production Ready | Recommended |
| ----------------------- | ---- | ---------------- | ----------- |
| CR lookup at CREATE | βœ… | βœ… | ⭐⭐⭐⭐⭐ |
| Mutate after scheduling | ⚠️ | ❌ | Avoid |
| WaitForFirstConsumer | βœ… | βœ… | ⭐⭐⭐⭐⭐ |
| Delete & recreate pod | ❌ | ❌ | No |
---
# πŸ’‘ My Advice
Use:
* CR-driven dataset abstraction
* Pre-created per-node PVCs
* CREATE-time mutation only
* Avoid nodeName-dependent logic in webhook
---
If you'd like, next I can:
* Design the full SharedDataset operator architecture
* Show the PVC-per-node reconciliation logic
* Show how to combine topology + webhook cleanly
* Or draw the complete control-plane flow step-by-step
Which direction would you like to go deeper into?
------------------------
I your "Part-1 - Mutate Based on a CR Lookup", the example code adds "-default" to the PVC name,
but the comment says "-node-a". Because this is a CREATE and the Node is not know at create, I
don't understand how this helps to pick a node specific PVC name. What am I missing?
------------------------
Excellent β€” you caught the inconsistency πŸ‘
You are not missing anything.
You are correctly identifying the core constraint.
Let’s untangle this carefully.
---
# πŸ”΄ The Core Truth
At **Pod CREATE time**:
```
pod.Spec.NodeName == ""
```
So:
> You CANNOT compute a node-specific PVC name during CREATE
> unless the Pod is pre-scheduled (DaemonSet, static pod, etc.)
So the earlier example that hinted at `-node-a` during CREATE was conceptually misleading in the context of Deployments.
Good catch.
---
# 🧠 So How Does The CR Lookup Pattern Actually Help?
The CR lookup pattern is useful when:
* The PVC name is derived from a CR
* But NOT from the node
Example:
```
SharedDataset:
spec:
basePVCName: dataset-prod
```
Then:
```
dataset-prod-readonly
dataset-prod-clone
dataset-prod-version2
```
That works.
But it does NOT solve node-specific naming.
---
# πŸ”₯ The Real Problem You’re Trying to Solve
You want:
* RWO storage
* Multiple nodes
* One PVC per node
* Pod does not know PVC name
* PVC name depends on node
This creates a circular dependency:
| Step | Requires |
| ------------ | ---------------------- |
| Schedule Pod | PVC must already exist |
| Choose PVC | Need to know node |
| Know node | Must schedule Pod |
That’s the deadlock.
---
# πŸ† The Correct Solution: Stop Encoding Node Name in PVC Name
Instead of:
```
dataset-node-a
dataset-node-b
```
You should create:
```
dataset
```
Multiple PVs behind it:
* PV 1 β†’ node-a (nodeAffinity)
* PV 2 β†’ node-b (nodeAffinity)
* PV 3 β†’ node-c (nodeAffinity)
And use:
```
volumeBindingMode: WaitForFirstConsumer
```
---
# 🧠 How This Actually Works
1. Pod references generic PVC:
```yaml
claimName: dataset
```
2. StorageClass:
```yaml
volumeBindingMode: WaitForFirstConsumer
```
3. Scheduler:
* Picks a node for Pod
4. PVC binder:
* Selects PV whose nodeAffinity matches that node
5. PVC binds to correct PV
6. CSI mounts local RWO storage
No webhook needed.
No nodeName logic needed.
No circular dependency.
---
# 🧱 Example Architecture
## PVC (single)
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: dataset
spec:
accessModes:
- ReadWriteOnce
storageClassName: local-storage
```
---
## PV for node-a
```yaml
apiVersion: v1
kind: PersistentVolume
spec:
storageClassName: local-storage
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
local:
path: /mnt/dataset
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- node-a
```
Repeat for node-b, node-c.
---
# 🎯 Why This Works
Because Kubernetes does:
```
Schedule first β†’ bind later
```
When using WaitForFirstConsumer.
This breaks the circular dependency.
---
# 🚫 Why Mutating After Scheduling Is Wrong
You might think:
"Okay, I’ll mutate on UPDATE when nodeName is set."
But by then:
* Volume scheduling already considered PVC
* Changing PVC name is forbidden
* You risk invalidating scheduling decisions
Kubernetes intentionally prevents this pattern.
---
# 🧠 When Is CREATE-Time CR Lookup Useful?
It is useful when:
* Selecting dataset version
* Selecting environment-specific PVC
* Injecting precomputed PVC name
* Multi-tenant routing
But not for node-based routing.
---
# 🏁 Final Clarification
You are not missing anything.
The reality is:
> You cannot safely select a node-specific PVC name during Pod CREATE unless the Pod is pre-scheduled.
The correct pattern is:
βœ… One PVC
βœ… Multiple node-affine PVs
βœ… WaitForFirstConsumer
❌ No nodeName-based mutation
---
# πŸš€ If You’d Like
I can now:
* Design the full operator logic that auto-creates per-node PVs
* Show how to replicate data per node cleanly
* Or explain how CSI topology fits into this pattern
Which direction do you want to explore next?
------------------------
So this only works for a single pod, right? If the workload is a Daemonset or Replicaset, multiple
pods on different nodes would map to the same PVC, where the PVC would bind to different PVs on each
node, which would fail. Is that correct?
------------------------
Excellent β€” this is exactly the right mental model to validate.
And yes β€” your concern is correct.
Let’s walk through it precisely.
---
# πŸ”΄ Key Fact: A PVC Can Bind To Only ONE PV
A `PersistentVolumeClaim`:
* Binds **once**
* To exactly **one PersistentVolume**
* For its lifetime
Even with:
```yaml
volumeBindingMode: WaitForFirstConsumer
```
The flow is:
1. First Pod gets scheduled
2. PVC binds to a matching PV (e.g., node-A)
3. PVC is now permanently bound to that PV
4. All future Pods using that PVC must use that same PV
---
# πŸ”₯ What Happens With a ReplicaSet or DaemonSet?
Let’s say:
* You have one PVC: `dataset`
* You have 3 PVs:
* PV-A (node-A)
* PV-B (node-B)
* PV-C (node-C)
### Pod #1 scheduled to node-A
PVC binds to PV-A.
All good.
---
### Pod #2 scheduled to node-B
Now:
* PVC is already bound to PV-A
* PV-A has nodeAffinity=node-A
* Pod is on node-B
Result:
❌ Pod cannot mount
❌ Scheduler fails
❌ You get volume node affinity conflict
---
# 🧠 So Your Conclusion Is Correct
> This pattern works only for one Pod using the PVC at a time.
Or:
* Multiple Pods on the **same node**
* Not multiple nodes
---
# 🎯 So What Does This Mean For Your Original Goal?
You want:
* Same logical dataset
* Multiple nodes
* RWO storage
* Multiple Pods
* No PVC awareness in workload
That means:
> You must have multiple PVCs.
There is no Kubernetes mechanism where:
```
One PVC dynamically binds to different PVs per Pod
```
PVC binding is cluster-level, not per-Pod.
---
# πŸ— So What Actually Works?
There are only four real patterns:
---
# βœ… Pattern 1 β€” One PVC Per Pod (StatefulSet)
This is what Kubernetes is designed for.
StatefulSet automatically creates:
```
dataset-0
dataset-1
dataset-2
```
Each Pod gets its own PVC.
Works perfectly with RWO.
---
# βœ… Pattern 2 β€” Operator Creates One PVC Per Node
For a DaemonSet:
Operator creates:
```
dataset-node-a
dataset-node-b
dataset-node-c
```
Then:
* Pod on node-A references dataset-node-a
* Pod on node-B references dataset-node-b
This requires:
* Webhook mutation
OR
* Generating the DaemonSet yourself
This is the correct abstraction if you insist on DaemonSet.
---
# βœ… Pattern 3 β€” True RWX Storage (The Cleanest)
Use:
* NFS
* CephFS
* EFS
* Longhorn RWX
* etc.
Then:
* Single PVC
* Many nodes
* No hacks
This is what RWX is for.
---
# βœ… Pattern 4 β€” Data Replication Layer
If storage is RWO but data is read-only:
Operator:
1. Creates per-node PVC
2. Clones from snapshot
3. Keeps them in sync if needed
Pods use per-node PVC.
This is effectively building your own distributed dataset layer.
---
# 🧠 Why Kubernetes Works This Way
PVC binding is designed to guarantee:
* Storage topology correctness
* Scheduler correctness
* Predictable data locality
If PVCs dynamically rebound per Pod:
* Data consistency guarantees would break
* RWO semantics would be meaningless
So Kubernetes intentionally prevents this.
---
# 🎯 Final Answer
Yes.
Your understanding is correct:
> A single PVC cannot bind to different PVs for different Pods across nodes.
So the WaitForFirstConsumer trick only works for:
* Single Pod workloads
* Or multi-pod same-node workloads
Not ReplicaSets across nodes.
Not DaemonSets across nodes.
---
# πŸ† What You Should Do
Given everything you've described so far:
The correct pattern for you is:
> Operator-managed per-node PVCs
>
> * Pod mutation (or generated workload spec)
Not CSI logic.
Not PVC rebinding tricks.
---
If you'd like, I can now:
* Design the full per-node PVC operator pattern
* Show how to safely mutate DaemonSet pods
* Or propose a different architecture depending on whether the data changes or is immutable
What does your dataset look like β€” immutable, periodically updated, or actively written?
------------------------
------------------------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment