Answer: Container isolation is enforced by multiple independent kernel subsystems, not a single boundary:
- Namespaces restrict visibility (what a process can see)
- cgroups restrict resource consumption (what a process can exhaust)
- Capabilities restrict privileged operations
- Seccomp restricts kernel attack surface (syscalls)
- LSMs enforce mandatory access control
Isolation strength is therefore emergent and configuration-dependent. Any misconfiguration in one layer weakens the whole model.
Answer: Because containers do not virtualize the kernel. All containers share:
- The same syscall interface
- The same kernel memory
- The same kernel scheduler
A single kernel privilege escalation (e.g., Dirty Pipe) can allow:
- Container → host compromise
- Cross-container impact
- Node-wide breach in Kubernetes
This is fundamentally different from VM-based isolation.
Answer: Namespaces only affect namespaces of perception, not authority:
- They hide objects (PIDs, mounts, interfaces)
- They do not prevent privileged kernel actions
If a process gains sufficient privileges (e.g., CAP_SYS_ADMIN), namespaces become largely bypassable.
Answer: A common exploit chain:
- Container runs with CAP_SYS_ADMIN or privileged
- Writable hostPath mount exists
- Attacker remounts host filesystem
- Modifies host binaries or runtime files
- Achieves persistent host compromise
Mount namespaces are therefore one of the highest-risk kernel interfaces in container environments.
Answer: User namespaces introduce:
- Complex UID/GID mappings
- Filesystem ownership complications
- Compatibility issues with legacy software and NFS
Despite being a major security improvement, operational friction has limited adoption.
Answer: Attackers can:
- Trigger repeated OOM kills to disrupt workloads
- Exhaust PID limits to cause node instability
- Abuse CPU shares to starve critical services
These are availability attacks, not escapes, but can be leveraged for lateral movement or incident masking.
Answer: Because cgroups:
- Do not restrict privileges
- Do not isolate memory access
- Do not prevent kernel exploitation
They reduce blast radius but do not prevent compromise.
Answer: CAP_SYS_ADMIN is a catch-all capability that includes:
- Mounting filesystems
- Namespace manipulation
- Kernel tuning interfaces
Most historical container escapes rely on this capability.
Answer: Because UID ≠ privilege in Linux.
A non-root process with dangerous capabilities can still:
- Mount filesystems
- Reconfigure networking
- Abuse kernel interfaces
Capabilities define what you can do, not UID.
Answer: Most kernel exploits require:
- Specific syscalls
- Specific argument patterns
By blocking those syscalls entirely, seccomp can:
- Break exploit primitives
- Convert RCE into DoS
- Force attackers into harder chains
This is exploit prevention, not just detection.
Answer: Because they:
- Are generic
- Allow many legacy syscalls
- Optimize for compatibility over minimalism
High-security workloads require tailored profiles.
Answer: LSMs restrict:
- File access beyond DAC
- Process interactions
- Capability usage in context
They can block attacker actions after container compromise.
Answer: Because:
- Policies are hard to write
- Break applications silently
- Require deep workload knowledge
Security teams often trade enforcement for operability.
Answer: OverlayFS operates across:
- Host filesystem
- Container layers
- Copy-on-write logic
Bugs here often lead to host filesystem access or corruption.
Answer: Because mounts are:
- Mutable at runtime
- Often writable
- Frequently misconfigured
Image layers are static; mounts are dynamic attack surfaces.
Answer: Because:
- Containers allow attacker-controlled code execution
- Shared kernel magnifies impact
- Exploits are often reliable and fast
Kernel patch latency directly maps to container risk.
Answer: Because successful exploitation often depends on:
- Available syscalls
- Capabilities
- LSM policies
- Kernel config hardening
Defense-in-depth can break exploit chains.
Answer: Linux optimized for:
- Performance
- Multi-tenancy efficiency
- Backward compatibility
Security isolation was layered incrementally, not designed upfront.
Answer: When:
- Running untrusted tenant code
- High-value secrets share the node
- Regulatory isolation requirements exist
VMs or hardware isolation are more appropriate.
Answer: In practice:
- Enable user namespaces
- Drop all unnecessary capabilities
- Enforce seccomp profiles
- Enable SELinux/AppArmor
- Patch kernels aggressively
Together, these drastically reduce real-world exploitability.
Answer: Without standardization, container behavior was tightly coupled to Docker’s implementation, creating:
- Vendor lock-in
- Inconsistent security guarantees
- Opaque runtime behavior
OCI introduced explicit contracts for image format and runtime behavior, making container execution auditable, portable, and analyzable across platforms.
Answer: The OCI runtime spec defines how isolation must be applied, including:
- Namespace configuration
- cgroup application
- Mount semantics
- Capability dropping
It does not guarantee security, but it eliminates ambiguity, which is critical for:
- Threat modeling
- Runtime hardening
- CVE impact assessment
Answer:
Because runc:
- Runs with elevated privileges
- Directly configures namespaces, cgroups, mounts, and capabilities
- Executes the container’s initial process
Any flaw in runc can collapse all higher-level isolation, which is why runc CVEs often lead to container escape class vulnerabilities.
Answer:
runc operates:
- Outside the container
- With host-level privileges
- At container creation time
A successful exploit can therefore impact the host and all containers on the node, not just a single workload.
Answer: Because containerd is a lifecycle manager, not an execution engine. It:
- Pulls and unpacks images
- Manages snapshots and metadata
- Delegates execution to
runc
The actual isolation enforcement still occurs in runc and the kernel.
Answer:
Although safer than runc, containerd:
- Becomes a high-value control-plane component
- Manages container lifecycle and state
- Exposes APIs that, if reachable, can allow container manipulation
Misconfiguration or exposure of containerd APIs can lead to host-level impact without kernel exploitation.
Answer: CRI-O:
- Implements only what Kubernetes requires
- Removes Docker legacy features
- Reduces code paths and attack surface
However, it still relies on runc, so kernel and runtime risks remain.
Answer: CRI expands the trust boundary by inserting:
- kubelet
- CRI
- runtime
Any weakness or misbehavior in this chain can affect all workloads on a node, making runtime integrity critical.
Q29. Why are runtime sockets (e.g., docker.sock, containerd.sock) considered critical security risks?
Answer: Because access to runtime sockets allows:
- Creating privileged containers
- Mounting host filesystems
- Escaping namespace boundaries without exploits
In practice, mounting docker.sock is equivalent to granting root on the host.
Answer: Registries introduce:
- Supply chain trust assumptions
- Remote code ingestion at scale
- Dependency confusion risks
A compromised registry or image can affect entire fleets, not just individual containers.
Answer: Because mutable images:
- Break provenance guarantees
- Undermine incident response
- Allow silent behavior changes
Immutability enables deterministic forensics and rollback.
Answer:
- Kernel CVEs exploit shared execution primitives
- Runtime CVEs exploit privileged orchestration logic
Runtime CVEs often require less sophistication and are more reliable in real environments.
Answer: Because runtimes:
- Run on every node
- Are identical across clusters
- Sit below orchestration layers
A single unpatched runtime CVE can enable mass compromise.
Answer: High-signal indicators include:
- Unexpected container creation or deletion
- Runtime socket access from containers
- Mount or namespace syscalls during runtime
- Execution of shells or debugging tools in production containers
Answer: Standardization improves consistency and observability, but it also concentrates risk.
Security therefore depends on:
- Aggressive patching
- Minimal runtime exposure
- Strong runtime monitoring