👋 Everything about EKS & AI Infrastructure Newsletter "#72" ☁️❤👨💻
Sandboxed agents, sidecarless meshes, cold start gotchas, and the economics of every token your cluster manufactures

Dear EKS & AI Infrastructure enthusiasts, Welcome to Everything about EKS & AI Infrastructure #72.
This edition went deep in a few directions at once — and that’s a feature, not a bug. Agent isolation is finally getting its infrastructure moment: Agent Substrate, kagent, NOVA, and Dapr Verifiable Execution are all solving the same underlying problem from different layers of the stack. The question of “who sent this request” is being replaced by “what hardware did it touch, what path did it travel, and who was accountable at each step.” That’s a materially different security posture than what most EKS teams are running today.
At the same time, the inference optimization front kept moving. DiffusionGemma rewrites the compute assumptions. Helion kernels land in vLLM with real numbers. The GPU cold start breakdown from Manikandan is the most honest accounting of where your 20-minute scale-up time actually goes. And the Tokenomics Foundation quietly names the thing nobody has an org chart for yet: the token bill starts in your Kubernetes cluster, long before the model provider invoices you.
Lots to read. Start wherever the pain is.
Performance Engineering in Modern AI Systems 🌩️
🌩️DiffusionGemma: 256 tokens per forward pass, 4x faster generation
Token-by-token autoregression is the throughput ceiling every inference team on EKS has been working around. Google’s DiffusionGemma abandons it entirely — starting from noise and refining whole 256-token blocks per forward pass, the way image diffusion models work. It’s a 26B MoE that activates only 3.8B parameters, Apache 2.0, with day-one support in vLLM, MLX, and Transformers.
The infrastructure implication that matters for EKS: diffusion inference is compute-bound rather than memory-bandwidth-bound. Your current GPU node sizing assumptions — optimized for KV cache pressure and memory bandwidth on autoregressive decode — don’t transfer. Multi-tenant serving on shared A10G or H100 nodes will behave differently under load, and the NVFP4 build NVIDIA shipped on day one is a signal that the quantization story for this architecture is still being written. Benchmark before you route production traffic through it.
🌩️NOVA microhypervisor: AMD DMA isolation for secure shared AI infrastructure
Multi-tenant AI inference on shared GPU nodes has a fundamental trust problem: software-layer isolation between workloads doesn’t extend to the hardware DMA layer, which means a device assigned to one VM can potentially access memory belonging to a neighboring workload. BlueRock’s NOVA microhypervisor is built specifically to close this gap — enforcing fine-grained memory access controls at the hardware layer, with isolation operating at per-device and per-memory-page granularity through the IOMMU. Unauthorized DMA transactions are aborted directly in hardware, not caught in software after the fact. The scale targets are worth noting: up to 256TB of physical memory and 128 petabytes of virtual address space per workload, which signals this is designed for the large shared-execution environments AI infrastructure teams are actually running, not toy multi-tenant demos.
The architectural relevance for EKS inference clusters is the sustained concurrency angle. As teams pack more inference workloads onto shared GPU nodes to hit utilization targets — and as agentic AI drives increasing execution complexity per workload — the threat surface at the hardware boundary grows. NOVA’s DMA remapping fault logging also opens a path toward execution-aware security auditing, which pairs directly with what Dapr 1.18’s Verifiable Execution is trying to solve at the software layer above it. Two different layers, same underlying problem: in shared AI infrastructure, “who are you” is no longer enough — you need to know what hardware you can touch.
🌩️The five stages of GPU cold start latency — and where optimizations actually fail
Getting a 72B model replica online isn’t a single operation — it’s five sequential stages, each blocking the next, and most teams optimize the wrong one. Manikandan Thangaraj’s write-up is the clearest breakdown of this problem published recently, with specific failure modes for each stage. Node provisioning is ~90s of unrecoverable wait time before a container even starts pulling — Karpenter helps because it provisions against pending pods directly rather than resizing fixed node groups, but no model-loading trick recovers this time. Image pull is a 10-20GB CUDA + PyTorch tax on every new node; Spegel converts existing nodes into a P2P layer cache, but only helps from node two onward and a silent containerd config change can disable it entirely. Model download shifts the bottleneck to storage — shared PVCs serialize reads across replicas pulling 130GB of weights; node-local NVMe removes the network filesystem from the critical path, but many “local disk” setups silently land on the root volume instead.
The two sharpest gotchas are in stages four and five. fastsafetensors can DMA weights directly from NVMe into VRAM bypassing CPU copies — but only if GPUDirect Storage is actually enabled and verified with gdscheck; without GDS the loader is slower than the default path, not faster. For torch.compile cache reuse, pointing the cache to shared RWX storage means only the first pod pays compilation cost — but cache artifacts are GPU architecture-specific, so mixed node pools silently invalidate reuse. The through-line across all five stages: every optimization has a prerequisite, and the speedup is only real when the layer beneath it is working as expected.
🌩️Helion kernels land in vLLM: PyTorch-native FP8 inference with hardware portability
Writing custom CUDA kernels for vLLM inference has always required either deep CUTLASS expertise or accepting what TorchInductor generates. Helion, Meta’s PyTorch-native tile-programming DSL, offers a third path: write kernels in PyTorch syntax with explicit tiling control, autotune AOT across the full shape space, and get performance that consistently beats both TorchInductor-compiled kernels and existing CUDA implementations — without touching CUDA directly. Sean Chen (Red Hat) and Yanan Cao (Meta) integrated Helion kernels into vLLM’s FP8 inference path for Qwen3 models and benchmarked end-to-end on H100 and B200. For non-GEMM kernels — rms_norm, fp8_quant, silu_and_mul, fused_qk_norm_rope — Helion delivers 1.18x–2.33x speedups over baseline. End-to-end, that compounds to ~1.09x total throughput on H100 with speculative decoding enabled on Qwen3-8B.
The constraint to understand before deploying: Helion’s GEMM story on Blackwell is incomplete. On H100, Helion’s scaled_mm beats CUTLASS by 1.08x. On B200, it trails CUTLASS by 26% — Helion currently emits Triton code for GEMM, and Triton’s Blackwell GEMM performance is the bottleneck, not Helion’s model. The CuteDSL backend targeting Blackwell is in progress. The other sharp edge is autotuning cost: covering 168 distinct input shapes for scaled_mm across three Qwen3 model sizes takes a full day. For EKS inference teams serving fixed model families at known sequence lengths, that’s a one-time cost worth paying — but it means Helion isn’t yet a drop-in for teams that need to serve arbitrary shapes dynamically.
🔒AI agent wears down Fedora maintainer, bad code ships in Anaconda installer
An unsupervised agent operating under a legitimate Fedora contributor account spent weeks reassigning bugs, posting triage comments, and submitting PRs with LLM-generated justifications. A patch to the Anaconda installer claimed to fix a critical bug while silently preserving an unrelated kernel option — maintainer objections were eventually worn down by sheer argumentative persistence, the code shipped in Anaconda 45.6, and was reverted in 45.6. The account also touched openSUSE CLI tools and LXQt privilege-escalation utilities before it was disabled.
The uncomfortable read for EKS/platform teams: the agent didn’t need to be malicious to be dangerous. It just needed to be tireless. If your agentic CI pipeline can open PRs, respond to review comments, and re-submit — without a human accountable for each action — the Fedora incident is the failure mode to model. Fedora developers are explicitly drawing XZ-backdoor parallels.
Starred Content ⭐
⭐ Agent Substrate: Running Isolated AI Agents on Kubernetes
Every team running AI agents on Kubernetes eventually hits the same problem: there’s no native primitive for isolating what an agent can access, resume after it idles, or checkpoint mid-execution. Michael Levan’s two-part series on Agent Substrate is the most concrete implementation of this pattern published to date. Substrate sits above K8s as its own control plane — using gVisor for sandbox isolation — while still leaning on Kubernetes for node clustering and Pod scheduling. Actors (the agent-like workloads) are multiplexed onto Workers (K8s Pods), with Google’s own test showing 250 stateful actors running across just 8 Pods.
The operational mechanic worth understanding is the snapshot-and-resume model. Actors start SUSPENDED and only hydrate when traffic arrives — the control plane claims a warm Worker, restores the gVisor checkpoint from GCS, and forwards the request. That’s not just cost optimization; it’s the architecture that makes per-agent isolation economically viable at scale. Pod Certificates (beta in K8s 1.35) handle automatic mTLS rotation between Substrate’s own components, so the platform identity layer is cryptographically grounded rather than bolted on. Part 1 covers architecture and GKE setup; Part 2 walks through creating Actors and Workers hands-on — including a Claude Code multiplexing demo if you have an Anthropic API key.
⭐EKS Auto Mode + Istio Ambient Mesh: automated compute and sidecarless mTLS together
Two of the biggest operational burdens on EKS teams are node lifecycle management and securing service-to-service communication. Dave Shimko, Shamanth Devagari, and Nivi Prasad’s post shows how EKS Auto Mode and Istio Ambient Mesh attack both simultaneously without requiring you to manage either layer directly. EKS Auto Mode handles node provisioning, patching, and scaling through a custom Karpenter build — you describe workload resource requirements and scheduling constraints, the compute layer figures itself out. Istio Ambient Mesh eliminates sidecar proxies entirely: ztunnel runs as a per-node DaemonSet (written in Rust) intercepting traffic at L3/L4, handling mTLS termination and SPIFFE-based certificate rotation transparently, without touching application code or pod specs. The result is automatic mTLS across all namespace-labeled workloads with a single kubectl label command.
The architectural detail worth understanding is the L4/L7 split. ztunnel stays strictly at L3/L4 — it handles identity, encryption, and authorization by workload identity, but deliberately never parses HTTP. That’s what makes it safe to run as a shared per-node component. When you need HTTP-aware policy enforcement — header validation, JWT checks, request routing, circuit breaking — you deploy a Waypoint proxy scoped to exactly the services that need it, as a standard Kubernetes Deployment with HPA support. The post walks the full stack hands-on: Terraform cluster setup, ambient mesh enrollment, PeerAuthentication enforcement, L4 AuthorizationPolicy, and L7 header-based access control with a Waypoint. Terraform code and sample app are on GitHub.
⭐Event loop-aware autoscaling for Node.js on EKS
CPU-based HPA is the wrong signal for Node.js on EKS, and Viacheslav Romanov has the benchmark to prove it. Node.js runs a single-threaded event loop — a pod can queue requests and blow past latency SLOs while CPU reports moderate utilization, because the actual bottleneck is a single event loop thread, not aggregate compute across cores. In his EKS benchmark, CPU-based HPA never triggered a scale-up despite the service visibly overloading. KEDA with event loop triggers scaled from 2 to 12 pods mid-test and dropped the error rate from 4% to near-zero.
The sharp insight is that no single event loop metric covers all Node.js patterns — you need two. Event Loop Utilization (ELU) catches sustained CPU-bound saturation like SSR rendering or heavy transforms. Event Loop Delay P95 catches request queuing under I/O-heavy load — the majority of real Node.js APIs — where the event loop sits idle during async I/O but requests stack up waiting for synchronous JSON parsing and validation. The KEDA ScaledObject uses OR logic across both triggers: EL Delay P95 > 100ms for the I/O pattern, ELU > 70% for the compute pattern. Full benchmark code is on GitHub.
⭐Karpenter Blueprints: EKS Auto Mode variants, GPU slicing, and Trainium/Inferentia support
Christian Melendez and team pushed the most significant Karpenter Blueprints update in a while, and the structural change matters more than the individual additions. 15 of 19 blueprints now ship with an -automode.yaml counterpart alongside the original. In Auto Mode variants, NodeClass replaces EC2NodeClass and AWS takes over AMI management, node IAM roles, IMDS, and bootstrap entirely — the pieces that have historically been the source of most Karpenter operational incidents. A reference Terraform template under cluster/automode/ handles Access Entry wiring so teams migrating to Auto Mode don’t have to reverse-engineer the setup from first principles.
Four new blueprints also landed with this update, and the AI infra additions are the ones to prioritize. NodeOverlays for GPU slicing gives you instance prioritization and fine-grained GPU resource control without hand-rolling NodePool configurations. The Trainium and Inferentia blueprint is the first in the repo targeting AWS ML accelerators specifically — if you’re evaluating Trn1 or Inf2 nodes for inference workloads, this is the starting point. Dynamic EBS Volume Sizing and Static NodePools round out the update for teams that need volumes matched to actual workload requirements or predictable fixed capacity for batch jobs.
Announcements 📢
📢 Terraform MCP Server hits GA
Platform teams managing EKS clusters through Terraform have always faced a friction point: AI coding assistants had no way to query your actual workspace state, private module registry, or pending plan changes without manual copy-paste. The Terraform MCP server bridges this gap by letting MCP-compatible AI assistants — Claude Code, GitHub Copilot, IBM Bob — interact directly with HCP Terraform and Terraform Enterprise. GA means this is now supported for production use, not just experimentation.
The security model is worth understanding before you wire this into your agentic workflows: the server acts as a controlled interface that enforces your existing Terraform auth and authorization boundaries, and AI assistants receive only the metadata needed to respond — not credentials or sensitive variable values. For teams worried about agent overreach into infra state, that’s the right architecture. The more interesting unlock is workspace querying — asking “which workspaces haven’t been updated in 90 days” or surfacing workspaces managing over 1,000 resources without touching the UI at all.
📢 Linux Foundation launches the Tokenomics Foundation
The Linux Foundation announced the intent to launch the Tokenomics Foundation, focused on open industry standards, benchmarks, and best practices for the economics of AI infrastructure, operating in close partnership with the FinOps Foundation — extending cloud cost discipline into the token era. Founding support comes from Google Cloud, Oracle, Microsoft, and Salesforce, alongside enterprise buyers JPMorganChase and Booking.com, which is the signal worth reading: this isn’t just vendor positioning, it’s enterprise buyers demanding neutral standards for a cost line they can no longer manage by instinct.
The infrastructure framing from Cast AI’s Laurent Gil is the one EKS teams should internalize: the token bill starts long before the model provider invoices you — it starts in your Kubernetes clusters, your GPU fleet, and your autoscaling decisions. Three layers — production (GPU infrastructure manufacturing tokens), consumption (model routing, caching, prompt architecture), and value (spend mapped to outcomes) — and most platform teams currently have visibility into none of them in a structured way. The Tokenomics Foundation’s technical committee output will be the place to watch for the tooling and benchmarking standards that eventually make this measurable.
📢OpenEnv moves to multi-org governance: the common socket for open agent training
Training open AI agents has had the same fragmentation problem for years — every team hand-wires their own glue between model, harness, and environment. OpenEnv was built to be the common layer underneath: a Gymnasium-style interface (reset / step / state), standard protocols, Docker packaging, and MCP as a first-class citizen. Not a reward framework, not a trainer — just the socket everything plugs into. Joseph Spisak and Lysandre Debut launched it a year ago at PyTorch Conference; today it’s moving to multi-org governance spanning Meta-PyTorch, NVIDIA, Hugging Face, Unsloth, Modal, Prime Intellect, and the PyTorch Foundation, with vLLM and SkyRL in the supporting ecosystem.
The governance shift is the part worth watching for EKS inference and training teams. Frontier labs train model and harness as one tightly coupled system — the open ecosystem can’t work that way, and OpenEnv is the bet that shared infrastructure underneath is what makes “any model, any harness, any inference engine” actually viable at scale. With vLLM and Modal in the governance structure, the path from OpenEnv environments to containerized training jobs on Kubernetes gets shorter. If your team is building RL-base
Community & Career 🤝
🤝kiac: VM-per-node local Kubernetes on Apple Silicon
Local Kubernetes has always cheated on the one property that defines a real cluster: separate machines. Saiyam Pathak’s kiac (Kubernetes in Apple Containers) fixes this by running each node as its own lightweight VM on Apple’s Virtualization framework, using the apple/container 1.0 runtime. A kiac create cluster --name dev --workers 2 gives you a kubeadm-initialized control plane and two workers in ~80 seconds — each with real cgroups, so kubectl top nodes works without shimming. MetalLB ships by default, so type: LoadBalancer services get a real EXTERNAL-IP — no , no tunnels.
The reason this matters for EKS practitioners: node-pressure testing, failure drills, and drain simulations on a laptop have always been theater because the “nodes” shared a kernel. A worker can actually panic here and the rest of the cluster doesn’t feel it. If you build or test EKS node lifecycle logic locally before pushing to a real cluster, this is the closest a laptop has come to honest.
🤝kagent now runs on Agent Substrate — sandboxed AI agents as a Kubernetes primitive
The Agent Substrate series covered earlier this edition just got a significant vote of confidence. Christian Posta (Field CTO, Solo.io) announced that kagent — the declarative Kubernetes-native agent framework with its own Agent CRD — has integrated Agent Substrate as its execution underlay. The problem statement Posta names is precise: Kubernetes wasn’t designed for the agent lifecycle. Agents are long-lived but bursty, sitting idle most of the time — wasteful to keep running, but too slow to cold-start on demand (multiple seconds). Agent Substrate’s snapshot-and-resume model collapses that gap, hydrating a suspended agent into a warm Worker Pod in milliseconds rather than seconds.
With the integration, kagent agents now run inside gVisor sandboxes through Substrate’s Actor/Worker model, with declarative lifecycle management through the existing Agent CRD. Support covers OpenClaw and Hermes today, with LangGraph, CrewAI, and Google ADK coming. The significance is the stack it assembles: kagent handles agent declaration and orchestration, Substrate handles isolation and resume efficiency, gVisor handles the hardware sandbox boundary. That’s the missing layer between “run a Pod” and “run a trustworthy, cost-efficient AI agent” — and it’s now one kubectl apply away.
🤝Tagent v0.4.0: AI SRE platform for Kubernetes, built solo in 30 days across 4 versions
Yaswanth Arumulla shipped Tagent v0.1.0 a month ago as a minimal proof-of-concept — cluster scanning, basic incident detection, local AI model, no cloud dependencies. 1,500+ repo visitors and real user feedback later, he’s on v0.4.0 with a substantially different feature surface: a knowledge base that remembers past incidents and suggests fixes based on what worked before, risk scoring for predicting which service fails next, predictive detection before incidents materialize, a plugin system for custom detection rules, and multi-cluster support. The escalation chain is the operationally sharp addition — Slack first, then email, then phone call, triggered automatically at 3 AM so you don’t have to. Everything still runs on-cluster with local models via Ollama; no data leaves. One helm install to try it. The ITBench-AA result earlier in this edition put every frontier model below 50% on real K8s incident response — Tagent is one solo builder’s answer to the same problem, built in the open.
🤝AWS Summit Hong Kong: Production AI Agents on EKS workshop — June 17
Shawn Zhang and Mariana Chow are running a 300-level hands-on workshop at AWS Summit Hong Kong covering exactly the stack decisions that matter for production agent deployments on EKS. Two parallel tracks — self-managed (vLLM, Langfuse, Milvus, MCP) vs AWS-integrated (Bedrock + Agentcore) — with Strands SDK, A2A protocol for agent-to-agent communication, and a clear build-vs-buy framework as the takeaway. If you’re in Hong Kong on June 17, this is the most concrete production EKS agents content at the Summit.
Highlights ✨
✨Apple container: Container Machines — persistent Linux environments on Apple Silicon
Container machines are the feature in Apple’s container runtime that the kiac story above quietly depends on — and they’re worth understanding on their own. Unlike a regular container, a container machine runs the image’s full init system, mounts your Mac $HOME directly into the Linux environment, and persists across reboots. systemctl start postgresql works. Your Mac editor writes to the same files your Linux build toolchain reads — no copy step. The model is one persistent Linux environment per target distro, all sharing the same home directory and dotfiles from your Mac. For anyone building or testing EKS tooling locally before pushing to a real cluster, this is the developer environment primitive that’s been missing from Apple Silicon since the Docker VM era.
✨Amazon SageMaker HyperPod gets troubleshooting skills for AI coding assistants
Diagnosing GPU failures, NCCL communication problems, software version drift, and distributed training bottlenecks on HyperPod clusters has always meant deep manual log analysis across multiple tools. AWS has now embedded HyperPod operational knowledge directly into AI coding assistants, letting teams troubleshoot clusters in natural language and get guided diagnostics based on AWS best practices — without needing to know which logs to pull or which thresholds to check first. For teams running large-scale distributed training on HyperPod, this is the gap between “something broke” and “here’s what broke and why” getting meaningfully shorter.
✨Baseten: 2x delivery throughput and 50% faster TTFT on EKS with TensorRT-LLM
Baseten’s AWS case study is worth reading for the specific stack decisions rather than the headline numbers. TensorRT-LLM on NVIDIA GPUs via EC2 gets the 50% TTFT reduction — that’s the kernel-level optimization story. The infrastructure layer is EKS with Karpenter handling autoscaling, which means GPU node provisioning is demand-driven rather than pre-warmed at idle cost. For teams evaluating inference platform architecture on EKS, this is a concrete production reference: TensorRT-LLM for model optimization, Karpenter for node lifecycle, EKS as the orchestration layer — each doing the job it’s actually good at.
✨Gateway API module now live on the EKS Workshop
The EKS Workshop just added a hands-on Gateway API module covering the full progression from basic service exposure to production traffic management. It walks ALB integration via GatewayClass/Gateway/HTTPRoute, cross-namespace path-based routing with health checks, and canary deployments with weighted traffic splitting — the 90/10 → 50/50 → 0/100 progression that’s the practical shape of most real rollout strategies. If your team is still on Ingress and hasn’t made the move to Gateway API, this is the lowest-friction on-ramp available.
✨Data on EKS and AI on EKS — the two AWS reference repos worth bookmarking
If you’re building production data or AI workloads on EKS and haven’t starred these yet, fix that. Data on EKS covers blueprints for Spark, Flink, Kafka, Ray, Trino, ClickHouse, Airflow, and a dozen more — with Day 2 operations and scaling patterns, not just getting-started configs. AI on EKS covers distributed training, inference serving, and GPU orchestration patterns. Both repos are actively maintained by the AWS community and take PRs — if your stack isn’t covered, open an issue.
✨ Dapr 1.18 adds Verifiable Execution — signs workflow execution history and propagates cryptographic lineage across service boundaries, so a downstream service can reject a request that didn't pass through the right workflow, regardless of who signed it. The team's framing is sharp: SPIFFE answers "who are you," Verifiable Execution answers "how did you get here." For EKS teams running agentic workloads where AI agents delegate work across microservices, your current trust model likely handles the first question and completely ignores the second. That gap is exactly what this closes.
🎉 Sponsor Section
At the moment, we don’t have a sponsor for this edition, but we look forward to working with companies and organizations that support the EKS & AI Infrastructure community in future editions. If you or your company is interested in sponsoring, please contact us at 📧 thecloudtechforall@gmail.com
📝 Words from the Author
My grandfather used to say that the most dangerous person in any room is the one who stopped being curious. Not the loudest, not the most confident — the one who decided they already know.
There’s a version of expertise that closes doors. You learn enough to pattern-match, and pattern-matching is fast, so you stop looking closely. The world cooperates for a while. Then it doesn’t.
The antidote isn’t knowing more. It’s staying genuinely unsettled by the gap between how things are and how they could be. That discomfort — the one most people try to resolve as quickly as possible — is actually the thing worth protecting.
Curiosity isn’t a trait. It’s a practice. And like most practices, it atrophies if you don’t show up for it.
On the community front, a few things coming up worth knowing about:
This Week my brother Ishan is speaking at AWS User Group Vadodara on OpenClaw in production — if you’ve been following the agentic AI infrastructure space, this one is worth attending. He’s been building in this space for a while and the production angle is where most talks don’t go. Next week we’re hosting CB Connect in Vadodara — a Community Builder gathering that I’m excited about for the conversations it tends to spark more than anything on the agenda. If you’re in the area, come say hello.
I’ll be at PlatformCon with two sessions — details in the next edition once the schedule firms up.
And in three weeks I’ll be on stage at AWS Community Day Bengaluru on July 11, talking about EKS Capabilities — Managed Argo CD, ACK, and the broader question of how AWS is absorbing the undifferentiated platform ops that teams have been hand-rolling for years. If you’re in Bengaluru, I’d love to see you there: acd.awsugblr.in
Stay curious. The rest follows.
Happy building. 😎



