👋 Everything about EKS & AI Infrastructure Newsletter "#69" ☁️❤👨💻
Fleet-scale Kubernetes, LLM cold starts, KV cache hierarchies, and the infrastructure layer catching up to agentic AI

Dear EKS & AI Infrastructure enthusiasts, Welcome to Everything about EKS & AI Infrastructure #69.
The past two weeks felt like the industry decided to ship everything at once. Model architecture papers, infrastructure announcements, community deep-dives, and a position paper that might be the most important Kubernetes read of the year all landed in the same window. Edition #69 reflects that — it’s one of the denser editions I’ve put together, and I think it earns the length.
The thread running through most of it is the same one that’s been building for months: the AI infrastructure story has moved past “schedule the GPU” and into harder problems — data movement at cold start, KV cache architecture for agentic workloads, topology-aware networking, and what it actually means to operate Kubernetes at fleet scale. The pieces in this edition don’t just describe those problems. Most of them propose concrete answers.
As always, if something here is useful, share it with someone building the same thing.
Performance Engineering in Modern AI Systems 🌩️
🌩️NetEase Cut LLM Cold Starts from 42 Minutes to 30 Seconds with Fluid on Kubernetes
The standard assumption in serverless GPU inference is that the bottleneck is compute provisioning. For 70B+ parameter models, that’s wrong — it’s data movement. Without a caching layer, loading a model means pulling hundreds of gigabytes from remote storage on every cold start, which makes autoscaling economically incoherent: you’re paying for GPU spin-up and then sitting idle waiting for weights to arrive.
NetEase’s Tmax AI platform used CNCF’s Fluid (currently incubating) to solve this at the architecture level. The numbers across approaches: raw cross-region access at 42 minutes → traditional cache at 14 minutes → Fluid with prefetch at 3 minutes → tuned Fluid under 30 seconds. The key architectural pattern is treating the model as a first-class Kubernetes resource with its own scheduling concerns: Fluid ties prefetch workflows to scheduled events, enables cross-namespace dataset sharing so multiple teams don’t re-cache the same foundation model, and places pods near already-cached content via data-aware scheduling. This is directly portable to EKS teams running SageMaker-adjacent workloads or self-managed vLLM on GPU node groups.
🌩️Nemotron-Labs-Diffusion: Autoregressive, Diffusion, and Self-Speculation in a Single Set of Weights
The standard tradeoff in LLM serving is between throughput (diffusion and speculative decoding win at high concurrency) and latency (autoregressive wins at low concurrency with a single user). Most teams pick one and live with the costs at the other end. Nemotron-Labs-Diffusion, from Yonggan Fu and the NVIDIA research team, sidesteps this by encoding all three modes — AR, diffusion, and self-speculation — into a single weight set, switching between them purely via attention mask changes with no separate draft model required.
The self-speculation mode is the sharp detail: diffusion generates draft tokens, AR verifies them, decoding roughly 6× more tokens per forward pass than standard AR. On real hardware in SGLang, the 8B model hits 1,015 tok/s at concurrency 1 on GB200 — 4× faster than AR baseline — and 2.7× on DGX Spark. Against Eagle3 (the current speculative decoding benchmark in SGLang), it shows 2.3× speed-up with meaningfully better system-vs-per-user throughput trade-offs. For EKS inference deployments where you’re optimizing the same GPU pool across both interactive (low concurrency) and batch (high concurrency) workloads, a single model that adapts its decoding mode to the traffic shape changes the capacity planning math significantly.
🌩️PyTorch 2.11.0 Fixes aarch64 GPU Wheel Installation for GH200 and Grace Blackwell
The specific failure mode this fixes is insidious: on GB200/GB300/GH200 systems, pip install torch would silently resolve to a CPU-only wheel because CUDA-enabled aarch64 builds weren’t on the default PyPI index. No error. No warning. A GPU system running CPU inference, discovered only when throughput numbers looked wrong. vLLM shipped two workarounds in-tree (use_existing_torch.py and a tool.uv build-isolation passthrough) specifically because this kept catching new users — particularly at hackathons and first-time cluster setups on Grace Hopper hardware.
PyTorch 2.11.0, driven by NVIDIA and PyTorch core contributors (Alban Desmaison, Nikita Shulga, Andrey Talman, Piotr Bialecki), now publishes CUDA-enabled aarch64 wheels to the default PyPI index. The fix went from a 2024 hackathon bug report to a PyTorch Foundation TAC discussion to a merged wheel publishing change — the full story written up by Kaichao You (vLLM lead maintainer, co-founder Inferact). If you’re building EKS node images or Dockerfiles that install vLLM on Arm-based GPU instances, you can drop the --index-url workarounds and trust the standard install path from 2.11.0 onward.
🌩️Why Prefix Caching Breaks for Agents — and the KV Memory Hierarchy That Fixes It — William Chen
Prefix caching works on one assumption: the new request starts with the same token sequence as a prior one. Agentic workloads violate this constantly — tool results inject variable text, retrieval reorders chunks, browser output shifts token positions, and the same document reappears at a different offset. The content is repeated. The prefix isn’t. So turn 50 pays full prefill cost for content turn 3 already processed, TTFT climbs, and the GPU metrics look healthy the entire time.
William Chen’s breakdown maps the full KV memory hierarchy worth building toward: local GPU KV (L0, fastest path when the same worker holds the prefix), LMCache MP turning host DRAM into shared L1 across processes, Mooncake making KV visible across workers and nodes so a local miss doesn’t become full recompute, and CacheBlend handling the non-prefix case — same content, different position. The routing layer matters too: a KV-aware router asks which worker already holds the context this request needs, not just which worker is free. The metrics to instrument are specific — matched KV blocks, recomputed tokens, cache-overlap score, L0/L1/L2 hit rates, transfer latency, and TTFT cached vs uncached at p95/p99 by turn index. Cache hit rate alone doesn’t tell you whether the hit arrived before recompute would have finished.
Starred Content ⭐
⭐Kargo on EKS — A Deep Dive into GitOps Continuous Promotion — Shawn Zhang
Argo CD handles Git-to-cluster sync well. What it doesn’t handle is artifact promotion — the logic that decides when and how a new image moves from test to UAT to prod. Most teams fill that gap with CI scripts that open PRs to update image tags, which means your promotion pipeline lives outside GitOps entirely and is invisible to Argo CD’s audit trail.
Kargo sits between your Warehouse (image/Helm/Git sources) and your Stages (environments), treating promotion as a first-class Kubernetes resource with its own CRDs, RBAC, and verification gates. Shawn’s write-up includes a working PoC on EKS with a clear responsibility split: Argo CD owns reconciliation, Kargo owns progression. If your team is running Argo CD and your release process still involves a shell script committing an image tag to a branch, this is the architectural layer worth understanding.
⭐Zeta Processes 208 Million Credit Accounts in 40 Minutes on Amazon EKS
End-of-day processing in banking — billing, interest calculations, compliance updates across millions of accounts every night — is one of the unglamorous bottlenecks that still runs on mainframes at most institutions. Zeta’s engineering team built a cloud-native architecture on EKS combining Apache Flink for stream and batch processing, Kafka for high-throughput data feeds, S3 Express One Zone for low-latency large file transfers, and Amazon RDS — scaling to 23,000+ EC2 Spot Instances, launching over 4,200 EKS pods and 126 RDS instances in minutes. The result: 208 million credit card accounts processed in 40 minutes, 10,000 transaction authorizations per second, and 40 million statements produced in 5 minutes.
The cost story is the sharp detail worth sitting with. The architecture eliminates the need for a \(90 million upfront on-premises infrastructure investment and reduces operating costs to under \)500 per day. That delta is only possible because the EOD burst is ephemeral — Spot Instances at scale for the processing window, then back to zero. The S3 Express One Zone choice came directly from an AWS re:Invent discovery — Zeta’s team recognized its potential for handling large file transfers with single-digit millisecond latency, which is what makes Flink’s state management at this account volume tractable. If your team is designing any large-scale batch workload on EKS, the Flink + Kafka + S3 Express One Zone combination on Spot is the architecture to benchmark against.
⭐Fleet-Scale Kubernetes: An Operating Model for Homogeneous Clusters with Decoupled Capacity — Lucy Sweet
The core diagnosis: capacity is fragmented across per-cluster islands, Datadog’s data shows average CPU utilisation of ~18% across enterprise Kubernetes fleets with overprovisioning factors of 2–5×, and the root cause is that every existing autoscaler embeds a full scheduling simulation to decide what to provision — a parallel model that diverges from kube-scheduler, creates topology bugs, and at scale produces 600+ second decision times for ~2,000 pending pods. The paper’s answer is not a bigger cluster, a smarter federation layer, or a better autoscaler. It’s a different model: homogeneous clusters sized only to accommodate your single largest workload, with capacity decoupled from cluster identity and managed through a standard contract.
The contract is three CRDs and a protobuf: CapacityRequest (a pod declares what it needs), UpcomingNode (the autoscaler declares a node is on the way), and AvailableCapacity (an eventually-consistent hint). The operator rolls up thousands of CRs into ~15 aggregated entries per cluster and sends them — the autoscaler provisions against the diff without needing cluster inventory because it provisioned every node and already knows what’s there. The scheduling layer is untouched; topology spread, DRA, gang scheduling, and Kueue admission all work unchanged. The autoscaler just makes nodes appear. The scaling analysis goes to 100 million nodes across 20,000 clusters — at that scale the roll-up message is still ~40MB per cycle, the autoscaler needs sharding but the protocol doesn’t change, and the only real pressure point is etcd object count if CRs are created for all pods rather than only unschedulable ones. Read the whole thing — the sections on what you think you need (but don’t), the NLB SNAT trap equivalent for fleet management, and the Same topology operator are all worth the time.
⭐Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention — Sebastian Raschka
The pattern across April–May open-weight releases is consistent: Gemma 4 reduces KV cache memory via cross-layer KV sharing and adds capacity through per-layer embeddings; Laguna XS.2 varies query-head count per layer (more heads for sliding-window layers, fewer for expensive global-attention layers); ZAYA1-8B moves attention computation directly into a compressed latent space; DeepSeek V4 adds constrained parallel residual streams via mHC and compresses along the sequence dimension rather than the per-token dimension with CSA/HCA. Each is attacking the same bottleneck — long-context KV cache size — from a different angle, and the choices have direct implications for how much GPU memory your inference deployment actually needs.
The DeepSeek V4 numbers are the sharpest signal: at 1M-token context, DeepSeek V4-Pro uses only 27% of single-token inference FLOPs and 10% of the KV cache size compared to DeepSeek V3.2, which already used MLA and sparse attention. DeepSeek V4-Flash is even more aggressive at 10% FLOPs and 7% KV cache size. For EKS inference teams sizing node memory for long-context deployments, these aren’t incremental improvements — they change what instance type you need. Raschka’s architecture diagrams and from-scratch code implementations make this one of the more practical deep dives available on what’s actually changing inside the transformer block right now.
Announcements 📢
📢etcd 3.7.0-beta.0 ships RangeStream and permanently drops v2store
etcd has quietly been a source of unpredictable latency in large-cluster reconciliation loops — not because of CPU, but because large range queries buffered their full result set before returning anything. RangeStream (contributed by Jeffrey Ying at Google) fixes this by streaming results in chunks over gRPC, which removes the spike that was hitting big-cluster operators during control plane reconciliation at scale.
The v2 removal is a hard break: v2 discovery, bootstrap, API requests, and the v2 client library are all gone in 3.7. If you’re running anything pre-v3.6.11 or have any v2-dependent config left over from older cluster bootstraps, test against this beta before GA lands (expected late June or early July). v3.4 already hit EOL on May 15.
📢LiteLLM Agent Platform — Open Source Per-Session Kubernetes Sandboxes for Coding Agents
Running Claude Code, Codex, or any coding agent harness in production means solving two problems at once: isolation (agents shouldn’t share state or blast radius across sessions) and credential safety (agents need real API keys to do work, but handing them raw keys in environment variables is a supply chain risk). LiteLLM’s Agent Platform, open-sourced by Ishaan Jaffer (CTO, LiteLLM), addresses both: each agent session gets a fresh Kubernetes pod destroyed at session end, and the agent environment receives stub credentials (GITHUB_TOKEN=stub_github_a8f1) that the egress vault swaps for real keys on outbound TLS. The agent never sees the actual key.
The architecture worth paying attention to is the credential swap at egress rather than at pod creation — it means the compromise surface for a misbehaving agent is the stub, not the real credential, even if the pod’s environment is fully readable. Multi-harness support (Claude Code, Codex, opencode) and WebSocket-attached terminal access (claude-code-cli) are included. Deployment targets are kind (local), EKS, and Render — the EKS path makes this directly forkable for platform teams already running LiteLLM as their LLM gateway and wanting to extend it into agent execution infrastructure.
📢DRANET Now Supported on EKS, AKS, and GKE — DRA-Based Accelerator Networking Goes Cross-Cloud
The old model for GPU-NIC topology alignment on Kubernetes was fragile by design: device plugins couldn’t express PCIe locality, workloads needed privileged containers to manage RDMA interfaces, and pairing a GPU with its topologically-closest NIC required custom scheduler extensions or manual configuration. DRANET replaces that with a DRA-native network driver that discovers RDMA-capable devices, publishes them as ResourceSlices with topology attributes (NUMA node, PCI address, RDMA device), and injects allocated interfaces into pods via NRI — fully compatible with existing CNI plugins and without privileged containers. The research paper accompanying the project reports up to 59.6% bandwidth improvement for all_gather and 58.1% for all_reduce in distributed AI/ML workloads through topology-aware GPU-NIC co-scheduling.
On EKS specifically: the EFA DRA driver — built on the upstream DRANET project — is now the recommended path for new EFA deployments on EKS 1.34+ with managed or self-managed node groups, enabling topology-aware allocation that pairs EFA interfaces with their PCIe-local GPUs, Trainium, or Inferentia devices, plus EFA interface sharing across pods on the same node. The one constraint worth noting: the EFA DRA driver is not yet supported with Karpenter or EKS Auto Mode — the EFA device plugin remains the path there for now. If your team runs GPU or Trainium workloads on EKS 1.34+, this is the networking layer to move to.
Community & Career 🤝
🤝Dream Server — One Command to a Full Local AI Stack — Michael Bradley
Setting up local AI today means stitching together a dozen projects, writing Docker configs, and hoping everything talks to each other. Dream Server, built by Michael Bradley at Light Heart Labs, is the one-command alternative: a fully local stack covering LLM inference, chat, voice, agents, workflows, RAG, and image generation — no cloud, no subscriptions. The installer detects your GPU, picks the right model for your hardware, generates secure credentials, and launches everything — chat available in under 2 minutes via a bootstrap mode that starts a small model instantly while the full model downloads in the background.
The project has received AMD Featured Developer recognition and was selected as a May 2026 AMD Lemonade Developer Challenge winner. The GitHub journey Michael describes — 500 stars in 5 months of quiet building, then 500–1,500 stars in 4 days after a single organic mention — is the one every open-source builder lives through. The work compounds invisibly until it doesn’t. Worth starring if you’re tracking the local inference and sovereign AI tooling space.
🤝n8n on EKS — AWS Cost Monitoring with Terraform, Lambda, and Slack Alerts — Ramalakshmi Mani
Running workflow automation on EKS rather than a managed SaaS layer is the right call for teams that want auditability and data residency — but most examples stop at “here’s a Helm chart.” Rama’s open-source project, extracted directly from her platform engineering work at BMW TechWorks, goes further: multi-region EKS cluster with Terraform modules, n8n on Kubernetes with persistent storage, a Lambda function hitting the Cost Explorer API, and automated Slack notifications when spend crosses a threshold.
The pattern worth noting is the Lambda-as-bridge: Cost Explorer doesn’t emit native EventBridge events on threshold breach, so the common approach is a scheduled Lambda that polls and pushes to n8n via webhook. It’s a clean seam between AWS-native cost data and a self-hosted workflow engine running on the same cluster you’re trying to monitor. If your team runs EKS and has FinOps visibility gaps, this is a forkable starting point rather than a blog post to read and forget.
🤝EKS 1.33 → 1.34 with Terraform — Zero-Downtime Upgrade + Instance Type Migration — Naveena Ravi
Most EKS upgrade write-ups cover the version bump and stop there. Naveena’s 1.33→1.34 article does the upgrade and migrates the Karpenter instance family from i4i to r6a in the same change — driven by a product recommendation that produced a 35% reduction in EKS FinOps cost. The Terraform changes are concrete: exact addon versions pinned (coredns v1.13.2-eksbuild.4, vpc-cni v1.21.1-eksbuild.7, kube-proxy v1.34.6-eksbuild.2, plus EBS/EFS/S3 CSI drivers), AMI lookup via SSM Parameter Store for AL2023, and the full Karpenter-managed rolling node replacement via GitHub Actions.
The forward-looking note is worth catching: the 1.34 upgrade surfaced compatibility errors for the 1.35 path — specifically Amazon GuardDuty EKS Runtime Monitoring and EKS Pod Identity Agent needing upgrades before the next hop. If your cluster is on 1.33 now, this is the piece to read before you upgrade, not after. The series (1.31→1.32, 1.32→1.33 with AL2023 AMI migration, now 1.33→1.34 with instance rightsizing) is the closest thing to a peer-reviewed upgrade runbook in the community.
Highlights ✨
✨Building a Hardened EKS Cluster with Terraform — marianita_cloud
Most EKS security content covers what to enable — private API endpoints, KMS secrets encryption, IMDSv2, encrypted EBS, control plane audit logging — without showing how to verify it actually worked. marianita_cloud’s lab covers all of these controls via Terraform, then pairs each one with a specific AWS CLI command to validate it: describe-cluster queries for endpoint access and encryption config, kms get-key-rotation-status for KMS rotation state, describe-instances filtered by EKS cluster tag to check HttpTokens=required, and describe-volumes to confirm EBS encryption. Console cross-check steps are included alongside each CLI command — useful for audit evidence gathering, not just initial deployment.
The SSM node access section is worth reading specifically: the default EKS managed node group policies (AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, AmazonEC2ContainerRegistryReadOnly) don’t include SSM permissions, so the SSM agent can’t register without explicitly attaching AmazonSSMManagedInstanceCore to the node role. That gap catches teams who assume SSM Session Manager works out of the box on EKS nodes. The repo is public at CloudSecMari/eks-hardened-terraform — forkable starting point for anyone building a security baseline from scratch.
✨Migrating EKS from AWS VPC CNI to Cilium with Zero Downtime — Ravindu Fernando
The design rule that shaped this entire migration: a node should belong to exactly one CNI generation. Workloads move between generations; nodes do not mutate in place. That single constraint ruled out an in-place swap and drove the blue/green node approach — old nodes labeled cni: aws-cni, new nodes labeled cni: cilium, with DaemonSet affinity rules keeping the two networking stacks isolated by node rather than by namespace. Cilium was deployed with ENI IPAM mode (firstInterfaceIndex: 1 to leave the primary node interface alone), prefix delegation enabled for pod density, native routing preserving the existing VPC subnet model, and kube-proxy replacement via eBPF — so kube-proxy only ran on legacy nodes during the overlap window, not both.
The production detail worth reading carefully is the SNAT/masquerade section. ipv4NativeRoutingCIDR tells Cilium which CIDR is native-routed and should not be treated as external traffic — setting it too broadly (e.g., 10.0.0.0/8) can change SNAT behavior for destinations outside the cluster and blackhole traffic to peered VPCs or VPN endpoints that can’t route back to pod IPs. The other non-obvious trap: with NLB IP targets in Cilium ENI IPAM mode, preserve_client_ip.enabled=true broke their ingress setup — disabling it at the target group and enabling Proxy Protocol v2 instead kept client identity recoverable while making the return path predictable. The full annotated Helm values are in the article — worth keeping as a reference before attempting this on any active cluster.
✨Gated DeltaNet-2: Decoupled Erase and Write Gates in Linear Attention
Linear attention models compress the KV cache into a fixed-size recurrent state — the constraint being that every write to that state risks overwriting existing associations. Prior delta-rule models (Gated DeltaNet, KDA) used a single scalar gate to handle both erasing old content and writing new content, but these two operations act on different axes of the state. Gated DeltaNet-2 decouples them: a channel-wise erase gate picks which key-side coordinates to remove, a write gate picks which value-side coordinates to commit, implemented via a chunkwise WY algorithm with gate-aware backward fused in Triton. At 1.3B parameters trained on 100B tokens, it outperforms KDA and Mamba-3 on language modeling and commonsense reasoning, with the biggest gains on long-context RULER retrieval — S-NIAH-3 jumps from 63 to 90 over KDA. The architecture is worth tracking if your team is evaluating recurrent alternatives to transformer serving for memory-constrained inference deployments.
🎉 Sponsor Section
At the moment, we don’t have a sponsor for this edition, but we look forward to working with companies and organizations that support the EKS & AI Infrastructure community in future editions. If you or your company is interested in sponsoring, please contact us at 📧 thecloudtechforall@gmail.com
📝 Words from the Author
My grandmother never had a to-do list. Never had a productivity system. Never time-blocked her mornings or tracked her habits in a journal.
She also never seemed to run out of things to do, people to take care of, or stories to tell.
I’ve been thinking about that a lot lately. About how the obsession with systems for doing things can quietly replace the actual doing. The framework becomes the product. The process becomes the point. And somewhere in there, you stop noticing what’s actually in front of you.
This edition came together because a bunch of people just... shared what they built. No strategy. No brand consideration. Just “here’s what I learned, maybe it helps someone.” That’s it. That’s the whole thing.
Your grandmother had it right. Do the work. Share it honestly. Let the systems figure themselves out.
Happy building. 😎
Happy Building. 😎



