👋 Everything about EKS & AI Infrastructure Newsletter "#71" ☁️❤👨‍💻

Dear EKS & AI Infrastructure enthusiasts,
Welcome to Everything about EKS & AI Infrastructure #71.
The primitives arrived this week.

A 550B open-weight model built for agent loops — native speculative decoding, 1M context, on hardware you control. The first standardized sandboxing primitive for EKS agent workloads, backed by AWS and the Kubernetes SIG community. A router that finally understands what an agent run actually is, not just the requests inside it. Kubernetes 1.36 on EKS with User Namespaces at GA — container root no longer means node-level access after a breakout.

On the data and inference side: a two-month upstream collaboration that turned an 11% Spark regression into a 32% win. A production OLAP benchmark from Amazon’s own FinTech team where the engines only diverge when you run what users actually query. A disaggregated inference result where 16 GPUs outperformed 32 under real concurrent load.

The community builders this week are doing the connective tissue work — agent orchestration with trajectory scoring and per-codebase memory, agentic IaC collapsing multi-cloud Kubernetes to a single prompt, automated EKS node patching that reduces human involvement to one PR review, a private AI platform kit that treats governance as a first-class requirement from day one.

The hard infrastructure problems underneath agentic AI are getting solved in the open. The primitives that were whiteboard architecture six months ago are Helm charts and GitHub repos today.

If you find something useful in here, you know what to do with it.

Performance Engineering in Modern AI Systems 🌩️

🌩️Kueue v0.18: DRA hits beta for GPU workload scheduling

GPU scheduling on Kubernetes has relied on integer device counting — a workload requests N GPUs and gets them whole, with no mechanism to express finer-grained requirements like memory capacity, interconnect topology, or time-sharing across jobs. Dynamic Resource Allocation changes that: workloads declare what they need from an accelerator rather than which device, and the scheduler matches against actual device attributes and shares capacity intelligently across concurrent requests. Kueue v0.18 promotes DRA support to beta, which moves it from experimental to a credible option for production GPU platform teams building shared accelerator pools.

The MultiKueue change matters separately for teams running GPU capacity across multiple cloud regions. Automated quota aggregation removes the manual bookkeeping that previously made multi-region scheduling impractical — available capacity across regions now rolls up automatically, so the scheduler can place jobs against real supply rather than statically partitioned quotas. For EKS teams managing GPU clusters across regions or accounts, this is the release that makes MultiKueue worth evaluating seriously.

🔒 One-click VS Code flaw exfiltrates full-scope GitHub tokens

A researcher published a proof-of-concept that chains VS Code’s sandboxed webview message-passing to install a rogue extension inside github.dev, silently exfiltrate a user’s GitHub OAuth token, and enumerate every private repo they can reach. The stolen token carries full account privileges — not scoped to a single repository. If your developers use github.dev, treat this as a credential-rotation event before it becomes an incident response event.

Starred Content ⭐

⭐ Apache DataFusion Comet on EKS: from 11% slower to 32% faster in two months — Manabu McCloskey & Vara Bonthu, AWS

Running Apache DataFusion Comet against a 3TB TPC-DS benchmark on EKS, the AWS team initially found it 11% slower than vanilla Spark — not the result the local benchmarks had suggested. The culprits were four distinct issues: Comet’s native Rust layer was creating a fresh object store instance per Parquet file read, generating 5,000 DNS queries per second per pod (500x vanilla Spark) and hammering the Route 53 Resolver per-ENI limit of 1,024 qps. Memory usage ran 67% higher than vanilla Spark. And the biggest hit — Dynamic Partition Pruning wasn’t supported in the native scan, so DPP queries fell back to Spark with a planning bug that effectively dropped the partition filter entirely, causing full table scans on star-schema workloads.

Two months of tight upstream collaboration — reproducing issues at 3TB scale, filing reports with Comet’s explainFallback output and minimal reproductions, validating fixes against unreleased branches within hours — turned that into a 32% win over vanilla Spark by 0.16.0. On Iceberg tables the same workload runs 37% faster, with 90% of queries seeing 20%+ improvement. The DNS and S3 region fixes landed within a week of being reported; the DPP native support — which moved 78 queries from 30–50% native execution to 80–97% — took the deeper collaboration. Full reproduction kit and per-query results are on the Data on EKS benchmark page.

⭐ Modernizing a legacy .NET monolith with EKS Windows containers — Tipalti

Most Windows container migrations on EKS get written up at the architecture level — pick a base image, configure the VPC CNI, done. Tipalti’s post from Danny Teller covers what actually breaks in production: SIGTERM signals not propagating from Windows nodes to containers due to a containerd maturity gap, a HNS race condition during rolling updates that left zombie pods in Terminating state requiring a 3-hour node TTL as mitigation, and DNS resolution failures above 20 pods per node traced to UDP Checksum Offload being misinterpreted by the virtual switch and dropping packets silently. The fix for the last one — disabling UDP Checksum Offload on the ENA via PowerShell in node user data — is exactly the kind of thing that takes days to find without a prior art reference.

The results after working through all of it: 50% performance improvement over EC2, 60% cost reduction through KEDA-driven autoscaling off RabbitMQ queue depth, and scale-up time cut from 11 minutes to under 7 via EBS throughput tuning (125 → 250 MB/s) and pre-cached base layers. If you have .NET Framework workloads that can’t be rewritten and are still on EC2, this is the case study to put in front of whoever owns the migration decision.

⭐ llm-d disaggregated inference on OCI: 16 GPUs outperforming 32 at scale

The naive inference serving model — identical replicas, round-robin routing, scale out when latency degrades — assumes LLM traffic behaves like a stateless web application. It doesn’t. Prefill is compute-heavy; decode is memory-bandwidth-bound. When the same replica must handle both, you end up overprovisioning to protect responsiveness. llm-d, the CNCF sandbox project originally founded by Red Hat, separates those phases so prefill and decode workers can be sized and optimized independently. Dennis Kennetz’s benchmarks on OCI with AMD MI300X GPUs make the case in concrete terms: as request rates scaled from 1 to 50 QPS, a 2-node 16-GPU disaggregated deployment maintained virtually flat inter-token latency while a 4-node 32-GPU aggregated deployment degraded — double the stability at half the hardware cost.

The pattern is Kubernetes-native and cloud-agnostic, which is the part that matters for EKS teams. In the mid-range throughput band that most enterprise workloads actually occupy — 40 to 80 tokens per second per user under concurrent load — disaggregated PD consistently delivered 10–30% better GPU efficiency over aggregated on identical infrastructure. llm-d is already a CNCF sandbox project with upstream AMD-specific deployment guidance available on GitHub. If you’re sizing GPU clusters for multi-user inference and still running aggregated serving, this benchmark is the quantitative case for revisiting that decision.

⭐ Scaling StarRocks on EKS with KEDA and Karpenter for enterprise OLAP workloads — Vara Bonthu, Navaneeth Sagar & Apurva Sherke, AWS

Standard benchmarks flatter both engines equally — the gap between StarRocks and ClickHouse only opens up when you run against what financial analysts actually query: multi-table joins across star schemas, hierarchical drill-downs, high-cardinality filters applied simultaneously. Amazon’s WW Stores FinTech team built a Query Complexity Framework against their actual production datasets and found ClickHouse held a slight edge on no-JOIN aggregations, but StarRocks delivered 3–5x higher throughput and 80% lower P95 latency on multi-join patterns, and maintained 1.5x better P95 at 1,000 concurrent users. The Cost-Based Optimizer self-tuning query plans as data distributions change is what kept performance stable across complex patterns without manual intervention.

The architecture that makes elastic scaling practical is the hybrid deployment model: stateless Compute Nodes pull fact tables from S3 and scale instantly via KEDA with no data movement, while Backend nodes hold indexed dimension tables on EBS for join acceleration — the query planner routes workloads between tiers automatically. Karpenter provisions CN NodePools from Spot and FE/BE NodePools from On-Demand, with each role in a dedicated NodePool so scaling decisions don’t interfere across roles. Two operational callouts worth internalizing before you benchmark anything: EBS throughput at default gp3 settings (125 MB/s) turned a 25-minute load into 13 hours — tuning to 1,000 MB/s fixed it. And at 1,000 concurrent users, 40% query failure from memory exhaustion dropped to under 5% only after migrating BE nodes to memory-optimized instances and configuring StarRocks Resource Groups to queue rather than fail. Production blueprints and Helm configs are in the Data on EKS repo.

⭐ One Valkey, Multiple Tenants: Per-Database Isolation on Kubernetes — Ishan Jain

The default answer when a microservice needs a cache is its own Redis or Valkey instance — simple, isolated, and sitting at 2% utilization alongside a dozen others doing the same. The consolidation argument has always been obvious; the blocker has been isolation. Without a server-enforced boundary, a bug in one service can read or corrupt another’s keyspace, which makes shared instances a liability rather than an optimization. Valkey 9.1.0’s db= ACL selector removes that blocker: user app3 on >password3 ~* +@all db=3 locks that user to database 3 at the protocol level, and any attempt to SELECT 4 gets a hard NOPERM rejection from the server.

Ishan Jain benchmarked this across 10 services at 10,000 RPS aggregate — p95 at 34ms, zero errors across 1.77 million requests covering steady load, ramp, and spike scenarios. The honest caveat he flags, and it’s the right one to flag: db= gives access isolation, not resource isolation. A service flooding the instance with large keys raises latency for every other tenant — there’s no per-database CPU or memory quota. Two constraints also worth planning around before adopting this: databases are numbered, not named, and a single standalone instance caps you at 15 service slots. Both are workable for most teams, but they need to be in the architecture decision, not discovered after the Helm chart is deployed.

Announcements 📢

📢 Kubernetes 1.36 is now available on Amazon EKS

Kubernetes 1.36 landed on EKS on June 2, available across all regions including GovCloud. The feature worth acting on immediately is User Namespaces hitting GA — container root now maps to an unprivileged host user, so a container breakout no longer grants node-level access. That’s a meaningful change to the blast radius of a compromised workload, with no measurable performance cost. CEL-based Mutating Admission Policies also land here, replacing the webhook maintenance overhead for in-cluster mutation. In-Place Pod-Level Vertical Scaling lets you resize CPU and memory without a pod restart — relevant for anything running stateful workloads with variable resource demand.

Before upgrading, three breaking changes need attention. gitRepo volumes are removed — migrate to an init container or git-sync sidecar before bumping the version. StrictIPCIDRValidation is on by default, so leading zeros and ambiguous CIDRs in manifests, Helm charts, and automation will be rejected. SELinux volume labeling switched to mount-based context at GA — review seLinuxChangePolicy if you run SELinux-enforcing nodes. Run EKS Cluster Insights first; it flags these with remediation steps before the upgrade touches anything.

📢 NVIDIA Nemotron 3 Ultra: 550B Open Frontier Model for Agents

The “open frontier model” framing has been aspirational until now — models that are open-weight but not actually competitive with hosted APIs on agentic workloads. Nemotron Ultra changes that calculus. It’s a 550B-total / 55B-active MoE with a hybrid Mamba-Attention backbone, native Multi-Token Prediction for speculative decoding built into the architecture rather than bolted on, and a 1M-token context window. NVIDIA’s benchmarks put it at 5.9x the throughput of GLM-5.1 and 1.6x of Qwen-3.5 at roughly 30% lower cost — deployable on-prem, in the cloud, or at the edge, weights live on Hugging Face under OpenMDW-1.1.

The “for agents” claim is the part that matters for EKS GPU teams. Native speculative decoding and 1M context are exactly what long-horizon, tool-calling agent loops burn through — and an open-weight model with those properties means you can now run that loop on infrastructure you control. For teams already running vLLM or Dynamo on EKS, this is the model to benchmark next.

If you're on AWS and want the shortest path to running it, Nemotron Ultra launched Day 0 on SageMaker AI — deploy on H100 (P5), B200 (P6), or Blackwell RTX 6000 (G7e) without standing up your own serving stack.

📢 ThunderAgent integrated into NVIDIA Dynamo as experimental agentic router

Every inference router today optimizes at the request level — it sees individual LLM calls, not the agent program running above them. That means a router can’t distinguish a tool-call wait from job completion, can’t coordinate GPU memory across multi-turn loops, and can’t prevent KV cache leakage as agent sessions accumulate state. ThunderAgent, developed by Hao Kang’s team at Georgia Tech in collaboration with Together AI, schedules at agent-run granularity instead — tracking program states and workflow dependencies explicitly so the router understands the full lifecycle of an agent run, not just the current request.

Now integrated into Dynamo as an experimental router, the benchmark numbers are worth paying attention to: 1.48–3.58x throughput improvement on serving workloads across SWE-Agent, OpenHands, and HLE-Bench, and 1.79–3.92x on RL rollouts across distributed GPU nodes. For EKS teams running agentic workloads at scale, this is the first router that treats the agent as the scheduling unit rather than an afterthought.

📢 Agent Sandbox on EKS merged into awslabs/ai-on-eks

Running coding agents or autonomous workloads on EKS has had no standardized sandboxing primitive — teams have been rolling their own isolation with varying degrees of rigor. Agent Sandbox on EKS, now merged into awslabs/ai-on-eks by Brian Hammons, is the first AWS-backed implementation of the kubernetes-sigs/agent-sandbox CRD — the vendor-neutral standard SIG-Apps is building around — with AWS-specific patterns layered as extensions. The implementation ships three composable blueprints: a smallest-viable sandbox, FQDN egress enforcement, and a Bedrock-backed reference agent with end-to-end conformance. The mode-aware egress enforcement is the operationally sharp piece — the same pod-level allowlist labels work whether you’re running Cilium or native VPC CNI ApplicationNetworkPolicy on EKS Auto Mode, so the security model is portable across compute modes without rewriting policies.

The two-tier SandboxTemplate model is designed to absorb future isolation tiers cleanly — Kata + Firecracker is next, with hardware-isolated runtimes on EC2 nested virtualization on the roadmap. For EKS teams already building toward agentic workloads, this is the implementation to start from rather than the one to build yourself. Deployed via Terraform + ArgoCD addons, validated on both Standard EKS and Auto Mode.

Community & Career 🤝

🤝 Orchestrating Secure AI Agents on Amazon EKS — Matt Camp, Unitary

The default way teams run Claude Code or Codex in production is a developer watching a terminal, approving tool calls, catching loops — which doesn’t scale and doesn’t work when nobody’s watching. Matt Camp’s team at Unitary already ran 1,000+ nodes on EKS for video inference, and when they needed agent orchestration they reached for the same primitives: pods for isolation, Jobs for lifecycle management, IRSA for credential scoping, NetworkPolicies for egress control. The result is Osmia, an open-source controller that translates incoming tasks into Kubernetes Jobs, now released under Apache 2.0.

The intelligence layer is what makes this worth studying beyond the basic Kubernetes scaffolding. Every tool call streams as NDJSON; the controller scores whether the agent is making progress or stuck — five run_tests calls with the same failure triggers intervention before budget burns out. Per-codebase memory extracts facts from each completed task and injects them into the next one on the same repo, decaying stale knowledge automatically. Two sharp operational findings from production: NetworkPolicies matter more than gVisor sandboxing for most threat models (egress control beats filesystem isolation), and Spot doesn’t work for agent workloads — a reclaimed 30-minute job loses all token spend, which makes interruption cost too high compared to short-lived inference.

🤝 Multi-Cloud Kubernetes with Kiro + Cilium Cluster Mesh — Ansley

Standing up Kubernetes across AWS and GCP has always meant weeks of networking work — VPN or interconnects, cross-cloud DNS, cluster mesh configuration — before a single workload runs. Ansley compressed that into a single natural language prompt, instructing AWS Kiro to act as a cloud-native platform engineer and deploy a cross-cloud EKS + GKE cluster connected via Isovalent’s Cilium Cluster Mesh. The full project is on GitHub.

What’s worth tracking here isn’t the multi-cloud topology itself — it’s what the workflow reveals about where agentic IaC is actually landing. Kiro generates the spec, Cilium handles the overlay, and the human’s role shrinks to intent definition. The gap between “write me a Terraform module” and “stand up a production-grade cross-cloud mesh” is closing faster than most platform teams have planned for.

🤝 AI-Driven EKS AMI Updates via Bedrock + Karpenter GitOps — Suryansh Gupta

EKS node patching is the task that lives permanently at the bottom of the backlog — not because it’s hard, but because the ritual is tedious enough that it always loses to whatever’s currently on fire. Suryansh Gupta automated the entire loop: EventBridge triggers twice daily, Lambda checks SSM for a new EKS-optimized AMI, Bedrock (Claude 3.5 Haiku) pulls the actual release notes from the awslabs/amazon-eks-ami repo and writes a structured risk assessment with CVE analysis, then opens a GitHub PR with the full analysis embedded. A human reads it and merges or closes. Karpenter handles the zero-downtime rollout.

The sharpest design decision is using GitHub PRs as the approval interface — no new tooling, a permanent audit trail, and the PR description is the change record. The Bedrock prompt is grounded in real release notes, not a hallucinated summary, and the risk score is a 1–10 JSON field so reviewers can triage at a glance. A typical CI job costs ~$0.034 on Fargate versus ~$0.007 on a Spot m6a.large — for a task that runs maybe 4–6 hours of actual compute per day, that gap compounds fast. Full CloudFormation deploy in the repo.

🤝 Private AI Platform Kit — Ramazan Kara, OTTO

The gap between “running a model locally” and “running private AI like production infrastructure” is where most platform teams get stuck. Who can use which model? Are prompts leaking company secrets? What is each team spending, and can every request be audited back to a sandbox? Ramazan Kara’s Private AI Platform Kit is a Kubernetes-native reference implementation that treats those questions as first-class platform requirements from day one — not afterthoughts bolted on after the demo works. The stack covers an OpenAI-compatible inference gateway for Ollama and vLLM with API-key auth, model allowlists, Redis-backed sandbox budgets, and redacted audit logs; isolated agent workspaces with namespace RBAC, default-deny networking, and prompt secret detection before requests reach the runtime; and a full Day-2 operations layer including chaos drills, SBOMs, image signing, and restore runbooks.

The design decision worth paying attention to is the portable operating model — starts on kind locally, carries into customer-owned Kubernetes clusters with CPU, NVIDIA GPU, and AMD ROCm profiles using the same Argo CD, Helm, and Kyverno policy stack. That means the governance model you validate locally is the one you deploy to production, not a different thing you rebuild later. If you’re a platform team trying to get ahead of the “can we use LLMs internally” conversation before it turns into a compliance incident, this is the reference architecture to study.

Highlights ✨

✨ Migrating GitLab Runners from EKS Fargate to EKS Auto Mode: 40% Cost Reduction — Pawan Sawalani

Fargate’s per-second billing looks attractive until pipeline frequency grows — a typical CI job at 2 vCPU / 4GB for 15 minutes costs ~$0.034 on Fargate versus ~$0.007 on a Spot m6a.large. Pawan Sawalani ran that math on ~100 jobs/day and migrated to EKS Auto Mode with Spot, dropping monthly compute from $143 to ~$45 and total infra from $385 to ~$230. The 40% headline is conservative — compute savings alone are 68%. One hard-won operational call-out: exclude t3/t3a instances from your CI NodePool entirely — burst credit exhaustion on a Maven build turns a 10-minute job into 50 minutes.

✨ Deploying EpicBook on AWS with Terraform: End-to-End DevOps Project — Bhupendra Bhati

Most EKS walkthroughs cover one layer and leave the rest as an exercise. Bhupendra Bhati’s EpicBook project is a full vertical slice — VPC, EKS cluster, application deployment, and pipeline wiring in one Terraform-managed repo. Useful as a skeleton to adapt when onboarding someone to the stack or when you need a working reference before building the production version.

✨ Linux Foundation stands up a Tokenomics Foundation

Until now there’s been no neutral body measuring token efficiency across models and vendors — GPU-hours had FinOps, tokens had nothing. The Linux Foundation’s new Tokenomics Foundation fills that gap, building open standards and benchmarks in partnership with the FinOps Foundation and extending the FOCUS spec into token-based spend. Goldman Sachs projects global token usage growing 24x between 2026 and 2030 to 120 quadrillion tokens a month — the unit of AI infrastructure cost has already shifted, the measurement tooling is just catching up.

✨ Amazon ECS Managed Instances now supports AWS Trainium and Inferentia

If you’re running training or inference workloads on ECS rather than EKS, AWS Trainium and Inferentia are now available on ECS Managed Instances — bringing purpose-built AI accelerator support to the ECS compute model without needing to manage the underlying EC2 instances directly. Worth knowing if your team runs mixed ECS/EKS workloads and wants to evaluate Trainium cost efficiency against GPU instances before committing to an EKS-based inference stack.

✨ AWS PCS-ready DLAMI: validated base AMI for Slurm clusters

Building custom AMIs for AWS Parallel Computing Service has always meant manually reconciling independent release cycles across kernel versions, EFA drivers, NVIDIA drivers, CUDA, and Lustre client — one version mismatch and networking, storage, or compute breaks silently. The new PCS-ready DLAMI ships on top of the Deep Learning Base GPU AMI (Ubuntu 24.04) with PCS Agent, Slurm, and EFS utilities pre-integrated. Multiple Slurm versions are bundled and the correct one activates automatically based on your cluster configuration. Latest AMI version is always discoverable via SSM, so IaC pipelines stay current without manual tracking.

✨ NVIDIA OpenShell: sandboxed runtime for AI coding agents

Giving an agent shell access means trusting it won’t quietly touch files it shouldn’t, leak credentials, or make network calls you didn’t authorize — which is exactly the gap OpenShell addresses. It runs Claude Code, Codex, GitHub Copilot CLI, and others inside a policy-enforced sandbox with declarative YAML policies covering filesystem, network, process, and inference layers. The sharp detail is hot-reloadable network and inference policies: you can adjust egress routing and model access at runtime without restarting the sandbox. Credentials are injected as environment variables by providers, never stored on disk. Written in Rust, runs on Docker, Podman, MicroVM, or Kubernetes (Helm charts experimental). Explicitly alpha — the README calls it “single-player mode, proof-of-life” — but the design direction is the right one if you’re building toward production agent environments where auditability matters. Apache 2.0.

🎉 Sponsor Section

At the moment, we don’t have a sponsor for this edition, but we look forward to working with companies and organizations that support the EKS & AI Infrastructure community in future editions. If you or your company is interested in sponsoring, please contact us at 📧 thecloudtechforall@gmail.com

📝 Words from the Author

Seattle, evening. The AWS Ambassador Summit wrapped today and I’m writing this from my hotel room with the kind of quiet that only comes after two days of back-to-back conversations about infrastructure, AI, and what comes next.

I spoke here on distributed LLM inferencing on AWS. The room was full of people building the same things I write about every week — and for the first time in a while, the conversations weren’t about whether to build agentic infrastructure. They were about how. The primitives have landed. The question is now operational.

That shift is what this edition is about. Agent sandboxing merged into awslabs. A router that understands agent runs, not just requests. A benchmark that finally puts a number on how badly current models handle real Kubernetes incidents. These aren’t research previews. They’re things you can deploy next week.

Edition 71 is going out from Seattle. Edition 1 went out from my desk with almost nobody reading it. The distance between those two things isn’t a strategy. It’s just showing up, week after week, until the work compounds.

To everyone I met this week at the Ambassador Summit — thank you for the conversations, the energy, and the reminder that the community is the whole point.

Happy building. 😎