👋 Everything about EKS & AI Infrastructure Newsletter "#66" ☁️❤👨‍💻

Dear EKS & AI Infrastructure enthusiasts,
Welcome to Everything about EKS & AI Infrastructure #66.

Inference just crossed two-thirds of all AI compute. That number, from CNCF’s Jonathan Bryce at KubeCon Amsterdam, is the quiet headline underneath everything in this edition.

Three years ago the conversation was about training — how to get enough GPUs, how to distribute across nodes, how to not run out of memory. That problem isn’t solved, but it’s understood. The frontier has moved. The new hard problems are inference at scale, agentic workloads in production, and everything the infrastructure community hasn’t fully caught up to yet.

This edition sits right at that frontier.

The performance engineering pieces go deep on disaggregated prefill/decode, heterogeneous CPU+GPU scheduling, and why the metrics you’re probably watching right now are lying to you. The security pieces make the case — clearly and uncomfortably — that Kubernetes was not designed for the threat model that LLM workloads introduce. The announcements show AWS moving fast on the agentic infrastructure layer: AgentCore, VPC egress for private APIs, HyperPod topology automation.

And running underneath all of it: the same question the best platform engineers are asking right now. Not “can we run this?” — we can run almost anything. But “do we actually understand what’s happening inside it?”

That’s what this edition is about. Let’s get into it. 👇

Performance Engineering in Modern AI Systems 🌩️

🌩️43,751 videos, 3,425 hours, curated in 4h 26m — Ray Data pipeline, by Suman Debnath

The insight buried in this post is about scheduling, not models. Most video curation pipelines run stage-by-stage: CPUs decode, then GPUs infer, then CPUs write — and at scale those idle gaps between stages compound into weeks of wasted compute. Suman Debnath (Technical Lead, ML at Anyscale) ran the HuggingFaceFV/finevideo dataset through a single streaming Ray Data pipeline on a mixed A10G + CPU-only cluster, with all five stages — video decode, fused CPU transforms (scene detection, quality filtering, keyframe extraction), Qwen2.5-VL-3B via vLLM for captioning, CLIP ViT-B/32 for 512-d embeddings, and parquet write — running concurrently with automatic backpressure between operators.

The result: both CPU and GPU pools stay busy through almost the entire 4h 26m run instead of taking turns. The framing worth internalizing for EKS-based ML pipelines — Ray on Karpenter with heterogeneous node pools — is that “GPU utilization” measured in isolation is the wrong metric. End-to-end throughput per dollar is what matters, and that requires the scheduler to see CPU and GPU as a single resource budget rather than two sequential queues.

🌩️Webinar: Multimodal data pipelines with Ray Data — how to keep GPUs busy across CPU+GPU stages

Multimodal pipelines — video, audio, PDFs, sensor streams — interleave CPU-bound preprocessing with GPU-bound inference. In traditional batch architectures those two phases run sequentially, which means GPUs sit idle more than 50% of the time waiting for CPU steps to finish. This Anyscale webinar covers why batch processing engines break under multimodal workloads, how to eliminate the I/O handoff between CPU and GPU stages that causes the idle gap, and how to architect a streaming CPU+GPU pipeline with Ray Data where both resource pools stay busy concurrently.

Starred Content ⭐

⭐NVIDIA Nemotron 3 Nano Omni — 30B-A3B hybrid MoE, single architecture for text, image, video, and audio

Most multimodal agentic stacks today are three models duct-taped together — a vision model, an audio model, and an LLM, each with its own serving stack, memory footprint, and failure mode. Nemotron 3 Nano Omni is NVIDIA’s answer to that: a single 30B-A3B hybrid MoE that handles text, image, video, and audio natively in one architecture. The MoE design combines Mamba layers for memory efficiency with transformer layers for reasoning, keeping active parameters at 3B while reaching 30B total capacity.

The throughput numbers are the headline for EKS inference sizing: NVIDIA reports ~9.2x throughput on video tasks and ~7.4x on multi-document reasoning versus comparable omnimodal open models. The deployment path is already wired in — vLLM, SGLang, and TensorRT-LLM all supported, available on HuggingFace, NIM, and AWS from day one. For platform teams building agentic pipelines on EKS where the current stack involves multiple specialized models per modality, this is the consolidation option worth benchmarking before your next node pool sizing conversation.

⭐llm-d distributed tracing + vLLM NIXL metrics — closing the observability gap in P/D mode

When prefill and decode instances are separated for throughput optimization, standard metrics lie to you. Local TTFT looks fine — but that number is measured at the decode instance and misses the full request journey from gateway through prefill pod to KV cache transfer to decode. Sally from llm-d SIG Observability demonstrates how to bridge this gap with two complementary layers: llm-d distributed tracing for request-level context — routing decisions, prefix cache hits, token usage per request — and vLLM NIXL metrics for real-time KV cache transfer visibility, including transfer times and average sizes that tell you whether your system actually needs RDMA-class networking.

The Gateway API Inference Extension tracing instrumentation now merged into llm-d tracks requests from the initial gateway through pod scoring all the way to final vLLM execution — giving platform engineers both the aggregate P50/P95 latency picture from metrics and the precise per-request story from traces. For EKS teams running disaggregated inference stacks, this is the observability layer that makes P/D mode production-viable rather than just benchmark-viable.

⭐Production-grade LLM inference at scale with KServe + llm-d + vLLM — Tesla + Red Hat engineering blog

Edition 65 covered the 3x output tokens/sec result from this stack. Here’s the full production story from the engineers at Tesla and Red Hat who built it. The naive starting point — vLLM in a Kubernetes StatefulSet — hit three immediate walls: NFS storage drag for models reaching hundreds of gigabytes, rigid node-to-pod affinity from local LVM PVCs that turned hardware failures into manual interventions, and round-robin load balancing that completely ignored KV-cache state on GPU. The fix wasn’t incremental tuning — it required rethinking the serving topology. The Register

The winning stack is KServe + llm-d + vLLM with Envoy and Gateway API Inference Extension for prefix-cache aware routing — the 3x improvement in output tokens/s and 2x reduction in TTFT was measured on Llama 3.1 70B across 4 MI300X AMD GPUs with tensor-parallel-size=4 and a 65K context window, and the performance jump is visible as a step change in the production chart at the moment routing was deployed. Running this at scale also surfaced upstream fixes now merged into KServe — storageInitializer made optional to support RunAI Model Streamer, and updated Gateway API Inference Extension support. The production edge cases that get upstreamed are the most useful signal for teams evaluating the same stack.

⭐Deloitte cuts EKS environment provisioning from 45 minutes to 5 — EKS Auto Mode + vCluster

The problem is familiar to any platform team running dedicated EKS clusters per QA environment: 30–45 minute provisioning times per cluster including ALB, Route 53, and monitoring agent setup, significant infrastructure duplication, and QA teams blocked on platform engineers for every new environment request. Deloitte’s solution is EKS Auto Mode as the host cluster with vCluster on top — platform services like ingress controllers and monitoring agents deploy once on the host and are shared across all virtual clusters, while QA teams get fully isolated Kubernetes environments that provision in under 5 minutes without platform team involvement.

The outcomes: 89% reduction in provisioning time, 500 hours reclaimed annually by the QA team, over 50 vCPUs and 200 GB of memory saved through resource consolidation at peak, up to 70% cost savings from Spot Instances via EKS Auto Mode autoscaling, and 50+ virtual clusters now running efficiently on a single shared host cluster. The architecture is reproducible — the walkthrough in the post covers the full Helm deployment, vCluster configuration YAML, and path-based ALB routing setup that makes a single load balancer serve traffic across multiple virtual clusters. Worth reading before your next conversation about cluster proliferation in your org.

⭐Implement SPIFFE/SPIRE on Amazon EKS — nested multi-cluster workload identity and mTLS authorization

The CNCF threat model paper earlier in this edition makes the case that Kubernetes pod isolation isn’t sufficient for LLM workloads — agents process untrusted input and make dynamic decisions that infrastructure-layer controls can’t see. SPIFFE/SPIRE is the concrete next layer: cryptographic workload identity that moves with pods across clusters and networks, not tied to IP addresses or network perimeters that shift.

This AWS blog walks through deploying SPIRE in a nested architecture across multiple EKS clusters — root SPIRE server managing child servers per cluster, all issuing SVIDs within the same trust domain, with Amazon Aurora as the persistent datastore for the root server. The nested chaining means a new cluster joins the trust domain by wiring up an additional child SPIRE server, without changing the root CA. The two concrete deliverables are short-lived auto-rotated X.509 certificates for mTLS between specific workloads, and JWT-SVIDs for scenarios where direct mTLS isn’t possible — like when an L7 load balancer sits between services or multiple workloads share an encrypted channel. For EKS platform teams now running agentic workloads that need to call internal APIs, this is the identity architecture worth understanding before you’re being asked about it in a security review.

Announcements 📢

📢Amazon Bedrock AgentCore — managed harness, CLI, and skills for faster agent prototyping

The pattern AWS is establishing with AgentCore is worth paying attention to: start with a fully managed harness where you supply a model, system prompt, and tools and get a running agent with zero orchestration code, then export to Strands-based Python when you need full control. That escape hatch matters — most managed agent platforms trap you in their abstraction layer. AgentCore’s export path means the managed harness is a prototyping accelerator, not a lock-in mechanism.

The AgentCore CLI deploys agents via AWS CDK today, with Terraform support coming — which means agentic workloads get the same IaC governance and auditability pipeline that infrastructure teams already run for EKS clusters and everything else. For platform teams building internal developer platforms on top of EKS, this is the pattern to watch: agent deployment as a first-class IaC workflow, not a one-off notebook-to-endpoint process.

📢AgentCore Gateway + Identity now support VPC egress — agents can reach private EKS-hosted MCP servers

Every agentic AI PoC at a regulated customer eventually hits the same wall: the agent can’t reach the API it needs to automate because that API lives inside a private VPC with no public internet access. AgentCore’s VPC egress support for both its Gateway and Identity services closes that gap. The architecture that’s now possible: AgentCore agent → managed VPC egress → private ALB → EKS-hosted MCP server → internal APIs — entirely within the AWS network, full CloudTrail audit trail, zero public endpoints.

Two egress modes: managed VPC egress handles most cases with just CLI config and no custom networking, while self-managed via VPC Lattice covers complex multi-VPC topologies. Available in ap-southeast-2 from launch. For platform teams building agentic workflows on EKS where the MCP servers or backend APIs are private — banking, healthcare, anything regulated — this is the architectural blocker that just got removed.

📢Meta just became one of AWS's largest Graviton customers.

Tens of millions of Graviton5 cores. For agentic AI. Not training. Not GPU workloads. The orchestration layer — real-time reasoning, multi-step agent coordination, code generation.

Graviton5 ships with 192 cores and a cache 5x larger than the previous generation. That cache improvement reduces inter-core communication latency by up to 33%. For agentic workloads that continuously reason through and execute multi-step tasks, that latency delta compounds at scale.

The EKS implication is specific: Graviton5 instances support Elastic Fabric Adapter. Which means low-latency, high-bandwidth communication between instances — the same networking primitive you'd use for distributed GPU training, now relevant for distributed agent orchestration. If you're designing node pools for agentic workloads on EKS, this is worth a closer look before you default to GPU instances for everything.

📢Amazon SageMaker HyperPod now auto-manages Slurm network topology

Network topology directly determines distributed training performance — when jobs land on topologically close nodes, NCCL collective operations run faster and training throughput improves. Until now, teams running Slurm on HyperPod had to manually maintain topology.conf and reconfigure it every time the cluster scaled or a node was replaced. That ops overhead compounds fast on large clusters where node churn is frequent.

HyperPod now inspects instance types at cluster creation, selects the right topology model — tree for hierarchical interconnects like p5/p5e/p5en, block for uniform high-bandwidth instances like the p6e-gb200 NVL72 — and keeps the configuration updated through every scale event automatically. Topology-aware scheduling is on by default with no config required, which means mixed-instance-type clusters also get a compatible topology selected without manual intervention.

📢SageMaker HyperPod adds G7e (NVIDIA RTX PRO 6000 Blackwell) and r5d.16xlarge instances

G7e instances are powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs and deliver up to 2.3x better inference performance than G6e instances, with up to 768 GB of total GPU memory, up to 1.27x the TFLOPs, and up to 4x the GPU-to-GPU bandwidth compared to G6e. The memory headroom is the headline for inference teams — 768 GB total GPU memory means larger models or multiple concurrent models on a single endpoint without multi-node serving complexity. Available in us-east-1, us-east-2, ap-northeast-1, and us-west-2.

The r5d.16xlarge addition is the less flashy but equally useful part: 64 vCPUs, 512 GB memory, and 5×600 GB NVMe SSD, well suited for distributed training data preprocessing with Ray, large-scale feature engineering, and running memory-heavy orchestration services alongside GPU compute. For teams running HyperPod with EKS where the CPU preprocessing bottleneck is the constraint rather than GPU compute — exactly the heterogeneous pipeline problem covered in the Ray Data items in this edition — r5d.16xlarge in the same cluster is the right pairing.

Community & Career 🤝

🤝future-agi/future-agi — self-hostable AI agent observability and evals platform

Most teams running agentic workloads on EKS today have observability and evaluation stitched together across Langfuse, Braintrust, and some custom guardrails layer that never quite closes the loop back into the training pipeline. Future AGI is an Apache 2.0 platform — still in early/nightly release — that collapses tracing, evals, simulations, guardrails, LLM gateway, and prompt optimization into a single deployable stack with a Docker Compose or Kubernetes path out of the box.

The gateway is a Go binary claiming ~29k req/s on a t3.xlarge with P99 under 21ms when guardrails are active — and the architecture uses OTel OTLP for traces and an OpenAI-compatible HTTP interface for the gateway, which means it slots in front of any existing inference stack on EKS without rewriting client code. very early, but the feedback loop design — production traces feeding back into prompt optimization — is the right architecture for teams who want agents that actually improve over time, not just get monitored.

🤝EKS LLM benchmarking app — inference-perf + vLLM + SOCI, by Jeremy Cowan

Picking the right instance for an LLM isn’t guesswork anymore — but actually running structured benchmarks across multiple instance types, with real load generation and comparable metrics, is still more ops work than most teams want to do from scratch. Jeremy Cowan (Principal SA at AWS) built a benchmarking application on EKS that auto-recommends instances for a given model, runs benchmarks on them, collects performance metrics, and lets you compare up to four models side-by-side with exportable configs.

The stack is worth noting: inference-perf as the load generator, vLLM as the model runtime, model weights cached from HuggingFace to S3 to sidestep rate limits, run.ai’s model streamer for weight loading from cache, and SOCI parallel pulls for cold start reduction. The whole thing is moving to AWS Samples shortly for permanent maintenance — worth starring now before the repo moves.

🤝Free workshop: EKS Auto Mode + Kiro + MCP Server — hands-on, by Olawale Olaleye

Olawale Olaleye (Senior GenAI/ML Specialist SA at AWS, Containers SME, CNCF Kubestronaut) is running a free hands-on workshop covering EKS Auto Mode deployments accelerated with Kiro and an MCP server integration. If you’ve been meaning to get hands-on with the Kiro + EKS workflow beyond slide decks, this is a direct path in.

Highlights ✨

✨ Terraform Registry gets a Partner Premier tier — SBOM, ephemeral resources, and Day 2 actions required

HashiCorp has launched a Partner Premier tag on the Terraform Registry, a tier above the existing Partner tag, requiring providers to include a software bill of materials and implement at least one of three advanced features: ephemeral resources, Terraform search, or Terraform actions. The SBOM requirement is the quiet but meaningful one — it brings the same supply chain transparency to Terraform providers that container teams have been demanding from OCI images for the past two years.

Terraform actions are the more interesting feature signal: they allow providers to expose imperative Day 2 operations — reboots, snapshots, config commits, credential rotations — directly through the Terraform pipeline instead of requiring external scripts or manual steps outside IaC. For EKS platform teams who already manage cluster lifecycle through Terraform, this means node group operations, add-on upgrades, and similar tasks that currently live in runbooks could eventually land inside the same declarative workflow. Launch partners include Sidero Labs (Talos OS management), 1Password (ephemeral credential injection), MongoDB Atlas, Palo Alto Networks, and Cisco, among others.

✨CNCF: Kubernetes pod isolation is not sufficient for LLM workloads — the threat model is different

A well-configured EKS cluster — pods healthy, logs clean, resource usage stable — tells you nothing about whether a prompt should be allowed, whether a response contains sensitive data, or whether the model should have access to certain tools. Kubernetes did its job: it scheduled and isolated the workload. What it cannot do is enforce semantic controls over what the LLM decides to do inside that boundary.

Kubernetes isolates containers. It does not isolate decisions. Pod isolation, RBAC, network policy, resource limits, and admission control all operate at the infrastructure layer — even a genuinely hardened cluster with Cilium, Kyverno, and Falco has zero visibility into model behavior, prompts, or outputs. The threat model for an LLM workload is structurally different: the workload takes untrusted input and dynamically decides what actions to take, which is not a threat model Kubernetes was designed to contain. For EKS platform teams now running inference or agentic workloads in production, this CNCF paper is the right framing to bring into your next security review — before your security team brings it to you.

✨Poolside releases Laguna XS.2 — 33B-A3B open-weight agentic coding model, Apache 2.0

Poolside shipped two models into preview: Laguna M.1, a 225B total parameter MoE model with 23B active parameters built for long-horizon agentic coding, and Laguna XS.2, a 33B-A3B open-weight release under Apache 2.0. The XS.2 is the more immediately interesting for EKS deployments — 3B active parameters means it runs on a single GPU and fits inside the same node pool sizing you’d use for Qwen3.6-35B-A3B, which it benchmarks against directly.

XS.2 went from pre-training start to fully post-trained public release in five weeks, is available on HuggingFace, runs locally via Ollama with MLX support, and both models are accessible via the Poolside API and OpenRouter free for a limited time. For platform teams evaluating self-hosted coding agents on EKS — whether for internal developer tooling or agentic CI pipelines — XS.2 is the most efficient new entry in the Apache 2.0 agentic coding space right now alongside Devstral Small.

🎉 Sponsor Section

At the moment, we don’t have a sponsor for this edition, but we look forward to working with companies and organizations that support the EKS & AI Infrastructure community in future editions. If you or your company is interested in sponsoring, please contact us at 📧 thecloudtechforall@gmail.com

📝 Words from the Author

I sat for my AWS GenAI Developer Professional exam last week. Twelfth certification.

People ask me sometimes why I keep going. Twelve in, you’d think the answer would be obvious — career, credibility, the next thing on the list. But honestly, sitting in that room, I couldn’t have told you a clean reason. There’s something about the moment after you hit submit. The screen hasn’t changed yet. You’re just sitting there with yourself — with everything you studied, everything you skipped, every concept you thought you understood until someone asked you the hard version of it. Twelve times in, that moment still makes my heart do something. I hope it always does. The day it stops feeling like something, I’ll know I’ve started doing it for the wrong reasons.

The thing no one tells you about certifications — or about any of this, really — is that they’re not about the badge. They’re about what curiosity does to you when you let it run.

I didn’t study for this exam because I had to. I studied because somewhere in the process of reading, building, breaking things, and reading again — I found something I didn’t know before. And that feeling, that specific feeling of the world getting slightly larger, never gets old. Twelve certifications in, it still doesn’t. If anything, it gets sharper. Because the more you know, the more precisely you can see what you don’t know yet. And that gap, that visible gap between where you are and where understanding could take you — that’s not a source of anxiety for me. It’s the whole point.

I think about the newsletter, the user group, the talks, the platform we’re building — none of it was planned. It grew the way things grow when you stay genuinely curious long enough. One question leads to the next. One conversation opens a door. You follow what’s interesting, not what’s strategic, and one day you look up and there’s a body of work behind you that you couldn’t have designed from the start.

I spoke at AWS Summit Bengaluru for the third time this year. Standing on that stage, I didn’t feel like someone who had arrived anywhere. I felt like someone who is very much still in the middle of something — still learning, still finding the edges of what I understand, still occasionally wrong in ways that teach me more than being right does.

The people who think they’ve arrived have usually stopped being curious. I never want to arrive. I just want to keep finding the next thing worth understanding.

That’s why I keep going. That’s why any of us keep going.

Happy Building. 😎