👋 Everything about EKS & AI Infrastructure Newsletter "#65" ☁️❤👨‍💻

Dear EKS & AI Infrastructure enthusiasts,
Welcome to Everything about EKS & AI Infrastructure #65.

Big week — and a big one personally too.

Kubernetes v1.36 shipped on Wednesday with 70 enhancements. Two removals need your attention before you upgrade: the gitRepo volume plugin is gone, and so is IPVS mode in kube-proxy. Ingress-NGINX retirement is the bigger forcing function running alongside it — no more security patches, and the clock is ticking on migration to Gateway API.

On the AI infra side: SageMaker HyperPod gets flexible instance groups with Karpenter-native fallback, seven new IAM condition keys bring real SCP-enforced governance to EKS cluster creation, and Claude Platform on AWS removes the last procurement blocker for enterprise teams already inside AWS.

Performance Engineering is dense this week. Red Hat AI and Tesla published production results — 3x output tokens/sec with prefix-cache aware routing on vLLM. DeepSeek open-sourced Tile Kernels, GPU kernels built to approach hardware limits. And Moonshot AI dropped FlashKDA with 2x+ prefill speedup as a drop-in backend swap.

I was also at AWS Summit Bengaluru this week. Spoke for the third time on that stage. Got stopped unexpectedly at a chip display. Had the conversation that made the whole trip worth it.

Something about that at the end.

Let’s get into it. 👇

Performance Engineering in Modern AI Systems 🌩️

🌩️3x output tokens/sec with KServe + llm-d + vLLM prefix-cache aware routing — Red Hat AI + Tesla

Tesla and Red Hat AI’s engineering teams published a joint blog on solving LLM inference at production scale — three specific problems: model weights saturating storage, GPU cycles wasted on load balancers unaware of KV cache state, and node failure handling that fights the infrastructure instead of working with it. The stack they landed on is KServe for serving orchestration, llm-d for distributed inference coordination, and vLLM underneath, with prefix-cache aware routing as the key differentiator.

The routing piece is where the performance comes from. Naive load balancing sends requests to whichever replica has capacity — but if a replica already has the relevant prefix cached, routing there avoids recomputing it entirely. At scale, that cache hit rate compounds: 3x improvement in output tokens per second and 2x faster time to first token in production. For EKS teams running multi-replica vLLM deployments, this is the clearest published evidence that routing intelligence — not just more GPUs — is the next lever to pull on inference performance. The full blog has the architecture walkthrough worth reading before you reach for more p5 nodes.

🌩️ DeepSeek releases Tile Kernels — GPU kernels approaching hardware performance limits

DeepSeek shipped Tile Kernels, a set of optimized GPU kernels for LLM operations built with TileLang. The claim is direct: most kernels in this project approach the limits of hardware performance in terms of both compute intensity and memory bandwidth — and several are already running in DeepSeek’s internal training and inference pipelines. That last detail matters. These aren’t research artifacts; they’re production-validated kernels being open-sourced.

For EKS teams running GPU inference on p4d or p5 instances, kernel efficiency is the layer between raw GPU hardware and actual throughput. The difference between a well-optimized kernel and a naive implementation on an H100 can be the gap between needing two nodes and needing one. TileLang as the underlying DSL is worth understanding separately — it’s designed to make writing high-performance tile-based GPU programs more tractable than raw CUDA, which has implications for teams building custom inference infrastructure rather than relying entirely on vLLM or TensorRT-LLM defaults.

🌩️ FlashKDA — open-source CUTLASS implementation of Kimi Delta Attention kernels

Moonshot AI open-sourced FlashKDA, a CUTLASS-based implementation of Kimi Delta Attention kernels delivering 1.72x–2.22x prefill speedup over the flash-linear-attention baseline on H20 GPUs. It’s a drop-in backend for flash-linear-attention, which means adoption path is low-friction — no architecture changes, just swap the backend and measure.

Prefill is where large context requests spend most of their time, and it’s the phase that most directly affects time to first token at scale. A 2x+ speedup on H20 hardware without model changes is the kind of result that changes node capacity math on EKS inference deployments — fewer GPUs needed for the same prefill throughput, or the same GPU count handling significantly more concurrent long-context requests. Worth benchmarking against your workload before the next GPU procurement conversation.

Starred Content ⭐

⭐Containers from the Couch: EKS platform tooling and AI-driven incident response

The first episode covers three tools that are converging on EKS platform teams: ArgoCD handles GitOps reconciliation, ACK brings AWS resources — RDS, S3, IAM — into the Kubernetes control plane as native objects, and KRO sits on top letting you define higher-level resource compositions that bundle them together. The practical payoff is that a developer can submit a single Kubernetes manifest that provisions the app, the database, and the IAM role in one shot, all Git-tracked, all reconciled by the same control loop. No Terraform side-channel, no manual AWS console steps.

The second episode puts AWS DevOps Agent on an EKS-backed production scenario — a checkout latency spike — and walks through the agent querying CloudWatch metrics, application logs, distributed traces, and recent deployment history simultaneously rather than sequentially. The MCP layer is what makes this composable: any tool with an MCP server (Datadog, Splunk, PagerDuty) connects to the agent without custom integration work. RMIT University and CyberArk both appear as customer examples with measurable MTTR improvements. For EKS platform and SRE teams, the combination of these two episodes maps roughly to the full lifecycle — build the platform right, then let the agent handle what breaks at 2am.

⭐5,000 pods/second and 60% utilization with Gödel and Katalyst — KubeFM

The default Kubernetes scheduler is a single-process, pod-centric system. At ByteDance’s scale — 20,000 nodes, 1 million pods — that design hits a ceiling. Yue Yin walks through how ByteDance built Gödel, a distributed scheduler with a dispatcher-scheduler-binder split, where each scheduler shard handles 2,000 pods/second independently and multiple shards together reach 5,000 pods/second in production.

The more interesting half is Katalyst, the node-level resource management system that makes co-location actually work. Gödel needs NUMA-level visibility into what online workloads are currently consuming before it can safely place offline jobs on the same node. Without Katalyst feeding that real-time resource picture upward, the unified resource pool doesn’t exist — and ByteDance’s CPU utilization stays at 30% instead of 60%. The two systems are complementary, not interchangeable.

Both Gödel and Katalyst are open source under the kubewharf org. If you’re building on EKS and hitting scheduler throughput limits for batch/ML workloads, or trying to make online/offline co-location work without overprovisioning, this episode is worth the time.

Announcements 📢

📢Kubernetes v1.36 — Haru — released April 22

Kubernetes v1.36 shipped yesterday with 70 enhancements — 18 stable, 25 beta, 25 alpha. The two items that require action before upgrading: the gitRepo volume plugin is removed (security risk, was running code as root on the node), and IPVS mode in kube-proxy is gone (deprecated in v1.35). If you’re using either, migrate before you touch the version. The externalIPs field in Service spec is now deprecated with removal planned for v1.43 — start auditing now if you rely on it.

The stable graduations worth noting for EKS and AI workloads: native user namespaces (rootless containers without third-party tooling, just hostUsers: false), SELinux volume mount optimization now GA (replaces per-file relabeling with a single mount-time label — meaningful pod startup improvement on SELinux-enforcing nodes), HPA scale-to-zero now enabled by default (needs an external metric source like KEDA to scale back up), and Mutating Admission Policies via CEL going stable (no more webhook server to maintain for common mutation tasks). On the AI infra side, DRA device taints/tolerations and partitionable device support both promoted to beta — enabled by default now.

The Ingress-NGINX retirement from March 24 is the bigger forcing function running alongside this release. No further security patches. If you haven’t started evaluating Gateway API as a replacement, v1.36 is the right moment to prioritize it.

📢Claude Platform on AWS — IAM auth, CloudTrail, and consolidated billing (coming soon)

The friction in adopting Anthropic’s Claude Platform for enterprise teams hasn’t been the model — it’s been the separate credential chain, the second billing relationship, and the gap in audit coverage. Claude Platform on AWS closes all three: IAM credentials replace Anthropic-specific API keys, Claude Platform activity lands in CloudTrail alongside every other AWS service, and usage bills through the existing AWS account. No new contracts, no separate Anthropic account required.

The distinction from Amazon Bedrock is worth being precise about: Bedrock gives you Claude model access through AWS’s managed inference layer. Claude Platform on AWS gives you Anthropic’s native platform experience — the same APIs, console, and feature set as working with Anthropic directly — but surfaced through AWS identity and billing primitives. For platform teams that have been holding off on Claude adoption because of security team pushback on a second SaaS relationship, this removes the blocker cleanly.

📢 SageMaker HyperPod now supports flexible instance groups

Teams running training or inference on HyperPod with EKS have had to create a separate instance group per instance type per AZ — meaning every capacity fallback, patch cycle, and scaling event multiplied across N groups. Flexible instance groups collapse this into a single group with an ordered InstanceRequirements list and multiple subnets, so HyperPod tries the highest-priority instance type first and falls back automatically when capacity isn’t available.

The Karpenter integration is the sharper detail here: when you use Karpenter autoscaling against a flexible instance group, Karpenter reads the supported instance types directly from the group definition and picks the optimal type and AZ per pod’s requirements — no separate NodePool per instance type needed. For teams already using Karpenter on EKS and now adopting HyperPod, this aligns the provisioning model rather than forking it.

📢 EKS adds 7 IAM condition keys for cluster governance via SCPs

AWS added seven new IAM condition keys for EKS cluster APIs — covering private endpoints, KMS encryption, Kubernetes version restrictions, deletion protection, and zonal shift. The practical value is that these work with AWS Organizations SCPs, so platform teams can deny non-compliant cluster creation at the org level rather than auditing after the fact.

Previously, enforcing baseline cluster standards across accounts meant documentation, manual review, or custom Config rules. Now it’s a policy condition. For anyone running multi-account EKS at scale, this is worth wiring into your SCPs sooner rather than later.

Community & Career 🤝

🤝LLM Infra Planner — open-source GPU/VRAM sizing tool for LLM deployments (by Kishan Khatrani)

Before you spin up a p4d or g5 on EKS, you need to know whether your model actually fits — and whether it fits with enough headroom for KV-cache at your target concurrency. Most teams either over-provision out of caution or discover the answer at runtime. Kishan’s LLM Infra Planner is a browser-based, no-login calculator that estimates GPU memory requirements across model sizes (7B to 70B+) and quantization levels (FP16, INT8, INT4) before any hardware or cloud spend happens.

The tool is open-source and built to be extended — the repo accepts PRs for new models and GPU types, which makes it useful as a community-maintained reference rather than a point-in-time estimate. For platform teams sizing node pools for inference workloads on EKS, this fills the gap between “the model card says X GB” and “what does that actually mean for concurrent requests on a g5.12xlarge.”

Highlights ✨

✨ EKS Hybrid Nodes gateway — VXLAN bridging for cloud-to-on-prem pod connectivity

The three connectivity problems that made EKS Hybrid Nodes painful in production were: the control plane couldn’t reach admission webhooks running on-prem, ALBs and NLBs had no path to on-prem pod IPs, and pod-to-pod traffic across cloud and on-premises required manual route management. The EKS Hybrid Nodes gateway addresses all three — it deploys two gateway pods on EC2 nodes in your VPC, establishes VXLAN tunnels (VNI 2, UDP 8472) to Cilium-managed hybrid nodes on-prem, and automatically updates VPC route tables as nodes join or leave.

The hard constraint worth flagging to any team evaluating this: Cilium with VTEP support is the only CNI supported, and the VXLAN tunnels are unencrypted. If your on-prem connectivity runs over Direct Connect or a VPN that already encrypts the underlay, the unencrypted VXLAN is likely acceptable. If not, MACsec over Direct Connect is the recommended layering. One gateway deployment per cluster also means this isn’t horizontally scalable today — worth sizing the EC2 gateway nodes accordingly for high-traffic hybrid workloads.

✨How Much Do GPU Clusters Really Cost? — SemiAnalysis

Price-per-GPU-hour is the number everyone negotiates on and the number that matters least for actual spend. SemiAnalysis breaks down GPU cluster TCO across seven cost dimensions — compute, storage, networking, control plane, support, goodput loss from downtime, and engineering setup time — and shows that two providers at identical GPU pricing can diverge by 5–15% in real TCO on large training workloads. For fault-tolerant workloads like single-node inference, that gap collapses to near zero, which puts a dollar figure on the intuition that reliability premiums are workload-dependent.

The goodput framing is the sharpest part. A cluster with 95% uptime SLA can contractually absorb 5% downtime with no credits — but the actual cost of that downtime compounds through checkpoint frequency, blast radius per failure, and job restart overhead. Their free Cluster TCO Calculator and Goodput Calculator let you plug in your own failure rates, job sizes, and storage tiers to see what a reliability delta actually costs. For teams sizing GPU node pools on EKS — whether on HyperPod, Karpenter-managed p4d/p5 nodes, or a neocloud — this framework is the right way to think about build-vs-buy and provider tradeoffs.

✨AWS DevOps Agent — autonomous incident response with 75% MTTR reduction

Production incidents on EKS clusters follow a predictable pattern: alert fires, engineer context-switches, spends 30–90 minutes correlating logs, metrics, and recent deploys before identifying root cause. AWS DevOps Agent targets that middle phase — autonomous investigation that pulls context across CloudWatch, X-Ray, and deployment history without waiting for a human to start the chain. Janardhan and Joseph Alioto’s session puts numbers on it: 75% MTTR reduction, 8.5x faster resolution, and 94% root cause accuracy from customer deployments.

The 6Cs framework they present (Context, Control, Convenience, Collaboration, Continuous Learning, Cost Effectiveness) is worth reading as a design philosophy, not just a marketing checklist — it describes how the agent is constrained to investigate and recommend rather than act autonomously in ways that could cause blast radius. For platform teams running EKS workloads where on-call burden is a real cost, this is the most concrete agentic AIOps capability AWS has shipped to date.

✨Qwen3.6-27B — dense 27B model matching frontier coding benchmarks

A fully dense 27B model from Alibaba’s Qwen team, Apache 2.0, hitting flagship-level coding performance. All 27B parameters active at inference — unlike the MoE variant, nothing is sparse, which means stronger reasoning depth but a heavier memory footprint. Fits on a single high-end GPU, which makes it a realistic candidate for self-hosted coding agents on EKS without multi-node inference setup. Two Qwen releases in a single month, both open weights, both competitive with models several times their size — if you’re benchmarking local models for platform tooling or internal coding agents, this one belongs on the list alongside Qwen3.6-35B-A3B and Gemma 4.

🎉 Sponsor Section

At the moment, we don’t have a sponsor for this edition, but we look forward to working with companies and organizations that support the EKS & AI Infrastructure community in future editions. If you or your company is interested in sponsoring, please contact us at 📧 thecloudtechforall@gmail.com

📝 Words from the Author

Something I noticed at AWS Summit Bengaluru

I’ve been to a lot of conferences. After a while they start to blur — same format, same agenda, same hallway conversations that feel important in the moment and fade by Monday.

But every now and then something small pulls you out of autopilot.

I was walking past a booth — not even planning to stop — and I ended up standing there for twenty minutes. They had AWS’s own chips on display. Trainium. Graviton. Physical things. Small. Quiet. Nothing about them looks like they should matter.

I don’t know why that got me. Maybe because so much of what we do lives in dashboards and terminals and abstract layers that it’s easy to forget there’s something real underneath all of it. Something someone actually built with their hands.

I also spoke at the summit this year. Third time on that stage. I talked about AI and platforms and letting systems do work that engineers shouldn’t have to do manually. The session went well. But honestly, the talk wasn’t the point.

The point was the person who came up afterward and said we tried something similar and it broke in exactly this way — and then we stood there for forty minutes figuring out why.

That’s the thing no one tells you about communities. You think you’re there to present. You’re actually there to be interrupted in the best possible way.

Happy Building! 😎