👋 Everything about EKS & AI Infrastructure Newsletter "#68" ☁️❤👨💻
EKS Scaling Stories, When Your Infrastructure Becomes Your Teacher

Dear EKS & AI Infrastructure enthusiasts,
Welcome to Everything about EKS & AI Infrastructure #68.
It’s been a weird couple of weeks. I’ve been bouncing between three customer calls, a newsletter deadline that snuck up on me, and trying to convince myself that “I’ll sleep when the book is done” is a sustainable life strategy. Spoiler: it’s not.
But you know what kept me sane? The EKS community keeps shipping. Real stories. Real problems. Real solutions. Not blog posts written by marketing teams — actual operators documenting what breaks and how they fixed it.
This edition is packed with that. CoreDNS stress-testing that shows you your own breaking point before production teaches it to you. SRE agents that learn from your incidents so you’re not repeating the same troubleshooting dance every quarter. Cost optimization patterns that actually stick because they’re baked into how you run clusters.
The through-line? EKS is growing up. It’s not just “deploy your containers” anymore — it’s about knowing your infrastructure so well that nothing surprises you.
Let’s dig in.
- Performance Engineering in Modern AI Systems 🌩️
🌩️When CoreDNS Becomes the Bottleneck: A Stress-Testing Story on EKS with kube-burner — Aadhith
Upload/download-heavy workloads on EKS generate thousands of simultaneous DNS queries — S3 bucket endpoints, internal service names, retry loops with cache-busting TTLs all funneling through a centralized CoreDNS. The cluster looks healthy until traffic spikes, then you get non-deterministic failures that look like network issues but trace back to DNS timeouts and CoreDNS OOMKills. The real problem: nobody has benchmarked their actual DNS saturation point, so every incident is a surprise.
Aadhith nails this by stress-testing CoreDNS with kube-burner: 250 pods running parallel nslookup workers against external names (s3.amazonaws.com), in-cluster names (kubernetes.default.svc), and cache-busting random UUIDs to simulate bad ndots behavior. The results are quantified: at 3,000 req/s, P99 DNS latency hits 1.71 seconds, CoreDNS CPU spikes to 0.633 cores, and you have the exact Grafana curve and run UUID to prove it. The fix is Node Local DNS Cache (a DaemonSet caching agent per node), which drops CoreDNS query load dramatically and serves cache hits in sub-millisecond latency. For S3-heavy or microservice-dense workloads, this is the highest-leverage EKS optimization that requires zero application changes. The broader lesson: treat benchmarks like code — version your config.yaml, archive results per UUID, run before/after upgrades, and build dashboards so you recognize degradation before production.
- Starred Content ⭐
⭐LLM on EKS: Serving with vLLM — Daniel Pepuho
Running your own LLM in production sits between MLOps and platform engineering: you need GPU scheduling that doesn’t waste capacity, memory constraints that force hard choices, and cost discipline so a g4dn.xlarge doesn’t become a g4dn.12xlarge. Daniel builds Llama 3.1 8B (AWQ quantized) on EKS with vLLM, AWS CDK for infrastructure, and Streamlit for the UI. The quantization and instance choice aren’t accidents — 16GB of VRAM on a g4dn.xlarge is a real constraint, and AWQ quantization lets you run an 8B model without bleeding performance.
The post makes the tradeoff reasoning visible: why EKS over a standalone EC2 instance (cluster elasticity, workload isolation via taints/tolerations), why vLLM over raw transformers (batch processing, continuous batching, token streaming), and why quantization matters when your GPU memory is fixed. If you’re evaluating where to host your first LLM or building a self-hosted inference platform on EKS, this is the starting point.
⭐Maximizing Value with Amazon EKS Auto Mode: Strategies for Visibility, Control, and Optimization
Platform engineering teams spend weeks per month managing cluster maintenance, capacity planning, OS updates, and AMI versioning—operational drag that bleeds budget and focus. EKS Auto Mode, powered by Karpenter, strips this away: automatic compute provisioning, dynamic node consolidation, health management, and OS patching happen without manual intervention. But Auto Mode costs money (a management fee on top of EC2 pricing), and the real win lives in how you instrument visibility and governance around it.
Goutham Annem and Hevert Brito walk through the full cost architecture: tagging strategies for cost allocation by namespace/team, consolidation settings that prevent overpaying for idle nodes, using Spot Instances for up to 90% savings, and right-sizing HPAs so Auto Mode scales to actual demand, not just unschedulable pods. The disruption budget configuration is sharp—overly conservative settings leave money on the table. For teams running non-production environments (dev/test/QA), they show how to scale to zero nodes on weekends via custom Node Pools. This is reference-grade cost engineering.
When a namespace gets deleted accidentally or a cluster upgrade fails, rebuilding by hand means recreating every deployment, service, and PVC from scratch. Velero solves this: it captures Kubernetes object definitions and etcd state to S3, and backs up persistent volume data as EBS snapshots. The Kubernetes API-driven approach means you scope backups to namespaces, resource types, or labels — not monolithic cluster-wide blobs. Cross-cluster restores and namespace remapping work out of the box, which makes Velero the pattern for disaster recovery on EKS.
The walkthrough by Sapeksh and Shalabh replaces Velero’s default cluster-admin binding with a least-privilege ClusterRole, uses EKS Pod Identity for credential management (no secrets management overhead), and demonstrates namespace mapping during restore (backing up myprimary, restoring to myrestore with PVC data intact). The key operational detail: set the disruption/consolidation budgets carefully in your Node Pools, or else Velero operations might get evicted mid-backup. For stateful workloads on EKS, this is essential reading.
When your EKS clusters are spread across a dozen AWS accounts and multiple Regions, an incident means switching consoles, hunting log groups, and manually correlating metrics while customers experience degraded service. Multi-account monitoring fragments visibility exactly when you need it most. This post walks you through a hub-and-spoke architecture that unifies Container Insights and CloudWatch data without requiring existing infrastructure changes.
The solution has three layers: the EKS Dashboard for organization-wide strategic visibility (cluster health, upgrade readiness, cost projections updated every 12 hours); CloudWatch cross-account observability to replicate metrics, logs, and traces into a central monitoring account within the same Region (enabling queries without role assumption); and cross-account cross-Region dashboards that assume IAM roles to query data on-demand across Regions. Together, they eliminate context switching during incidents and shift capacity planning from reactive to preventive. The two-way authorization model (sinks in the monitoring account, links in source accounts) enforces explicit consent and maintains account isolation for compliance auditing.
- Announcements 📢
📢Canva achieves 58% cost savings on GPU inference with Amazon EKS Hybrid Nodes
EKS Hybrid Nodes let you extend the EKS control plane to on-premises infrastructure, which means you can manage GPU workloads across cloud and datacenter from a single Kubernetes API — no separate on-prem cluster to operate. Canva’s platform serves 265M monthly active users, and they needed to scale GPU capacity for AI workloads without fragmenting their orchestration layer. Hybrid Nodes solved it: they achieved 58% cost savings, rapid scaling during peak demand, and improved GPU utilization, all while keeping their internal apps unchanged.
The implementation detail that matters: Canva’s Runtime Platform team didn’t have to rewrite their platform or change workload placement logic — the hybrid nodes integrated directly into their existing EKS cluster as additional capacity. That’s the power of the model: it’s not a new operational paradigm, it’s an extension of what you already run. For teams managing GPU workloads at scale or exploring on-prem + cloud burstability, this is a reference architecture worth studying.
Community & Career 🤝
🤝Building a SRE Agent for Amazon EKS with Amazon Bedrock AgentCore — Tolgahan Demirbaş
SRE troubleshooting today still requires context-switching: a human runs kubectl get pods, then checks Prometheus, then digs through CloudWatch logs, correlating signals manually. Tolgahan’s SRE agent changes that by embedding Claude 3.5 Sonnet into a Bedrock AgentCore runtime with Model Context Protocol (MCP) gateways connecting to Lambda-based tool integrations for Kubernetes, Prometheus, and CloudWatch. Instead of manual clicking, you ask the agent: “Find pods in CrashLoopBackOff and get their previous logs” or “Check if payment-service is meeting its 99.9% SLO target,” and it routes the request through the right tools, correlates the data, and returns actionable diagnosis with error budget status and recommendations.
The architecture is production-hardened: Cognito JWT auth on the gateway, VPC-scoped Lambda, least-privilege RBAC on the cluster, and AgentCore Memory that learns incident patterns across runs so the agent improves over time. The implementation covers the full stack — pod diagnostics, golden signals (latency percentiles, error rates), SLO/SLI calculations with error budget logic, and CloudWatch Logs Insights queries — all orchestrated by the agent without human routing. Full source code at GitHub, Terraform-ready, exactly the kind of operational automation that shifts your team from reactive firefighting to proactive pattern recognition.
🤝Global Capacity Orchestrator (GCO) on AWS — One API, Every Accelerator, Any Region
Most ML and inference teams think in regional silos: pick a cluster, pick a region, pick an endpoint. But accelerator capacity is fragmented across regions, which means you either overprovision locally or build a custom routing layer yourself. Jacob Mevorach spent two years building Global Capacity Orchestrator to solve this: one REST API and CLI to submit jobs, route work toward available capacity across regions, and deploy inference endpoints with automatic failover. It deploys EKS Auto Mode clusters, connects them with Global Accelerator, and handles capacity-aware scheduling for GPUs, Trainium, Inferentia, and Graviton workloads.
The architecture is clean: multi-region job submission, spot instance fallback, persistent outputs via EFS/FSx for Lustre, built-in CloudWatch observability, and one-command deploy/teardown. The MCP server with 44 tools is a nice touch — agents can learn and query the entire codebase. If you’re building a multi-region ML platform or batch orchestration layer, this is reference-grade code to study.
🤝eksup — EKS Cluster Upgrade Guidance Tool by ClowdHaus
Kubernetes releases a new minor version every four months, and EKS supports each for 14 months, which means you’re upgrading at least once a year. But each upgrade is different: new team members, deprecated APIs, removed in-tree components, component changes (kube-dns → CoreDNS, Docker → containerd), and shifts in your own architecture all create unique risk profiles per cycle. Without a repeatable framework, every upgrade feels like firefighting.
ClowdHaus’ eksup analyzes your cluster against the next Kubernetes version, surfaces findings (informational, recommended, or required), and generates a human-editable playbook tailored to your specific cluster. It’s intentionally read-only — you modify your infrastructure through Terraform, eksctl, or CloudFormation — but it gives your team the information needed to answer: “What will break on upgrade?” before you learn the hard way. The playbooks become historical artifacts that improve your process each cycle. Written in Rust, installable via Homebrew or cargo, and battle-tested on real cluster upgrades, this is the kind of unglamorous but indispensable infrastructure that operators ship once and use forever.
- Highlights ✨
✨Unlock GenAI Inference Anywhere with Amazon EKS Hybrid Nodes — ARC303, AWS Summit Sydney 2026
EKS Hybrid Nodes let you run Kubernetes pods on premises while keeping the control plane in AWS, which means you can push inference workloads closer to specialized hardware (GPUs, custom accelerators, low-latency infrastructure) without managing a separate on-prem Kubernetes cluster. For teams running GenAI inference at scale, this unlocks a critical pattern: burst to cloud when on-prem capacity is saturated, keep latency-sensitive inference local, and manage it all from a single EKS control plane.
The reference implementation walks through the architecture, tenant isolation, and operational mechanics of a hybrid inference layer. Frank Fan and Sheng Chen built an end-to-end demo showing how workload placement across hybrid nodes works in practice — the kind of hands-on session that’s worth studying if you’re evaluating on-prem + cloud GPU strategies or building a multi-cluster inference platform.
✨ECS vs. EKS vs. Lambda: The Decision Framework — Dheeraj Choudhary
ECS vs. EKS is rarely a technical question; it’s a bet on operational complexity and multi-cloud strategy. Dheeraj frames it clearly: ECS for AWS-first teams that want fast time-to-deployment and tight AWS integration (start here if you’re new), EKS for multi-cloud strategies and microservice architectures where you need Kubernetes portability, Lambda for event-driven workloads where you pay per invocation and don’t manage infrastructure. The framing—”start with ECS, graduate to EKS only when complexity demands it”—is honest and matches how teams actually grow.
🎉 Sponsor Section
At the moment, we don’t have a sponsor for this edition, but we look forward to working with companies and organizations that support the EKS & AI Infrastructure community in future editions. If you or your company is interested in sponsoring, please contact us at 📧 thecloudtechforall@gmail.com
📝 Words from the Author
I watched a street dog the other day. Just sitting there, completely unbothered, while the entire city moved around it. Honking. Chaos. Rush hour madness. And this dog? Just... existed. Peaceful. Content. No performance anxiety.
That dog had it figured out.
I’ve been thinking about this a lot lately — about how we’re all constantly in “go mode.” Ship faster. Build bigger. Learn more. Get certified. Speak at more conferences. Write the book. Grow the community. And somewhere in that noise, you forget why you started in the first place.
The weirdest part? I think that’s when the best work actually happens. Not when you’re grinding the hardest, but when you step back and remember that you genuinely like solving problems. That you actually enjoy the people you work with. That shipping something imperfect but real beats shipping nothing while waiting for perfect.
So this edition exists because people sent me incredible stories. Not because I chased them down. Not because I had some master plan. Just... people being generous with their knowledge. And that generosity is contagious.
Maybe the lesson is: be like that street dog. Do your thing. Stay unbothered. Let the good stuff come to you.
(Also I’m getting better at taking breaks. Key word: getting. 😅)
Happy Building. 😎



