By Claudiu Bota, Sr. Solutions Architect, and Vitalii Vlasov, DevOps Engineer, Automat-it – AWS Premier Tier Partner
In today’s AI-driven technology landscape, organizations face a critical challenge: securing sufficient GPU resources while maintaining cost efficiency and operational resilience.
The explosive growth in AI development has created unprecedented demand for these specialized processors, resulting in global shortages and skyrocketing costs. Meanwhile, the stakes couldn’t be higher, as AI training jobs that run for days or weeks can be devastated by infrastructure failures, potentially losing weeks of progress and valuable resources.
The GPU Challenge in Modern AI Development
The current AI boom has transformed GPUs from specialized gaming hardware into the lifeblood of machine learning operations. Organizations building AI solutions face three interconnected challenges:
- Scarcity: Global GPU shortages make capacity planning difficult
- Cost Pressure: Increasing prices for high-performance computing resources
- Resilience Requirements: The devastating impact of failures during long-running training jobs
Consider the reality of training large-scale foundation models: orchestrating hundreds of high-end GPUs continuously for weeks, where a single hardware failure could force a complete restart, losing substantial progress and significant computational investment.
AWS recognized these pain points and responded with Amazon SageMaker HyperPod, offering dedicated GPU clusters with substantial discounts in exchange for time commitments.
HyperPod’s automatic failure recovery mechanisms address the resilience challenge by automatically handling node failures and resuming training from the last checkpoint, protecting organizations from catastrophic losses. While HyperPod integrates with Amazon Elastic Kubernetes Service (EKS) to leverage Kubernetes orchestration capabilities, it has one significant limitation: it lacks native autoscaling for handling peak demands.
A Hybrid Solution for Flexibility with SageMaker HyperPod and EKS
When one of our customers at Automat-it approached us with this exact challenge, we recognized an opportunity to create something innovative. They needed a solution that combined the cost-effectiveness and reliability of reserved GPU capacity with the flexibility to scale dynamically during peak periods.
We designed a hybrid architecture that leverages the strengths of both Amazon SageMaker HyperPod and Amazon EKS with Karpenter (a flexible, high-performance Kubernetes cluster autoscaler):
- SageMaker HyperPod: Provides the baseline GPU capacity with discounted pricing through long-term commitments via Flexible Training Plans, ensuring constant availability for core workload
- Amazon EKS with Karpenter: Delivers dynamic autoscaling capabilities to handle burst workloads during peak demand periods
Implementation Details: Bridging Two Worlds
The implementation required careful architecture design to create a seamless experience across both environments. For example, during a typical day, routine model training jobs would run on the cost-effective HyperPod baseline capacity, but when urgent inference requests spike during business hours or a critical model needs rapid iteration, the system automatically provisions additional EKS resources to handle the burst while keeping costs optimal. Our solution included:
- Unified Job Scheduling: A custom scheduler that intelligently routes workloads to either HyperPod or EKS based on resource availability, priority, and cost considerations
- Consistent Environment: Container images and runtime environments standardized across both platforms to ensure workloads can execute identically regardless of destination
- Centralized Monitoring: A unified observability layer using the SageMaker HyperPod task governance dashboard that provides visibility into GPU utilization, job status, and costs across the hybrid infrastructure. This enables teams to see, for instance, that long-running training jobs utilize 80% of HyperPod capacity while EKS has dynamically scaled up 15 additional nodes to handle a sudden inference workload spike
- Cost Optimization: Automated policies that maximize the usage of committed HyperPod resources while intelligently scaling EKS resources based on both urgency and cost-efficiency
Smart Workload Placement with Helm and Kubernetes Priorities
A critical component of our solution was implementing intelligent workload placement through Kubernetes priorities and a custom Helm chart. Below is a sample fragment from our values.yaml:
# ai-training-helm-chart/values.yaml # Default values for ai-training. # This is a YAML-formatted file. # Declare variables to be passed into your templates. # Global configuration global: environment: production # Name overrides nameOverride: "" fullnameOverride: "" # Job type to deploy (production, development, research) jobType: development # Container image configuration image: repository: nvidia/pytorch pullPolicy: IfNotPresent # Overrides the image tag whose default is the chart appVersion. tag: "23.10-py3" # Image pull secrets imagePullSecrets: [] # Service account configuration serviceAccount: # Specifies whether a service account should be created create: true # Annotations to add to the service account annotations: {} # The name of the service account to use. # If not set and create is true, a name is generated using the fullname template name: "" # Job configuration backoffLimit: 3 ttlSecondsAfterFinished: 86400 # Training command and arguments command: - "python" args: - "/workspace/train.py" # Model type modelType: "default" # Environment variables env: CUDA_VISIBLE_DEVICES: "all" NCCL_DEBUG: "INFO" # Persistent Volume Claims dataPvcName: "training-data-pvc" checkpointPvcName: "model-checkpoints-pvc" nodeSelectors: # Priority tiers for different node pools hyperpod: enabled: true labels: node-type: hyperpod gpu-type: a100-80gb eksDemand: enabled: true labels: node-type: eks-on-demand gpu-type: a100-40gb eksSpot: enabled: true labels: node-type: eks-spot gpu-type: a100-40gb priorityClasses: # Priority class definitions based on job importance critical: value: 1000000 description: "Critical production training jobs that cannot be interrupted" high: value: 800000 description: "High priority training jobs for active development" medium: value: 600000 description: "Standard training jobs" low: value: 400000 description: "Experimental or research training jobs" preemptible: value: 200000 description: "Jobs that can be interrupted if necessary" preemptionPolicy: PreemptLowerPriority # Job configurations with different priorities and placement jobTemplates: production: priorityClassName: critical nodeSelector: node-type: hyperpod tolerations: - key: "dedicated" operator: "Equal" value: "hyperpod" effect: "NoSchedule" resources: limits: nvidia.com/gpu: 8 requests: nvidia.com/gpu: 8 memory: "64Gi" cpu: "32" development: priorityClassName: high nodeSelector: node-type: eks-on-demand resources: limits: nvidia.com/gpu: 4 requests: nvidia.com/gpu: 4 memory: "32Gi" cpu: "16" research: priorityClassName: medium nodeSelector: node-type: eks-spot tolerations: - key: "spot" operator: "Equal" value: "true" effect: "NoSchedule" resources: limits: nvidia.com/gpu: 2 requests: nvidia.com/gpu: 2 memory: "16Gi" cpu: "8"
This Helm chart implements several key features:
- Priority Classes: Defines a hierarchy of job priorities, ensuring that critical production workloads take precedence over experimental jobs
- Node Selectors: Strictly directs different types of workloads to the appropriate infrastructure (HyperPod for production, EKS on-demand for development, EKS spot for research)
- Resource Specifications: Tailors GPU, memory, and CPU requests based on workload type
- Tolerations: Ensures jobs can run on nodes with specific taints (like spot instances or dedicated HyperPod nodes)
When deployed, this configuration ensures:
- Production training jobs run on cost-effective, reliable HyperPod instances
- Development workloads use on-demand EKS instances when HyperPod is at capacity
- Research and experimental jobs leverage spot instances for maximum cost savings
- Critical jobs can preempt lower-priority workloads during resource constraints
The Results: 50% Cost Reduction and Faster Development Cycles
The hybrid solution delivered impressive results for our customer:
- 50% Reduction in GPU Costs: By leveraging Flexible Training Plans for committed baseline capacity and optimizing the balance with on-demand resources
- Significantly Improved Job Reliability: HyperPod’s resilience features eliminated the previously common job failures
- Faster Time to Market: The ability to dynamically scale during peak periods accelerated model development cycles
- Simplified Operations: A consistent management experience across both environments reduced operational complexity
Key Learnings and Best Practices
Through this implementation, we identified several best practices for organizations considering a similar approach:
- Start with Workload Analysis: Thoroughly understand your baseline and peak GPU requirements before committing to reserved capacity
- Design for Compatibility: Ensure your ML pipelines and workflows can operate identically in both environments
- Implement Smart Scheduling: Develop clear policies for workload routing that balance cost optimization with performance requirements
- Monitor Continuously: Implement comprehensive monitoring to identify optimization opportunities and prevent resource wastage
- Plan for Evolution: Create an architecture that can adapt as GPU availability, pricing, and AWS service capabilities evolve
Conclusion: Flexibility is the Future
In the rapidly evolving world of AI infrastructure, rigid solutions quickly become obsolete. Automat-it’s hybrid approach demonstrates that organizations don’t need to choose between cost optimization and scalability; they can have both by intelligently combining AWS services.
As an AWS Premier Tier Partner, Automat-it specializes in designing and implementing these kinds of innovative solutions that maximize the value of AWS services while addressing real-world business challenges. The successful implementation of this hybrid GPU architecture showcases our commitment to finding the optimal balance between cost, performance, and operational excellence.
Whether you’re just beginning your AI journey or looking to optimize an existing infrastructure, consider how a hybrid approach might help you navigate the challenging terrain of GPU availability while maintaining the flexibility to scale with your business needs.