GPU Scalability: Creating a Hybrid Amazon SageMaker HyperPod & EKS

Table of Contents

blog image

By Claudiu Bota, Sr. Solutions Architect, and Vitalii Vlasov, DevOps Engineer, Automat-it – AWS Premier Tier Partner

In today’s AI-driven technology landscape, organizations face a critical challenge: securing sufficient GPU resources while maintaining cost efficiency and operational resilience.

The explosive growth in AI development has created unprecedented demand for these specialized processors, resulting in global shortages and skyrocketing costs. Meanwhile, the stakes couldn’t be higher, as AI training jobs that run for days or weeks can be devastated by infrastructure failures, potentially losing weeks of progress and valuable resources.

 

The GPU Challenge in Modern AI Development

 

The current AI boom has transformed GPUs from specialized gaming hardware into the lifeblood of machine learning operations. Organizations building AI solutions face three interconnected challenges:

  • Scarcity: Global GPU shortages make capacity planning difficult
  • Cost Pressure: Increasing prices for high-performance computing resources
  • Resilience Requirements: The devastating impact of failures during long-running training jobs

 

Consider the reality of training large-scale foundation models: orchestrating hundreds of high-end GPUs continuously for weeks, where a single hardware failure could force a complete restart, losing substantial progress and significant computational investment.

AWS recognized these pain points and responded with Amazon SageMaker HyperPod, offering dedicated GPU clusters with substantial discounts in exchange for time commitments.

HyperPod’s automatic failure recovery mechanisms address the resilience challenge by automatically handling node failures and resuming training from the last checkpoint, protecting organizations from catastrophic losses. While HyperPod integrates with Amazon Elastic Kubernetes Service (EKS) to leverage Kubernetes orchestration capabilities, it has one significant limitation: it lacks native autoscaling for handling peak demands.

 

A Hybrid Solution for Flexibility with SageMaker HyperPod and EKS

 

When one of our customers at Automat-it approached us with this exact challenge, we recognized an opportunity to create something innovative. They needed a solution that combined the cost-effectiveness and reliability of reserved GPU capacity with the flexibility to scale dynamically during peak periods.

We designed a hybrid architecture that leverages the strengths of both Amazon SageMaker HyperPod and Amazon EKS with Karpenter (a flexible, high-performance Kubernetes cluster autoscaler):

  • SageMaker HyperPod: Provides the baseline GPU capacity with discounted pricing through long-term commitments via Flexible Training Plans, ensuring constant availability for core workload
  • Amazon EKS with Karpenter: Delivers dynamic autoscaling capabilities to handle burst workloads during peak demand periods

 

Implementation Details: Bridging Two Worlds

 

The implementation required careful architecture design to create a seamless experience across both environments. For example, during a typical day, routine model training jobs would run on the cost-effective HyperPod baseline capacity, but when urgent inference requests spike during business hours or a critical model needs rapid iteration, the system automatically provisions additional EKS resources to handle the burst while keeping costs optimal. Our solution included:

  • Unified Job Scheduling: A custom scheduler that intelligently routes workloads to either HyperPod or EKS based on resource availability, priority, and cost considerations
  • Consistent Environment: Container images and runtime environments standardized across both platforms to ensure workloads can execute identically regardless of destination
  • Centralized Monitoring: A unified observability layer using the SageMaker HyperPod task governance dashboard that provides visibility into GPU utilization, job status, and costs across the hybrid infrastructure. This enables teams to see, for instance, that long-running training jobs utilize 80% of HyperPod capacity while EKS has dynamically scaled up 15 additional nodes to handle a sudden inference workload spike
  • Cost Optimization: Automated policies that maximize the usage of committed HyperPod resources while intelligently scaling EKS resources based on both urgency and cost-efficiency

 

Smart Workload Placement with Helm and Kubernetes Priorities

 

A critical component of our solution was implementing intelligent workload placement through Kubernetes priorities and a custom Helm chart. Below is a sample fragment from our values.yaml:

# ai-training-helm-chart/values.yaml
# Default values for ai-training.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

# Global configuration
global:
  environment: production

# Name overrides
nameOverride: ""
fullnameOverride: ""

# Job type to deploy (production, development, research)
jobType: development

# Container image configuration
image:
  repository: nvidia/pytorch
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: "23.10-py3"

# Image pull secrets
imagePullSecrets: []

# Service account configuration
serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: ""

# Job configuration
backoffLimit: 3
ttlSecondsAfterFinished: 86400

# Training command and arguments
command:
  - "python"
args:
  - "/workspace/train.py"

# Model type
modelType: "default"

# Environment variables
env:
  CUDA_VISIBLE_DEVICES: "all"
  NCCL_DEBUG: "INFO"

# Persistent Volume Claims
dataPvcName: "training-data-pvc"
checkpointPvcName: "model-checkpoints-pvc"

nodeSelectors:
  # Priority tiers for different node pools
  hyperpod:
    enabled: true
    labels:
      node-type: hyperpod
      gpu-type: a100-80gb
  eksDemand:
    enabled: true
    labels:
      node-type: eks-on-demand
      gpu-type: a100-40gb
  eksSpot:
    enabled: true
    labels:
      node-type: eks-spot
      gpu-type: a100-40gb

priorityClasses:
  # Priority class definitions based on job importance
  critical:
    value: 1000000
    description: "Critical production training jobs that cannot be interrupted"
  high:
    value: 800000
    description: "High priority training jobs for active development"
  medium:
    value: 600000
    description: "Standard training jobs"
  low:
    value: 400000
    description: "Experimental or research training jobs"
  preemptible:
    value: 200000
    description: "Jobs that can be interrupted if necessary"
    preemptionPolicy: PreemptLowerPriority

# Job configurations with different priorities and placement
jobTemplates:
  production:
    priorityClassName: critical
    nodeSelector:
      node-type: hyperpod
    tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "hyperpod"
        effect: "NoSchedule"
    resources:
      limits:
        nvidia.com/gpu: 8
      requests:
        nvidia.com/gpu: 8
        memory: "64Gi"
        cpu: "32"

  development:
    priorityClassName: high
    nodeSelector:
      node-type: eks-on-demand
    resources:
      limits:
        nvidia.com/gpu: 4
      requests:
        nvidia.com/gpu: 4
        memory: "32Gi"
        cpu: "16"

  research:
    priorityClassName: medium
    nodeSelector:
      node-type: eks-spot
    tolerations:
      - key: "spot"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
    resources:
      limits:
        nvidia.com/gpu: 2
      requests:
        nvidia.com/gpu: 2
        memory: "16Gi"
        cpu: "8"

This Helm chart implements several key features:

  1. Priority Classes: Defines a hierarchy of job priorities, ensuring that critical production workloads take precedence over experimental jobs
  2. Node Selectors: Strictly directs different types of workloads to the appropriate infrastructure (HyperPod for production, EKS on-demand for development, EKS spot for research)
  3. Resource Specifications: Tailors GPU, memory, and CPU requests based on workload type
  4. Tolerations: Ensures jobs can run on nodes with specific taints (like spot instances or dedicated HyperPod nodes)

 

When deployed, this configuration ensures:

  • Production training jobs run on cost-effective, reliable HyperPod instances
  • Development workloads use on-demand EKS instances when HyperPod is at capacity
  • Research and experimental jobs leverage spot instances for maximum cost savings
  • Critical jobs can preempt lower-priority workloads during resource constraints

The Results: 50% Cost Reduction and Faster Development Cycles

The hybrid solution delivered impressive results for our customer:

  • 50% Reduction in GPU Costs: By leveraging Flexible Training Plans for committed baseline capacity and optimizing the balance with on-demand resources
  • Significantly Improved Job Reliability: HyperPod’s resilience features eliminated the previously common job failures
  • Faster Time to Market: The ability to dynamically scale during peak periods accelerated model development cycles
  • Simplified Operations: A consistent management experience across both environments reduced operational complexity

 

Key Learnings and Best Practices

 

Through this implementation, we identified several best practices for organizations considering a similar approach:

  1. Start with Workload Analysis: Thoroughly understand your baseline and peak GPU requirements before committing to reserved capacity
  2. Design for Compatibility: Ensure your ML pipelines and workflows can operate identically in both environments
  3. Implement Smart Scheduling: Develop clear policies for workload routing that balance cost optimization with performance requirements
  4. Monitor Continuously: Implement comprehensive monitoring to identify optimization opportunities and prevent resource wastage
  5. Plan for Evolution: Create an architecture that can adapt as GPU availability, pricing, and AWS service capabilities evolve

 

Conclusion: Flexibility is the Future

 

In the rapidly evolving world of AI infrastructure, rigid solutions quickly become obsolete. Automat-it’s hybrid approach demonstrates that organizations don’t need to choose between cost optimization and scalability; they can have both by intelligently combining AWS services.

As an AWS Premier Tier Partner, Automat-it specializes in designing and implementing these kinds of innovative solutions that maximize the value of AWS services while addressing real-world business challenges. The successful implementation of this hybrid GPU architecture showcases our commitment to finding the optimal balance between cost, performance, and operational excellence.

Whether you’re just beginning your AI journey or looking to optimize an existing infrastructure, consider how a hybrid approach might help you navigate the challenging terrain of GPU availability while maintaining the flexibility to scale with your business needs.