CIO Influence
IT and DevOps

Driving Efficiency in AI/ML: Unlocking Potential through a Dynamic Workload Scheduler

Dynamic Workload Scheduler: Optimizing resource access and economics for AI/ML workloads

Amidst an era of AI-driven innovation, a groundbreaking architecture called AI Hypercomputer has been announced. This system integrates AI-optimized hardware, software, and consumption models, enabling enterprises worldwide to utilize Google’s advanced AI/ML infrastructure.

PREDICTIONS SERIES 2024 - CIO Influence

Effective resource management becomes increasingly critical as the demand for TPUs and NVIDIA GPUs surges. In response, the Dynamic Workload Scheduler is introduced today—an uncomplicated yet powerful approach to accessing GPUs and TPUs. This blog is dedicated to technical audiences, providing an in-depth exploration of its functionality and immediate applicability.

Dynamic Workload Scheduler for AI Hypercomputer

The Dynamic Workload Scheduler is a resource management and job scheduling platform meticulously crafted for the AI Hypercomputer. Its primary aim is to enhance access to AI/ML resources, optimize expenditure, and elevate the efficiency of various workloads, including training and fine-tuning tasks, by adeptly scheduling all necessary accelerators simultaneously. This sophisticated scheduler supports TPUs and NVIDIA GPUs while seamlessly incorporating scheduling advancements from Google’s ML fleet to benefit Google Cloud customers. Moreover, the Dynamic Workload Scheduler boasts integration within various favored Google Cloud AI/ML services, including Compute Engine Managed Instance Groups, Google Kubernetes Engine, Vertex AI, and Batch, with plans for further expansion.

Read More: CGI and Google Cloud Extend Support to UN in Climate Change Combat

Two Distinctive Modes of Dynamic Workload Scheduler

  1. Flex Start Mode

Enhancing Accessibility and Optimizing Economics Flex Start mode is meticulously designed to facilitate efficient access to GPUs and TPUs, optimizing their utilization and cost-effectiveness. This mode caters to tasks such as model fine-tuning, experimentation, shorter training jobs, distillation, offline inference, and batch jobs. Users can request GPU and TPU capacity precisely when their jobs are prepared to execute.

Operating within the Flex Start mode entails submitting a GPU capacity request and specifying the required quantity, duration, and preferred region for the AI/ML jobs. The Dynamic Workload Scheduler intelligently manages these requests, automatically provisioning VMs when the requested capacity becomes available. This allocation enables continuous execution of workloads for the designated duration, with support for capacity requests spanning up to seven days without any minimum duration requirement. Whether users require capacity for minutes or hours, the scheduler prioritizes shorter requests, typically fulfilling them more expediently.

Furthermore, users can terminate the VMs upon completing a training job, liberating the resources and only incurring charges for the actual consumption. This eradicates the necessity to retain idle resources for future use.

  1. Calendar Mode

Reserved Start Times for Precise AI Workloads [Preview Q1’24] Addressing the needs of training and experimentation workloads requiring specific start times and defined durations.  This mode facilitates reserving GPU capacity in fixed duration blocks, initially supporting future reservations for 7 or 14 days and purchasable up to 8 weeks in advance. Upon availability confirmation, users can request reservations, which deliver the allocated capacity to their projects on the specified start date. VMs can then be directed to consume this reserved capacity block. After the defined duration, the VMs will be terminated, and the reservations will be automatically deleted. Utilizing Calendar mode involves creating a future reservation and running VMs with the specific reservation affinity via available APIs such as Compute Engine, Managed Instance Groups, or GKE.

Customer Testimonials: Dynamic Workload Scheduler’s Impact on AI Innovators

Testimonials about the Dynamic Workload Scheduler highlight its utilization of Google Borg technology, renowned for its real-time scheduling capabilities within the Google ML Fleet. As a platform managing millions of jobs, including one of the largest distributed LLM training jobs globally as of November 2023, the Scheduler offers Flex Start and Calendar modes, presenting customers with enhanced flexibility, improved GPU and TPU accessibility, optimized resource usage, and reduced costs. Early adopters among customers and partners are already reaping its benefits.

Linum AI, a company specializing in text-to-video generative AI, shared its experience:

“The new scheduling capabilities within Dynamic Workload Scheduler have revolutionized our ability to secure sufficient GPU capacity for training runs. The worry of idle GPUs draining resources while waiting for compute availability is no more.”Sahil Chopra, Co-Founder & CEO, Linum AI

Similarly, sudoku, a 3D generative AI company, leveraged Dynamic Workload Scheduler to train its latest generative model:

“The convenience of on-demand capacity without concerns has been remarkable. This allowed us to explore new ideas, iterate, and conduct extensive training sessions. Our latest 3D Gen AI model was successfully trained using Dynamic Workload Scheduler, enabling us to meet our launch deadlines.”Robin Han, Co-Founder and CEO, sudoAI

FAQs

1. What is the Dynamic Workload Scheduler, and how does it contribute to AI Hypercomputer’s efficiency?
Answer: The Dynamic Workload Scheduler is a resource management and job scheduling platform designed specifically for the AI Hypercomputer. It optimizes the utilization and cost-effectiveness of AI/ML resources, including TPUs and NVIDIA GPUs, by efficiently scheduling workloads like training and fine-tuning tasks.

2. How does Flex Start Mode optimize resource utilization and cost-effectiveness?
Answer: Flex Start Mode allows users to request GPU and TPU capacity for tasks like model fine-tuning and shorter training jobs when their jobs are ready to execute. The scheduler automatically provisions VMs when the requested capacity becomes available, prioritizing shorter requests. Users can terminate VMs after job completion, eliminating charges for idle resources.

3. How does the Dynamic Workload Scheduler leverage Google’s Borg technology?
Answer: The scheduler uses Google Borg technology, known for real-time scheduling capabilities within the Google ML Fleet. This technology manages millions of jobs and offers Flex Start and Calendar modes, providing customers with enhanced flexibility and optimized resource usage.

[To share your insights with us, please write to sghosh@martechseries.com]

Related posts

Emergn Earns Microsoft Solutions Partner Designation

Business Wire

Kaseya’s Highly-Anticipated M&A Summit Helps MSPs Navigate the Evolving Mergers & Acquisitions Landscape

SIGNiX Announces Partnership with EasySend to Streamline Digital Signature and Remote Online Notary Experiences