Google Cloud has announced a new portfolio of computing capabilities in a preview launch of its next-generation A3 GPU supercomputer. The Google A3 supercomputer will support a wide variety of machine learning models, including the large language models (LLMs), Generative AI, and diffusion models. This supercomputer serves various generative AI models through NVIDIA L4 Tensor Core GPUs. A3 Supercomputer with a complete range of GPU options for AI ML training carries NVIDIA H100 GPUs at its heart for superior inference of ML-centric computing.
AI News: Chindata Set to Ride AI-powered Explosion in Demand for Data Storage, Processing
Google Cloud built the A3 Supercomputer in combination with the NVIDIA H100 Tensor Core GPUs and Google’s leading networking advancements. Customers can now deploy the A3 VMs on Vertex AI. Vertex AI is an end-to-end platform for building ML models on fully-managed infrastructure that’s purpose-built for low-latency serving and high-performance training.
So, what is Google A3 Supercomputer all about?
The Google A3 Supercomputer powered by NVIDIA H100 GPUs is the first actual application of Google’s existing 200 GPS IPUs. It allows fast-speed inter-GPU data transfers with 10x higher network bandwidth compared to the existing A2 VMs. With lower instances of latency and higher stability in bandwidth management, the A3 supercomputer quickly bypasses the CPU host to allow reconfigurable optical links for all kinds of existing topology within the data center.
By using A3 Supercomputer GPUs, AI and IT companies can reduce their operational costs and improve their networking fabric across thousands of interconnected GPUs that can now run with full bandwidth.
How powerful is the A3 Supercomputer?
If you are working with AI models and data training, your ingredient to succeed in your business lies with the kind of performance you derive from your VMs and GPUs. Every other VM pales infront of A3 Supercomputer’s AI performance. The Google A3 supercomputer’s scale provides up to 26 exaFlops of AI performance, as confirmed by the Google source. This AI performance significantly improves the time of computing AI data and reduces the costs for training large ML models. You can scale your computing process to achieve real-time AI results.
Here are the key features and capabilities of the Google A3 Supercomputer:
- 8 H100 GPUs utilizing NVIDIA’s Hopper architecture, delivering 3x compute throughput
- 3.6 TB/s bi-sectional bandwidth between A3’s 8 GPUs via NVIDIA NVSwitch and NVLink 4.0
- Next-generation 4th Gen Intel Xeon Scalable processors
- 2TB of host memory via 4800 MHz DDR5 DIMMs
- 10x greater networking bandwidth powered by our hardware-enabled IPUs, specialized inter-server GPU communication stack and NCCL optimizations
What’s Next with A3 Supercomputer GPUs?
A3 supercomputer, once available in the market, would drastically improve AI delivery models, helping AI teams come up with faster solutions to manage and build advanced ML models, particularly in the generative AI space. It could also improve Google’s own capabilities to produce newer kinds of AI platforms that are now part of its enterprise offerings such as Duet AI and the upcoming Gemini.
Why ML Models Need to Speed Up?
Businesses could lose millions of dollars due to the lag in their ML models. If developed before a competitor releases the AI product, it could result in loss of market opportunity. By considerably speeding up the computing process for advanced ML models, A3 supercomputers help businesses stay ahead in the AI competition.
The announcements were made at the ongoing Google I/O 2023, where the company also announced about introducing new features and foundation models.
Where to deploy A3?
A3 Supercomputer GPUs can be deployed on the Google Kubernetes Engine (GKE) and Compute Engine. This would allow users to train and to serve the latest foundation models without worrying about complex workload orchestration, and automatic upgrades. A google customer, Noam Shazeer, CEO, Character.AI said – “Google Cloud’s A3 VM instances provide us with the computational power and scale for our most demanding training and inference workloads. We’re looking forward to taking advantage of their expertise in the AI space and leadership in large-scale infrastructure to deliver a strong platform for our ML workloads.”
Recommended: Accenture Collaborated with Microsoft to Transform Its Azure Supply Chain