CIO Influence
IT and DevOps

Enhance Training Performance with Latest NVIDIA’s NeMo Framework Features and NVIDIA H200

Enhance Training Performance with Latest NVIDIA's NeMo Framework Features and NVIDIA H200

NVIDIA has unveiled enhancements in the upcoming January release of the NeMo framework. These optimizations and new features promise significant improvements in performance for NVIDIA AI Foundation Models like Llama 2, Nemotron-3, and other LLMs. Additionally, the expanded NeMo model architecture support and the introduction of a highly requested parallelism technique further bolster the NVIDIA AI platform’s capabilities.

PREDICTIONS SERIES 2024 - CIO Influence

Amid the escalating demand for high-performance AI training, NVIDIA’s NeMo framework is a pivotal solution. This end-to-end, cloud-native framework is tailored for constructing, customizing, and deploying generative AI models. Its robust suite of advanced parallelism techniques enables efficient large language model (LLM) training at scale.

NeMo played a vital role in powering NVIDIA’s GPT-3 175B performance submissions, achieving outstanding results in the latest MLPerf Training industry benchmarks. With up to 797 TFLOPS per H100 GPU and a groundbreaking submission featuring 10,752 H100 Tensor Core GPUs, NeMo showcased record performance and impressive linear scaling.

Enhanced Llama 2 Performance in NeMo’s Latest Release

Experts in the field emphasize the remarkable improvements witnessed in the latest NeMo framework release, specifically about the performance of Llama 2—a widely-used, open-source large language model initially developed by Meta.

The advancements are evident in comparing the prior NeMo release operating on A100 GPUs with the current iteration on H200 GPUs. It showcased a 4.2x increase in Llama 2’s pre-training and supervised fine-tuning efficiency. One notable enhancement involves the introduction of mixed-precision implementations within the model optimizer’s state. This strategic addition reduces model capacity demands and amplifies the adequate memory bandwidth for interactions with the model state by 1.8x.

Moreover, there’s a substantial boost in the performance of rotary positional embedding (RoPE) operations—a critical element in modern LLM architectures. Concurrently, optimization efforts have targeted the Swish-Gated Linear Unit (SwiGLU) activation functions, commonly substituting the Gaussian Error Linear Unit (GELU) in contemporary LLMs, further enhancing overall performance.

The strides made in communication efficiency for tensor and pipeline parallelism have also been noteworthy. These improvements collectively drive a significant increase in Tensor Core utilization on GPUs leveraging the NVIDIA Hopper architecture, achieving an impressive 836 TFLOPS per H200 GPU for Llama 2’s 70B pre-training and supervised fine-tuning tasks.

From a practical standpoint, these advancements translate into remarkable processing capabilities. For instance, a single system built on the eight-way NVIDIA HGX H200 can efficiently fine-tune Llama 2, equipped with 70B parameters on sequences of length 4096, processing over 15,000 tokens per second. This means accomplishing a supervised fine-tuning task involving 1B tokens in just over 18 hours—a testament to the accelerated performance facilitated by the latest NeMo framework enhancements.

Advancing Performance with Fully Sharded Data Parallelism (FSDP)

Experts highlight Fully Sharded Data Parallelism (FSDP) as a prominent and widely favored feature in the deep learning community. Its prevalence spans major frameworks like PyTorch, DeepSpeed, and JAX, catering to various models.

FSDP and Model Pipelining

    • Model pipelining demands a highly regular model structure, often with identical layers repeated extensively.
    • FSDP introduces usability enhancements by distributing data and memory on a per-layer basis, accommodating regular and irregular neural network structures.

Ease of Implementation

      • FSDP offers improved usability and minimal performance compromise, simplifying its application through simple model wrappers.
      • Eliminates the need for intricate model partitioning, making it adaptable to new model architectures like multi-modal LLMs.

Performance and Adaptability

    • Competitive performance compared to traditional tensor and pipeline parallelism, especially at scales smaller than the global batch size.
    • Highlights FSDP’s efficacy in optimizing performance across various computational scales in deep learning frameworks.

Advancing Model Capacity with Mixture of Experts (MoE)

Expanding model parameters to enhance information absorption and generalization capabilities has long been a strategy in generative AI models. However, the challenge arises as larger models demand increased computing for inference, elevating production costs.

Mixture of Experts (MoE) Significance

    • Addresses capacity augmentation without a proportionate increase in computing demands for training and inference.
    • Employs conditional computation, directing input tokens to specific expert neural network layers, thereby decoupling model capacity from necessary compute enhancements.

NeMo’s Support for MoE-based LLM Architectures

    • Officially embraces MoE-based Large Language Model (LLM) architectures with expert parallelism.
    • Utilizes an architecture similar to Balanced Assignment of Experts (BASE), employing algorithmic load balancing through Sinkhorn-based routing to route tokens to individual experts.

Flexible Expert Parallelism Configuration in NeMo

    • NeMo offers adaptable expert parallelism configurations, facilitating expert mapping across GPUs without limiting the count per device.
    • Supports scenarios where expert parallelism is smaller than data parallelism, ensuring versatile setups.

Complementary Nature of NeMo’s Expert Parallelism

    • NeMo’s expert parallelism integrates seamlessly with various parallelism dimensions, including tensor, pipeline, and sequence parallelism.
    • Enables efficient training of models containing trillions of parameters across NVIDIA GPU clusters, showcasing a robust synergy.

TensorRT-LLM Integration in Reinforcement Learning from Human Feedback (RLHF)

Enhancements to NeMo’s RLHF support introduce integration with TensorRT-LLM for inference within the RLHF loop. TensorRT-LLM significantly accelerates inference in the actor model, the primary output of the RLHF process. With the upcoming NeMo release, pipeline parallelism via TensorRT-LLM is enabled in RLHF, ensuring superior performance with reduced node requirements, especially for larger models.

For instance, leveraging TensorRT-LLM within the RLHF loop for the Llama 2 70B parameter model, in conjunction with H100 GPUs, yields up to a remarkable 5.6x performance surge compared to RLHF without TensorRT-LLM on the same H100 GPUs.

Accelerating Generative AI for Performance Advancements

NVIDIA NeMo constantly evolves to optimize generative AI training. It integrates cutting-edge methods for enhanced performance and flexibility in the NVIDIA platform.

This platform’s versatility spans the AI workflow, from data preparation to model training and deployment. Recent TensorRT-LLM demonstrations on a single H200 GPU, with advanced 4-bit quantization, maintained 99% accuracy with the latest Falcon-180B model. For more details, explore the latest TensorRT-LLM post.

Continuously advancing, the NVIDIA AI platform drives performance, versatility, and innovation. It is the preferred choice for current generative AI applications and catalyzes future models and techniques.

[To share your insights with us, please write to sghosh@martechseries.com]

Related posts

Nearly Two-thirds of Ransomware Victims Paid Ransoms Last Year, Finds “2022 Cyberthreat Defense Report”

GitLab Inc. Wins 2021 Google Cloud Technology Partner of the Year for Application Development Award

Transposit DevOps Process Orchestration Platform Now Empowers Non-Developers to Automate Processes

CIO Influence News Desk