NVIDIA's brand new open-source model: three times the throughput, can run on a single card, and has achieved state-of-the-art inference

NVIDIA launched the Llama Nemotron Super v1.5 open-source model, designed specifically for complex reasoning and agent tasks. This model achieves state-of-the-art performance in fields such as science, mathematics, and programming, with a throughput increase of up to three times compared to its predecessor, and can run efficiently on a single card. It employs Neural Architecture Search (NAS) technology to optimize accuracy and efficiency while reducing operational costs. The model architecture includes skip attention mechanisms and variable feedforward networks, enhancing performance and efficiency

As we all know, Huang not only sells shovels (GPUs) but also mines himself (builds models).

NVIDIA's latest open-source model, Llama Nemotron Super v1.5, is specifically designed for complex reasoning and agent tasks.

While achieving SOTA performance in science, mathematics, programming, and agent tasks, it also increases throughput to three times that of the previous generation and can run efficiently on a single card, achieving a more accurate, faster, and lighter "want it all."

How is this achieved?

Model Introduction

Llama Nemotron Super v1.5 is short for Llama-3.3-Nemotron-Super-49B-V1.5. It is an upgraded version of Llama-3.3-Nemotron-Super-49B-V1 (which is a derivative model of Meta's Llama-3.3-70B-Instruct), specifically designed for complex reasoning and agent tasks.

Model Architecture

Llama Nemotron Super v1.5 employs Neural Architecture Search (NAS), achieving a good balance between accuracy and efficiency, effectively converting throughput improvements into lower operational costs.

(Note: The goal of NAS is to find the optimal neural network structure from a large number of possible architectures through search algorithms, replacing manual design of neural network architectures with automated methods to improve model performance and efficiency.)

In Llama Nemotron Super v1.5, the NAS algorithm generates non-standard, non-repetitive network modules (blocks). Compared to traditional Transformers, it includes the following two types of changes:

Skip attention mechanism: In certain modules, the attention layer is directly skipped, or only a linear layer is used as a substitute.
Variable Feedforward Network (Variable FFN): In the feedforward network, different modules adopt different expansion/compression ratios.

As a result, the model runs more efficiently under resource constraints by skipping attention or altering FFN width to reduce FLOPs Afterwards, the research team also performed block-wise distillation on the original Llama model (Llama 3.3 70B Instruct), constructing multiple variants for each module and searching for combinations across all module structures to build a model.

This allows it to meet the throughput and memory requirements on a single H100 80GB GPU while minimizing performance loss.

Training and Dataset

The model was initially trained on a total of 40 billion tokens from three datasets: FineWeb, Buzz-V1.2, and Dolma, focusing on single-turn and multi-turn English conversations through knowledge distillation (KD).

In the post-training phase, the model further improved its performance on key tasks such as coding, mathematics, reasoning, and instruction following by combining supervised fine-tuning (SFT) and reinforcement learning (RL) methods.

The data includes both questions from public corpora and artificially synthesized Q&A samples, with some questions accompanied by answers that enable and disable reasoning, aimed at enhancing the model's ability to discern reasoning patterns.

NVIDIA stated that the dataset will be released in the coming weeks.

Overall, Llama Nemotron Super V1.5 is a variant of Llama 3.3 70B Instruct that is automatically optimized through NAS and streamlined computational graphs. It simplifies the structure for single-card operation scenarios, incorporates knowledge distillation training and post-training, balancing high accuracy, high throughput, and low resource usage, making it particularly suitable for deployment in English conversational tasks and programming tasks.

Additionally, in terms of deployment, NVIDIA continues its consistent ecological advantages:

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By fully leveraging NVIDIA's hardware (such as GPU cores) and software frameworks (such as CUDA libraries), the model achieves significant speed improvements during training and inference compared to CPU-only solutions.

The model is now open source. Developers can experience Llama Nemotron Super v1.5 at build.nvidia.com or download the model directly from Hugging Face.

One more thing

As NVIDIA's latest open-source large language model, Llama Nemotron Super v1.5 belongs to the NVIDIA Nemotron ecosystem, which integrates large language models, training and inference frameworks, optimization tools, and enterprise-level deployment solutions, aiming to achieve high performance, strong controllability, and easy scalability in generative AI application development.

To meet the needs of different scenarios and user positioning, NVIDIA has launched three differently positioned large language model series based on this ecosystem—Nano, Super, and Ultra.

Among them, the Nano series is aimed at cost-effectiveness and edge deployment, suitable for deployment on edge devices (such as mobile devices, robots, IoT devices, etc.) or cost-sensitive scenarios (such as local operation, offline scenarios, and commercial small model inference).

The Super series focuses on balanced accuracy and computational efficiency on a single GPU. It can run on a high-performance GPU (such as H100) without the need for multiple cards or large clusters. Its accuracy is higher than that of Nano but smaller than Ultra, making it suitable for enterprise developers or medium-sized deployments. The Llama Nemotron Super v1.5 mentioned above belongs to this series.

Ultra is dedicated to maximum accuracy in data centers, specifically designed to run on data centers, supercomputing clusters, and multiple GPUs, targeting tasks that require extremely high accuracy, such as complex inference, large-scale generation, and high-fidelity dialogue.

Currently, Nemotron has received support or integration from companies such as SAP, ServiceNow, Microsoft, Accenture, CrowdStrike, and Deloitte, used to build AI intelligent platforms aimed at enterprise-level process automation and complex problem-solving.

Additionally, the Nemotron model can also be called through NVIDIA NIM microservices in the Amazon Bedrock Marketplace, simplifying the deployment process and supporting various operational solutions such as cloud and hybrid architectures.

Author of this article: Quantum Bit, Source: Quantum Bit, Original title: "NVIDIA's Brand New Open Source Model: Triple Throughput, Single Card Operation, and Achieving Inference SOTA"

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account individual users' specific investment goals, financial conditions, or needs. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Investing based on this is at your own risk