OpenAI open-source model leak: Six technical details

Details of OpenAI's upcoming open-source large model technology have been leaked, including a 120 billion parameter mixture of experts model and a 20 billion parameter dense model. The former activates about 5-6 billion parameters during inference, improving inference efficiency and reducing costs. The model may use Float4 training technology, utilize NVIDIA Blackwell chips, employ a clipped SwiGLU activation function, support a 128K context window, and adopt a sliding window attention mechanism

Detailed technical specifications of the open-source large model that OpenAI may soon release have arrived, based on leaked information.

Model Architecture: 120 Billion Parameter Mixture of Experts (MoE)

According to leaks, OpenAI may release two models:

One is a 120 billion (120B) parameter Mixture of Experts (MoE) model: it activates only about 5-6 billion (5B/6B) parameters during inference. This means it can achieve extremely high inference efficiency while maintaining a vast knowledge capacity, significantly reducing operational costs.

The other is a 20 billion (20B) parameter dense model: a more compact and easier-to-deploy version.

Currently, these two models will focus on text processing and will not involve multimodal capabilities for the time being.

Training Technology: Possibly Using Float4 and NVIDIA's Latest Blackwell Chips

For ultimate efficiency, the model may have used Float4 for training or quantization. This is a very aggressive quantization scheme that can greatly compress the model size and enhance computational speed.

It is speculated that this may have been accomplished using NVIDIA's newly released Blackwell architecture GPUs, as this series of chips natively supports Float4 operations. Another possibility is that the model was compressed to Float4 through Post-Training Quantization (PTQ) technology after training.

Activation Function: Range-Limited SwiGLU

To accommodate Float4 quantization, the model may have adopted the SwiGLU activation function and clipped its output range, limiting it between -7 and 7.

This is similar to the classic ReLU6 function, aimed at eliminating extreme outliers in activation values, ensuring a more stable numerical distribution, thereby reducing precision loss during quantization. This is crucial for low-precision formats like Float4.

Context Window: Expanded to 128K Through YaRN Technology

The model will have a super long context window of 128K, but it was not trained from scratch. It is speculated that the base context window of the model is 4K, which was then seamlessly expanded to 128K during training using technologies like YaRN.

Attention Mechanism: Sliding Window Attention (SWA) and Attention Sinks

To efficiently handle long texts of 128K, the model employs two key technologies:

Sliding Window Attention (SWA): The window size is 128. This means that when calculating attention, each token only needs to focus on its 128 neighboring tokens, reducing computational complexity from quadratic to linear.

Attention Sinks: To address the issue of SWA forgetting early important information, the model introduces attention aggregation technology This technology forces the model to always focus on the first few (e.g., 4 or 8) key tokens, ensuring that the model does not forget when processing long sequences. NVIDIA's TensorRT-LLM also supports this feature.

Underlying Architecture: Integrating Features of Llama/Mixtral and Using Bias Terms

The model's underlying architecture likely draws from successful open-source models like Llama and Mixtral. Key features include:

Merged QKV Matrices: Merging the query (Q), key (K), and value (V) matrices in the attention mechanism to optimize computational efficiency.

Extensive Use of Bias Terms: Unlike some models (such as Llama) that remove bias terms, this model seems to retain bias terms across all modules (including MLP, attention layers, and even the routing layers of MoE), which may help enhance the model's fitting ability.

AI Cambrian, original title: "OpenAI Open Source Model Leak: Six Technical Details"

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account individual users' specific investment goals, financial situations, or needs. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investing based on this is at one's own risk