FalconMamba7B Transformer-Replacing Model with Enhanced Performance

Just by replacing the Transformer architecture, the performance is comprehensively enhanced and tops the same-scale open-source models immediately! (The attention mechanism no longer exists.)

This is the latest FalconMamba7B model.

FalconMamba7B Transformer-Replacing Model with Enhanced Performance_0

It uses the Mamba state space language model architecture to handle various text generation tasks. By cancelling the traditional attention mechanism, it effectively improves the problem of inefficient computing when handling long sequences. It can handle unlimited long sequences without increasing memory requirements. No matter how long the context is, the time for generating each token is basically the same.

FalconMamba model thus comprehensively outperforms a bunch of Transformer architecture models such as Llama-3.1 (8B), Mistral (7B), and Falcon-2 (11B).

FalconMamba7B Transformer-Replacing Model with Enhanced Performance_1

The above achievements are brought by the Technology Innovation Institute (TII) in Abu Dhabi, UAE, which is exactly the development team of the Falcon model.

This series contains four models in total: the basic version, the instruction fine-tuning version, the 4-bit version, and the instruction fine-tuning 4-bit version.

The latest model follows the TIIFalconLicense2.0 open protocol, which is under the Apache2.0 protocol. Onlookers exclaimed: The game rules are about to change!

FalconMamba7B Transformer-Replacing Model with Enhanced Performance_2

World's first open-source SSLM

In terms of performance, FalconMamba7B comprehensively outperforms a host of open-source models.

FalconMamba7B Transformer-Replacing Model with Enhanced Performance_3

It is based on the first-generation Mamba.

Mamba is a state space model (SSM, StateSpaceModel). It combines the characteristics of RNN and CNN, and by introducing a selection mechanism, it allows the model to selectively spread or forget information according to the current input, thereby improving the efficiency of processing text information.

At the same time, it designs a hardware-aware parallel algorithm that runs in a recursive mode, avoiding IO access between GPU memory levels and improving computing efficiency.

Finally, it also simplifies the architecture by combining the SSM architecture and the MLP block in the Transformer into a single block.

Switching from Transformer to Mamba enables the Falcon model to handle arbitrarily long sequences without the need to increase memory. Especially suitable for a single A1024GB GPU.

Research also discusses two different methods of processing sequences. The parallel pre-fill method is suitable for GPU parallel processing and has high memory requirements; the sequential fill method is suitable fo r SSM models and can handle arbitrarily long sequences, so it will not be limited by memory.

FalconMamba7B Transformer-Replacing Model with Enhanced Performance_4

In order to ensure the stability of large-scale training, the FalconMamba model uses an additional RMS normalization layer.

The RMS normalization layer can simplify the calculation process of LayerNorm and can reduce the amount of calculation.

The model is trained with 5500GT data, which mainly comes from the RefedWeb dataset and public data. The training process is basically at a constant speed, and a small portion of high-quality curated data is added in the later stage of training, which helps the model in the final stage of optimization.

On the H100, in the test of generating tokens with a batch size of 1 and a prompt word length of 1-130k, FalconMamba can maintain a stable throughput when generating new tokens, which means that its performance is not affected by the text length and can stably handle long sequences without performance degradation.

FalconMamba7B Transformer-Replacing Model with Enhanced Performance_5

FalconMamba7B Transformer-Replacing Model with Enhanced Performance_6

FalconMamba supports multiple HuggingFace APIs, including AutoModelForCausalLM, pipline. It also launches an instruction-tuned version, which can be fine-tuned with an additional 5 billion tokens to make the model more accurate.

It can be accessed on HuggingFace and GitHub for the latest model~

Reference link:

This article is from WeChat public account: Quantum Bits (ID: QbitAI), author: Ming Min, original title Replace Transformer, and the 7B open-source model immediately tops! Any long sequence can be handled.

Likes