MetaAI Introduces Transfusion Unifying Language and Image Generation in AI

TapTechNews August 24th news, MetaAI company has newly launched the Transfusion new method, which can combine the language model and the image generation model and integrate them into a unified AI system.

TapTechNews quoted the team introduction, Transfusion combines the advantages of the language model in processing discrete data such as text and the ability of the diffusion model in generating continuous data such as images.

Meta explained that current image generation systems usually use a pre-trained text encoder to process the input prompt words and then combine them with a separate diffusion model to generate images.

Many multimodal language models work similarly, they connect a pre-trained text model with a dedicated encoder for other modalities.

However, Transfusion adopts a single, unified Transformer architecture that is applicable to all modes and is trained end-to-end for both text and image data. Different loss functions are used for text and image: next-token prediction for text and diffusion for image.

MetaAI Introduces Transfusion Unifying Language and Image Generation in AI_0

In order to handle both text and image simultaneously, the image is converted into a sequence of image patches. In this way, the model can handle both text tokens and image patches in one sequence, and a special attention mask also allows the model to capture the relationships within the image.

Different from Meta's existing methods such as Chameleon (converting the image into discrete tokens and then processing it in the same way as processing text), Transfusion retains the continuous representation of the image and avoids the information loss caused by quantization.

Experiments also show that compared to similar methods, the fusion has a higher expansion efficiency. In terms of image generation, it achieves similar results to specialized models, but the computational effort is significantly reduced. Surprisingly, integrating image data also improves the text processing ability.

MetaAI Introduces Transfusion Unifying Language and Image Generation in AI_1

The researchers trained a 7-billion-parameter model on 2 trillion text and image tokens. The model achieved similar results in image generation to mature systems such as DALL-E2 while also being able to handle text.

TapTechNews attached the reference address

Likes