New Breakthroughs in AI Technology 70B Model and Inference Acceleration

70B model, generating 1000 tokens per second, which is approximately 4000 characters! (Note: 1 token is approximately 4 characters.)

Researchers fine-tuned Llama3 and introduced an acceleration algorithm. Compared to the native version, the speed is 13 times faster!

Not only is it fast, but its performance in code rewriting tasks even surpasses that of GPT-4o.

This achievement comes from the anysphere team behind the popular AI programming tool Cursor, and OpenAI has also participated in the investment.

New Breakthroughs in AI Technology 70B Model and Inference Acceleration_0

You know, on the Groq inference acceleration framework known for its speed, the inference speed of 70B Llama3 is only a little over 300 tokens per second.

Cursor's speed like this can be said to have achieved near-instantaneous complete code file editing.

New Breakthroughs in AI Technology 70B Model and Inference Acceleration_1

Some people even excitedly said that if the fine-tuned Llama3 of Cursor is put on Groq, could it generate tens of thousands of tokens per second.

New Breakthroughs in AI Technology 70B Model and Inference Acceleration_2

Introducing a New Inference Acceleration Algorithm

The acceleration method designed by the author this time is mainly used to solve a task called FastApply, that is, to quickly modify and apply the code content.

First of all, it needs to be explained that although the final effect of the task is the local modification of the code, in the actual operation process, the output is not only the changed content, but directly rewriting the whole globally.

The reason for doing this is that the team made a choice after pre-testing - they found that except for Claude-3-Opus, most models perform unsatisfactorily in the true local modification task.

The reasons for this are mainly the following three: First, when directly rewriting, more tokens will be output, so that there are more forward passes to determine the correct solution. Second, most of the training data of the model are also complete codes, and it is relatively unfamiliar with local modifications. In addition, the poor mathematical operations of large models cannot guarantee the correct handling of line numbers when outputting differences. (However, the author believes that this is still a potential future research direction.)

New Breakthroughs in AI Technology 70B Model and Inference Acceleration_3

After determining to adopt the global rewriting scheme, the Cursor team fine-tuned Llama3 with task-related data. The data used has two major sources, real edit data and synthetic data, and they are mixed in a ratio of 1:4. Among them, the synthetic data refers to the code editing suggestions generated by GPT-4, and then these suggestions are applied to the original code by other models. In order to improve the quality of the data se t, the author also down-sampled small files, duplicate files, and unchanged samples.

New Breakthroughs in AI Technology 70B Model and Inference Acceleration_4

To evaluate the performance of these models, the author let them handle 450 code editing tasks (each not exceeding 400 lines) and scored the output with Claude3-Opus.

Finally, the 70B Llama3 model fine-tuned by the author performs almost as well as Claude3-Opus-diff and is superior to GPT-4-Turbo and GPT-4o.

New Breakthroughs in AI Technology 70B Model and Inference Acceleration_5

Up to this point, the fine-tuning has solved the performance problem, but it is not difficult to see that at this time, the speed of Llama3 is still very slow, and it can only output less than 300 characters per second (note that it is characters, not words or tokens).

And another secret weapon that makes the rewrite work extremely fast. For the code rewriting task, the Cursor team specifically introduced a so-called speculative edits algorithm. This way uses a prior algorithm to predict multiple subsequent tokens and then verifies them with the main large model, reducing the number of calls to the large model and thus reducing the amount of computation. This prior algorithm comes from a characteristic of the code task - compared to other texts, its vocabulary is smaller, and the grammar structure, indentation rules, etc. have higher certainty, and more accurate prediction of future tokens can be made by using prior knowledge. Such a practice also has something in common with GPT-4 and Meta - The reason why the traditional language model inference speed is slow is mainly that the process of predicting the next token is usually autoregressive, that is, when the model generates each token, it has to consider all the previously generated tokens. In order to reduce the amount of computation, large models represented by GPT-4 use an acceleration algorithm called speculative decoding, which makes predictions in advance through a small approximate model and then verifies the prediction results with the main large model. The difference between Cursor and GPT-4 is that the small model of the former is a more certain algorithm, while the latter is just a reduction in the model scale, and it is still probabilistic prediction in essence. Meta, on the other hand, has launched an algorithm that predicts multiple subsequent tokens at once, using n independent output heads to predict n future tokens in parallel, and the result is found that it performs particularly well in programming tasks, because due to the more rigorous logical structure of programming languages, the internal connection of knowledge is more closely related. Of course, Cursor makes more full use of this characteristic and does not use attention heads, but directly uses a more certain algorithm to make multi-token predictions. The final result is that the prediction algorithm brings a nearly 13-fold speed improvement to the 70B Llama3, and there is no loss in the evaluation performance.

New Breakthroughs in AI Technology 70B Model and Inference Acceleration_6

In addition, the author also cooperated with the enterprise AI model i nfrastructure platform fireworks.ai to further improve the running efficiency of the model by using its optimized inference engine and customized hardware environment. In the future, the team also plans to conduct knowledge distillation and transfer the prediction edit algorithm to the smaller 8B Llama3 and expand to more programming languages and tasks. At the same time, for the true local modification (Diff) algorithm that the Cursor team has studied but not adopted, the author also plans to improve it.

OneMoreThing

In the experiment, the author not only accelerated Llama3 with the prediction algorithm, but also achieved the acceleration of GPT4-Turbo. However, the author did not introduce how to achieve it specifically in GPT, but left it as a thinking question and also held a prize quiz. Those who can answer correctly will get 1 month of Cursor membership; if they can achieve prediction acceleration in vllm and TensorRT-LLM respectively, they will get half a year and one year of membership.

New Breakthroughs in AI Technology 70B Model and Inference Acceleration_7

If you feel you have an idea, you might as well challenge it (manual狗头).

Reference link:

This article comes from the WeChat public account: Quantum Bits (ID: QbitAI), author: Kelei Xi.

Likes