Two Chinese Manufacturers' Language Models Rank Among Top on Stanford's Leaderboard

TapTechNews on June 22nd. On June 11th, the Center for Research on Fundamental Models at Stanford University (CRFM) released the Massive Multitask Language Understanding Assessment on HELM leaderboard. Among the top ten large language models in the comprehensive ranking, there are two from Chinese manufacturers, namely Qwen2Instruct (72B) from Alibaba and YiLarge (Preview) from Zero-One Universe.

It is known that the Massive Multitask Language Understanding Assessment on MMLU on HELM adopts a testing method proposed by Dan Hendrycks et al. to measure the accuracy of text models in multitask learning. This test includes 57 tasks in fields such as basic mathematics, American history, computer science, law, etc. To score high in this test, the model must possess extensive world knowledge and problem-solving abilities. TapTechNews attached the ranking as follows:

 Two Chinese Manufacturers Language Models Rank Among Top on Stanford's Leaderboard_0

1. Claude3Opus (20240229): Anthropic (US, invested by Amazon).

2. GPT-4o (2024-05-13): OpenAI (US).

3. Gemini1.5Pro: Google (US).

4. GPT-4 (0613): OpenAI (US).

5. Qwen2Instruct (72B): Alibaba (China).

6. GPT-4Turbo (2024-04-09): OpenAI (US).

7. Gemini1.5Pro (0409preview): Google (US).

8. GPT-4Turbo (1106preview): OpenAI (US).

9. Llama3 (70B): Meta (US).

10. YiLarge (Preview): Zero-One Universe (China).

Qwen2 is an open-source large language model developed by Alibaba and was released on June 6th this year. The Qwen2 series includes five different-scale pre-trained and instruction-tuned models such as Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B; supports data training in 27 additional languages other than English and Chinese; Qwen2-7B-Instruct and Qwen2-72B-Instruct support long 128,000 tokens of context.

YiLarge is a closed-source large model developed by Zero-One Universe. The Yi model series is based on 6B and 34B pre-trained language models, and then extended to chat models, 200,000 long context models, deeply upgraded models, and visual language models. The official claim is that it outperforms leading models such as GPT-4 and Claude3Opus in key benchmark test scores.

Likes