Alibaba Cloud's Tongyi Qianwen Open-Sources Two Speech Base Models

TapTechNews July 9th news, Alibaba Cloud's Tongyi Qianwen has open-sourced two speech base models, SenseVoice (for speech recognition) and CosyVoice (for speech generation).

Alibaba Clouds Tongyi Qianwen Open-Sources Two Speech Base Models_0

SenseVoice focuses on high-precision multilingual speech recognition, emotion recognition, and audio event detection, and has the following characteristics:

Multilingual recognition: Trained with more than 400,000 hours of data, supporting more than 50 languages, outperforming the Whisper model in recognition effect.

Rich text recognition: Has excellent emotion recognition and can achieve and exceed the effect of the current best emotion recognition model on the test data; supports the ability to detect sound events, including various common human-computer interaction events such as music, applause, laughter, crying, coughing, sneezing, etc.

Efficient inference: The SenseVoice-Small model adopts a non-autoregressive end-to-end framework with extremely low inference latency, and only takes 70 ms for 10-second audio inference, 15 times better than Whisper-Large.

Fine-tuning customization: Has a convenient fine-tuning script and strategy to facilitate users to fix long-tail sample problems according to business scenarios.

Service deployment: Has a complete service deployment link, supports multiple concurrent requests, and the supported client languages include python, c++, html, java, and c#.

Compared with the open-source emotion recognition model, the SenseVoice-Large model can achieve the best effect on almost all data, and the SenseVoice-Small model can also achieve better results than other open-source models on most data sets.

Alibaba Clouds Tongyi Qianwen Open-Sources Two Speech Base Models_1

The CosyVoice model also supports multilingual, timbre, and emotion control, and this model performs excellently in functions such as multilingual speech, zero-shot speech generation, cross-language speech cloning, and instruction following.

TapTechNews attached relevant links:

SenseVoice: https://github.com/FunAudioLLM/SenseVoice

CosyVoice: https://github.com/FunAudioLLM/CosyVoice

Likes