TapTechNews July 9th news, Alibaba Cloud's Tongyi Qianwen has open-sourced two speech base models, SenseVoice (for speech recognition) and CosyVoice (for speech generation).
SenseVoice focuses on high-precision multilingual speech recognition, emotion recognition, and audio event detection, and has the following characteristics:
Multilingual recognition: Trained with more than 400,000 hours of data, supporting more than 50 languages, outperforming the Whisper model in recognition effect.
Rich text recognition: Has excellent emotion recognition and can achieve and exceed the effect of the current best emotion recognition model on the test data; supports the ability to detect sound events, including various common human-computer interaction events such as music, applause, laughter, crying, coughing, sneezing, etc.
Efficient inference: The SenseVoice-Small model adopts a non-autoregressive end-to-end framework with extremely low inference latency, and only takes 70 ms for 10-second audio inference, 15 times better than Whisper-Large.
Fine-tuning customization: Has a convenient fine-tuning script and strategy to facilitate users to fix long-tail sample problems according to business scenarios.
Service deployment: Has a complete service deployment link, supports multiple concurrent requests, and the supported client languages include python, c++, html, java, and c#.
Compared with the open-source emotion recognition model, the SenseVoice-Large model can achieve the best effect on almost all data, and the SenseVoice-Small model can also achieve better results than other open-source models on most data sets.
The CosyVoice model also supports multilingual, timbre, and emotion control, and this model performs excellently in functions such as multilingual speech, zero-shot speech generation, cross-language speech cloning, and instruction following.
TapTechNews attached relevant links:
SenseVoice: https://github.com/FunAudioLLM/SenseVoice
CosyVoice: https://github.com/FunAudioLLM/CosyVoice