Westlake Xinyu's Lingo Speech Model A Breakthrough in China's Tech

TapTechNews August 24th news, Westlake Xinyu, invested by Jinke Tomcat, launched the Xinyu Lingo speech large model this August, which is the first end-to-end speech large model in China and has started internal testing reservations today (August 24th).

Westlake Xinyus Lingo Speech Model A Breakthrough in China's Tech_0

In the announcement released on August 21st, the official introduction stated that compared to traditional TTS, the end-to-end speech large model is a more comprehensive technology. It not only can perform speech recognition but also integrates multiple links such as natural language processing, intention recognition, dialogue management, and speech synthesis, realizing a complete interaction process from speech input to speech feedback, greatly enriching the depth and breadth of human-computer interaction.

TapTechNews quoted the official press release, The Xinyu Lingo speech model is the first model in China with the ability to catch up with the GPT-4o speech ability, and in terms of technical capabilities, it has the following three remarkable characteristics:

Native speech understanding: As an end-to-end model, Xinyu Lingo can not only recognize the text information in the speech but also accurately capture other important features such as emotion, tone, pitch, and even ambient sounds, helping the model understand the speech content more comprehensively and thereby providing a more natural and vivid interaction experience.

Multiple speech style expressions: Xinyu Lingo can adaptively adjust the speed, pitch, and noise intensity of the speech according to the context and user instructions, and can generate various styles of speech responses such as dialogue, singing, and cross-talk, effectively enhancing the flexibility and adaptability of the model in different application scenarios.

Speech modality super compression: Xinyu Lingo adopts a speech codec with a compression rate of hundreds of times, which can compress the speech to an extremely short length, helping the model generate high-quality speech content while significantly reducing computing and storage costs.

Likes