This allows for a consistently small latency and a highly responsive experience when synthesizing. To ensure a highly responsive system, we designed the device NTTS to synthesize in a streaming mode, which means that the latency is independent of the length of the input sentence. High synthesizing speed and low latency are critical factors that affect the user experience in a text-to-speech system. Recently more and more IoT devices like car manufacturers are using NPU to accelerate their systems. It can accelerate the neural network inferencing efficiently without increasing general CPU usage. NPU, or Neural Process Unit, is one of the critical components in the CPU, especially for AI related processing.It is a typical platform that device TTS is running on and most customers can adopt, so we use this CPU as our platform for measurement. 820A is a type of CPU that is broadly used in car systems currently.RTF, or Real-Time Factor, is the measurement of the time in seconds to generate the audio of 1 second in length. Overall, its efficiency is close to some traditional device TTS systems and can meet almost all customers’ requirements on efficiency. So, we must create a super highly efficient solution for our device neural TTS.īelow are the metrics and the score card for our device neural TTS system. For device TTS scenarios and customers, the challenge is even bigger due to lower end devices and lower CPU usage reservation in the system according to our customers’ experience. Hear how close they sound with below samples.ĭeploying neural network models to IoT devices is a big challenge for both those performing AI research as well as multiple industries today. MOS gap (on-device neural TTS as the base)Īs you can tell from the above comparison, with the new technology, the naturalness of device neural TTS voices have reached near parity with the cloud version. Here ‘traditional device TTS’ is the device SPS technology we shipped on Windows 10 and the previous Windows versions, which is also the major technology used for embedded TTS in the current industry. Our MOS and CMOS (Comparative Mean Opinion Score) tests have shown that the device neural TTS voice quality is very close to the cloud TTS.Ĭheck below table for a comparison of voice naturalness, output support and features available among traditional device TTS (SPS), embedded neural TTS and cloud neural TTS. Now, with the new device neural TTS technology, we have closed the gap between the device TTS and cloud TTS. Traditional TTS on-device voices are built with the legacy SPS technology and the voice quality is significantly lower than the cloud-based TTS, typically with a MOS ( Mean Opinion Score) gap higher than 0.5. This new generation on-device neural TTS has three key advances: high quality, high efficiency, and high responsiveness. Seamless switch between cloud TTS and device TTS with Azure device neural TTS technology Check the video below to hear how natural the new voices sound and how much better they are than the old-generation embedded voices. Thanks to this new technology, natural voices on-device have been released to Microsoft’s flagship products such as Windows 11 Narrator, and are now available in Speech services for Azure customers.Ī set of natural on-device voices lately became available with Narrator on Windows 11. To address the needs for high-quality embedded TTS, we developed a new-generation device neural TTS technology which has significantly improved the embedded TTS quality compared to the traditional on-device TTS voices, e.g, those based on the legacy SPS (Statistical Parametric Speech Synthesis) technology. Automobile manufacturers are requesting features to enable voice assistants in cars when disconnected. For example, users of screen readers for accessibility (such as the speech feature on Windows) are asking to improve the voice experience with better on-device TTS quality. These highly natural voices are available on cloud, and on prem through containers.Īt the same time, we have received many customer requests to support Neural TTS on devices, especially for scenarios where devices do not have network availability or the network is not stable, and scenarios that require extremely low latency or have privacy constraint. During the past months, Azure Neural TTS has achieved parity with the natural human recordings ( see details) and has been extended to support more than 140 languages and variances ( see details). It has been applied to a wide range of scenarios, including voice assistants, content read-aloud capabilities, and accessibility uses. Azure Neural Text-to-Speech (Neural TTS) is a powerful AIGC (AI Generated Content) service that allows users to turn text into lifelike speech.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |