It’s estimated that some 1.2 billion people, roughly equivalent to 16 percent of the world’s population, speak some form of Chinese. Obviously, a lot of developers would like to tap into that market without necessarily having to translate their application into Chinese.
The APIs specifically address Long Utterance Speech Recognition, Far-Field Speech Recognition, Expressive Speech Synthesis and Wake Word. Long Utterance Speech Recognition enables the transcription of long audio clips such as interviews, speeches and lectures. Far-Field Speech Recognition enables the recognition of speech from audio sources that are up to 16 feet away. Expressive Speech Synthesis provides a collection of realistic voices that can be used to read aloud. Wake Word allows developers to create customized short words or phrases that can be spoken to turn devices on.
Sanjeev Satheesh, a research scientist at Baidu who specializes in machine learning, says Baidu is trying to drive globalization of applications by making speech APIs available along with other services such as facial recognition, optical character recognition and natural language processing.
“We’re not just talking about speech to text,” says Satheesh.
To drive that effort, Satheesh says that in contrast to other providers of APIs, Baidu has opted to make the four speech APIs available to developers for free. Longer term, Satheesh says Baidu envisions speech replacing traditional keyboards as the primary interface users interact with to access applications. In fact, Satheesh says Baidu expects natural language APIs to be used in concert with speech APIs and language translation services enabled by a deep learning framework it developed to translate audio. Baidu early this year made that deep learning framework, dubbed PaddlePaddle, available as an open source project.