The New York Times's John Markoff says following the evolution of speech-recognition technology since the early 1980s “feels like watching paint dry.”
But David Nahamoo, IBM’s CIO for speech, takes issue with that assessment.
“I don’t believe it has been developing slowly,” Nahamoo says. “We’re trying to develop machines that expose intelligence by transforming sound into text that carries meaning, which is a very complicated problem we’re trying to solve. … We have been solving different aspects of the problem, but it’s not a simple problem.”
Some observers, though, now say the push to add speech-recognition to mobile devices will drive improvements in the technology through 2009 and beyond.
Markoff sees a battle for control of the mobile voice search market by Google, Yahoo, Microsoft and others that’s tied to ad revenue from local businesses.
But Nahamoo also foresees improvements in business technology as well as uses beyond mobile search.
What Did You Say?
Google’s voice search app for iPhone generated laughter in Britain last fall when users learned they had to essentially talk like a Texan to get it to work. Markoff notes that neither Yahoo, using IBM partner Vlingo’s technology, nor Microsoft has perfected it, either.
That only illustrates the problems, Nahamoo says.
“English is one language but is spoken at least 100 different ways in each country where you go,” he says. “Accent and even dialect, new words, maybe even the phraseology changes, such as between UK and United States. All those things affect building the models. For example, I can’t build a good speech-recognition system based on recorded Northeastern U.S. accents and expect it will do a good job for people from India who speak English. I have to record data on those people as well.”
Even when repeating the same sentence, people will vary the tempo or stressed words. Speech can be affected by emotion or even by different ways of holding the mouth. And with mobile devices, you have to add in background noise in the car, which will be different from that of walking down the sidewalk or entering a busy train station. Speech on the telephone differs from speech on television, and on and on.
IBM has been working on such systems for 100 or so popular languages with the idea that eventually these machines would be better at processing language than any multilingual human.
But Nahamoo concedes there’s plenty of work yet to be done. He says it takes three components, which he describes as ideas – the algorithms behind the models – computing to sift through massive databases of sounds, and lots and lots of recorded speech samples. He says computing power is no longer the issue, pointing instead to algorithms and recorded data.
Nahamoo says 1,000 hours of recorded speech is considered the minimum database required for a building respectable system. And Markoff notes that Google, in a technical paper on building large models for machine translation, wrote that the system used 2 trillion “tokens,” or words.
Still, the technology’s problem over the years, as Jamie Bertasi, senior vice president of the Business Solutions division of Tellme points out, has been that it doesn’t really work that well and adoption rates have lagged.
For a Self-Service Menu …
Bertasi sees Microsoft’s acquisition of Tellme providing a boost to its widely used call-center technology. But the real driver for adoption will be simple economics. The company fields more than 2 billion calls a year for clients such as American Express, American Airlines and Domino’s Pizza. AT&T and other carriers use Tellme technology for automated directory assistance.
“Every single customer we talk to is under pressure to cut costs and one of their biggest cost items in their contact centers are their agents. … Using self-service technology can really drive down those costs. … At the same time, they’re trying to differentiate themselves through better service. Speech-based self-service can help them achieve both those goals,” she says.
And as companies look to update their systems, she says, they find value in on-demand services such as that Tellme offers because the system is continually being improved, while on-premise systems’ speech recognition stays just like it was right out of the box.
Tellme’s system uses human editors to weigh in on the accuracy of the automated response. That feedback is put back into the system for what Bertasi calls continual “tuning.” With its linkup with Microsoft, she says, Tellme’s system needs only about 3 seconds of speech to adjust to that voice.
Though long the province of automated call centers, speech technology is venturing out – in car navigation systems, hands-free dialing, speech-enabled ATMs, kiosks and other devices. But IBM also is researching ways for people to interact with e-commerce sites through speech. And Microsoft added a speech feature to Windows Vista, integrating it with word processing and e-mail applications.
In the future, Nahamoo sees broader use of speech technology in business intelligence and analytics.
“Any conversation with an agent would be transcribed to text and a lot of text analytics would be applied to it to learn ‘Was the agent doing his job properly?’ ‘Was the customer happy?’ ‘Are there things we can learn about the products or services we are selling?’” he says. “There is a gold mine of information in the voice of the customer and more of it is going to be analyzed as we go forward.”
Companies will use it to analyze broadcast news that mentions the company. Any voice recording – court reporting, medical transcription, insurance records – will be automated.
That’s already happening. Adobe recently added voice-recognition technology that enables its Creative Suite software to generate transcripts of video and audio recordings.
“The more accurate these things become, the more value we can get out of the text and adoption, the usage, will grow,” he says.
Those call center IVR systems, which stands for interactive voice response, will be moving to smaller and smaller devices as part of the push to mobile, he says.
And as hackers increasingly can figure out your mother’s maiden name and other security questions, he foresees using a voice signature as part of a biometric security system for transactions.
Previous Page Next Page
Sign up now and get the best business technology insights direct to your inbox.







To ShareThis, click on a service below: