Automated transcription faces significant challenges.
There are so many amazing applications for artificial intelligence in areas such as industrial automation, medicine and consumer devices. And yet, we still haven’t gotten the whole long-form voice transcription thing figured out.
Anyone who holds a lot of info-gathering calls knows the struggle: You end up with an mp3 treasure trove—but the nuggets of wisdom are locked inside. There are, of course, transcription services, but when there’s more than one person on the recording, long blocks of text, and technical jargon, you know you’re going to spend a good amount of time guessing what’s in those “[inaudible]” spaces when you get your document back. (Side note: Sorry to those transcriptionists whose weeks I ruined!)
You can also tell yourself you’ll transcribe it (“It’ll help me process the info!”). But 30 minutes in, you realize that you’d rather be at the DMV than keep trying.
Beyond calls and meetings, the ability to transcribe long blocks of speech could make a big impact in terms of creating searchable databases of historical recordings and videos (as well as Youtube content).
Researchers are working on advanced speech recognition systems, such as Deep Speech, hoping to automate the capability and bring down the error rate. “We’ve made some very good progress in Deep Speech with state-of-the-art speech systems in English and Chinese,” says Carl Case, a research scientist on the Machine Learning team at Baidu. “But I still think there’s work to do to go from ‘works for some people in some contexts’ to actually just works the same way you and I can have this conversation, having never met, over a relatively noisy phone line and have no problem understanding one another.”
One expert says we’re still decades off, citing that it's “a much higher artificial intelligence problem that has really not been solved yet.” If this is the kind of news that speaks to you, read more about the progress here.