Reporting on research conducted by Stanford, the University of Washington and Chinese web company Baidu, they present results on the current state of Baidu’s Deep Speech 2 speech recognition performance.
The test was made on short sentences, hinting at short-breadth types of interaction. DS2 is compared with state of the art Apple iOS keyboard. Two metrics are discussed: input rate and error rate.
Regarding input rate, DS2 was found three times faster than the keyboard for English, and 2.8 times faster than Mandarin Chinese keyboard input (through pinyin). In absolute terms, the average input rate for English was 53.46 wpm (words per minute) on keyboard versus 161.2 through Speech, before corrections. For Mandarin Chinese, it was 38.78 wpm versus 108.43, with Speech coming ahead, again, before corrections.
Regarding errors, DS2 also outperformed in every scenario. In English, speech had a winning rate of 2.93% versus 3.68%. In Chinese the gap was wider, 7.51% to 20.54%. This percentages are related to all the processed input, which include corrective commands (mostly mouthing “backspace” or hitting\swiping the backspace key). The high error rate for Mandarin Chinese keyboard input is noteworthy, likely signal of a more dire market need in the Asian market.
A disclaimer for error rates points to another advantage of speech recognition. Variance of error rates in keyboard input are higher as they depend of the level of keyboard input skill by the human text subject. With speech recognition, besides learning some keywords, the learning curve is practically flat.
Looking forward, the question is how long until speech recognition is a task better left to machines over humans. Learning innovator Andrew Ng seems to be ready to answer:
“A significant shift from typing to speech might be imminent and impactful”.
Future research will involve the optimization of interface design to translate the gains in input from research conditions.