"The first [step] takes my words and finds the Chinese equivalents, and while non-trivial, this is the easy part," Rashid wrote. "The second reorders the words to be appropriate for Chinese, an important step for correct translation between languages. Of course, there are still likely to be errors in both the English text and the translation into Chinese, and the results can sometimes be humorous. Still, the technology has developed to be quite useful."
For the final, text-to-speech leg of the process, Microsoft had to record a few hours of a native Chinese speaker's speech, and around an hour of Rashid's own voice.
All of the common speech recognition and automatic translation systems are based on a statistical technique known as Hidden Markov Modeling, which has an error rate of between 20-25%. According to Rashid, the new DNN technique reduces that rate to about 14-18%.
"This means that rather than having one word in four or five incorrect, now the error rate is one word in seven or eight," he wrote. "While still far from perfect, this is the most dramatic change in accuracy since the introduction of hidden Markov modelling in 1979, and as we add more data to the training we believe that we will get even better results."
"The results are still not perfect, and there is still much work to be done, but the technology is very promising, and we hope that in a few years we will have systems that can completely break down language barriers," Rashid added.