How good is voice recognition for learning Chinese pronunciation?

In the previous article, we started exploring how speech recognition can be used to improve your Mandarin pronunciation. The main goal of that article was to investigate false negatives, or in other words, when the speech recognition says something is wrong ,but it’s actually correct.

Can speech recognition be used to learn Mandarin pronunciation?

The conclusion was that speech recognition was very good at identifying two-syllable words and sentences, but not as good when it came to single-syllable words. The takeaway for learners was that if your pronunciation is very good, the speech recognition on your phone probably will likely be able to identify what you say.

This article continues where the first one left off. If you haven’t read that article yet, I suggest that you do so before reading this one. I will try to not repeat to many things from last time and will assume that you have read the first article. You can find it here:

Using speech recognition to improve Chinese pronunciation, part 1

Tune in to the Hacking Chinese Podcast to listen to the related episode:

Available on Apple Podcasts, Google Podcast, Overcast, Spotify and many other platforms!

How well does speech recognition handle non-native audio?

In this article, I will try to answer the following question:

If I say something and the voice recognition spits out exactly what I intended to say, does that really mean that my pronunciation is good, or could it be that the voice recognition is too lenient?

We will also look at the question the first article discussed, but now using non-native audio. That question was:

If I say something and the voice recognition spits out something else, does that really mean that my pronunciation is bad, or could it be that the voice recognition is wrong?

For more about the experiment setup and caveats regarding that, please see the first article. Just like last time, the results are split across monosyllabic words, disyllabic words and short phrases.

A) Monosyllabic words

The results of the first part, monosyllabic words, are presented in the table below. Each item is presented as follows:

The number of the item
The utterance in Pinyin with attached audio (click to play)
My judgement: If correct, the intended word; if incorrect, problems are pointed out (T means tone, F means final, X means several issues at once; in the original pronunciation check, these were described and explained in detail, but here I have merely indicated what’s wrong)
My score: 0 means “this is likely to be perceived as the wrong syllable” and 3 means “very likely to be perceived as the right syllable”.
Google’s guess: Please note that the software can’t know which specific character the speaker is reading, so any character with the same pronunciation is considered correct.
Google’ score: One point is earned for each correct identification, out of three possible.
Apple’s guess: Please note that the software can’t know which specific character the speaker is reading, so any character with the same pronunciation is considered correct.
Apple’s’ score: One point is earned for each correct identification, out of three possible.

I have analysed pronunciation from two students who both participated in my pronunciation course. This is only part of the material covered and my comments have been limited to fit the format of this article. I chose one female and one male student. To get more reliable results, more students would need to be included, but two should be good enough to get the discussion started.

Student A

Number	Student	Olle	Score	Google	Score	Apple	Score
A1	bō	波	3	播	3	拨	3
A2	fù	父	3	附	3	付	3
A3	tí	T	1	提	2	体	1
A4	zǒu	走	3	走	3	走	3
A5	ěr	耳	3	耳	2	儿	0
A6	yǔ	雨	3	与	2	与	3
A7	zhì	志	3	g*	0	据	0
A8	shā	沙	2	虾	0	虾	0
A9	cè	策	2	色	0	塞	0
A10	péi	X	0	嘿	0	嘿	0
	Total		77%		50%		43%

*I tried many times but never managed to get an actual Mandarin syllable here.

Student B

Number	Student	Olle	Score	Google	Score	Apple	Score
B1	bō	波	3	多	1	嗯	0
B2	fù	F	2	数	1	嗯	0
B3	tí	X	2	请	0	期	0
B4	zǒu	I	2	手	0	走	3
B5	ěr	F	2	偶尔	0	偶尔	0
B6	yǔ	雨	3	鱼	0	与	2
B7	zhì	F	0	这	0	这	0
B8	shā	沙	3	上	0	傻	0
B9	cè	I	0	色	0	是	0
B10	péi	X	1	黑	0	嗯	1
	Total		60%		7%		20%

Discussion

Average deviations across both students and both software providers:

Items where I gave a higher score: 60% (13/20)
Items where the speech recognition gave a higher score: 5% (1/20)
Items where both agreed about the score: 30% (7/20)

Looking at the results for these two students, we start seeing some interesting patterns. Naturally, a lot more data would need to be collected and analysed to draw any widely-applicable conclusions, but it seems the speech recognition fails often for these single-syllable words unless the input audio is really good, both in terms of pronunciation and quality.

I consistently rate these students much higher than the score from the speech recognition software indicates, in the case of student B, the difference between my appraisal and Google’s success rate is huge (60% correct vs. 7% correct). I suspect that audio quality plays a role here, though, as the recording quality from student B is not as good as for Student A or the teacher audio used in the previous article. Background noise is pretty easy for a human to disregard, but I assume it’s much harder for a computer to do that!

What this means for you as a learner

The results here basically tell us that unless your pronunciation is already very clean, you can’t expect speech recognition to do a good job. It is very likely to judge you more harshly than you deserve. The conclusion for single-syllable words is that speech recognition software can tell you if you’re near-native, but unless you are, it’s not very useful and will misunderstand you more than a human would.

B) Disyllabic words

Two-syllable words are the backbone of Mandarin and something I often advise students to focus on when it comes to tones.

The columns are the same as above, but scoring works slightly differently. For each error that could cause a syllable to be perceived as another syllable, one point is deducted from a total of three.

For example, getting both the tones wrong, but getting everything else right, would deduct two points (e.g. item A17), whereas getting only one initial wrong on one syllable would deduct only one point (e.g. item A15). Again, T means tone, I initial, F final and X means several errors at once. The number refers to which syllable the error is on.

For the speech recognition columns, a majority vote is used (out of three attempts) and the same scoring as described above is then applied. For example, item A13 was identified two out of three times as 水鸡 on iOS, which is wrong both because the tone on the first syllable should be rising rather than low, and because the initial should be s and not sh (two points deducted).

What this means for you as a learner

Student A

Number	Student	Olle	Correct	Google	Correct	Apple	Correct
A11	nǚrén	1:T,2:T	1	你人	2	女人	3
A12	ěrduo	1:T	2	耳朵	3	耳朵	3
A13	suíjī	1:T	2	水机	1	水鸡	1
A14	xiàngxià	2:F	2	向上	1	向下	3
A15	pínqióng	2:F	2	贫穷	3	贫穷	3
A16	lǎoshī	老师	3	老师	3	老师	3
A17	qīngchu	1:I, 2:T	1	请出	0	秦琼	0
A18	liǎojiě	1:T	2	了解	3	了解	3
A19	rùnzé	2:T	2	日语词	0	引子	0
A20	bózi	1:T	2	波子	2	墨子	1
	Total		63%		60%		67%

Student B

Number	Student	Olle	Correct	Google	Correct	Apple	Correct
B11	nǚrén	1:F, 2T	1	女人	3	牛人	1
B12	ěrduo	1:T	2	儿子	0	儿歌	0
B13	suíjī	1:T	2	水机	0	水晶	0
B14	xiàngxià	向下	3	向下	3	向向	2
B15	pínqióng	贫穷	3	凭祥	0	贫穷	3
B16	lǎoshī	1:T, 2:F	1	狼蛇	0	狼神	0
B17	qīngchu	1:F, 2:T	1	青丘	0	清楚	3
B18	liǎojiě	2:I	2	表姐	2	了解	3
B19	rùnzé	X	0	本子	0	卵子	0
B20	bózi	1:T, 2T	1	儿子	1	百色	0
	Total		53%		30%		40%

Average deviations across both students and both software providers:

Items where I gave a higher score: 50% (10/20)
Items where the speech recognition gave a higher score: 35% (7/20)
Items where both agreed about the score: 15% (3/20)

Here we can see things evening out a bit, with my assessment being about the same as speech recognition for student A and just a bit higher for student B (not the order of magnitude difference we saw for single syllables).

One thing that probably influences the results quite a bit here, but which is very hard to control for, is that it matters if there is a word that lies close to what the student says. When I listen to a student, I don’t have to guess at a word, I just write down what I hear, but the way speech recognition on smart phones work, they will always guess at something.

For example, in the case of A13 and B13, I hear nothing that indicates that they are saying sh rather than s, yet both iOS and Android hear s. Or do they? Probably not, it’s just that there is no two-syllable word with that tone contour (the only common character with a low tone on the syllable sui is 髓, but that doesn’t make sense with a jī coming after it).

What this means for you as a learner

Using two-syllable words works a lot better for pronunciation practice than does single-syllable words (that’s true when you practise with humans as well). I doesn’t seem like the speech recognition is too lenient here, rather the opposite, i.e. that small imperfections in pronunciation can throw it off completely, guessing a different word which is far from what you said. So, you can probably use two-syllable words for practice, but use it only as a binary indicator (right or wrong, not what you said incorrectly or how serious it is).

C) Phrases

Now we’re approaching the home territory where voice recognition software ought to be good at guessing. It’s not normal for people to ambush their phones by suddenly saying strange things like “glossy”, but it is normal to dictate a sentence or ask a question. Let’s see if it works as well as it ought to!

Scoring here works the same way as for the disyllabic words above, i.e. each error that could shift a syllable to a different meaning deducts one point. The maximum for each item is still 3.

Student A

A21	Tā shìbushì Wáng lǎoshī.	Score
Olle	Slight rise on second 是.	3
Google	他是不是王老师	3
Apple	他是不是王老师	3
A22	Máfan nǐ bǎ yán dì gěi wǒ.	Score
Olle	Wrong tones on 你, 盐, 递.	0
Google	麻烦您把验的给我	1
Apple	麻烦您把硬币给我	0
A23	Wǒ yídìng yào qù Měiguó.	Score
Olle	Wrong tones on 一, 美; initial+final on 去.	0
Google	我一定要去美国	3
Apple	我一定要去美国	3
A24	Qǐngwèn, wǒ kěyǐ jìnlai ma?	Score
Olle	Wrong tones on 进, 来.	1
Google	请问我可以尽卖吗	0
Apple	请问我可以进来吗	3
A25	Fángzū yígòng shì yìqiān wǔbǎi yīshí yuán.	Score
Olle	Wrong tone on 一十.	2
Google	房租一共是1510元	3
Apple	房租一共是1510元	3

Olle’s score: 33%
Google’s score: 66%
Apple’s score: 80%

Student B

B21	Tā shìbushì Wáng lǎoshī.	Score
Olle	Wrong tone on 王.	2
Google	他是不是王老师	3
Apple	嗯是不是王宝斯	0
B22	Máfan nǐ bǎ yán dì gěi wǒ.	Score
Olle	Wrong tones on 盐, wrong final on 我.	1
Google	麻烦你把盐递给我	3
Apple	麻烦你把颜宁几点我	0
B23	Wǒ yídìng yào qù Měiguó.	Score
Olle	Unclear tones on 要 and 去, wrong on 美; and final on 去.	0
Google	我一定要去美国	3
Apple	我一定要娶你回国	0
B24	Qǐngwèn, wǒ kěyǐ jìnlai ma?	Score
Olle	Wrong tone on 进.	2
Google	请问啊可以循环吗	0
Apple	请问我可以去玩嘛	0
B25	Fángzū yígòng shì yìqiān wǔbǎi yīshí yuán.	Score
Olle	Wrong tones on 共 and 千.	1
Google	房租一共是1550元	2
Apple	房租一中学1510元	0

Olle’s score: 40%
Google’s score: 85%
Apple’s score: 0%

Discussion

Here we see the first major deviation between the two speech recognition providers. For student B, Google gave a score of 85%, compared with my rating of 40%, more than twice as high. But for the same student, Apple gave a score of 0%! My guess is that this is mostly because of the low audio quality and that the data from student B should probably be disregarded because of this.

What we can learn from this as students is that at least if you’re on an Apple phone, you need a good recording environment! That advice is of course applicable to any student, but here it probably made up most the difference between a 0% score and a 85% score.

For student A with higher audio quality, though, the result is quite expected: speech recognition is considerably more lenient than I am. This is again because of context and the fact that there are only so many sentences that make sense (even though some of the suggested sentences don’t make much sense either).

If anyone adapts this as a serious research project, audio quality should of course be kept constant to confirm that my hypothesis here is correct.

What this means for you as a learner

If we disregard the low audio quality, speech recognition lets you think that your pronunciation is better than it actually is. The score from both Google and Apple is about 50% to 100% higher than my manual assessment. This ties in well with what I said in the introduction of the first article: Speech recognition is not meant to give you a fair assessment of your pronunciation, it’s designed to understand what you want to say. The more clues it gets, the better it will guess.

General conclusion

In these two articles, I’ve tried to answer the question if speech recognition can be used to check your Mandarin pronunciation. As I have reminded readers of through the two articles, the results here can only be tentative, so take these notes with a pinch of salt.

Speech recognition is next to useless for single-syllable words, unless you just want to verify that your pronunciation is native-like (and it still might not work; see the results for dipping third tones). Getting single syllables right means that your pronunciation is very good, but don’t be discouraged if your phone doesn’t understand you; it might not be your fault!
Speech recognition works better for two-syllable words, which is nice since you should probably practise those anyway. Here, it’s certainly possible to get it 100% right if your pronunciation is good enough, but don’t be too discouraged if your phone thinks your saying something completely different, because even a small mistake can throw it off. However, if you say it right, it will probably recognise what you say.
Speech recognition is probably too lenient for sentences. Provided that the sentence is fairly common, you need to make several errors at once to derail the speech recognition algorithm. You can not assume that your pronunciation is good just because your phone writes out the right sentence, but you can be fairly sure your pronunciation is not good if it doesn’t understand you at all.

That’s it for now! I would be happy to hear about your experience with speech recognition. Perhaps you can try the words and phrases used in this article and see if your phone can transcribe them correctly? Please share the results in the comments!

It will be interesting to follow how this develops in the future. As speech recognition becomes even better at handling non-native accents, we should see its usefulness for checking pronunciation decrease as the software will become better at understanding incorrectly pronounced sentences. This is probably true for words as well, but since there’s so little context to use there, I doubt the situation will change quickly in that area.

Tips and tricks for how to learn Chinese directly in your inbox

I've been learning and teaching Chinese for more than a decade. My goal is to help you find a way of learning that works for you. Sign up to my newsletter for a 7-day crash course in how to learn, as well as weekly ideas for how to improve your learning!

1 comments

Timmy says:

2020-05-05 at 02:31

Great couple of articles! I will use this while in quarantine. Without access to Chinese people and teachers it’s the only way… and a good beginning it seems.

Hacking Chinese

A better way of learning Mandarin

Can speech recognition be used to learn Mandarin pronunciation?

How well does speech recognition handle non-native audio?

A) Monosyllabic words

Student A

Student B

Discussion

What this means for you as a learner

B) Disyllabic words

What this means for you as a learner

Student A

Student B

What this means for you as a learner

C) Phrases

Student A

Student B

Discussion

What this means for you as a learner

General conclusion

1 comments

Leave a comment Cancel reply