Learning Chinese can sometimes lack structure and feel confusing, especially if you study on your own. There are few reliable reference points, and it’s easy to understand why many turn to standardised tests, not just for assessment, but for guidance as to what to study and when.
HSK (Hànyǔ Shuǐpíng Kǎoshì) is by far the most well-known such test, and there are many textbooks, courses and learning resources specifically geared towards taking students through levels of increasing difficulty. It’s not uncommon to hear about students who say that they’re “working their way through HSK3” and similar.
While I think the idea of using a proficiency test to guide your learning and as the main source of new vocabulary is a bit backward, I also understand why people do so, especially if you need the certificate to apply for a scholarship or a job that requires Chinese. If you care about your grades, you decisions should not only be guided by what makes sense from a language learning perspective.
Tune in to the Hacking Chinese Podcast to listen to this article:
So, we have a large number of students that for some reason focus heavily on HSK study materials in general and HSK word lists in particular. This raises an interesting question: If you focus on HSK, what other things would you miss? Or, more specifically, if you learn words mostly from the HSK lists, what common words would you miss?
This article will provide an answer to that question. If you’re just interested in checking out the words, you can click here to skip to the word lists at the end of the article. If you’re learning Chinese in Taiwan and are more interested in the TOCFL test, check this follow-up article about that very topic: What important words are missing from TOCFL?
For those of you who want to know a little bit more, I’ll go through the process in more detail before we get to the actual words.
What important words are missing from HSK?
It should be clear that HSK is not meant to be a representation of the most commonly used Chinese words. This is very obvious in the lower levels, where words like “train station” and “bus” are part of HSK1, which has only 150 words in total. Those words are nowhere near the top 150 words in Chinese in general, but they are of course important for foreigners visiting and travelling in China, which probably is why they are included.
Overall, I think the lower levels of HSK match the needs of foreign students quite well. I have spent dozens of hours pouring over these lists when creating the sentence pack for my beginner course Unlocking Chinese, and in general, there aren’t that many weird decisions about which words to include.
In other words, the purpose of this article is not to complain about HSK, but rather to highlight some very common words that were left out in favour of other words. Most of them were left out for good reasons, but this doesn’t mean that you shouldn’t learn these words!
The biggest problem when discussing words in Chinese is that there is no clear definition of what a word actually is. Since there’s no spacing between words, figuring out what is a word and what isn’t is hard. 你 is a word, but is 你好 a word? Most dictionaries say no. What about 你们? Or if you think 你好 is a word, what about 你们好? What about 老师好?
I think you’ll agree that 你 is a word and that 老师好 is not a word, but where to draw the line is not obvious, especially if you have to rely on an automated method (needed to deal with databases with millions and millions of characters).
The question of wordhood in Chinese is complex, and something I can’t go into in this article, but the bottom line is that different methods of separating Chinese text into words (segmentation) will yield different results.
This means that it’s hard to compare a word frequency list to the HSK list directly, simply because they have different standards for what a word is. If you just check for things that appear in a frequency list, but not in HSK, many of the results you get will be things that are actually not words, such as 那个 and 出来.
What does “common” mean, anyway?
The next problem is what frequency list to choose. How do you decide what a “common” word is? There are many frequency lists, of course, but most are based on written Chinese, which is much more formal than the language most students encounter. If we compared one of these lists with HSK to see how they differed, the result is easy to predict: characters and words used in formal, written Chinese would appear high on the frequency list, but low, if at all, in HSK. That would be neither helpful nor interesting.
Instead, I choose to look at word frequencies from the SUBTLEX-CH corpus (Cai and Brysbaert, 2010), which consists of Chinese subtitles from movies and TV series. This is still not naturally spoken Chinese, but it’s a lot closer to that than books and newspapers are. For a thorough look at resources for word, character and component frequencies in Chinese, please refer to this article:
At first, I thought that the fact that the corpus includes foreign movies and TV series translated into Chinese would be a big disadvantage, but the more I worked on this project, the more I realised that it is actually a potential advantage.
Many of the words common in Chinese subtitles but that aren’t in the HSK lists are things that are non-Chinese, such as “baseball” and “jury”. Being a foreigner (why else would you study HSK), learning such words is useful, not because they have a natural place in China, but because they do in your home country, and you might want to talk about them in Chinese, especially if you aren’t living in China.
Plugging gaps in your Chinese vocabulary
Next, the goal is to identify holes in the vocabulary of a student who focuses on HSK vocabulary only, not to find any word that doesn’t exist in HSK. I normally advise students to only use word lists for plugging holes, not to expand vocabulary in general. The difference is that plugging holes is about finding words much more common than those you are currently learning, but which you have somehow missed.
For example, if you’re currently at HSK3 but somehow missed the word “train station”, that would be a hole in your vocabulary. It’s much easier than the HSK3 words you know, but you missed it somehow. However, if you don’t know the word for “elevator”, this can’t really be seen as a hole, because it’s on your level and something you can’t really say that you have “missed”.
Identifying common words missing from HSK
For each HSK level, I checked the general frequency list for words that were twice as common as the HSK level in question indicated, and listed all words missing from HSK.
For example, for HSK1-3, which contains 600 words, I checked the top 300 words in the frequency list, and noted all that did not appear in HSK1-3. This means that if you’ve completed HSK3, you might have missed these words. For HSK5, which contains a cumulative total of 2500 words, I checked the top 1250 words in the frequency list to see which were missing. This makes sure we’re talking about actual holes in your vocabulary.
This generated a list of roughly 1000 words that were missing from all HSK levels. I then manually went through the whole list, deciding which of these were actually words students might want to learn. Here are the decisions I made when deciding what words should be included, but you can get the full list at the end of if you prefer:
- Words that are also part of words that are in the HSK are included. Example: 但是 is in HSK, but only 但 is not. I included 但 because it’s deemed to be a word. Some cases are less obvious, such as 唱歌, which is in HSK, but 唱 and 歌 are not there separately and might not be obvious for students.
- Combinations of words that are in HSK and form phrases are excluded. Example: 这 and 个 are in HSK, but 这个 is not. 这个 is excluded because it’s not deemed to be a word.
- Words plus particles that are in HSK are excluded. Example: 你们 is a combination of a word and a particle, and can be assumed to be known, even if it’s not in HSK.
- Verbs plus complements are excluded if the meaning is obvious from the parts. Example: 找到 is ignored because it’s assumed that you know what it means if you know what 找 means and how 到 works.
- Single-character words that are in HSK only as part of longer words are excluded if the meaning is obvious. Example: It’s assumed that you know what 前 means if you know what 前面 means.
- Duplications of words that are in HSK are excluded. Example: 看看 is not counted as a word, since 看 is in HSK.
- Adverbs plus verbs are excluded if the meaning is obvious from the constituent parts, and those parts are in HSK. Example: 只是 is not included because its meaning is obvious from knowing 只 and 是.
- All negated words are excluded, so 不要 or 不能 are not included, because these are normally not considered to be words. If the meaning is deemed non-obvious to students, such as 无法, it is included, though.
- Characters that aren’t words that can be used on their own are excluded. For example, 者 is hardly ever used as a word on its own and is not included. It would only appear as part of words.
- Phrases and expressions are not deemed to be words and are excluded. For example, 怎么样 and 没什么 are not included.
- Logical extensions of words that are in the HSK are excluded, so even if 以前 is in HSK, but 以后 is not, 以后 is still not included.
- All erisation (儿化音) is excluded. Example: 一点儿 is excluded if 一点 is included.
Remember, the goal here is to generate missing words in HSK that you might want to learn. Thus, it makes no sense to include 不要 in the list, because no one would regard that as a new word you actually need to learn. Similarly, if you know 饭馆儿, it doesn’t make sense to treat 饭馆 as a new word either.
Types of words left out of the HSK word lists
This culling resulted in a list of roughly 650 words (meaning that I manually removed around 300 based on the principles described above), which would then be actual words that I think there’s a real chance that you might genuinely want to learn as a student.
I identified several categories of words that were missing from HSK, presented below with some examples:
- Many single-character words are missing – I included these only when they didn’t violate any of the principles above, and when they can actually be used on their own. i think most students will know what 饭 means, even if they have only learnt 吃饭, but I chose to include these because it’s not obvious that you can use these independently. If you’re the kind of student that only learns characters in the context of words, you should definitely learn these at least. Other such words missing from HSK: 话，山，车，美.
- Names of places and countries are missing – These are highly relevant for students, but are not part of HSK. Most textbooks have them, but if you focus solely on HSK, you will miss important names like 美国，英国 and 日本. There are also common Western personal names, but these can be ignored.
- Regional variants are missing – This might sound like it’s a good thing at first, but extremely common regionally preferred words are excluded entirely, mainly those being used in Mandarin spoken in the south. You should definitely learn these, even if you live in the north. Here are some examples of missing words: 这里，哪里，讲话，老公，礼拜一.
- Profanity is missing entirely – This is not hard to understand, but if you look at the most common words in TV dramas and movies, there’s going to be a lot of swearing. None of that is in HSK, not even the mild ones. Examples: 傻瓜，笨蛋.
- Foreign things are mostly missing – Things, places and phenomena that aren’t that common in China are not included in HSK. Examples: 棒球，女王，骑士. Most vocabulary related to religion is also missing: 教堂，上帝，圣诞节.
- Particles in informal language are missing – While some are excluded, many are not: 哟，耶，哦，嗯. These are extremely common and it’s nice to know them.
Words that are significantly delayed in HSK
The above discussion is mostly about what’s left out of HSK entirely, but there are also words that have been significantly delayed. Prioritising words suitable for learners also means that other words that are very common have been pushed further down the lists. Which are they?
I have not attempted to sort these words into categories, but many of them are more formal or written expressions that are common in Chinese, but tend to be left out in learning materials, or at least delayed until written, formal language is introduced. This is true even if the frequency list I used for this project uses spoken language. I have included words that are significantly delayed in HSK as separate lists below.
Lists of missing and delayed words in all levels of HSK
I will now share the complete lists, including the raw list of missing words before my manual sorting for those who want to have a go themselves. For most students, though, simply check any HSK level at or below your current level and see what words you might have missed.
You will probably find that you know most of these, but you can safely assume that those that you don’t know would be good to know, at least if movie and TV subtitles are a good guide to spoken word frequency, which is shown to be the case in the paper linked to earlier (Cai and Brysbaert, 2010).
Please note that this sorting was done manually and probably contains some inconsistencies. My goal was to include words that students at this level might want to know and that there is a fair chance that you’d miss if you only focus on HSK. I have also created a deck with all these words in Skritter for your convenience!
- Words missing in HSK1-3 (39)
- Words delayed in HSK1-3 (42)
- Words missing in HSK4 (67)
- Words delayed in HSK4 (28)
- Words missing in HSK5 (143)
- Words delayed in HSK5 (37)
- Words missing in HSK6 (409)
- All missing words by level in Skritter
- All missing words by level (CSV)
- All missing words by level, raw unsorted (CSV)
If you have any questions or suggestions for how to use this material, please leave a comment below!
References and further reading
Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PloS one, 5(6), e10729.
Words delayed in HSK6 (0)
There are by definition no words delayed in HSK6, because there’s no higher level to delay them to. This will probably change in 2021, so I will likely revisit the topic of missing and delayed words in HSK then!
Tips and tricks for how to learn Chinese directly in your inbox
I've been learning and teaching Chinese for more than a decade. My goal is to help you find a way of learning that works for you. Sign up to my newsletter for a 7-day crash course in how to learn, as well as weekly ideas for how to improve your learning!