Learning Chinese can sometimes lack structure and feel confusing, especially if you study on your own. There are few reliable reference points, and it’s easy to understand why many turn to standardised tests, not just for assessment, but for guidance as to what to study and when.
HSK (Hànyǔ Shuǐpíng Kǎoshì) is by far the most well-known such test, and there are many textbooks, courses and learning resources specifically geared towards taking students through levels of increasing difficulty. It’s not uncommon to hear about students who say that they’re “working their way through HSK3” and similar.
While I think the idea of using a proficiency test to guide your learning and as the main source of new vocabulary is a bit backward, I also understand why people do so, especially if you need the certificate to apply for a scholarship or a job that requires Chinese. If you care about your grades, you decisions should not only be guided by what makes sense from a language learning perspective.
Tune in to the Hacking Chinese Podcast to listen to this article:
Available on Apple Podcasts, Google Podcasts, Overcast, Spotify, YouTube and many other platforms!
So, we have a large number of students that for some reason focus heavily on HSK study materials in general and HSK word lists in particular. This raises an interesting question: If you focus on HSK, what other things would you miss? Or, more specifically, if you learn words mostly from the HSK lists, what common words would you miss?
This article will provide an answer to that question. If you’re just interested in checking out the words, you can click here to skip to the word lists at the end of the article. If you’re learning Chinese in Taiwan and are more interested in the TOCFL test, check this follow-up article about that very topic: What important words are missing from TOCFL?
For those of you who want to know a little bit more, I’ll go through the process in more detail before we get to the actual words.
What important words are missing from HSK?
It should be clear that HSK is not meant to be a representation of the most commonly used Chinese words. This is very obvious in the lower levels, where words like “train station” and “bus” are part of HSK1, which has only 150 words in total. Those words are nowhere near the top 150 words in Chinese in general, but they are of course important for foreigners visiting and travelling in China, which probably is why they are included.
Overall, I think the lower levels of HSK match the needs of foreign students quite well. I have spent dozens of hours pouring over these lists when creating the sentence pack for my beginner course Unlocking Chinese, and in general, there aren’t that many weird decisions about which words to include.
In other words, the purpose of this article is not to complain about HSK, but rather to highlight some very common words that were left out in favour of other words. Most of them were left out for good reasons, but this doesn’t mean that you shouldn’t learn these words!
The biggest problem when discussing words in Chinese is that there is no clear definition of what a word actually is. Since there’s no spacing between words, figuring out what is a word and what isn’t is hard. 你 is a word, but is 你好 a word? Most dictionaries say no. What about 你们? Or if you think 你好 is a word, what about 你们好? What about 老师好?
I think you’ll agree that 你 is a word and that 老师好 is not a word, but where to draw the line is not obvious, especially if you have to rely on an automated method (needed to deal with databases with millions and millions of characters).
The question of wordhood in Chinese is complex, and something I can’t go into in this article, but the bottom line is that different methods of separating Chinese text into words (segmentation) will yield different results.
This means that it’s hard to compare a word frequency list to the HSK list directly, simply because they have different standards for what a word is. If you just check for things that appear in a frequency list, but not in HSK, many of the results you get will be things that are actually not words, such as 那个 and 出来.
What does “common” mean, anyway?
The next problem is what frequency list to choose. How do you decide what a “common” word is? There are many frequency lists, of course, but most are based on written Chinese, which is much more formal than the language most students encounter. If we compared one of these lists with HSK to see how they differed, the result is easy to predict: characters and words used in formal, written Chinese would appear high on the frequency list, but low, if at all, in HSK. That would be neither helpful nor interesting.
Instead, I choose to look at word frequencies from the SUBTLEX-CH corpus (Cai and Brysbaert, 2010), which consists of Chinese subtitles from movies and TV series. This is still not naturally spoken Chinese, but it’s a lot closer to that than books and newspapers are. For a thorough look at resources for word, character and component frequencies in Chinese, please refer to this article:
The most common Chinese words, characters and components for language learners and teachers
At first, I thought that the fact that the corpus includes foreign movies and TV series translated into Chinese would be a big disadvantage, but the more I worked on this project, the more I realised that it is actually a potential advantage.
Many of the words common in Chinese subtitles but that aren’t in the HSK lists are things that are non-Chinese, such as “baseball” and “jury”. Being a foreigner (why else would you study HSK), learning such words is useful, not because they have a natural place in China, but because they do in your home country, and you might want to talk about them in Chinese, especially if you aren’t living in China.
Plugging gaps in your Chinese vocabulary
Next, the goal is to identify holes in the vocabulary of a student who focuses on HSK vocabulary only, not to find any word that doesn’t exist in HSK. I normally advise students to only use word lists for plugging holes, not to expand vocabulary in general. The difference is that plugging holes is about finding words much more common than those you are currently learning, but which you have somehow missed.
For example, if you’re currently at HSK3 but somehow missed the word “train station”, that would be a hole in your vocabulary. It’s much easier than the HSK3 words you know, but you missed it somehow. However, if you don’t know the word for “elevator”, this can’t really be seen as a hole, because it’s on your level and something you can’t really say that you have “missed”.
Identifying common words missing from HSK
For each HSK level, I checked the general frequency list for words that were twice as common as the HSK level in question indicated, and listed all words missing from HSK.
For example, for HSK1-3, which contains 600 words, I checked the top 300 words in the frequency list, and noted all that did not appear in HSK1-3. This means that if you’ve completed HSK3, you might have missed these words. For HSK5, which contains a cumulative total of 2500 words, I checked the top 1250 words in the frequency list to see which were missing. This makes sure we’re talking about actual holes in your vocabulary.
This generated a list of roughly 1000 words that were missing from all HSK levels. I then manually went through the whole list, deciding which of these were actually words students might want to learn. Here are the decisions I made when deciding what words should be included, but you can get the full list at the end of if you prefer:
- Words that are also part of words that are in the HSK are included. Example: 但是 is in HSK, but only 但 is not. I included 但 because it’s deemed to be a word. Some cases are less obvious, such as 唱歌, which is in HSK, but 唱 and 歌 are not there separately and might not be obvious for students.
- Combinations of words that are in HSK and form phrases are excluded. Example: 这 and 个 are in HSK, but 这个 is not. 这个 is excluded because it’s not deemed to be a word.
- Words plus particles that are in HSK are excluded. Example: 你们 is a combination of a word and a particle, and can be assumed to be known, even if it’s not in HSK.
- Verbs plus complements are excluded if the meaning is obvious from the parts. Example: 找到 is ignored because it’s assumed that you know what it means if you know what 找 means and how 到 works.
- Single-character words that are in HSK only as part of longer words are excluded if the meaning is obvious. Example: It’s assumed that you know what 前 means if you know what 前面 means.
- Duplications of words that are in HSK are excluded. Example: 看看 is not counted as a word, since 看 is in HSK.
- Adverbs plus verbs are excluded if the meaning is obvious from the constituent parts, and those parts are in HSK. Example: 只是 is not included because its meaning is obvious from knowing 只 and 是.
- All negated words are excluded, so 不要 or 不能 are not included, because these are normally not considered to be words. If the meaning is deemed non-obvious to students, such as 无法, it is included, though.
- Characters that aren’t words that can be used on their own are excluded. For example, 者 is hardly ever used as a word on its own and is not included. It would only appear as part of words.
- Phrases and expressions are not deemed to be words and are excluded. For example, 怎么样 and 没什么 are not included.
- Logical extensions of words that are in the HSK are excluded, so even if 以前 is in HSK, but 以后 is not, 以后 is still not included.
- All erisation (儿化音) is excluded. Example: 一点儿 is excluded if 一点 is included.
Remember, the goal here is to generate missing words in HSK that you might want to learn. Thus, it makes no sense to include 不要 in the list, because no one would regard that as a new word you actually need to learn. Similarly, if you know 饭馆儿, it doesn’t make sense to treat 饭馆 as a new word either.
Types of words left out of the HSK word lists
This culling resulted in a list of roughly 650 words (meaning that I manually removed around 300 based on the principles described above), which would then be actual words that I think there’s a real chance that you might genuinely want to learn as a student.
I identified several categories of words that were missing from HSK, presented below with some examples:
- Many single-character words are missing – I included these only when they didn’t violate any of the principles above, and when they can actually be used on their own. i think most students will know what 饭 means, even if they have only learnt 吃饭, but I chose to include these because it’s not obvious that you can use these independently. If you’re the kind of student that only learns characters in the context of words, you should definitely learn these at least. Other such words missing from HSK: 话，山，车，美.
- Names of places and countries are missing – These are highly relevant for students, but are not part of HSK. Most textbooks have them, but if you focus solely on HSK, you will miss important names like 美国，英国 and 日本. There are also common Western personal names, but these can be ignored.
- Regional variants are missing – This might sound like it’s a good thing at first, but extremely common regionally preferred words are excluded entirely, mainly those being used in Mandarin spoken in the south. You should definitely learn these, even if you live in the north. Here are some examples of missing words: 这里，哪里，讲话，老公，礼拜一.
- Profanity is missing entirely – This is not hard to understand, but if you look at the most common words in TV dramas and movies, there’s going to be a lot of swearing. None of that is in HSK, not even the mild ones. Examples: 傻瓜，笨蛋.
- Foreign things are mostly missing – Things, places and phenomena that aren’t that common in China are not included in HSK. Examples: 棒球，女王，骑士. Most vocabulary related to religion is also missing: 教堂，上帝，圣诞节.
- Particles in informal language are missing – While some are excluded, many are not: 哟，耶，哦，嗯. These are extremely common and it’s nice to know them.
Words that are significantly delayed in HSK
The above discussion is mostly about what’s left out of HSK entirely, but there are also words that have been significantly delayed. Prioritising words suitable for learners also means that other words that are very common have been pushed further down the lists. Which are they?
I have not attempted to sort these words into categories, but many of them are more formal or written expressions that are common in Chinese, but tend to be left out in learning materials, or at least delayed until written, formal language is introduced. This is true even if the frequency list I used for this project uses spoken language. I have included words that are significantly delayed in HSK as separate lists below.
Lists of missing and delayed words in all levels of HSK
I will now share the complete lists, including the raw list of missing words before my manual sorting for those who want to have a go themselves. For most students, though, simply check any HSK level at or below your current level and see what words you might have missed.
You will probably find that you know most of these, but you can safely assume that those that you don’t know would be good to know, at least if movie and TV subtitles are a good guide to spoken word frequency, which is shown to be the case in the paper linked to earlier (Cai and Brysbaert, 2010).
Please note that this sorting was done manually and probably contains some inconsistencies. My goal was to include words that students at this level might want to know and that there is a fair chance that you’d miss if you only focus on HSK. I have also created a deck with all these words in Skritter for your convenience!
- Words missing in HSK1-3 (39)
- Words delayed in HSK1-3 (42)
- Words missing in HSK4 (67)
- Words delayed in HSK4 (28)
- Words missing in HSK5 (143)
- Words delayed in HSK5 (37)
- Words missing in HSK6 (408)
- All missing words by level in Skritter
- All missing words by level (CSV)
- All missing words by level, raw unsorted (CSV)
If you have any questions or suggestions for how to use this material, please leave a comment below!
References and further reading
Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PloS one, 5(6), e10729.
The images used for the HSK levels for this article are from Skritter and are used here with permission.
Words missing from HSK1-3 (39)
Words delayed in HSK1-3 (42)
Words missing from HSK4 (67)
Words delayed in HSK4 (28)
Words missing from HSK5 (143)
Words delayed in HSK5 (37)
Words missing from HSK6 (408)
Words delayed in HSK6 (0)
There are by definition no words delayed in HSK6, because there’s no higher level to delay them to. This will probably change in 2021, so I will likely revisit the topic of missing and delayed words in HSK then!
Tips and tricks for how to learn Chinese directly in your inbox
I've been learning and teaching Chinese for more than a decade. My goal is to help you find a way of learning that works for you. Sign up to my newsletter for a 7-day crash course in how to learn, as well as weekly ideas for how to improve your learning!
“While I think the idea of using a proficiency test to guide your learning is a bit backward”
I think one advantage of the HSK lists is that they bring a level of standardisation to learning materials that is missing from other languages.
For learning materials based around the HSK word lists it’s easier to switch between learning materials because the person designing it can assume that anyone picking up, say, their HSK4 material already knows the HSK1-3 words, whereas otherwise it’s not only more difficult to pick the right level, but you’ll find that material at a particular level is assuming knowledge taught at the lower levels that you happen to not have picked up yet from whatever resources you were using.
This obviously applies more to the lower levels, but even at the higher levels it can be useful for a teacher or author to know with some confidence that students will already know that the students are familiar with all the words at least up to HSK5 or so. This could be helpful when deciding which words to include definitions of, or ensure that new vocab is introduced at an appropriate pace while having to make fewer guesses about which words people know.
I know with my iTalki tutor, for instance, being able to say “I’ve learned all the words up to HSK5, and now I’m working on the HSK6 list” immediately gives her a good idea where my vocab is at, and if she’s reasonably familiar with the lists (which a lot of mainland teachers are, at least up to HSK4 or so) it helps her know which words I’m likely not to know, even though I do know a lot of words not on the list as well.
Similarly for students, if you want to, say, enrol in a course aimed at the HSK5 level, you might decide to make sure you’ve learned all the words up to HSK4, whereas otherwise it would be difficult to identify what level you should enter or what gaps you’d need to fill in before starting the class.
In short, while the HSK lists have their limits, I do think they help to rationalise the Chinese language ‘industry’ as a whole, particularly for learners at a lower level.
Of course they aren’t perfect, so many thanks for your work putting together these additional lists!
Yes, I agree, which is why I in the next sentence said that I understand why people are doing it. 🙂 I also used the HSK list as a basis for the sentences in my own course, which I wouldn’t do if I didn’t think they were any good. Like you say, this is mostly true for beginners, though, as the guidance received from word lists decreases a lot for each level and is basically useless at HSK6. Still, I don’t think it’s a good idea to use online word lists as the main source of vocabulary (many people are doing that, well beyond HSK1-3). I wrote more about which words to learn and where to find them in this article: Which words you should learn and where to find them. I’ll update the article a bit to clarify what I mean!
Incredibly useful! Thank you for putting this together.
The’re revising the HSK exam, and maybe they’ll fix that problems. Some information on wikipedia page (English version) on HSK.
From what I’ve seen, they are very unlikely to add most of these words to the lists. I think there will be fairly small changes to the beginner part of the lists, even though some reshuffling is quite likely. The thing is, most of these words are not on HSK for a reason, it’s not that the creators of the previous lists overlooked them, they deliberately left them out. While a new structure could change that, I don’t think it will. I actually thought about this a bit before posting, because I realie that I will have to make a second post when the new lists are out next year, but considering that most of the work is already done now, it should be pretty easy to post an update later!
Do you have any similar assessment for the Taiwanese equivalent? Obviously, 棒球 and 民主 aren’t going to be omitted there, and certain expressions don’t mean the same (小姐, 土豆), but are there any obvious omissions from learning materials?
Good question! Now that I have done it once for HSK, it should be a pretty easy task to do it for TOCFL as well. Maybe a companion article in a week or two, we’ll see! This was obviously done with simplified characters only, as trying to include both would essentially double the work and make it almost impossible to verify since mappings between simplified and traditional would become a huge problem. But I wouldn’t be too hard if both lists are traditional only. Thanks for the suggestion!
I have now published a follow-up article about TOCFL! I used a different frequency list, but the principles remain the same. It’s interesting that for many words with different regional versions, such as 腳踏車/自行車, both versions are listed on the test. You can check the article here: https://www.hackingchinese.com/what-important-words-are-missing-from-tocfl/
Thank you for these lists, they are very useful! 🙂
Would you also be able to give the meanings for each word and/or recommend how to choose which meanings to learn?
I ask as many of the words (especially the one-character words) have multiple meanings. Further, it is often not clear how common each of these meanings are and/or which meanings are referring to the meaning of the character as used alone or as used as part of another word.
An example of what I mean is 么, which can be part of 什么 and pronounced “me”, while it could also be used as either the interrogative final particle (i.e. replacing 吗) or the exclamatory final particle (i.e. replacing 嘛) and pronounced as “ma” or “ma2”, respectively. The latter two “meanings” are arguably “words”, while the former is more referring to how the character is used as parts of other words.
中 is another example that has the common meaning “middle” with pinyin zhong1, but can also mean “to hit the mark, to be hit/affected by, to win a prize/lottery” with pinyin zhong4. I know the former meaning is common, but I don’t know how common the latter is (or even if it is just a “character” meaning and not an independent “word” meaning).
Connected with learning characters and words that are commonly encountered rather than random words (as you discuss in another article), I would rather not spend lots of time in the early stages learning rare meanings of words/characters that 99% of the time have their common meaning. (Though, knowing which common words also have a rare meaning is useful to keep in the back of one’s mind, even if that rare meaning is not learned yet.)
Do you have any recommendations for how to deal with this (both for these lists and in general)?
This is not a problem that I can solve easily, as it would entail going through the whole list manually, and then I would only guess at why that word is there (I only have the frequency data associated with it). In theory, this should be possible at least partly, though, as long as the corpus data is properly segmented. For example, if there is information about 中 indicating either an action or a location, it ought to be possible to see which is most common. I have not seen anyone do that for large numbers of words, though, and it would only work if the different meanings belong to different categories of words, such as 中. It would not work if a word could be two different adverbs or two different nouns, for example.
Skritter usually lists the most common meaning first, but only if the pronunciation is different, and this is also the result of manual work and edits, meaning it’s not necessarily 100% reliable.
The easiest way to get the definitions, no matter which definitions we’re talking about, is to paste the words into MDBG, which will give you a list with all of them. I realised that it’s actually possible to link to such a list, so I have added links for each section for your convenience. Enjoy!
Thank you! This is an amazing way to get the meanings 🙂
I had been always getting the meanings for words or characters by going through each one at a time on Wiktionary; this is definitely much faster.
I actually found that, after clicking on the MDBG link, I can get a text version that I can paste directly into excel, with the appropriate column separations. I will put the steps here for anyone who wants to do this:
1. Click on the “View meanings in MDBG” link for the list you want.
2. Click on “Look up All Chinese Words in a Text” near the top left of the page
3. Change the second drop-down box to “Create a vocabulary list”
4. Click on “Go”
5. Click on “Print a vocabulary list” and hit cancel on the Print window
-> This creates a pop-up window with the list in text format.
6. Highlight all the text (ctrl-A seems to work properly), copy it and paste it directly into Excel.
-> The text should be separated into columns automatically
-> Note that a “By MDBG” link may be at the bottom of what was just pasted into Excel.
Now, with the list in Excel, one can format it for import into Anki 🙂
Thank you again, Olle!
Great, thanks for sharing the detailed steps! I’m sure other people who want to import it to various programs and platforms will benefit, too.
As an alternative to the above, I use the following page to generate word lists to import into Anki:
Your list of words missing from HSK6 has only 408 entries, not 409.
Good catch! I’ve updated it to say 408 now, thanks!
Hi Olle, thanks for this resource as well as the other great stuff in your articles and book!
Is there a version of this list that is independent from the HSK? Meaning, you distilled the frequency list down to what you thought are the most useful entries, but because the article is about what’s missing from HSK, you also removed the words occurring in HSK. So is there a version of your filtered list with the HSK words still included? I think this would be VERY useful to those wanting to make the most of the frequency list who are also not HSK learners.
I’m not sure I understand what you mean! What I did here was to compare a frequency list with HSK and then do some analysis to check what high-frequency words were either delayed or left out of HSK. What you ask for would simply be a frequency list. I did not try to subjectively decide what words are useful! The only subjective parts here are 1) my judgement of what is a word and 2) the grouping of words into categories (such as profanity, foreign things, etc.).
Sorry for the confusion! It’s my understanding that you:
1. Started with the SUBTLEX-CH list
2. Removed the actual HSK words
3. Removed additional entries based on your stated criteria
4. Were left with the words in this article that are “missing” from HSK
What I’m asking is, can we skip step 2? Essentially, a version of SUBTLEX-CH refined according to your criteria, but with any HSK words still included.
Basically, I feel it would be very useful to have a frequency list that has been edited according to an expert (you). Your criteria above makes a lot of sense and I believe this would be a very efficient study tool if it was applied to the frequency list independently from the idea of what’s missing from HSK. I hope this makes sense, it’s been a difficult idea to explain via text! Thanks again.
Hm.. I’m afraid I still don’t see what the purpose of this would be. If you take the list I have produced and you add HSK back in (which is a trivial copy and paste), don’t you get exactly what you describe? Step 3 has almost no impact on HSK words anyway, the whole point of step 3 was to remove things that could have been on HSK but aren’t because the authors didn’t consider them eligible candidates.
In short, it seems to me that if you remove step 2 in the process, the only difference is that step 3 doesn’t get applied to the HSK words, but step3 wouldn’t have any effect on the HSK words anyway, no?
Do you plan on doing a similar analysis with the new hsk vocab?
Yes! I did maybe half, but then ran out of time. I’ve been too busy with other commitments recently, but it’s definitely in the pipeline!