Hacking Chinese

A better way of learning Mandarin

The most common Chinese words, characters and components for language learners and teachers

How much Chinese you learn depends on three things: how you study (method), how much you study (time) and what you study (content). Naturally, if you want to learn as much as possible, you need to maximise all three. Of these three factors, I think content is often overlooked, meaning that many care much about how they learn and how much time they spend learning, but don’t really think too much about what they are actually learning. I’ve written about this problem before: Three factors that decide how much Chinese you learn.

So, what should you learn? This is easy to answer in the abstract: you should study whatever is the most useful to you, based on your goals for learning Chinese. In practice, this is hard to answer, though, but fortunately, we can use frequency as a proxy for usefulness. This is based on the assumption that something that occurs often in the language ought to be learnt before something that is less frequent, simply because you’re likely to both use and encounter it more.

Please note that frequency is not the only proxy for usefulness. Other ways of accessing lists of useful characters and words include vocabulary lists from textbooks, standardised proficiency tests and more. For a discussion about that, please check the article below.

Vocabulary lists that help you learn Chinese and how to use them

In this article, I will only talk about frequency.

Frequency resources for learning and teaching Chinese

The problem is that students, and sometimes teachers, have a simplistic view of what frequency lists are and what they can or should be used for. It’s easy to grab a list off the internet and use it as an absolute reference, believing it to show the most common components, characters and words.

It probably does, but it’s important to remember that all frequency data is based on small fraction of all language. Here are some questions that many students never even consider:

  • Is the data based on written or spoken language? Maybe both?
  • If written language, what were the sources? Newspapers? Novels? Blogs?
  • If spoken language, what were the sources? Actual conversations? Chat? Movies?
  • Where does the data come from geographically?

The answers to these questions matter. Below, I have listed the top 10 characters in two character frequency lists that are obviously based on different sources. Look at them and see if you can draw some conclusions regarding the corpus, i.e. what kind of language was used to generate these lists.

Rank List A List B
1
2
3
4
5
6
7
8
9
10

While some characters appear on both lists, there are important differences. What do you think List A is based on? What do you think List B is based on? Which one do you think would be most suitable to use as a student? Think about this for a minute before you check the answer below.

  • List A is based on formal, written language, which you can see because of the lack of pronouns high up on the list. The exact source is Chinese language Wikipedia, which is also something you might have guessed because of the very high frequency of characters used in dates (it could have been any encyclopaedia, of course).
  • List B is based on spoken language. More precisely, it’s based on movie subtitles. You can see it’s probably spoken language because people like to talk about themselves and people in their vicinity (hence all the common pronouns at the top). Of course, movies don’t contain naturally-produced spoken language, but at least the goal of most movie dialogue is to make it sound as if it were, which is good enough in many cases.

I hope this example has convinced you to not just grab random list you find after searching for one minute. I will introduce a number of frequency sources in this article and help you choose one that suits you!

Words, characters and components

Unlike other languages, Chinese can be broken down into much smaller pieces while still retaining some meaning. For example, if you take a two-syllable word and break it down, each syllable will be represented in the written language by a character that means something. Often, these characters can then be broken down into components that in themselves also mean something.

Frequency data can be used on all levels, but because of methodological issues, character frequency is what most people talk about. It is, after all, unambiguous what a character is since it occupies exactly one square on a page and it’s easy to calculate frequencies using characters as the unit. However, if you want to focus on spoken Chinese, you’re not much helped by character frequencies, but would much rather use word frequencies. If you focus on learning Chinese characters, you might also want to break down characters into components and see which are the most important to learn first.

In this article, I will introduce frequency data for words, characters and components, in that order.

Frequency resources for Chinese words

For spoken language (and perhaps written language too), words are the most interesting unit, since people use spoken words, not written characters, to communicate. Coming from a language like English, this might seem straightforward enough, just calculate the frequencies of words, right?

Not really. Chinese has no spaces between words, which means that it’s far from easy to figure out where one word stops and the next one begins. 红 (red) is a word, so is 灯 (light), but 红灯 is a word too (red light, as in ” stopping at a red light”). But what about 黄灯 (yellow light)? Or 紫灯 (purple light)? As you can probably see, this is not easy.

There are three ways of dealing with this problem:

  1. The easiest way for students and teachers is to take an extensive dictionary and simply say that everything listed in there is a word, and everything that is not listed isn’t a word. In that case, 红灯 and 黄灯 are words, but 紫灯 is not (I used 现代汉语词典 to check this). There might still be ambiguities regarding which character belongs to which word, though. I don’t know of any frequency lists that uses this method.
  2. Another way is to not care about what actually is a word as defined by linguists, but instead focus on characters that appear together often (called a bigram if it’s a pair of characters) and simply say that that’s a word. The problem is that then things like negated verbs rank very high, such as 不能, 不会 and so on, along with numbers in combination with measure verbs, like 一个.
  3. The most demanding way is to make sure the texts being used have been segmented into words in some way by a human. Or, more often, a combination of humans and clever tools that help a lot, but which don’t really get everything right. Still, there are often “words” like 不能 and 一个, probably because someone decided that they actually are words for the purpose of the corpus.

So, in essence, while there ought to be a difference between how carefully the material was segmented (if at all), there really isn’t that much of a difference.

We’re now ready to look at the available resources for word frequency in Chinese. I have tried to put broadly useful resources at the top, but haven’t paid close attention to exact ordering:

  1. SUBTLEX-CH (movie subtitles) This list is based on Chinese movie subtitles and is thus as close to natural spoken language as you can get. 100,000 simplified words without definition or Pinyin, You will find some 这个 and 不能, but should still be very useful. Note that there are two files, one for character frequencies and one for word frequencies. Article describing the underlying research project here.
  2. K-5 Word Frequency Dictionary for Chinese L2 LearnersThis list is somewhat unique in that it draws on materials for people who learn Chinese as a second language, i.e. textbooks, graded readers and so on (read more about the methodology here). The site presents a search interface, but you can also download excel spreadsheets. Simplified Chinese with Pinyin. This list comes very close to what I would consider most useful for second language learners.
  3. BLCU balanced corpus frequency lists These lists are based on a ridiculous 15 billion (simplified) character corpus, composed of news, literature, blogs and much more. It is probably the biggest, most comprehensive dataset available. You can access the corpus online here and read more about the project here (in Chinese). The ZIP-file linked to above contains text files for each part of the corpus, as well as a global file. If you can’t view the text files, a user over at Pleco’s forum posted UTF-8 encoded versions that work well for me. The lists contain some oddities, such as 第 coming out on top and some kana from Japanese showing up; please see the discussion over at Pleco for cleaned-up versions if you want to remove these.
  4. University of Leeds: Internet Word Frequencies This frequency list is based on the Leeds corpus of internet Chinese (90 million tokens from 2005). Simplified characters with no frills. You can search the corpus directly online, which is handy.
  5. 6000 Chinese Words: A Vocabulary Frequency Handbook, by James Erwin Dew – This is one of the few books I recommend. It lists words with traditional characters, but since it’s a printed book, you can’t do much analysis of your own, but it does have the data both in alphabetical order (good for looking things up) and in frequency order. It also offers lists of prolific characters, meaning characters that are used in common words.
  6. Mandarin Chinese Word Frequency Dictionary This list presents word frequencies based on the Academia Sinica Balanced Corpus of Modern Chinese, “balanced” meaning that it draws from a number of different kinds of texts. This list uses traditional Chinese. The data here is very useful, but it’s not very accessible because it’s in PDF format and not sorted by frequency. I haven’t been able to find this data in a more accessible format! Traditional characters.
  7. Leiden Weibo Corpus (and related frequency data) –  This list is based on Weibo messages, which makes it interesting for informal writing (and indirectly speaking). It also has a rather unique feature in that messages are coded for which city/province they were posted from. However, to make the most of this resource, you need some database and/or text manipulation skills. The word list is pretty straightforward, but not sorted by frequency. To get the geographical breakdown, you need to combine the word IDs in one file with the actual words in another, so not for the idly curious. Simplified characters, of course.
  8. Jun Da: Bigram frequency list of the general fiction sub-corpus – As the name indicates, this is an analysis of pairs of characters occurring together in the general fiction part of the corpus used by Jun Da. See also Bigram frequency list of the news sub-corpus. Both are somewhat configurable (you can set frequency and mutual information values) and spit out simplified characters (varying numbers depending on what values you set, of course).

Pew, that’s a lot! Have I missed anything, including easy-to-access derivatives of the above? Leave a comment!

Frequency resources for Chinese characters

As I mentioned above, character frequencies are very easy to obtain (no segmentation needed) and there are numerous lists available online. The only thing you might want to care about is that they are based on a large enough corpus and that the sample is roughly matches what you’re after and isn’t very specialised in some area you’re not interested in. I have tried to order them with the most useful at the top:

Have I missed any useful frequency list? Leave a comment below!

Frequency resources for character components

When looking at character components, the problem is similar to that discussed above regarding words. There is no unified definition of what a component is and there’s often more than one way to break down a character.

For example, the character 行 was originally a pictograph showing a road intersection. So, in a sense, it can’t be broken down and should be treated as a component itself. However, if you just look at the modern form, you can visually break the character down into  彳, 一 and 丁. Or take 想 as another example. Do you break it down into 心 and 相 and be done with it? Or do you then break 相 down into 木 and 日 and count each?

In fact, this kind of data is even trickier to deal with than words, because at least there is some consensus of what ought to be a word, it’s just not easy to apply computationally to a large dataset. When it comes to components, it’s even hard to find data!

As a shortcut, people sometimes use radicals instead of components, but there are many more components than there are radicals, so that doesn’t give the whole picture. Other times, people break things down as far as the result can be written with Unicode, damn the consequences. This leads to frequency lists where single strokes are the most common “components”.

Here’s my list created using the radical approach described above:

Kickstart your character learning with the 100 most common radicals

This list at HanziCraft lists productive character components in order of frequency. It was created by checking how often a component appears in breakdowns, weighted by the frequency of the characters it appears in. It’s a pretty good effort if you’re after visual breakdowns.

We also have this list over at HanziCraft, which lists phonetic components. There’s no frequency information here, but they are are commonly used, so I guess it’s better than nothing. This list also includes information about how they influence pronunciation, which is nice.

If you know of more extensive and/or scientific lists for character components, please let me know.

Other frequency resources

Here are some more frequency-related resources that don’t really fit in any specific category above, but that I thought someone might find interesting:

Conclusion

That’s probably more frequency resources for learning and teaching Chinese than you every wanted to see! Still, I suspect there are important resources I have overlooked. If you know of one that really should be included here, please let me know and I will update the article!


Sign up for a free crash course in how to learn Mandarin. You can also opt-in to my weekly newsletter if you want. For more about how your personal data is handled, please review the privacy policy.

3 comments

  1. Fen Ma says:

    I found a list, fatizi, undecomposable chinese characters, where I do not know, what to do with it. I guess, today I know about 75% of these characters.

    https://wenku.baidu.com/view/a941b1e784254b35eefd3483.html
    http://www.moe.gov.cn/ewebeditor/uploadfile/2015/01/13/20150113090418639.pdf

    The characters should fall in a category like easy characters or possible character in itself components. I found the list by searching for a list of useful components, that are characters of their own. But I haven’t used it up to now, and there ist nothing like a frequency.

    That brings me to an other point. You wonder about the usefulness of the number of strokes in a character. You can take this number as a measure for easyness of a character (as a rule of thumb – it’s far from perfect). With that you can create a ranking, that treats both easyness and frequency.

    1. Olle Linge says:

      I’ll check out the list you suggested! Regarding number of strokes, I think difficulty is not related to number of strokes at all. More strokes often make it easier to learn a character, not harder, at least for non-beginners. The real problem is not so much to learn a new character, which very likely consists of known components if it has a large number of strokes, but to keep the new characters separate from ones already known. Lots of strokes makes it harder to write, of course, but not harder to remember (in general, there are of course characters with many strokes that are also hard to remember). My comment here assumes that the student already knows the most common components, of course, otherwise characters with many strokes are obviously harder because they contain more unknown components.

  2. Birgit says:

    OK, learning new characters or words from a list can be done fast and it is not too boring. But they also need to be reviewed several times. Where do those frequency lists provide interesting review material, at the right level, at the right time?

    For beginners and intermediate learners I recommend checking out the official word lists over at wordswing, interesting review can be done by playing games, viewing all words on one page etc..

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.