How much Chinese you learn depends on three things: how you study (method), how much you study (time) and what you study (content). Naturally, if you want to learn as much as possible, you need to maximise all three. Of these three factors, I think content is often overlooked, meaning that many care much about how they learn and how much time they spend learning, but don’t really think too much about what they are actually learning. I’ve written about this problem before: Three factors that decide how much Chinese you learn.
So, what should you learn? This is easy to answer in the abstract: you should study whatever is the most useful to you, based on your goals for learning Chinese. In practice, this is hard to answer, though, but fortunately, we can use frequency as a proxy for usefulness. This is based on the assumption that something that occurs often in the language ought to be learnt before something that is less frequent, simply because you’re likely to both use and encounter it more.
Please note that frequency is not the only proxy for usefulness. Other ways of accessing lists of useful characters and words include vocabulary lists from textbooks, standardised proficiency tests and more. For a discussion about that, please check the article below.
In this article, I will only talk about frequency.
Frequency resources for learning and teaching Chinese
The problem is that students, and sometimes teachers, have a simplistic view of what frequency lists are and what they can or should be used for. It’s easy to grab a list off the internet and use it as an absolute reference, believing it to show the most common components, characters and words.
It probably does, but it’s important to remember that all frequency data is based on small fraction of all language. Here are some questions that many students never even consider:
- Is the data based on written or spoken language? Maybe both?
- If written language, what were the sources? Newspapers? Novels? Blogs?
- If spoken language, what were the sources? Actual conversations? Chat? Movies?
- Where does the data come from geographically?
The answers to these questions matter. Below, I have listed the top 10 characters in two character frequency lists that are obviously based on different sources. Look at them and see if you can draw some conclusions regarding the corpus, i.e. what kind of language was used to generate these lists.
|Rank||List A||List B|
While some characters appear on both lists, there are important differences. What do you think List A is based on? What do you think List B is based on? Which one do you think would be most suitable to use as a student? Think about this for a minute before you check the answer below.
- List A is based on formal, written language, which you can see because of the lack of pronouns high up on the list. The exact source is Chinese language Wikipedia, which is also something you might have guessed because of the very high frequency of characters used in dates (it could have been any encyclopaedia, of course).
- List B is based on spoken language. More precisely, it’s based on movie subtitles. You can see it’s probably spoken language because people like to talk about themselves and people in their vicinity (hence all the common pronouns at the top). Of course, movies don’t contain naturally-produced spoken language, but at least the goal of most movie dialogue is to make it sound as if it were, which is good enough in many cases.
I hope this example has convinced you to not just grab random list you find after searching for one minute. I will introduce a number of frequency sources in this article and help you choose one that suits you!
Words, characters and components
Unlike other languages, Chinese can be broken down into much smaller pieces while still retaining some meaning. For example, if you take a two-syllable word and break it down, each syllable will be represented in the written language by a character that means something. Often, these characters can then be broken down into components that in themselves also mean something.
Frequency data can be used on all levels, but because of methodological issues, character frequency is what most people talk about. It is, after all, unambiguous what a character is since it occupies exactly one square on a page and it’s easy to calculate frequencies using characters as the unit. However, if you want to focus on spoken Chinese, you’re not much helped by character frequencies, but would much rather use word frequencies. If you focus on learning Chinese characters, you might also want to break down characters into components and see which are the most important to learn first.
In this article, I will introduce frequency data for words, characters and components, in that order.
Frequency resources for Chinese words
For spoken language (and perhaps written language too), words are the most interesting unit, since people use spoken words, not written characters, to communicate. Coming from a language like English, this might seem straightforward enough, just calculate the frequencies of words, right?
Not really. Chinese has no spaces between words, which means that it’s far from easy to figure out where one word stops and the next one begins. 红 (red) is a word, so is 灯 (light), but 红灯 is a word too (red light, as in ” stopping at a red light”). But what about 黄灯 (yellow light)? Or 紫灯 (purple light)? As you can probably see, this is not easy.
There are three ways of dealing with this problem:
- The easiest way for students and teachers is to take an extensive dictionary and simply say that everything listed in there is a word, and everything that is not listed isn’t a word. In that case, 红灯 and 黄灯 are words, but 紫灯 is not (I used 现代汉语词典 to check this). There might still be ambiguities regarding which character belongs to which word, though. I don’t know of any frequency lists that uses this method.
- Another way is to not care about what actually is a word as defined by linguists, but instead focus on characters that appear together often (called a bigram if it’s a pair of characters) and simply say that that’s a word. The problem is that then things like negated verbs rank very high, such as 不能, 不会 and so on, along with numbers in combination with measure verbs, like 一个.
- The most demanding way is to make sure the texts being used have been segmented into words in some way by a human. Or, more often, a combination of humans and clever tools that help a lot, but which don’t really get everything right. Still, there are often “words” like 不能 and 一个, probably because someone decided that they actually are words for the purpose of the corpus.
So, in essence, while there ought to be a difference between how carefully the material was segmented (if at all), there really isn’t that much of a difference.
We’re now ready to look at the available resources for word frequency in Chinese. I have tried to put broadly useful resources at the top, but haven’t paid close attention to exact ordering:
- SUBTLEX-CH (movie subtitles) – This list is based on Chinese movie subtitles and is thus as close to natural spoken language as you can get. 100,000 simplified words without definition or Pinyin, You will find some 这个 and 不能, but should still be very useful. Note that there are two files, one for character frequencies and one for word frequencies. Article describing the underlying research project here.
- K-5 Word Frequency Dictionary for Chinese L2 Learners – This list is somewhat unique in that it draws on materials for people who learn Chinese as a second language, i.e. textbooks, graded readers and so on (read more about the methodology here). The site presents a search interface, but you can also download excel spreadsheets. Simplified Chinese with Pinyin. This list comes very close to what I would consider most useful for second language learners.
- BLCU balanced corpus frequency lists – These lists are based on a ridiculous 15 billion (simplified) character corpus, composed of news, literature, blogs and much more. It is probably the biggest, most comprehensive dataset available. You can access the corpus online here and read more about the project here (in Chinese). The ZIP-file linked to above contains text files for each part of the corpus, as well as a global file. If you can’t view the text files, a user over at Pleco’s forum posted UTF-8 encoded versions that work well for me. The lists contain some oddities, such as 第 coming out on top and some kana from Japanese showing up; please see the discussion over at Pleco for cleaned-up versions if you want to remove these.
- University of Leeds: Internet Word Frequencies – This frequency list is based on the Leeds corpus of internet Chinese (90 million tokens from 2005). Simplified characters with no frills. You can search the corpus directly online, which is handy.
- 6000 Chinese Words: A Vocabulary Frequency Handbook, by James Erwin Dew – This is one of the few books I recommend. It lists words with traditional characters, but since it’s a printed book, you can’t do much analysis of your own, but it does have the data both in alphabetical order (good for looking things up) and in frequency order. It also offers lists of prolific characters, meaning characters that are used in common words.
- Mandarin Chinese Word Frequency Dictionary – This list presents word frequencies based on the Academia Sinica Balanced Corpus of Modern Chinese, “balanced” meaning that it draws from a number of different kinds of texts. This list uses traditional Chinese. The data here is very useful, but it’s not very accessible because it’s in PDF format and not sorted by frequency. I haven’t been able to find this data in a more accessible format! Traditional characters.
- Leiden Weibo Corpus (and related frequency data) – This list is based on Weibo messages, which makes it interesting for informal writing (and indirectly speaking). It also has a rather unique feature in that messages are coded for which city/province they were posted from. However, to make the most of this resource, you need some database and/or text manipulation skills. The word list is pretty straightforward, but not sorted by frequency. To get the geographical breakdown, you need to combine the word IDs in one file with the actual words in another, so not for the idly curious. Simplified characters, of course.
- Jun Da: Bigram frequency list of the general fiction sub-corpus – As the name indicates, this is an analysis of pairs of characters occurring together in the general fiction part of the corpus used by Jun Da. See also Bigram frequency list of the news sub-corpus. Both are somewhat configurable (you can set frequency and mutual information values) and spit out simplified characters (varying numbers depending on what values you set, of course).
Pew, that’s a lot! Have I missed anything, including easy-to-access derivatives of the above? Leave a comment!
Frequency resources for Chinese characters
As I mentioned above, character frequencies are very easy to obtain (no segmentation needed) and there are numerous lists available online. The only thing you might want to care about is that they are based on a large enough corpus and that the sample is roughly matches what you’re after and isn’t very specialised in some area you’re not interested in. I have tried to order them with the most useful at the top:
- Jun Da Chinese text computing: Character frequency list of Modern Chinesea – This list from 2005 is based on written Chinese (both fiction and non-fiction). It contains 10,000 simplified characters, with Pinyin and definition. The same data is also available as two separate lists, with one for fiction and one for non-fiction.
- SUBTLEX-CH (movie subtitles) – This list is based on Chinese movie subtitles and is thus as close to natural spoken language as you can get. 6,000 simplified characters without definition or Pinyin. Note that there are two files, one for character frequencies and one for word frequencies. Article describing the underlying research project here.
- Patrick Zein: The most common Chinese characters in order of frequency – This list is based on Jun Da’s research, but contains further explanations, as well as definitions for variant pronunciations. 3,000 simplified characters, but with notes about traditional usage. Uses GB2312 encoding, which might cause problems for some. PDF version available here.
- Hanzicraft: Chinese Character Frequency List – Around 6,000 simplified characters without definition or Pinyin, but with clickable links to more information about each character. Not specified what data this list is based on.
- Chinese Character Frequencies (Chinese Wikipedia) – This is the list used in the introduction to illustrate the results of using different datasets. This one is based on Wikipedia. 10,000 traditional characters without definition or Pinyin.
- Jun Da: Character frequency list of Classical Chinese – Same as the first resource listed, but for Classical Chinese (i.e. not based on texts in Modern Chinese).
- 6000 Chinese Words: A Vocabulary Frequency Handbook, by James Erwin Dew – This was mentioned already for its word frequency lists, but I include it here as well for the list of prolific single characters.
- Far East 3000 Chinese Characters Dictionary – This book lists 3000 common traditional characters in alphabetical order (which makes it much less useful). It does contain example words and has a layout that is quite inviting. I wrote about using this dictionary in an earlier article.
- 國語辭典簡編本編輯資料字詞頻統計報告 – This lists both the most frequent characters and words, based on a balanced mixed of sources in traditional Chinese, totalling around two million characters. This is the best resource based on traditional samples I’ve been able to find. Note that the executable file needs to be downloaded and run, which produces a text file that I was able to view properly in Chrome.
Have I missed any useful frequency list? Leave a comment below!
Frequency resources for character components
When looking at character components, the problem is similar to that discussed above regarding words. There is no unified definition of what a component is and there’s often more than one way to break down a character.
For example, the character 行 was originally a pictograph showing a road intersection. So, in a sense, it can’t be broken down and should be treated as a component itself. However, if you just look at the modern form, you can visually break the character down into 彳, 一 and 丁. Or take 想 as another example. Do you break it down into 心 and 相 and be done with it? Or do you then break 相 down into 木 and 日 and count each?
In fact, this kind of data is even trickier to deal with than words, because at least there is some consensus of what ought to be a word, it’s just not easy to apply computationally to a large dataset. When it comes to components, it’s even hard to find data!
As a shortcut, people sometimes use radicals instead of components, but there are many more components than there are radicals, so that doesn’t give the whole picture. Other times, people break things down as far as the result can be written with Unicode, damn the consequences. This leads to frequency lists where single strokes are the most common “components”.
Here’s my list created using the radical approach described above:
This list at HanziCraft lists productive character components in order of frequency. It was created by checking how often a component appears in breakdowns, weighted by the frequency of the characters it appears in. It’s a pretty good effort if you’re after visual breakdowns.
We also have this list over at HanziCraft, which lists phonetic components. There’s no frequency information here, but they are are commonly used, so I guess it’s better than nothing. This list also includes information about how they influence pronunciation, which is nice.
If you know of more extensive and/or scientific lists for character components, please let me know.
Here are some more frequency-related resources that don’t really fit in any specific category above, but that I thought someone might find interesting:
- This list contains syllable frequency, listing each syllable, with Pinyin and Zhuyin, as well as a sample character. It’s not sorted in order of frequency, but the frequency data is there. Also, please not that the text file is encoded in BIG5, so make sure you select that when trying to open it.
- Here is a frequency list that also lists the number of strokes for each character. Not sure why this would be useful, but there are probably some applications. The same data is also available in another form, sorted by number of strokes (fewest first). Both are traditional Chinese.
That’s probably more frequency resources for learning and teaching Chinese than you every wanted to see! Still, I suspect there are important resources I have overlooked. If you know of one that really should be included here, please let me know and I will update the article!