Friday, August 1, 2008

Corpus Linguistics and Chinese

Word frequency has been on the tips of toungues and typed hundreds of times by those in lingistics, especially the applied linguists. Unfortunately, they're usually talking about the English language regarding second language acquisition.

So what about Chinese? Billions speak the language (and more than 23 million use the traditional, full form (thankfully!))...there must be a corpus in use to calculate the high-frequency words so learners know which words are important. And book publishers would know about these lists so their books would teach those top words so learners don't waste their time learning archaic words noone says anymore, right? Right?

W R O N G .

Integrated Chinese still uses words like {Na-lee} which is an old China-Chinese form of "Oh, you're too much! Stop embarrassing me!" Practical Audio-Visual Chinese teaches {ku1} which means "to cry" in Unit 24 (of 26 in Book 1).

Sinosplice links to a top 1000 list, which appears great with the first entries being truly common words. But look a little farther, like around the late 900s and early 1000s. Yes, that's right: 魚 {u3} appears at 971, 爹 {dai1} is at 965 whereas 爸 is at 991, 汽 {cheee1} at 1117, and the list goes on. I question this list, and wonder where Patrick, the creator, got his stats from.

Sinosplice links to another list, this one created by Jun Da and used by yellowbridge.com in their pay-for-its-convenience service. The left-hand menu bar mentions info and the site is a university (edu) site...but wait! 爸 is 1698? What kind of data are these sites using? I'm guessing very little spoken instances and mostly classical written texts. But before I really start criticizing this, I should read the introductory letter...but I can't just now since the link is opening a pdf that only shows the even pages...but you're more than welcome to read it in the meanwhile and maybe let me know what the odd pages say.

No comments: