Follow us on Facebook to receive important updates Follow us on Twitter to receive important updates Follow us on sina.com's microblogging site to receive important updates Follow us on Douban to receive important updates
Chinese Text Project
Discussion -> Latest updates -> Word clouds

2012-09-24 07:32:19Word clouds
Posted by: admin (CTP Admin)Word clouds have been added for all texts on the site. These show a graphical representation of unexpectedly frequent words and characters in a particular text, with more frequent and/or unusual words in larger fonts. To display the word cloud for a particular text, click on the icon shown on its contents page under "Media".

2012-09-25 12:35:05Word clouds
Posted by: bao_pu (Scott Barnwell)The description says "word clouds on this site highlight unusually frequent words in a text, rather than simply frequent words." When looking at the Daodejing, the largest word was Tianxia 天下, which surprised me, and after looking at the Xunzi, Mozi, Zhuangzi, Mengzi, Guanzi, Huainanzi, Lüshi Chunqiu and Shenzi, it does not seem to be that unusual of a word. But I assume that grouped with all the others (the classics, histories, etymology, military, etc.) it IS an unusual word. This is a drawback of taking such a large group of texts to draw data from, but I still think the Word Clouds are interesting.

2012-09-25 14:24:40Word clouds
Posted by: dsturgeon (Donald Sturgeon)It's true that "天下" is a relatively common word. But at the same time, if you look at the number of times that the word "天下" appears in a text versus the total number of characters in that text, of all the pre-Qin and Han texts on the site, only Shen Bu Hai exceeds the Dao De Jing in terms of frequency of "天下". In the Dao De Jing, we have 61 天下's in 5278 characters, so the frequency of 天下 in the Dao De Jing is about 61/5278 * 100% =~ 1.15%. In pre-Qin and Han texts, we have about 10608 天下's in around 5.5 million characters, so overall the frequency of 天下 in pre-Qin and Han texts is closer to 10608/5520329 * 100% =~ 0.19%. 1.15% is significantly more than 0.19%, which is why it is statistically significant that the Dao De Jing should mention "天下" so often relative to other texts. The Mozi certainly talks about "天下" a lot, but statistically speaking, the Mozi has this word appearing 523 times in 80421 characters, which gives us 0.65% - well above average, but by no means as frequent as the same word in the Dao De Jing. The Shuo Yuan is an example of a text in which the word "天下" appears about as often as it does generally in pre-Qin and Han text; in the word cloud for Shuo Yuan, "天下" does appear, but in a fairly small font: ctext.org/media.pl?if=en&id=7 .

The idea is something like this: if you were to pick, at random, a contiguous selection of pre-Qin and Han Chinese text of the same length as the Dao De Jing, it would be an extremely likely result that the text you chose would contain fewer occurrences of the word "天下" than the Dao De Jing. Typically, there just aren't that many 天下's. In fact, you'd pretty much have to have chosen the Shen Bu Hai or the Dao De Jing itself to get the opposite result, since all other texts contain fewer occurrences of this word than the Dao De Jing. In this sense, the Dao De Jing uses the word "天下" disproportionately often as compared to other texts.

This may be surprising, and of course there can be many different explanations as to the reason for it, but it does appear to be a statistical fact that on the face of it seems worthy of attention.

If you want to look at the actual numbers, you can do a normal search on the site and then click on "Statistics" to see numbers of occurrences of a character or word in different texts. If you then click on the link for "Export data", you'll get a table that you should be able to copy and paste into Excel (or import into any other spreadsheet program) to process further. This functionality is still under construction, but might already be useful if you're interested in looking at character and word frequencies.

2012-09-26 15:40:53Word clouds
Posted by: bao_pu (Scott Barnwell)Thanks Donald,

You've given the Daodejing as containing 5278, where can I find the total characters in each text?

2012-09-26 16:01:14Word clouds
Posted by: dsturgeon (Donald Sturgeon)That functionality hasn't been officially released yet either; at the moment the word counts only appear in the data you get when you click on "Export data" as described above, as the column currently labeled "Text length". The counts do not include punctuation or any emendations applied to the texts. Please don't rely on these numbers at present though as this feature has not yet been properly tested and some data may be incorrect.

2012-09-26 19:58:33Word clouds
Posted by: bao_pu (Scott Barnwell)It seems it is not fully functional yet, as you said. When I search the Daodejing for 天 and click export data, all I see in the right-hand box is:
"Section/title" "天" "Text length"

I get that same result for some other texts as well.

2012-09-26 20:27:03Word clouds
Posted by: bao_pu (Scott Barnwell)It seems it is not fully functional yet, as you said. When I search the Daodejing for 天 and click export data, all I see in the right-hand box is:
"Section/title" "天" "Text length"

I get that same result for some other texts as well.

2012-09-27 00:09:12Word clouds
Posted by: dsturgeon (Donald Sturgeon)The "stats" function is intended to give a summary of the number of occurrences of a search term in subdivisions of a text (e.g. chapters, volumes, etc.). Since the Daoedejing doesn't have any, searching it in stats mode doesn't give you any data. Try searching in "Daoism" instead.



To participate in the discussion, please log in to your CTP account using the form below. If you don't yet have an account, click here to set one up.

Log in
Username:
Password:
Keep me logged in
Forgotten password

Enjoy this site? Please help.Site design and content copyright 2006-2024. When quoting or citing information from this site, please link to the corresponding page or to https://ctext.org. Please note that the use of automatic download software on this site is strictly prohibited, and that users of such software are automatically banned without warning to save bandwidth. 沪ICP备09015720号-3Comments? Suggestions? Please raise them here.