Follow us on Facebook to receive important updates Follow us on Twitter to receive important updates Follow us on sina.com's microblogging site to receive important updates Follow us on Douban to receive important updates
Chinese Text Project
Simplified Chinese version

Semantic annotation

Note: this documentation describes a feature currently under development. The information provided here may be incomplete, and substantial changes may be made before the feature is publicly released.

Introduction

Semantic annotation involves adding computer-readable data about the meaning of words and phrases in their given context to a text. This enables further processing, and allows the system to display additional relevant information. For example, in the following passage, the semantically annotated version (left) provides useful contextual information about dates, people, and written works:

With annotationWithout annotation
1 夏四月乙巳吕夷简上《景佑法宝新录》。甲子吕夷简王曾宋绶蔡齐罢,以王随门下侍郎同中书门下平章事昭文馆大学士陈尧佐同中书门下平章事集贤殿大学士盛度知枢密院事韩亿程琳石中立参知政事王鬷同知枢密院事
1 夏四月乙巳,吕夷简上《景佑法宝新录》。甲子,吕夷简、王曾、宋绶、蔡齐罢,以王随为门下侍郎、同中书门下平章事、昭文馆大学士,陈尧佐同中书门下平章事、集贤殿大学士,盛度知枢密院事,韩亿、程琳、石中立参知政事,王鬷同知枢密院事。

General principles

Semantic annotation in the Chinese Text Project involves creating three types of closely related data:

  1. Annotations. An annotation locates a short region of text - usually a word or short phrase - and provides information about what that word or phrase means in the particular context in which it occurs. For example, in the sentence "孔子适齐。" we might want to add an annotation for the word "孔子" indicating that in this sentence, "孔子" refers to a particular person: the historical individual Confucius.
    Two types of annotation are supported in ctext:
    • Entity annotations - indicate that the annotated text refers to a particular entity, such as "ctext:855132" (王安石).
    • Date annotations - indicate that the annotated text refers to a particular historical date. The date is specified by recording the era (or ruler) to which the date belongs, such as "ctext:27110" (天禧 era), as well as data about the meaning of the date, such as "year 1, month 2".
  2. Entity records. An entity record represents a unique thing. This may be a concrete object - such as a person, or a physical building - or an abstract or constructed object, like a bureaucratic office. For example, factual and fictional historical people - like Wang Anshi - have entity records; so do works - like the History of Song - and dynasties - like Northern Song. Entity records are used to contain information about entities, and as a reference point for annotations: the annotation of "孔子" in the example above would point to the entity record for Confucius. Entity records help distinguish between different things that sometimes have the same name, and identify the same thing when it may be referred to by different names. Every entity record has a unique identifier, e.g. "ctext:27110" (天禧 era). Using these identifiers allows us to precisely distinguish between entities with the same name - such as "ctext:474358" for the 绍兴 era of the Song dynasty, and "ctext:63988" for the 绍兴 era of the Western Liao dynasty. The page for each entity lists its identifer immediately below the title.
  3. Knowledge claims. A knowledge claim represents one piece of information about an entity; entity records are made up of knowledge claims about that entity. A knowledge claim primarily connects three things: a subject (the entity the claim relates to), a verb or relation, and an object or target of the relation. For example, a knowledge claim about Wang Anshi might connect Wang Anshi (subject) and Wang Yi (object), with the verb "father" - thus recording the fact that Wang Anshi's father is Wang Yi. As a second example, we might connect Wang Anshi and the office Hanlin Academic through the relation "held-office", to indicate that Wang Anshi held this particular bureaucratic office.
    Sometimes it is useful to record additional information about a claim. This can be done by adding one or more qualifiers to the claim. A qualifier is an additional part of a claim which connects that claim with two other pieces of information: an additional verb (the qualifier), and an additional object. For example, while it is true to say that Wang Anshi held the office of Hanlin Academic, it is useful to further explain this by indicating that he held the office starting from a particular date - this is done by adding the from-date qualifier to the claim, together with an object representing that particular date.

Citations

Citations are required for most types of claim. A citation is a specific textual reference in ctext citation format. A citation is composed of two parts: a URN identifying a particular chapter of one edition of a text, and the literal content of the text being cited (in Traditional Chinese); these two parts are combined using the symbol "@". For example:

The citation should be chosen to be a complete sentence or meaningful sentence fragment that justifies the claim. Context does not need to be cited, because the text will be linked directly to its source.

Most claims require evidence, with the following exceptions:

Annotation conventions

In order to promote consistency in the data and facilitate effective automated processing, please observe the following conventions when marking up texts:

Dates

Dates are important pieces of historical data that need to be annotated carefully. A date annotation identifies connects a date in a text (e.g. "二月") with enough additional data to make the date unambiguous - for example, the information that the date refers to a particular year and month within some specific era. The annotation client provides a mechanism to input this information, by connecting each date annotation to an era. In many cases, dates in a text do not directly contain all of this information, as it is provided contextually - as in the following passage:

1 开宝九年冬十月癸丑太祖崩,帝遂即皇帝位。乙卯,大赦,常赦所不原者咸除之。

The first of the two dates in the above passage is "complete": it directly contains enough information, taken together with the era, to unambiguously point to a particular date - specifically, the information year 9, month 10, day 癸丑. The second date ("乙卯") does not directly contain this information because the information is implied by the context. Date annotation involves explicitly recording these separate values, so that digital systems can correctly process the date.

The annotation client will attempt to suggest appropriate values, however these will sometimes be incorrect. It is important to pay attention to the contextual flow of information when annotating dates, especially where parenthetical references to other years and eras do not affect the interpretation of dates later in the text. For example, in the following passage, purple arrows indicate the correct contextual flow of date information:

The annotation client will help by suggesting the correct values automatically for most cases - e.g. suggesting that "乙卯" refers to year 9 month 10 of the 开宝 era - but in this example will incorrectly propose that "十一月癸亥" refers to the 11th month of year 8 of 开宝, due to year 8 having been referenced immediately prior. In cases like these it is important to pay attention to the date flow: if "十一月癸亥" is marked as referring to year 8, then the annotation client will infer that 甲子 and 庚午 should also be marked as year 8, whereas in this passage they actually refer to year 9. Mistakes of this kind easily cascade to affect many dates in historical texts because much of the date information is implied contextually.

Texts and editions

Only one edition of each text should be annotated. This should normally be the representative edition.

Some annotations have been added to the following texts; please use the editions linked below when adding or correcting annotations:

Standard Histories

  1. 史记
  2. 汉书
  3. 后汉书
  4. 三国志
  5. 晋书
  6. 宋书
  7. 南齐书
  8. 梁书
  9. 陈书
  10. 魏书
  11. 北齐书
  12. 周书
  13. 南史
  14. 北史
  15. 隋书
  16. 旧唐书
  17. 新唐书
  18. 旧五代史
  19. 新五代史
  20. 宋史
  21. 辽史
  22. 金史
  23. 元史
  24. 明史
  25. 清史稿

Other historical works

Bibliographic works and catalogs

The above are only partial lists; other texts can also be annotated, provided that: