Chinese Text Project |
Semantic annotation
Introduction
Semantic annotation involves adding computer-readable data about the meaning of words and phrases in their given context to a text. This enables further processing, and allows the system to display additional relevant information. For example, in the following passage, the semantically annotated version (left) provides useful contextual information about dates, people, and written works:
With annotation | Without annotation | ||||
---|---|---|---|---|---|
|
|
General principles
Semantic annotation in the Chinese Text Project involves creating three types of closely related data:
- Annotations. An annotation locates a short region of text - usually a word or short phrase - and provides information about what that word or phrase means in the particular context in which it occurs. For example, in the sentence "孔子適齊。" we might want to add an annotation for the word "孔子" indicating that in this sentence, "孔子" refers to a particular person: the historical individual Confucius.
Two types of annotation are supported in ctext:- Entity annotations - indicate that the annotated text refers to a particular entity, such as "ctext:855132" (王安石).
- Date annotations - indicate that the annotated text refers to a particular historical date. The date is specified by recording the era (or ruler) to which the date belongs, such as "ctext:27110" (天禧 era), as well as data about the meaning of the date, such as "year 1, month 2".
- Entity records. An entity record represents a unique thing. This may be a concrete object - such as a person, or a physical building - or an abstract or constructed object, like a bureaucratic office. For example, factual and fictional historical people - like Wang Anshi - have entity records; so do works - like the History of Song - and dynasties - like Northern Song. Entity records are used to contain information about entities, and as a reference point for annotations: the annotation of "孔子" in the example above would point to the entity record for Confucius. Entity records help distinguish between different things that sometimes have the same name, and identify the same thing when it may be referred to by different names. Every entity record has a unique identifier, e.g. "ctext:27110" (天禧 era). Using these identifiers allows us to precisely distinguish between entities with the same name - such as "ctext:474358" for the 紹興 era of the Song dynasty, and "ctext:63988" for the 紹興 era of the Western Liao dynasty. The page for each entity lists its identifer immediately below the title.
- Knowledge claims. A knowledge claim represents one piece of information about an entity; entity records are made up of knowledge claims about that entity. A knowledge claim primarily connects three things: a subject (the entity the claim relates to), a verb or relation, and an object or target of the relation. For example, a knowledge claim about Wang Anshi might connect Wang Anshi (subject) and Wang Yi (object), with the verb "father" - thus recording the fact that Wang Anshi's father is Wang Yi. As a second example, we might connect Wang Anshi and the office Hanlin Academic through the relation "held-office", to indicate that Wang Anshi held this particular bureaucratic office.
Sometimes it is useful to record additional information about a claim. This can be done by adding one or more qualifiers to the claim. A qualifier is an additional part of a claim which connects that claim with two other pieces of information: an additional verb (the qualifier), and an additional object. For example, while it is true to say that Wang Anshi held the office of Hanlin Academic, it is useful to further explain this by indicating that he held the office starting from a particular date - this is done by adding the from-date qualifier to the claim, together with an object representing that particular date.
Citations
Citations are required for most types of claim. A citation is a specific textual reference in ctext citation format. A citation is composed of two parts: a URN identifying a particular chapter of one edition of a text, and the literal content of the text being cited (in Traditional Chinese); these two parts are combined using the symbol "@". For example:
- ctp:ws740739@父益,都官員外郎。 - this citation identifies the text "父益,都官員外郎。" occurring in the 列傳第八十六 chapter of the 摛藻堂四庫全書薈要 edition of the 宋史.
Most claims require evidence, with the following exceptions:
- Claims where the verb is: type, authority-..., or link-... - citations should never be added in these cases
- Claims where the verb is: name - citations should only be added where they explicitly record that two names are used for the same entity
- Claims where the verb is: born or died - these are temporary placeholders containing dates/years as recorded in other projects like Wikidata or Wikipedia. If evidence is available, these claims should be replaced with claims of type born or born-date and born or died-date respectively, together with machine-readable dates and evidence.
Annotation conventions
In order to promote consistency in the data and facilitate effective automated processing, please observe the following conventions when marking up texts:
- Dates - Whenever an era appears as part of a date reference, the era should be marked up as one entity, and the remainder of the date (year/month/day) as a separate reference. The annotation client is designed to make this easy to do.
Correct Incorrect 1 開慶元年二月 1 開慶元年二月 1 開慶元年,二月 - Offices and titles - Whenever an office or title exists in various forms - commonly as distinct offices of the same type attached to different places - the combined place-and-office should be treated as an office title.
Correct Incorrect 1 利州節度使 1 利州節度使
Dates
Dates are important pieces of historical data that need to be annotated carefully. A date annotation connects a date in a text (e.g. "二月") with enough additional data to make the date unambiguous - for example, the information that the date refers to a particular year and month within some specific era. The annotation client provides a mechanism to input this information, by connecting each date annotation to an era. In many cases, dates in a text do not directly contain all of this information, as it is provided contextually - as in the following passage:
1 開寶九年冬十月癸丑,太祖崩,帝遂即皇帝位。乙卯,大赦,常赦所不原者咸除之。
The first of the two dates in the above passage is "complete": it directly contains enough information, taken together with the era, to unambiguously point to a particular date - specifically, the information year 9, month 10, day 癸丑. The second date ("乙卯") does not directly contain this information because the information is implied by the context. Date annotation involves explicitly recording these separate values, so that digital systems can correctly process the date.
The annotation client will attempt to suggest appropriate values, however these will sometimes be incorrect. It is important to pay attention to the contextual flow of information when annotating dates, especially where parenthetical references to other years and eras do not affect the interpretation of dates later in the text. For example, in the following passage, purple arrows indicate the correct contextual flow of date information:
The annotation client will help by suggesting the correct values automatically for most cases - e.g. suggesting that "乙卯" refers to year 9 month 10 of the 開寶 era - but in this example will incorrectly propose that "十一月癸亥" refers to the 11th month of year 8 of 開寶, due to year 8 having been referenced immediately prior. In cases like these it is important to pay attention to the date flow: if "十一月癸亥" is marked as referring to year 8, then the annotation client will infer that 甲子 and 庚午 should also be marked as year 8, whereas in this passage they actually refer to year 9. Mistakes of this kind easily cascade to affect many dates in historical texts because much of the date information is implied contextually.
Texts and editions
Only one edition of each text should be annotated. This should normally be the representative edition.
Some annotations have been added to the following texts; please use the editions linked below when adding or correcting annotations:
Standard Histories
Other historical works
- 資治通鑑
- 續資治通鑑
- 廿二史劄記
- 十六國春秋
- 十六國春秋別傳
- 宋史紀事本末
- 明史紀事本末
- 欽定平定台灣紀略
- 曾文正公年譜
- 鴉片事略
- 庚子國變記
- 大越史記全書
- 越史略
- 三國史記
- 全唐文
- 全上古三代秦漢三國六朝文
- 十國春秋
Bibliographic works and catalogs
The above are only partial lists; other texts can also be annotated, provided that:- A scanned source exists in the Library and is linked to the transcription
- The work is non-fictional.
- The transcription is largely correct.
- The transcription is largely punctuated.