Chinese Text Project


About Unicode characters

Characters used on computers in word-processing and other applications as well as on the internet are generally represented by numbers - each character is assigned a different number, and the number is used internally to represent a particular character - for example the number 65 is used to represent the English letter A. The Unicode standard extends this idea to include the majority of characters in common use around the world, including common Chinese characters, and many ancient character forms.

Dictionary entries on this site list the Unicode number for a character directly beneath the character itself. This number is written in hexadecimal notation, and is preceded by the characters "U+" to indicate that it is a unicode number. To find out which Chinese character a Unicode number represents, you can enter it directly in the dictionary search box.

Duplicate characters

One problem with Unicode is that certain identical or near-identical characters have been assigned more than one unicode number. What this means is that from a user perspective, two characters which appear identical or near-identical when displayed may actually be represented by different Unicode numbers, and so considered by computer software to be different characters, even though they may appear identical. Sometimes the two characters may appear identically when displayed in one font, and yet different from each other in another font.

For example, according to the Unicode standard, the numbers U+4E0D and U+F967 both represent a character which should be displayed as "不" (in practice, one of these two may not be displayed on all systems since not all fonts include all characters). This causes a problem if some occurrences of the character are represented as one number, and other occurrences with the other number; since software treats these as being different characters because they have different numbers, when one searches for "不 (U+4E0D)" only those occurrences with the same number will be found, while those stored as "不 (U+F967)" are not.


To get around this problem, the Chinese Text Project normalizes those characters which have such identical or near-identical pairs by only allowing one representation to occur in the database. For example, in the above case, every occurrence of the character "不" is stored as "不 (U+4E0D)", and no occurrences of "不 (U+F967)" are allowed. "不 (U+F967)" can still be looked up in the dictionary, where it displays a warning message indicating that this character has been normalized to "不 (U+4E0D)". If you search the textual database for "不 (U+F967)", it will automatically be translated into "不 (U+4E0D)", and so should give the expected result.

Normalized characters

All characters currently normalized in the CTP are displayed below. Depending upon the fonts installed on your computer, some of these characters may be very similar but distinct, and some may not display at all (see the Font Test Page for more details).

The main rationale in defining this list is one of utility: if it is not useful for the system to distinguish between two different characters, then they should be normalized. In particular, if the differences between two semantically equivalent character forms are so small that they cannot be reliably distinguished in preserved texts, then the characters will be normalized even though they do represent distinct character forms.

