Follow us on Facebook to receive important updates Follow us on Twitter to receive important updates Follow us on sina.com's microblogging site to receive important updates Follow us on Douban to receive important updates
Chinese Text Project

Optical Character Recognition

OCR abstracts character forms

The Chinese Text Project primarily deals with digital texts in two distinct types of representation: as computer-encoded text, which can be typed, copied, and pasted - as seen in texts in the textual database and Wiki - and as image data, which cannot be manipulated digitally as ordinary text, but which provides an accurate facsimile of a printed work - as seen in texts in the Library.

Each of these forms has unique advantages when compared to the other, and neither form alone is suitable for all purposes.

Optical Character Recognition (OCR) refers to an automated process for converting text represented as an image into computer-encoded text. On the Chinese Text Project, OCR is performed on transmitted copies of Chinese texts such as those from the Sikuquanshu and other collections, in order to provide better ways of working with these transmitted texts.

Texts linked by OCR

Image and transcription

When both a computer-encoded transcription and a scanned edition of the text it is based upon are available, it is possible to use OCR data to link the existing textual copy to its precise location in the scanned edition. This makes possible a simple visual comparison of the transcription with the original edition itself.

Where this information is available for a paragraph of text, it is indicated by the icon to the left of the paragraph. Clicking this icon opens the corresponding page of the scanned text in the library. To highlight a specific word or phrase, search for it in the textual edition before clicking the icon.

Raw OCR results

Where no existing digital transcription of a text is available, OCR can be used to create a rough draft of a text. Typically - especially in cases where parts of the source material are unclear, damaged, or incomplete - the resultant text created using OCR will contain large numbers of errors.

At the same time, transcriptions created using OCR on this site have the advantage of being linked line-by-line to the scan of the corresponding edition. Thus even where there are errors in the transcription, it can provide a method for locating almost instantly information in the scanned text that might otherwise be hard or impractical to find, and thus also for verifying the accuracy of the transcription.

As with texts linked by OCR, clicking the icon to the left of a paragraph of text opens the corresponding page of the scanned text in the library. To highlight a specific word or phrase, search for it in the textual edition before clicking the icon.

Searching OCR-enabled texts

All linked texts, regardless of whether they derive from matching of an existing edition or from automated transcription of a new edition, are represented in the Chinese Text Project as ordinary texts that can be searched in exactly the same way as all other texts. The advantage of OCR-enabled texts is that having searched or navigated to any particular part of the text, it is possible to immediately locate the corresponding part of the relevant scanned text in the library.

To search an OCR-enabled text, firstly locate an OCR-enabled edition of the text you wish to work with in either the Textual Database, Wiki, or Library - for example, almost all texts in the Library are now linked to at least one OCR-enabled text. You are then free to browse the text - for Textual Database or Wiki resources, you can navigate by chapter or other means; for library resources, navigate by page or (where available) chapter. Searching an OCR-enabled text will display textual search results in the usual format; clicking on the icon beside any paragraph will open the appropriate page of the corresponding library resource with the search term highlighted.

For example, suppose that you have searched a Wiki text and obtained this result:

Clicking on the icon shown will then take you to this page of the library:

Correcting mistakes

Minor corrections

Minor corrections to texts created or matched using OCR can be made directly from the corresponding Library page. Clicking the "Quick edit" link beneath the transcription of the page allows editing of that page in a simplified format when compared to the full Wiki editing view described below.

In the "Quick edit" view, each line corresponds to a column of text, and paragraph breaks are indicated using the code "<p>". Do not attempt to add other tags or markup relating to scan links using the "Quick edit" view.

For example, the "Quick edit" view for the page shown above looks like this:

Full editing

All texts created using OCR are stored in the Wiki section of the site so that they can be collaboratively edited to correct errors. This section describes editing issues specific to texts that contain OCR data. If you have not edited texts in the Wiki before, please refer to the Wiki instructions for a general introduction to editing texts in the Wiki before reading the rest of this section.

A number of special codes are used within the Wiki to link OCR enabled texts to their corresponding scans. These are:

FunctionExample usageDescription
Start of a scanned page<scanbegin file="1234" page="5" y="6" />This indicates that the character immediately following this code is the first character from the transcription that appears on the specified page of the specified file of the scanned text. If the "y" parameter is specified, it indicates how many characters down the page the first character appears in the scan.
End of a scanned page<scanend file="1234" page="5" />This indicates that the character immediately preceding this code is the last character from the transcription that appears on the specified page of the specified file of the scanned text.
Column break within a scanned page<scanbreak file="1234" y="1" />This indicates that the character immediately following this code should appear at the top of the next column on the current page of the scanned text. If the "y" parameter is specified, it indicates how many characters down the page the first character appears in the scan.
Blank space in a column<scanskip file="1234" y="1" />This indicates that the character immediately following this code should be moved down the page a distance corresponding to "y" characters compared to where it would normally appear.
Normally there will be no need to alter the contents of any of these codes, and please do not attempt to do so unless you are sure that you understand their function. Please do not add these codes to texts that have not yet been automatically linked to a scan.

Editing the page shown above in the Wiki looks like this (the region corresponding to the page has been selected for clarity):