Follow us on Facebook to receive important updates Follow us on Twitter to receive important updates Follow us on's microblogging site to receive important updates Follow us on Douban to receive important updates
Chinese Text Project

The Late Imperial Primer Literacy Sieve: A digital tool for approximating how the primer-literate read texts

Sarah Schneewind and Joshua Day

Rationale and contents

In late imperial times, perhaps 5% of the population was fully classically literate, yet the theoretically meritocratic civil service examination system and vibrant market in print materials of all degrees of complexity, among other factors, have fueled a scholarly consensus that many more men and women in Ming society were partially literate, in varying degrees. Scholars have been studying the vernacular literature and other texts for Ming and Qing people who had acquired what historian Benjamin Elman calls "primer literacy," but have had no systematic way to ascertain which texts they could read and how well they could understand them. The Primer Literacy Sieve is a computer program that removes (turns red and black) from any text a researcher inputs any word that does not appear in the primer(s) the researcher selects. Studying what remains gives a rough idea of what a primer-literate reader would have read in the text. This permits a closer approximation to audience reception of all kinds of late-imperial texts than has been possible before.

The question of which primers people studied from is surprisingly hard to answer, not only because it appears to have changed and varied, but also because we know more about what reformers promoted than what people actually used. For this project, we settled on:

After consultation with colleagues, we also added the Heart Sutra and the Guanyin Sutra as primers, and a file with the digits and directions. The program is set up so that each researcher can decide which primers to include. Analects and Mencius are long and complex texts with a large number of characters, and we have not included them as primers in our project, but if researchers wished to run them against as specific text, that should be possible.

To demonstrate, we have included only a couple of target texts (in this case stele texts) - please see the example output generated by the program using these two texts.

Development and workings of the Sieve

The goals of the experiment were fixed at the time of the design of the software; the design of the experiment, however, was not. The first purpose of the software, then, was to clarify which questions to ask of the data. Since the interactive nature of the output made it possible to recognize when a certain use case was producing more interesting results than had been expected, or when a certain form of output was simply too confusing or its meaning was too contingent on circumstantial details, certain approaches that would have been technically interesting were rejected as requiring interpretations too opaque for responsible historical use. (In future, patterns may emerge that allow more sophisticated processing to be undertaken responsibly.)

To simplify the interpretation, then, a question was asked that could be given a clear historical interpretation. Under the assumption that a learner will have memorized a primer in its entirety, or not memorized it at all, the software transforms all input primers and target texts into a common format (stripping modern punctuation and formatting), allows the user to select a custom set of primers which a hypothetical late-imperial reader will have studied, and grays out in the target text any character the not in the selected primer(s). The researcher can then "read" what remains, possibly incorporating grayed-out characters that he or she thinks the hypothetical reader might know from daily use; should be be able to deduce from its radical and/or phonetic; or guess (rightly or wrongly) from context. The Sieve, in other words, is a tool that permits researchers to make nuanced readings and translations that may shed light on how the partially-literate read texts whose full nuances and structure they might not be able to make out.

The software presently consists of two independent but interlocking parts. The first phase, which is written in Lua and runs under LuaJIT, is meant to be the seat of all time-consuming processing that needs to be run only once for any set of documents. The advantages of LuaJIT are these: that code can be produced for it inexpensively and that its speed matches, and can exceed, that of C on relevant benchmarks. The speed gain made it possible to develop the software incrementally, even in early phases where the computation was more complicated and the execution times were significantly longer. The current, streamlined form was made possible by that incremental process, and lends itself to the kinds of improvements that future investigation will suggest. Secondly, the output from the Lua preprocessor is an HTML document that can be viewed and manipulated on the local machine (without the need for, but permitting the use of, a separate server). It is designed to allow a natural exploration of the data, with an intuitive set of tools for restricting the set of source documents that will be presented, and a panel devoted to giving context about the characters and their use in the source documents.

Installation and usage

This program requires the LuaJIT interpreter to be installed to run.

Once you have downloaded and installed LuaJIT, download and extract either one of the following two files:

Once you have downloaded and extracted the files, you can run the program using one of the following:

The program will produce its output in a file called "output.html". To view the output, open this file in any web browser.

Other texts can be used with the program by saving the target text in plain text format in the "steles" or "primers" directory as required. For further details, please refer to the INSTRUCTIONS.TXT file contained in the archive.


This project was funded by a grant from the University of California, San Diego Academic Senate. The development team also included Jenny Huangfu and Leo Tindall.

The Sieve is copyright © Joshua Day and Sarah Schneewind, 2013. It may be freely downloaded and used by anyone, but we disclaim responsibility for any errors that may result.


The Sieve is published under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license.

You are free to share and adapt, but not for commercial purposes, this material, under the following terms:

  1. With respect to attribution, you must credit Joshua Day, Sarah Schneewind, and the University of California, San Diego; provide a link to; and indicate what changes, if any, you made; and you may give that credit in any reasonable manner that does not suggest that the licensor endorses you or your use.
  2. With respect to ShareAlike provisions, if you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. You many not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

Contact details

Please write to Sarah Schneewind ( with inquiries concerning citation and redistribution or to Joshua Day ( with technical inquiries and requests.