Building a Graphetic Kanji Dictionary with SVG

Julien Quint <quint@nii.ac.jp> and
Ulrich Apel <ulrich_apel@t-online.de>
National Institute of Informatics
Hitotsubashi 2-1-2, Chiyoda-ku, Tokyo 101-8430 Japan

The Difficulty of Reading and Writing Kanji. The Japanese writing system is a very complex one, built around a large set of thousands of characters borrowed from China known as kanji (漢字.) Since an educated person is expected to know around 3,000 individual characters, learning them is an enormous task for Japanese people and foreigners alike. To make matters worse, it is difficult to find good learning material.

Among the challenges that the kanji pose for the student are the difficulty to look up a word or a character in a dictionary, as one needs to know details about the character such as the number of strokes or prononciation, which are difficult to guess for the beginner. Another big issue is the difficulty to write a kanji as it may involve dozens of strokes, the shape, order and direction of which matter to the legibility of the character.

A Graphetic Approach. Trying to build better teaching material leads to consider the kanji as graphic entities composed of graphemes: the smallest meaning distinguishing unit. Among many possibilities, a sensible one is to consider the brush or pen stroke as the basic unit of a character. Analysis of stroke uses 25 basic forms, considering direction, bending, endings, and so on.

We use two different XML vocabularies to describe a kanji. The first one is an ad-hoc vocabulary for the description of strokes, stroke groups, radicals, etc. Kanji are complex objects which, except for the simplest cases, can be divided into several groups of strokes inherited from other kanji. Some parts are radicals that will determine the meaning and/or pronunciation of the characters: for instance, 梅 (ume) has the radical 木 for tree (left part) determining its meaning and 毎 (mai) for the prononciation. In any case, it is important to (1) classify the different types of strokes and (2) clearly tag the different stroke groups and radicals that allow to effectively decompose a character into smaller constituents which help remembering how to read or write it.

The second vocabulary that we use is SVG. SVG allows us to precisely describe the drawing of the kanji by associating every stroke with a path (and every stroke group with a group of path.) SVG allows many interesting applications, going further than simple display of the characters, such as animation (as described below) or linking of radicals with the actual kanji that they come from (with XLink.)

Note that there are several styles of calligraphy in use in Japan; as in most other languages, handwritten characters may differ a lot from their typeset counterpart. We are mostly interested in the handwritten characters here.

Automatic Animation of Path Data. The SVG data is static; but animating the strokes allows to show effectively the stroke order and the direction of the stroke for the teaching of writing. Since the dictionary consists of thousands of characters, this has to be done automatically from the static data currently available.

Each stroke is represented by a path element which is mostly composed of quadratic Bézier curves (elements C, c, S and s.) Animating the drawing of path goes further than what SVG 1.1 proposes; therefore, we have devised a simple animation scheme consisting of segmenting every curve in every stroke into a certain number of steps, and then drawing the curve in small time increments. This time increment and number of segmentation steps can be modified to tweak the aspect of the final animation. Since we are dealing with Bézier paths, they are simple to segment into a number of parts that is a power of two. Also, the definition of the control points of the curves influences greatly the animation (some parts of a stroke are drawn fast when others are drawn more slowly) and can make it very natural looking. The coupling of the SVG information with the other XML information allows to enforce the correct order and direction of the paths.

Conclusion. As a result, we have collected (and are still collecting) an unprecedented amount of information about the shape and writing of kanji, for educative or reference purposes. SVG is a key technology in the presentation of this data, be it for static display or animation.