Online Aggregation and Visualization of Census Data: Population Mapping with SVG, XML, and Free Software

Yi-Hong Chang, Tyng-Ruey Chuang

Institute of Information Science

Academia Sinica

Nankang, Taipei 115, Taiwan

yhchang, trc@iis.sinica.edu.tw

Phone: +886 2 2788-3799 ext. 1608

Fax: +886 2 2782-4814

January 15, 2002

Abstract

Open standards (SVG, XML, etc.) allow data to be freely exchanged between systems while maintaining its semantic integrity. Free software tools (Apache web server, xt XSLT processor, etc.) provide the means to freely build and maintain systems for data exchange. We report our experience in integrating free software tools for online aggregation and visualization of large census datasets. We started this project aiming to evaluate currently and freely available software tools for XML document processing, and to build from these tools a Web-based data exploration system for the 1990 Taiwan census dataset and the associated administration map.

Our experience has been very encouraging. In the short period of time – 5 man-months, including the time to study SVG and XML – during which this project was undertaken, we have produced a system that is based on open standards and free tools, and proves rather useful for population studies. We outline in this paper the approaches we took, the lessons we learned, as well as the remaining challenges we continue to face in building such systems.

Keywords: Data Visualization, Online Aggregation, Population Statistics and Mapping, SVG, Web, XML.

1. Project Description

Taiwan Household and Residence Censuses were conducted 7 times, in 1956, 1966, 1970, 1975, 1980, 1990, and 2000. During each survey, every household in Taiwan was interviewed for its household composition (household size, personal data for each household member, etc.) and residence status (residence size, rented or owned, etc.). The 1990 dataset includes 19,974,329 records of de-identified survey results, with one record for each surveyed individual. Each record contains 31 personal attributes, ranging from the individual's gender, year-of-birth, marital status, education level, to the individual's residence area code.

The residence area code is an 8-digit numeric code used only for conducting the census. However, the code can be uniquely mapped to a standard "city, town, county, village, and precinct district code" often used in Taiwan local government administration. Taiwan has 319 local government districts (cities, towns, and counties), with each consists of a few dozen precincts (for city or town) or villages (for county). In total, Taiwan has 7738 disjoint precincts and villages, with each having a population in the thousands.

An Administration District Map of Taiwan, in ArcView format and containing the locations and contours of all the precincts and villages in Taiwan, was previously built by the Computer Centre of Academia Sinica, and has been used in other research projects. The 1990 census data, when collated with the district map, thus becomes a rich source of detailed population profiles of Taiwan in 1990. Our project goal is to use open standards and free tools to build a Web-based system to help map and explore this rich source of population data.

2. System Overview

The following diagram provides an overview of the current system. The system has two main parts: One that aggregates from the census dataset a selected district population profile (e.g., the population profile in Nankang district, grouped by marital status, nationality, and gender), and one that assembles from a map database a selected district map. The first results in an XML file with a generic tabulation DTD designed by us. The second results in a SVG file representing the district map. The two results are further combined by two XSLT programs to produce an HTML file, with associated SVG and ECMAscript files, for ready display and exploration in a Web browser.

Note that the raw census dataset (in plain text format) and the raw district map (in ArcView format) were pre-processed and stored in a RDBS for easy access. Also notice that, after the pre-processing, the aggregation of population profiles and the assemble of district maps are driven by user queries and performed online. The pre-aggregation step extracts from the census database the population profiles for each of the 7738 villages and precincts. Each profile is grouped by the most used 7 personal attributes: Gender, year-of-birth, marital status, nationality, education status, educational attainment, and health condition. The pre-aggregation allows a population profile of a particular district, grouped by a particular set of personal attributes, to be quickly generated from the pre-aggregated profiles without the need to go back to the entire census RDBS for aggregation.

The following is a screen-shot of a population map in a user's browser. Click on the image to access the live HTML document and explore yourself. (We regret that currently the map works only in Microsoft Internet Explorer on Windows systems. There are some ECMAscript/SVG compatibility problems in Netscape Navigator of which we have yet to sort out.)

TSM System Screen Shot

3. Related and Future Work

Vienna – Social Patterns and Structures by Andreas Neumann is a classic SVG-based population map. Our system differs from his in that ours is a 3-tier system: The presentation of data is separated from the preparation of data, and the preparation of data is again separated from the data source itself. We can easily modify the parts about online aggregation of population profiles, online assemble of district maps, and online preparation of presentation files, without seriously affecting one another. XML is the data exchange format that connects the three tiers. We also designed a generic tabulation DTD to structure population profiles so that presentation files for different population profiles can be uniformly generated using simple XSLT scripts.

Future work includes dealing with multiple census datasets (e.g., the 1980 and 2000 datasets) and with changing district maps (districts often merge and/or split over time).

4. Remaining Challenges

There remain many challenges in building a high-performance population mapping system for large census datasets, using only freely available software tools. The RDBS we use, MySQL, can hardly perform online aggregation of 20 million records over unrestricted number of record attributes. This limitation prompted us to perform pre-aggregation of the census data over a fixed number of record attributes, and to allow only online aggregations that can be directly derived from the pre-aggregated data. Likewise, free XSLT processors (actually, all XML processors based on the DOM API) do not work well with large XML documents. This forced us to use a RDBS, instead of an XML document, to store the entire cleansed census dataset. Free "XML data stores" that can efficiently manage gigabyte-sized XML documents do not exist. Also, free XML tools often do not support multi-lingual document processing very well. For example, many free XML processors do not support UTF-16 (or native) coding of the "big5" Chinese character set, so we need extra effort in order to process XML documents marked-up with Chinese element names.

We also realize that online data aggregation and visualization is only a small part of a successful population mapping system. We have not addressed issues of census data quality, standard survey codebooks, multiple socio-economic data sources, and personal privacy protection, among others.

5. Towards A Taiwan Social Map

We have named our current system Taiwan Social Map, and aim to gradually build up its collection of population datasets, as well as to improve its performance and usability. The web site is located at

http://quad.iis.sinica.edu.tw/~tsm

An English version specially prepared for the workshop is available at

http://quad.iis.sinica.edu.tw/~tsm/0.11b