Theme Mountain: a SVG-based Visual Data Mining Tool

Keywords: SVG, watermarking, fourier descriptor

Dr. Sebastiano Battiato
D.M.I. -Universita' di Catania
Via Andrea Doria, 6
9125, Catania
Italy
battiato@dmi.unict.it

Ph.D. Gianpiero Di Blasi
D.M.I. -Universita' di Catania
Via Andrea Doria, 6
9125, Catania
Italy
gdiblasi@dmi.unict.it

Prof. Giovanni Gallo
D.M.I. -Universita' di Catania
Via Andrea Doria, 6
9125, Catania
Italy
gallo@dmi.unict.it

Greco Sebastiano
D.M.I. -Universita' di Catania
Via Andrea Doria, 6
9125, Catania
Italy
seby.greco@libero.it


Abstract


To see is to know. When you understand what you see, then you can trust what you are seeing. In this paper we present ThemeMountain, an SVG-based Visual Data Mining tool. ThemeMountain is inspired by ThemeRiver [Hav02], a tool that identifies sequential patterns, trends and temporal relationships within large collections of documents and analyzes them over time. A collection of documents or other data could be displayed as a river or a ribbon made of different colours that flows across a period of time. Within the river, colour-coded currents identify widening or narrowing themes according with their relative strength.


Table of Contents


1. Introduction
2. Design Goal and Mountain Metaphor
3. ThemeMountain
4. Experimental Results
Bibliography

1. Introduction

Data Mining is an information extraction activity able to discover hidden facts in large databases. Using a combination of machine learning, statistical analysis, modelling techniques and database technology, Data Mining finds patterns and subtle relationships in data, inferring rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions and credit risk analysis.

Visual Data Mining (VDM) is a special approach to Data Mining that integrates the human ability to see images with the computer computational power in order to create a powerful discovering environment. VDM presents information through visual patterns in such a way that the user can easily pick up underlying regularities and mathematical models. On the other hand patterns reveal flows, relations, structures and anomalies of the data useful to verify and confirm a priori knowledge and/or hypothesis. Moreover, patterns can suggest new questions leading the user to new conclusions. A huge amount of information is daily generated and to explore and analyze these data is more and more difficult. Information Visualization could improve VDM capabilities. Visual data exploration has the great advantage to join the user himself in the data mining process.

SVG can be used to craft effective visualization tools, due to its powerful scalability and portability. Recently SVG has been also used for rendering of real raster images (see for example [Bat05a], [Bat05b], [Bat05c]).

ThemeMountain is a new tool for VDM inspired to the ThemeRiver [Hav02]. It uses the mountain metaphor instead of the river one. In this proposed system each peak represents the strength by which a theme is presented in a given period of time. ThemeMountain is a server-side interactive visualization system. Figures below show the overall pipeline together with a typical SVG output.

The paper is structured as follows: Section 2 describes the design goal of a Visual Data Mining Tool, in Section 3 the ThemeMountain approach is described. Finally Section 4 shows experimental Results.

schema.png

The overall pipeline

Figure 1:

output.png

A typical SVG output

Figure 2:

2. Design Goal and Mountain Metaphor

The main goal in designing new data visualization tools is to enable user to quickly and easily find patterns, using the tremendous power of his own human perceptual system. Final result is reached using familiar metaphors helping user to understand the data presentation. Adding contextual information, such as time lines and event annotations, allows connecting between the patterns in content and events or time intervals.

A visual metaphor facilitates discovery by presenting data in an intuitive way using a perceptual and cognitive point of view. Metaphors are wired into our understanding of particular concepts, using evidence from common linguistic expressions.

We use a mountain metaphor to convey several key notions about collection of textual documents evolving in time. The document collection's time evolution, selected thematic content, and thematic strength are respectively associated to mountain's directed flow, composition and changing width.

The directed flow from left to right is interpreted as movement through time. The horizontal distance between two points on the mountain defines a time interval.

Similarly to a histogram, ThemeMountain uses variations in width to represent variations in strength or degree of representation. At any point in time, the total vertical distance (width of the mountain) indicates the collective strength of the selected themes.

3. ThemeMountain

ThemeMountain is a Data Mining Visualization Tool inspired by ThemeRiver [Hav02], using the mountain chain metaphor. ThemeMountain allows the user to visualize a particular theme, occurrences and time range simply by moving the mouse over a mountain peak. The visualization is performed in real time. Using a simple web form, user inputs time range data and themes of interest that he wishes to explore inside a database of textual documents. Database that have been proved suitable to ThemeMountain explorations are, for example, articles published in daily papers over a period of time.

The tool works as follows:

  1. send the request to the php module;
  2. send the query to the MySQL database;
  3. receive the answer from the MySQL database;
  4. create the SVG file;
  5. redirect the file to the browser.

ThemeMountain is a serverside interactive tool. Using a web form, the user can insert data about range time and theme. The system will show him an SVG file representing the answer to the submitted query. The SVG file format makes ThemeMountain a platform independent system, allowing any user to access and use the system. Moreover the modular nature of ThemeMountain allows the use on any textual database. The software uses a MySQL database containing textual papers, an Apache web server and a PHP module. The system is very easy to use, the user simply selects the range time and inserts the theme (Figure 3). Then the system create the SVG file required (Figure 1).

selection.png

Range time and theme selection

Figure 3:

4. Experimental Results

The overall system has been tested collecting and analyzing data stored in public databases of news; in particular we considered data coming from -Il Manifesto- (http://www.ilmanifesto.it/) from November 11, 2003 to March 10, 2004. Experiments confirm the effectiveness of the proposed visual data mining approach. Major details, on line demo and results can be found at the following web address: http://svg.dmi.unict.it/

Bibliography

[Hav02]
Havre S., Hetzler E., Whitney P., Nowell L. ThemeRiver: visualizing thematic changes in large document collections. IEEE Transactions on Visualization and Computer Graphics, Volume: 8, Issue: 1, Jan.-March 2002
[Bat05a]
Battiato S., Di Blasi G., Gallo G., Messina G., Nicotra S. SVG Rendering of Digital Images: an Overview. In poster proceedings of ACM/WSCG2005
[Bat05b]
Battiato S., Costanzo A., Di Blasi G., Gallo G., Nicotra S. SVG Rendering by Watershed Decomposition. In proceedings of IS&T/SPIE Electronic Imaging 2005
[Bat05c]
Battiato S., Barbera G., Di Blasi G., Gallo G., Messina G. Advanced SVG Triangulation/Polygonalization of Digital Images. In proceedings of IS&T/SPIE Electronic Imaging 2005

XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.