Interactive Visualizations for Text Exploration

by Seth Raphael <raphael@nitle.org>
Software Researcher
National Institute for Technology and Liberal Education

Overview

The task of navigating large unstructured collections of documents to find information that is relevant, related, or connected to some topic can be very difficult. At the National Institute for Technology and Liberal Education (NITLE), we develop tools that use semantic analysis to correlate similar documents and to facilitate the exploration and discovery of connections between them. The tools range from statistical analysis, computational linguistics, and clustering algorithms to graph-theoretic models of text. These tools by themselves lead to interesting interactions with the text, but the experience can be improved upon through the use of interactive visualizations.

The use of dynamically generated representative graphics allows users of the NITLE Semantic Engine to understand a collection of documents in a new way. The visualizations can help elucidate the connections between documents, show the relationships between key concepts, and help characterize documents' content. Dynamically generated graphics can be created through a number of different technologies, but SVG has several features that make it particularly suited to these purposes.

Data-driven graphics created with the SVG format can interact robustly with the surrounding web-based environment, have the power of a built-in programming language, and can dynamically update themselves over a network. These abilities allow rich visualizations of content to not only be created on the fly, but navigated on the fly as well. The ability for an SVG image to interact with the underlying systems it represents makes it a very powerful solution for text and relationship visualization.

Our Visualizations

Information visualization is a growing field with lots of different areas of research. Within in the context of the NITLE Semantic Engine, a suite of tools for managing and discovering knowledge in unstructured collections, there are many potential uses of visualization. Visualizing tools can be conceived for just about every process. From depicting an entire collection with thousands of documents and their relationships, to revealing the way content in a single document fluctuates and changes over time, these graphical representations can reveal both macro and micro patterns.

We have experimented with several different visualizations, and will present them here. These include showing how different clustering algorithms reveal the structure of the collection, seeing graphically how many topics are discussed in a collection, visualizing the interactions between characters in a document, and their changes over time, among other examples.

Challenges

Through the course of implementing our various experiments in SVG, we have discovered tricks, developed toolkits and designed work-flows to solve our problems. In our work, we have ran into the need to reuse custom content to reduce the complexity of our SVG files and to keep them consistent. In the process of designing a toolkit for that purpose, we discovered a need for debugging tools. We also had to discover the best process for quickly delivering dynamically generated SVG images in a production environment in a way that dealt with server load, bandwidth, and site design. We developed a number of open source libraries that deal with these varying issues and will discuss them in this paper.

Our Toolkit

The NITLE SVG Toolkit is a collection of tools that we have developed to facilitate our use of interactive SVG. These include drag and drop libraries, debugging utilities, libraries for getting user text input, and corresponding with a server. The main library is the RCC library which allows the creation of a framework of widgets using their own namespaces. Content is automatically replaced in a simple SVG allowing the re-use of SVG code and objects across projects and the scaffolding of code. The library not only allows the instantiation of objects, but also allows robust interactions, and event-handlers to be defined in a manageable way. The libraries are described in this paper.

Dynamic and Live

Not only are our visualizations often dynamically generated, but they sometimes need to be live. This means that the data reflected in the graphics is not static. In fact, the SVG's themselves might even effect changes in the underlying models. Bugs in current browsers have made some of this functionality difficult to implement, but there are methods of working around them described herein.

Open Areas for Research

The paper will conclude with areas that are rich and waiting for further exploration.