Keywords: Web Services, REST, Web Services
Paul Prescod is Director of Application Development for the products division of Blast Radius. Paul is well-known in the XML community as a standards advocate, software developer and writer.
As application developers come to understand SVG's highly productive approach to graphics, it will naturally become more and more common to use SVG as a visualization technology for XML data sources. It is natural for SVG developers to wonder how they can make high-quality visualizations of these sources. There are many essays and papers on these techniques. This paper is not one of them.
This paper will ask the opposite question: "How can Web services developers make Web services that are easy to use as data sources for SVG visualizations?" This may seem as if it puts the cart before the horse but it is actually a very productive way to judge the extent that a Web service is integrated with XML and Web technologies in general. If an XML data source works well with SVG then it will likely also work well with the DOM, XLink, XInclude, the DOM, RDF and so forth. XML and the Web have an architecture and Web services tend to either fit the architecture and work well with all of them or resist the architecture and work poorly with all of them. This paper describes how to create services that integrate seamlessly.
Evolution of SVG Visualization Services
Atavi Genealogy Viewing Service
Client-side Bound Remote Data Style
Web-centric Web services
Designing a data source
SOAP/WSDL Request/Response Style
SOAP Data Source for SVG Service
Web Data Source for SVG Viewer
Comparing the approaches
Protocol ease of use
The Scalable Vector Graphics language promises to revolutionize the Web through the introduction of standards-based vector graphics, animation and interactivity. It is also being incorporated into many important non-Web applications from drawing programs to handheld phones to windowing systems and soon into printers.
One of the key applications of SVG is data visualization. SVG can be used to visualize everything from geographical information [CLEOPATRA] to genealogies [ATAVI] to census data [TAIWANCENUSUS] . These days, the data being visualized is often exposed to the visualization application as XML to loosen the coupling between the application and the database.
This paper will discuss strategies for using Web techniques and infrastructure to minimize the dependence of the visualizer on the interface to any particular data source. Following the guidelines described here will also open up a data source to integration into Web-centric specifications such as the DOM, XLink, XInclude, the DOM, RDF and so forth.
The paper starts out describing the architecture of SVG data visualization applications ("visualizers"). Then it will focus in on the XML data-delivery part of the problem. How can developers connect the visualizer to its data source? SOAP and WSDL arise as obvious answers but the paper will describe the impedance mismatch between SVG and these technologies. It will instead propose a more Web-centric approach to implementing Web services. This approach resolves the impedance mismatch, improves performance and is generally more elegant than current SOAP/WSDL techniques. As a bonus, the paper will delve into the deep reasons that SVG and the Web-centric approach work so well together.
The "Classic" style is the most common way to build SVG data visualizers. There is a back-end data source. That source can be queried to generate XML. The XML is fed to an XSLT transformation on the server side. Finally, he transformation generates SVG. The SVG is fed to the SVG/(X)HTML browser for user consumption. This is a powerful pipelined three-tier architecture (browser, transform, back-end data) with XML potentially involved in each step.
Atavi and Cleopatra are two interesting services in this style.
Atavi ( [ATAVI] ) is an SVG Web service that displays genealogical diagrams. It uses XSLT to transform XML genealogies into Scalable Vector Graphics (SVG) images. Each person has two resources, each with a URI. One resource shows the ancestors or the person and the other shows the descendants. For instance http://mycgiserver.com/servlet/genviewer.RestServer/ancestors/I14 ( Figure 1 ) is the ancestors URI for George V (Elizabeth's grandfather) http://mycgiserver.com/servlet/genviewer.RestServer/descendants/I14 ( Figure 2 ) is the descendants URI. Using this mechanism it is possible to navigate through a large series of related people. References to individual views can be bookmarked, emailed or stored in a database. If there were dozens of Atavi servers around the Internet, they could all link to each other, creating a network of genealogical database views of staggering size and complexity.
An Atavi genealogy consists of many such documents generated dynamically from XSLT. These could be saved to disk and re-generated every time the data file changed but it is much more space efficient to generate them on demand. A good compromise between space and CPU efficiency could be achieved by caching the most important views (e.g. Queen Elizabeth) leaving more obscure views to be generated on demand.
In some circumstances the visualization service may wish to work with data that is not entirely stored in one database at one site, controlled by one organization. Rather the service may be used to visualize data spread out across the Web.
For instance, in a few years the Web may be full of genealogies expressed in GEDCOM 6 XML format on hundreds of web sites. It would be great to be able to view these in a future version of the Atavi genealogy visualizer. One solution might be to download them from their current hosts and upload them to the Atavi database. But this implies that the Atavi database has massive storage. That certainly is not the case today. It also means that it becomes the end-user's responsibility to download the data from its host site and upload it to Atavi. It would be much more convenient to just paste a URI into a form and have Atavi display it. In other words, Atavi could be constructed so that it could work with any URI-addressable genealogy database. Similarly, one could imagine pointing Cleopatra at URI-addressable GML databases and configuration files.
In this case, the service becomes a computational mediator between the browser client and the third-party data source. For instance "foafnaut" [FOAF] is an SVG service from viewing "Friend Of A Friend" ("FOAF") relationships expressed in XML/RDF. It can incorporate any FOAF data into its database through a URI.
Another example of this style is the W3C's RDF graph visualization and validation service [RDFVALIDATOR] . It can display any RDF data source as a graph of arcs and nodes.
The remote data style is most appropriate for building views of data aggregated from multiple data sources. It is a very decentralized style wherein data providers can maintain their own fragment of the data with minimal coordination. Similarly, the creator of the visualizer does not have to coordinate with the maintainers of the data sources. The user can provide the connection between the two at runtime with a URI. The visualizer can create views of data sources that the programmer had never heard of. In addition, there could be many competing or complimentary viewers for the same data sources.
The only things that data providers and visualizer creators must coordinate (usually indirectly) is that they have the same data formats and protocols. For instance a mediator that provides views of the connections between weblog posts could work with any weblog that adhered to the standard RSS format and Trackback protocol. Of course this "coordination" is usually done through standards bodies or ad hoc standardization.
Remote data does have trade-offs. First, it can worsen reliability and latency. A single client request depends upon the availability and performance of the mediator, the remote data source and the network connection between them.
Nevertheless, caching can be used to manage both latency and availability. As an analogy, it is relatively common to use the Google web page cache to retrieve pages that are either no longer available or have unacceptable latency. SVG mediators can also apply this technique. By choosing how much space to allocate to caching they can increase or decrease service reliability and performance at a cost of disk space.
Caching also has its own dangers. It must done carefully. Otherwise it introduces the possibility of stale data. This is a serious concern for some applications but not for others.
The SVG 1.2 specification points to another interesting model. SVG is increasingly capable of working with remote XML data sources from the client-side. SVG 1.2 has features like "getUrl", "parseXml" and RAX which are perfect for this sort of thing. This suggests that many mediators could be moved to the client side. One can imagine applications where the user enters a URI to start the viewer and then uses the viewer to browse a variety of URI-addressable third-party data sources.
An SVG visualizer could navigate from data source to data source, following links. Essentially it would behave as a browser within a browser! So a client-side Atavi would be a "genealogy browser" and a client-side Cleopatra would be a "map browser".
This architecture minimizes user-perceived latency and the load on the mediator's server. It can serve up the viewer application as a series of static documents. From then on, the client talks directly to the data sources. Once the user has downloaded some data, they can quickly switch between views of that data without requiring a page reload.
Before discussing data source design in detail, it is worth describing the architecture of the Web. For one thing, this is the environment SVG was designed to work in.
The Web's architecture was always designed to support a wide variety of data types and application requirements. Not only is it perfectly capable of carrying XML in general and SVG in particular, it is actually designed specifically to handle data types like XML which can be easily generated programmatically and are rich in hyperlinks. The Web's underlying architecture is termed REST (Representational State Transfer) and it is optimized for data publishing applications.
In the Web-centric style of computing, every logical object is given its own URI. Each object can describe itself by responding to the "GET" method. An HTTP "GET" request is said to return a "representation". As much as possible, these representations should be delivered in widely understood and deployed XML vocabularies. In other words, REST encourages people to use 1) standard addressing scheme, 2) a standard deployed protocol and 3) standardized XML vocabularies.
Of these three goals, the most strict is the standardized addressing scheme: the Web's URI addressing scheme is sufficiently flexible that there is seldom a need to use something else. In fact, for the kind of applications described in this paper, it would be lunacy to invent a new addressing scheme competitive with URIs.
The most flexible of the guidelines is the encouragement to use standardized formats. Of course you will get the most bang for your standardization buck if you can use a vocabulary that has many tools developed for it (e.g. RSS, SVG, XHTML, OFX). But on the other hand, if you are the first person to try to build a web service for encoding computational astrological divinations, you will probably find that no existing XML vocabulary exactly meets your needs. Sometimes you need to innovate and in the REST model, data representation is where innovation happens. This is not unique to the REST model, by the way: good database designers spend most of their time worrying about data representation (as tables and columns) also. The difference between a financial database and a customer relationship database is primarily in the data representations possible.
But innovation for its own sake is not productive: smart developers only develop new data representations and XML vocabularies when older ones are unacceptable for one reason or another. If at all possible, the XML vocabularies used for the data source should not be specific to that data source but reusable across a variety of them. RSS is an example of an XML vocabulary that is used by thousands of different data sources from version control systems to newspapers to weblogs. RSS can even be used to deliver stock feeds. It is rare to find an XML vocabulary that is that reusable but XML vocabularies live on a spectrum of reusability and you should try to push your data source design as far as possible to the reusable side.
Putting this together: A Web-centric data source web service has the following properties: It uses URIs pervasively. It almost certainly uses HTTP. It uses standardized XML representations when possible but innovates where necessary.
We can see these forces at work in the Web-as-we-know-it. The most popular HTTP URI syntax has been essentially static for years. It just works and requires very little changing. Other URI schemes have arisen and fallen away but the main HTTP scheme remains dominant and relatively static. The HTTP protocol has grown and evolved more. HTTP 1.1 had some important performance improvements. Variants like WebDAV have arisen. HTTP does not change as quickly as newer protocols like SOAP or Jabber but it does evolve.
On the other hand, consider the velocity of change in the data HTTP transmits: HTML, XHTML, SVG, Flash, PDF, RDF, RSS, etc. The data representation realm bounces quickly between innovation and standardization. This is a Good Thing.
Applications consuming a REST service navigate from resource to resource in one of two ways. The first way is by following links (XLinks, in the case of SVG and many other XML vocabularies) from one resource to another. Each resource has a URI so it is easy to create these links.
<person> <mother href="http://./genealogy-server/person2.xml"/> <father href="http://./genealogy-server/person3.xml"/> </person>
The second navigational mechanism is to construct a query that searches for appropriate resources. Queries are usually done using "query parameters" embedded in URIs.
Queries have one huge advantage over navigation. They are typically much more efficient. For instance, one can find many web pages through Google directories, which organizes them as a hierarchy of linked pages or one can search Google using the HTML form (query). It is clearly much quicker to find things through the query. But then Google returns a list of links so you can see how these two mechanisms can work together.
But link-based navigation has advantages over queries. First, a client that follows links does not need to know the details of the query syntax, the types of the parameters and so forth. It is much easier and more common to standardize pure data representations than it is to standardize the full semantics of a dynamic service. The queries one site offers can differ substantially from the ones another site offers but their data representation can still be common.
Second, in order to do a query you need to know what question to ask. Directories like Yahoo and Google directories are still popular on the Web because if you do not know exactly what you are looking for, navigating around can help you figure it out. Sometimes a client application merely wants to know: "What is available?" The best REST services support both modes.
Ideally an XML data source will be designed so that it can easily support any of these three service styles. In particular, organizations in the business of publishing information should publish their information in a form that will allow third parties to provide a variety of illustrative views over the information. Providing a view is just one kind of processing: designing a data source for easy consumption by visualizers will usually also improve that site's availability to all sorts of applications.
The paper will compare two strategies for building these highly flexible services. First it will discuss the typical uses of the standards most strongly associated with Web services, SOAP and WSDL. Then it will propose an alternate model based upon the pre-existing Web architecture.
By design, SOAP and WSDL standards are extremely loose and "flexible" in what they allow. This makes it difficult to talk about what they can and cannot do. Nevertheless, there are some dominant patterns in the deployment of services based upon these standards.
The basic model of a traditional SOAP/WSDL web service is that the service is delivered through XML messages using SOAP envelopes. Each service has a single URI. These messages may or may not use the "SOAP Encoding" for the body of the messages. Although SOAP allows both unidirectional, bidirectional and other forms of messaging, any particular service will typically work in only one mode and in fact many SOAP implementations only support one mode. A request-response pattern is most common and is by far the simplest way to implement a data source service.
In the SOAP style of development, there are no "built-in" methods. Each site defines all methods from scratch. WSDL specifications describe the legal inputs and outputs to each method. In essence, SOAP and WSDL provide a substrate for defining your own domain-specific protocol.
How would a developer apply SOAP and WSDL to the problem of creating data sources for SVG views? First, define a protocol (service interface) for the data source. It should consist of an XML vocabulary for the data (described in XML Schema) and method names for retrieving it (described in WSDL). This would be implemented in an object oriented programming language and exposed through some kind of Web services toolkit.
Then, call this service from either a server-side mediator or a client-side SVG-based data browser application. At this point the developer would have to consider issues of compatibility between the client library and the server library. This would be especially worrisome for browser-based applications.
For brevity, this paper will not go into the detailed design of the WSDL interface and SOAP messages.
In the Web-centric style, our data source would consist of a series of URI-addressable virtual XML documents. Each would be connected to others using hyperlinks. For performance, the service would also have query interfaces. This interface could be implemented using all of the same tools and techniques used to build Web sites: Apache, IIS, Servlets, CGI, Squid, etc.
In the Web-centric architecture, we assign a distinct URI to each possible view of a data object. Just as with an application like Google, we do not actually generate a huge list of these URIs somewhere. Rather we encode queries in the URIs. For instance a URI for a map of a particular geographic region could encode the latitude and longitude of the top-left and bottom-right corners. The XML representation of each geographic resource could contain links to adjacent resources for adjacent geographic regions.
Now imagine two different types of clients. One knows both the query syntax and the data representation syntax. The other knows only the data representation syntax. The first can navigate based upon either queries or links. It can efficiently jump from one part of the map to a very distant part. The second client understands only the XML representation. It can navigate by following the links.
Note that the links have value even to the service that usually does queries. If parts of the map are uncharted, it would be inefficient to try each possible coordinate pair one by one. You could do a query to say: "What graph coordinates are available" but the result of that query would likely be a list of links. This is even more clearly true in the Atavi and FOAF examples where the resources are clearly distinct (representing individual people).
Note that because SOAP-based web services so seldom have support for first-class links, they usually represent them as if they were queries: "give me the object with such and such an identifier". But that is ultimately just an obscure way of doing a link.
Now that we have descried the alternatives, we can compare them according to a variety of factors.
Obviously there must be a communications channel between the viewer or mediator and the data source: a protocol. If we put aside the limitations of browsers, one could argue that any protocol is usable. It is merely a question of writing the code to use the protocol. But protocols live on a large spectrum of ease of development. Some are so difficult to work with that the problem is more like decrypting than coding. Others are easy to understand, already popular and widely deployed.
It would take much more space than is available here to discuss the issue of whether SOAP is an easy to use protocol. This is covered in all of its gory, controversial detail elsewhere. Nevertheless, there are a few issues specific to SVG that deserve special consideration.
First, it seems logical that even XML data sources may already embed SVG fragments. For instance a data source representing corporations might include corporate logos and stock charts. But SOAP is actually a poor way to transmit SVG. Compare the effort required to use SVG over SOAP versus over simple HTTP. Over SOAP, the SVG message is mixed into the same XML DOM as the SOAP envelope. There is no way to ask an SVG viewer to render this SVG inside of the envelope. You must separate it out into its own DOM and/or merge the elements into your own DOM. In contrast, if SVG is delivered through a simple HTTP request then it can be incorporated into a client side view with a single DOM mutation (e.g. change the "href" attribute on a "use" element).
Calling the service from the server-side mediator could be easy or difficult depending on the programming language it is implemented in. XSLT certainly cannot make SOAP calls directly so we will have to rely on some other language like Java, Python or Perl. Because there are varying versions and variants of SOAP, we may need to be careful about our choice of language and toolkit.
If you choose to use SOAP to deliver your data, you will have to define your own SOAP-based protocol. You will have to choose whether it is RPC-style or not, what its underlying protocol(s) will be, what its method names are, what their parameters are and so forth. This increases the difficulty of communicating the interface of your service to potential clients. It also means that if your protocol is to be better than HTTP you will likely need to be a fairly sophisticated protocol designer!
There must be an easy way to tell the SVG data visualizer what data source to use. Ideally this would be done through some form of URI. But what does the URI identify? The details of the data source design matter quite a bit. Consider a data source like today's UDDI repository. The repository stores a variety of different object types. tModel objects, businessEntity objects and so forth. A particular UDDI repository could have thousands or tens of thousands of these objects. The problem is that all of these objects share a single URI address. In order to differentiate between them you need both the URI and a Globally Unique Identifier (GUID).
It is essentially impossible to tell current SOAP and WSDL-based tools to generate a separate URI per data object. This is the reason that SOAP based services do not take a more intelligent approach to URIs. It is frankly a mystery why WSDL and its toolkits are so poor in this area but in the writer's opinion, the weakness emanates more from the way the standards evolved than from any conscious technical decisions.
In contrast, the design of dynamic Web development tools is organized around the need to give URIs to data objects. It does not matter whether the tool is a J2EE Servlet Engine, Python's Zope or Perl's Mason.
The data source should be fragmented into (virtual) XML documents that can be reasonably downloaded at runtime. If a data source is a single monolithic document then downloading it will cause unacceptable latencies no matter what the architecture. It stands to reason that each fragment must be individually addressable or at least accessible through some sort of query.
Either SOAP or HTTP can be used to transmit highly granular XML fragments. But there is a difference in how they approach it. Given a URI it is possible always possible to dereference any URI without caring about the context of the referenced resource. For instance, the CNet site may be organized totally differently than the Slashdot site, and they may have totally different search (query) facilities, but given a link to an RSS document you can put it in any RSS syndicator and that syndicator will do the right thing. The context (CNet on Vignette vs. Slashdot on Slashcode) is not relevant. The individual RSS data object's interface is simple, well-defined and not tightly integrated with the interface for the rest of the system.
But SOAP/WSDL sites invariably combine the data fetching protocol with the query protocol into a single monolithic interface. Once again, this is because of the design of WSDL and of WSDL-based tools. As described before, it is essentially impossible to directly address individual data objects so they must always be accessed through queries. Similarly, every SOAP service uses a unique, service-specific addressing mechanism so that links between these fragments are service-specific instead of generic as is the case with URIs/XLink.
The XML data source must be designed so that it can scale to its likely user base. This can often be handled by throwing copious amounts of hardware at the problem but proper design can reduce the cost of this investment. In particular, query result caching can reduce the computational load.
The Web architecture has strong support for caching. To start with, HTTP defines the cachability of each of its methods explicitly. GET is cachable (unless you are told otherwise explicitly). POST is not (ditto). Furthermore, HTTP has explicit cache control headers so that a service can say whether any particular response is cachable regardless of the method. HTTP has the "HEAD" method that allows the client or caching intermediary to ask whether some data has changed without requesting a re-send of the data.
On the other hand, SOAP does not have well-known methods like "GET" and "POST". This means that there is no way for a client to infer cachability from the method name. There is also no standard way to transmit that message in either a WSDL specification nor a SOAP message.
It should be clear by now that an approach built upon Web Architecture (REST) has many technical advantages over a SOAP/WSDL approach. Beyond this there are the many non-technical benefits of using tools that are mature, like Apache, and technologies that are well-understood, like HTTP.
The Web was designed to be a universal interface to data sources. It was intended that these data sources would be exposed to standards-based clients using a well-defined addressing model (URIs), a well-defined protocol (HTTP) and a variety of standard or proprietary data representations with hyperlinks between them. This is also a perfect model for Web services that will feed data to SVG visualization programs.
XHTML rendition created by gcapaper Web Publisher v2.0, © 2001-3 Schema Software Inc.