SVG Bioinformatics Collaboration Tool

Keywords: Visualization, Collaboration, Annotation, SVG, XML, Bioinformatics

W. Rutherford, Ph.D.
Research Faculty
Group for Advanced Information Technology
Vancouver
British Columbia
Canada
wrutherf@bcit.ca
http://www.rutherford-research.ca

Biography

William Rutherford received his Ph.D. in Electrical and Computer Engineering from the University of Alberta in 1995, and his M.A.Sc. in Electrical and Computer Engineering in 1986 from the University of British Columbia. His Ph.D. thesis, on design and simulation of compound semiconductor based heterostructure devices, was written while William was director of Rutherford Research between 1989 and 1995. He has spent fourteen years as director of Rutherford Research with offices in Edmonton and Vancouver.

R. Nikolic, M.A.
Research Associate
Group for Advanced Information Technology
Vancouver
British Columbia
Canada
rnikolic@bcit.ca
http://www.bcit.ca

Biography

Radina has extensive professional experience in Information Technology. She has proven invaluable as a team member of the CA*net 3 Internet Route Registry Project and also the Genops Bioinformatics Inc. project. Her wide range of experience encompasses business and systems analysis, application design, database design, project management, software development and implementation, technical and user documentation and training. Prior to joining us, Radina worked on complex Electronic Commerce projects, where she gained in depth knowledge and experience in many aspects of Electronic Commerce, particularly in Electronic Data Interchange.

Radina holds a BA and MA in Economics from the University of Belgrade, Yugoslavia. She is also an Oracle Certified Professional and her main areas of interest now include Distributed Parallel Systems, Software Systems Architecture and Internet Engineering.

B. Sayo, CST
Research Assistant
Group for Advanced Information Technology
Vancouver
British Columbia
Canada
bsayo@bcit.ca
http://www.bcit.ca

Biography

Ben is an excellent problem-solver with a background in data communications, network design and advanced system administration. He has proven invaluable as a team member of the CA*net 3 Internet Route Registry Project and also the Genops Bioinformatics Inc. project. For development, his languages of choice are often Java and C++ however he also is adept with VB and MS-Access for prototyping and has experience developing web-applications with ASP/JSP. His programming experience includes working with Win32 API, Winsock 2.0, Berkley Sockets and XWindows Motif. Ben has also worked with JNI and the Netscape's LDAP API. Ben was actively involved in the early stages of setting up the GAIT Internet Engineering Lab at the BCIT Technology Centre.

Ben is comfortable with all phases of the software development life cycle. He enjoys a challenge and strives continually to find the best solutions.


Abstract


The presentation will be on the design and implementation of a bioinformatics collaboration, annotation and visualization tool based on SVG. Open standards allow data to be freely exchanged between systems while maintaining its semantic integrity. This paper outlines the design and development approaches we took, functionality of the tool, lessons we learned, as well as some of the remaining issues we continue to research in order to build more efficient SVG solutions.

The input to the tool is an XML file compliant with NCBI Blast Output DTD for both amino acid and nucleic acid sequences. The resulting SVG file is a standalone application that can have sequence annotations and comments added to it successively by several authors each saving their changes for the others to view. The creation of the file is done in a scalable manner using streaming over the XML input with SAX while using the event handler tree to stream out and accumulate the components of the SVG files visualization, navigation and mouse driven information systems. Interactivity is achieved thru ECMA scripting and custom naming system of the SVG file data objects. The main view window of the tool allows the user to display sequence search results in graphical format rank ordered by HSP score and to zoom this to the underlying textual sequence information as required.

As the output from the BLAST search is sometimes in a disordered array of XML formatted hits the generation software, written in Java, does a preliminary sort using a limited DOM representation of critical parameters only (HSPs). This process is iterated over an arbitrary number of input XML files related by a discovery tree where a result sequence of a previous search can be used as the input parameter for a subsequent search. This provides several panels of graphical data which are limited in the number of hits displayed by a cutoff parameter and which can be selected either by accessing a separate navigation tree which is part of the SVG module or by clicking on a special embedded icon which displays the search descendent as an overlay structure which can be nested in like manner arbitrarily.

The top level SVG file is composed of a number of sub files some of which are static and others, which are generated by the handler tree. The top two SVG windows contain the initial search sequence, which is annotatable in detail and can be zoomed to the underlying textual reference sequence data. The window directly below this is the current reference sequence for the particular search result selected. The window to the right of this is an information display block that is populated by mouse events generated as one passes over the related structure in any of the other sub windows.

The right mouse button is used to access a context sensitive popup menu, which is customized for each sub window. This can be used to save the state of the entire SVG file including current views and all annotations. The annotation popup is accessed by the context sensitive popup while in the top window containing the initial search sequence. This popup modal dialogue allows one to embed colour coded scalable symbols by numerical sequence index range in the reference panel and to include attached text title, comments and initials of researcher.

Numerous scalability problems were encountered in the course of the project including the arbitrarily large number of input search result files and the number of hits in each of these. Iteration over the SVG elements in the file is also a significant scalability issue due to sluggish performance as the number of elements increases. In particular zooming from graphical to textual representation consumes more resources and causes a delay. The performance of scrolling and panning also suffers somewhat when in text view. We are currently conducting research on some of these issues.


Table of Contents


1. Introduction
2. Design Approach
3. Implementation
4. Functionality
5. Future Research
Bibliography

1. Introduction

The Group for Advanced Information Technology (GAIT) was contracted to develop a self-contained visualization front-end module for an existing thin client GUI as part of an integrated sequence analysis platform. This module would preferably be able to graphically portray complex sequence relationships [1] called homology search results in a graphical context called an annotation composer with the option of zooming to the underlying detail.

The presentation format would be on a pseudo 3D grid with the homology results arranged in rank order of significance from most to least, by tier, with an optionally limited number of complex search hits per tier. Notably a key requirement was that it should be able to preserve copies of itself and also be self-contained so that it could be easily shared between collaborators.

The main goal was to develop, in a reasonable time frame, and hence reasonable cost, the source code to automatically generate the self-contained module from a collection of source files spawned off of the launcher and underlying Unix cluster used as a search engine. The data sets available from the currently selected homology search sequences required a navigation tree and as such the self-contained module needed to replicate this functionality. One of the main requirements of the collaboration issue would be to annotate the source sequence with respect to search result sequences in a manner such that several participants could make comments and share the results easily.

As the source files passed from the context include synthetically combined XML formatted versions of the homology sequence search output from a search engine, for example from BLAST (Basic Local Alignment Search Tool) from NCBI [2] and several others, along similar lines [1] [3] , the automatic generation of graphical content involved XML preliminary sorting with summary parameterization and parsing to SVG graphical elements [4] with optional factors [5] , including related information for mouse over and zooming. In consideration of this the primary goal of maintaining scalability in the case of possibly very large input files required parsimonious use of preserved state in the sorting model.

There are bioinformatics tools [1] [2] which accomplish similar graphical representation however do not support the combined level of navigation, overlay, interaction, zooming, panning and mouse over information tagging as is possible using SVG with ECMAScript [6] [7] [8] . Also a primary benefit of this tool is in enabling the SVG context with the ability to save state and also all of the researchers annotations in detail by search sequence range combined with the view selected at the time of saving. In summary this tool offers features that are novel and unique in bioinformatics research community.

Please note this preliminary tool is a very rough prototype, which only barely implements the basic functionality for proof of concept, which will be expanded on by the contracting organizations developers at a later date.

Also it should be noted that while one may wish to run the tool on a PC it can also be run on a X-window client for a supercomputer which changes the scalability considerations somewhat favorably.

2. Design Approach

The launch context was implemented in Java, and as such we chose to extend this model by using a Java based subsystem to generate and store the standalone SVG application.

Initially the main design challenge was to develop a preliminary SAX style tag event tree handler, with multiple XML file compliant factors using DTDs for both amino acid and nucleic acid sequences [2] which on the fly created output component files and parameters which could be used later to generate the entire SVG standalone application.

In retrospect it seems the most critical part of the design was breaking down the final SVG application into fixed and generated components. As some of the fixed components referenced items in the generated components, indexed machine and human readable naming conventions were established, by joint agreement, between developers and helped to simplify the overall structure.

The first main sections of file component breakout were immutable items, followed by regular changeable items. In all a minimum of 15 separate files are produced in various forms during sorting and parsing runs then recombined at the end using parametric indexing into a single SVG application. Notably all of the ECMAScript is collected in one file with most of the other files forming concatenate ready SVG fragments. If one has a complex series of homology sequence searches the number of intermediating files increases linearly, producing a sorted and transformed version for each additional search engine result.

This provides an arbitrary number intermediating SVG files of graphical data which are limited in the number of hits displayed by a cutoff parameter and which can be selected either by accessing a separate navigation tree which is part of the SVG module or by clicking on a special embedded icon which displays the search descendent as an overlay structure which can be nested in like manner arbitrarily.

The resulting SVG Bioinformatics Collaboration Tool document is composed of the minimal set of file fragments indicated in Figure 1, assuming only one XML search engine result file. Please note that the static hard coded or fixed file segments are colored in blue and the other, white ones, are dynamically generated, depending on the section of the homology discovery tree selected in the launch context.

fig01.png

Figure 1: SVG Application Components

3. Implementation

The input file array to the tool are XML files compliant with NCBI Blast Output DTD [2] for both amino acid and nucleic acid sequences or other closely related types. The creation of the intermediate files is accomplished in a scalable manner by using the well known streaming model over the XML input files, in turn, with SAX based tag parsing, while driving the event handler tree to covariantly stream out and accumulate the components of the SVG files visualization, navigation and mouse driven information systems. The top abstract class of the handler tree is shown in Figure 2.

fig02.png

Figure 2: BlastHandler Abstract Class

Interactivity is achieved thru embedded ECMA scripting combined with a custom indexed naming convention system of the SVG file data objects. An example function of script source code is shown in Figure 3. The main view window of the tool allows the user to display sequence search results in graphical format, rank ordered by numerical sequence similarity score and to zoom in or out arbitrarily to the underlying textual sequence information as required.

As the output from the BLAST search, from a cluster search engine, is in a disordered array of XML formatted hits the generation software does a preliminary sort. The key factor in this is that the sort is done on a PC, not a supercomputer or cluster, so we minimize persistent state by using a very limited DOM representation of critical parameters, only to form a sorted set, then dynamically stream the resultant XML file for the next process in the pipe. This general process is iterated over an arbitrary number of input XML files, related by a successive homology search discovery tree [9] , where a result sequence of a previous search can be used as the reference input parameter for a subsequent search, rather than the initial reference sequence.

fig03.png

Figure 3: ECMA Script Example

4. Functionality

In the application window, indicated in Figure 4, the top SVG frame always contains the initial search sequence and can easily be zoomed to the underlying textual reference sequence data.

fig04.png

Figure 4: View Of SVG Bioinformatics Collaboration Tool When First Opened

The window directly below initial search sequence is the current reference search sequence for the particular search result selected in the navigation tree indicated on the lower right side. The window at the top right is an information display block that is dynamically populated as one enters the hot area associated with the related structure in any of the other sub windows, assuming there is a valid information frame entry.

Although the initial design used a top frame based drop down menu, we found this to be some what problematic, for various reasons, and opted for the more elegant right mouse button activated context sensitive popup menu, indicated in Figure 5, which is customized for each sub window.

Notably the commands available on the right click menu, indicated in Figure 5, allow the user to save the state of the entire SVG file including currently selected views and all accumulated annotations.

fig05.png

Figure 5: Zoomed Initial Search Sequence and Right Click Menu

Considerable developer time and effort was spent coding the annotation popup modal dialogue, indicated in Figure 6, which is accessed by the context sensitive right click popup menu if one is in the top window containing the initial search sequence.

This popup modal dialogue allows one to embed colour coded scalable symbols, indicated in Figure 7, by numerical sequence index range in the initial search sequence reference panel by integer reference to the nucleic or amino acid range of interest.

fig06.png

Figure 6: Modal Dialogue Annotation

When the initial sequence annotation symbol is in place the attached text title is visible and any information frame items such as comments or initials of the current researcher/collaborator for reference by subsequent collaborators are easily displayed by pointing to the annotation.

fig07.png

Figure 7: Annotation Symbol

Once the annotations are entered it is necessary to save the state of the entire file, indicated in Figure 8, possibly to a versioned SVG file naming convention, in order to preserve the annotations and current view selections. The lower text level zoom, indicated in Figure 8, will show the same alignment analysis as the graphical summary, but on a sequence to sequence mapping level. This allows user to view the actual relative alignment of the homology sequence search results at the text level.

Once the desired annotations are saved including possibly the current view of interest at the zoom level of interest at the time, the entire SVG application file might be sent to another collaborator with a text message indicating the homology context and purpose of the search, if it is not self evident, from the search names in the navigation tree.

fig08.png

Figure 8: Zoom To Text and Save SVG As...

Advanced functionality also includes the ability to graphically overlay related searches from an embedded triangular icon at the end of the grid line on which the homology sequence search result hit components reside.

5. Future Research

Numerous scalability problems were encountered in the course of the project, primarily the arbitrarily large number of input search result files of arbitrary length sequences.

Iteration over the SVG graphic elements is a significant scalability issue in personal computer architecture due to potentially sluggish graphics rendering performance, as the number of elements increases in proportion to the limited processor throughput.

In particular when zooming from sequence graphical element to textual nucleic acid or amino acid representation consumes more resources and causes a significant rendering delay.

This is mitigated to some degree, however, for the X terminal client on a supercomputer with numerous processors, and as such we are proceeding with preliminary testing.

The ultimate, long term, goal of this research, from our perspective, is to port the integrated functionality implied by life sciences homology sequence research to a novel portable collaborative environment with standalone and supercomputer capability from the SVG application level.

Clearly as the focus in the bioinformatics community transitions to real physics structural models [10] the ability to supply a convincing 3D rendition combined with interactive capability will be extremely valuable. This is particularly true for the myriad of bioinformatics related database front-end web servers that have appeared in the last decade or so.

Bibliography

[1]
BAXEVANIS, A, D., OULLETTE, B., F., F., Bioinformatics, A Practical Guide to the Analysis of Genes and Proteins, 2001, New York, John Wiley and Sons, Inc.
[2]
National Center for Biotechnology Information - www.ncbi.nlm.nih.gov
[3]
GRANT., G., R., EWENS, W., J., Statistical Methods in Bioinformatics,2001, Springer Verlag
[4]
Scalable Vector Graphics (SVG) 1.0 Specification, J. FERRAIOLO, editor, W3C Recommendation, 4 September 2001. www.w3.org/TR/SVG/
[5]
EISENBERG, J., D., SVG Essentials, 2002, Sebastopol, CA, O'Reilly and Associates, Inc.
[6]
FLANAGAN., D., JavaScript: The Definitive Guide, 2002, Sebastopol, CA, O'Reilly and Associates, Inc.
[7]
CAGLE, K., SVG Programming: The Graphical Web, 2002, Apress
[8]
WATT, A., Designing SVG Web Graphics,2001, Prentice Hall
[9]
BANASZAK. J., L., Foundations of Structural Biology, 2000, Academic Press
[10]
BOURNE, E., P., WEISSIG, H., Structural Bioinformatics, 2003, New York, John Wiley and Sons, Inc.

XHTML rendition created by gcapaper Web Publisher v2.0, © 2001-3 Schema Software Inc.