\documentclass[acmnow]{acmtrans2m} \usepackage{graphicx} \newtheorem{theorem}{Theorem}[section] \newtheorem{conjecture}[theorem]{Conjecture} \newtheorem{corollary}[theorem]{Corollary} \newtheorem{proposition}[theorem]{Proposition} \newtheorem{lemma}[theorem]{Lemma} \newdef{definition}[theorem]{Definition} \newdef{remark}[theorem]{Remark} \markboth{Nigel Stanger}{...} \title{Scalability of Techniques for Online Geovisualization of Web Site Hits} \author{NIGEL STANGER \\ University of Otago} \begin{abstract} A useful approach to visualizing the geographical distribution of web site hits is to geolocate the IP addresses of hits and plot them on a world map. This can be achieved by dynamic generation and display of map images at the server and/or the client. This paper compares the scalability with respect to source data size of four techniques for dynamic map generation and display: generating a single composite map image, overlaying transparent images on an underlying base map, overlaying CSS-enabled HTML on an underlying base map and generating a map using Google Maps. These four techniques embody a mixture of different display technologies and distribution styles. The results show that all four techniques are suitable for small data sets, but that the latter two techniques scale poorly to larger data sets. \end{abstract} \category{C.4}{Performance of Systems}{Performance attributes} \category{C.2.4}{Computer-Communication Networks}{Distributed Systems}[distributed applications] \category{H.3.5}{Information Storage and Retrieval}{Online Information Services}[web-based services] \terms{Experimentation, Measurement, Performance} \keywords{downloads, geolocation, geovisualization, scalability, Google Maps, distribution style, dynamic map generation} \begin{document} \bibliographystyle{acmtrans} \begin{bottomstuff} Author's address: N. Stanger, Department of Information Science, University of Otago, PO Box 56, Dunedin 9054, New Zealand. \end{bottomstuff} \maketitle \section{Introduction} \label{sec-introduction} When administering a web site, it is quite reasonable to want information on the nature of traffic to the site. Information on the geographic sources of traffic can be particularly useful in the right context. For example, an e-commerce site might wish to determine the geographical distribution of visitors to the site, so as to decide where best to target marketing resources. One approach to doing so is to plot geographical locations on a map. Geographical information systems (GIS) were already being used for these kinds of purposes prior to the advent of the World Wide Web \cite{Beau-JR-1991-GIS}, and it is a natural extension to apply these ideas to online visualization of web site hits. The author's interest in this area derives from implementing a pilot digital institutional repository for the University of Otago School of Business\footnote{\url{http://eprints.otago.ac.nz/}} in November 2005 \cite{Stan-N-2006-running}, using the GNU EPrints\footnote{\url{http://www.eprints.org/}} repository management software. This repository quickly attracted interest from around the world and the number of abstract views and document downloads began to steadily increase. There was great interest within the University in tracking this increase, particularly with respect to where in the world the hits were coming from. The EPrints statistics management software developed at the University of Tasmania \cite{Sale-A-2006-stats} proved very useful in this regard, providing detailed per-eprint and per-country download statistics; an example of the latter is shown in Figure~\ref{fig-tas-stats}. However, while this display provides an ordered ranking of the number of hits from each country, it does not provide any greater detail beyond the country level, nor does it provide any visual clues as to the spatial distribution of hit sources around the globe. \begin{figure} \centering \includegraphics[width=\textwidth,keepaspectratio]{tasmania_stats} \caption{A portion of the by-country display for the Otago EPrints repository, generated by the Tasmania statistics software \protect\cite{Sale-A-2006-stats}.} \label{fig-tas-stats} \end{figure} The author therefore began to explore possible techniques for plotting repository hit data onto a world map, with the aim of adding this capability to the Tasmania statistics package. Preference was given to techniques that could be used within a modern web browser without the need to manually install additional client software, so as to make the new feature available to the widest possible audience and reduce the impact of wide variation in client hardware and software environments \cite[pp.\ 27--28]{Offu-J-2002-quality}. There have been several prior efforts to geovisualize web activity. \citeN{Lamm-SE-1996-webvis} developed a sophisticated system for real-time visualization of web traffic on a 3D globe, but this was intended for use within a virtual reality environment, thus limiting its general applicability. \citeN{Papa-N-1998-Palantir} described a similar system called \emph{Palantir}, which was written as a Java applet and thus able to be run within a web browser, assuming that a Java virtual machine was available. \citeN[pp.\ 100--103]{Dodg-M-2001-cybermap} describe these and several other related systems for mapping Web and Internet traffic. These early systems suffered from a distinct limitation in that there was no public infrastructure in place for geolocating IP addresses (that is, translating them into latitude/longitude coordinates). They generally used \texttt{whois} lookups or parsed the domain name in an attempt to guess the country of origin, with fairly crude results \cite{Lamm-SE-1996-webvis}. Locations outside the United States were typically aggregated by country and mapped to the capital city \cite{Lamm-SE-1996-webvis,Papa-N-1998-Palantir,Jian-B-2000-cybermap}. Reasonably accurate and detailed databases were commercially available at the time \cite[p.\ 1466]{Lamm-SE-1996-webvis}, but were not generally available to the public at large, thus limiting their utility. The situation has improved considerably in the last five years, however, with the advent of freely available and reasonably accurate geolocation services\footnote{Such as \url{http://www.maxmind.com/} or \url{http://www.ip2location.com/}.} with worldwide coverage and city-level resolution. For example, Maxmind's \emph{GeoLite City} database is freely available and claims to provide ``60\% accuracy on a city level for the US within a 25 mile radius'' \cite{Maxm-G-2006-GeoLiteCity}. Their commercial \emph{GeoIP City} database claims 80\% accuracy for the same parameters. The techniques used by these prior systems can generally be divided into two classes. The first class of techniques generate a single bitmap image that contains both the map and the graphics representing web hits. This can be achieved by programmatically plotting points onto a base map image; the composite image is then displayed at the client. This class of techniques shall henceforth be referred to as \emph{single-layer} techniques. The second class of techniques separately return both a base map image and some kind of overlay containing the plotted points. The overlay and the base map are then displayed as separate items at the client. This class of techniques shall henceforth be referred to as \emph{multi-layer} techniques. Both classes of techniques have been used in the aforementioned systems, but multi-layer techniques appear to have been the most prevalent. For example, Palantir used a multi-layer technique, in which a Java applet running at the client overlaid graphic elements onto a base map image retrieved from the now-defunct Xerox online map server \cite{Papa-N-1998-Palantir}. A more recent example is the Google Maps API \cite{Goog-M-2006-maps}, which enables web developers to easily embed interactive maps within web pages. Google Maps is a dynamic multi-layer technique that has only become feasible relatively recently with the advent of widespread support for CSS positioning and Ajax technologies \cite{Garr-JJ-2005-Ajax} in many browsers. Multi-layer techniques enjoy a particular advantage over single-layer techniques, in that they provide the potential for a more flexible GIS-like interaction with the map, with multiple layers that can be activated and deactivated as desired. This flexibility could explain why such techniques appear more prevalent in the literature. As will be seen shortly, however, web-based multi-layer techniques tend to rely either on the installation of additional client-side software, or on more recent web technologies such as CSS and Ajax. Single-layer techniques, in contrast, typically do not rely on these things. Single-layer techniques should therefore be portable to a wider range of client and server environments. Each map generation and display technique comprises a specific technology or collection of technologies (such as transparent bitmap overlays + CSS positioning), implemented using a specific distribution style. For example, a particular single-layer technique might be implemented completely server-side while another might use a mixture of server-side and client-side processing. Similarly, multi-layer techniques may adopt different distribution styles, and the overlays themselves might take the form of transparent images, absolutely positioned HTML elements, dynamically generated graphics, etc. Given the wide variety of possible techniques that were available, the next question was which techniques would be most suitable? Ideally, a technique should not only efficiently fulfil the task of plotting repository hits on a map, but also provide tangible benefits to end-users. Scalability is a key issue for web applications in general \cite[p.\ 28]{Offu-J-2002-quality}, and for online activity visualization in particular \cite[p.\ 50]{Eick-SG-2001-sitevis}, so techniques that could scale to a large number of points were of particular interest. For example, at the time of writing the Otago EPrints repository had been accessed from 13,000 distinct IP addresses, each potentially representing a distinct geographical location. Separating out the type of hit (abstract view versus document download) increased that figure to nearly 16,000. Informal testing with these data suggested that a single-layer composite map image would perform well with this volume of data, taking at most a few seconds to load and display. Conversely, it appeared that Google Maps would not perform well, taking on the order of minutes to load and display a map containing a few thousand points. The range of techniques was first narrowed down to just four (server-side image generation, server-side image overlay, server-side HTML overlay and Google Maps); the selection process and details of the techniques chosen are discussed in Section~\ref{sec-techniques}. The scalability of these four techniques was then tested to determine how well each technique handled large numbers of points. A series of experiments was conducted on each technique with progressively larger synthetic data sets, and the data set volume, elapsed time and memory usage were measured. The experimental design is discussed in Section~\ref{sec-experiment}. Informal tests suggested that the server-side image generation and the server-side image overlay techniques would scale best, and this was borne out by the results of the experiments, which show that both techniques scale reasonably well to very large numbers of points. The other two techniques proved to be reasonable for relatively small numbers of points (generally less than about 500--1,000), but their performance deteriorated rapidly beyond this. The results are discussed in more detail in Section~\ref{sec-results}. It should be noted that the intent of the experiments was not to identify statistically significant differences in performance across the four techniques. It was expected that variations across techniques would be reasonably clear-cut, and the experiments were designed to test this expectation. However, the two best performing techniques, server-side image generation and server-side image overlay, produced very similar results, so a more formal statistical analysis of these techniques may be warranted. This and other possible future directions are discussed in Section~\ref{sec-conclusion}. \section{Technique selection} \label{sec-techniques} In this section the four techniques that were chosen for testing are discussed in more detail, along with the reasons for choosing these particular techniques. First, the impact of distribution style on the choice of technique is discussed. This is followed by an examination of how each technique works in practice, its implementation requirements, its relative advantages and disadvantages, and any other issues peculiar to the technique. \subsection{Distribution style} \label{sec-distribution} \citeN{Wood-J-1996-vis} and \citeN{MacE-AM-1998-GIS} identified four distribution styles for web-based geographic visualization software. The \emph{data server} style is where the server only supplies raw data, and all manipulation, display and analysis takes place at the client. In other words, this is primarily a client-side processing model, as illustrated in Figure~\ref{fig-distribution-styles}(a). For example, Palantir implemented a multi-layer technique using this distribution style \cite{Papa-N-1998-Palantir}, where the source data were generated at the server and the map was generated, displayed and manipulated by a Java applet running at the client. The data server distribution style can provide a very dynamic and interactive environment to the end user, but clearly requires support for executing application code within the web browser, typically using something like JavaScript, Java applets or Flash. JavaScript is now tightly integrated into most browsers, but the same cannot be said for either Java or Flash. That is, the existence of a Java virtual machine or Flash plugin cannot necessarily be guaranteed in every browser, which violates the requirement to avoid manual installation of additional client-side software. Java- or Flash-based data server techniques can therefore be eliminated from consideration, but JavaScript-based data server techniques are feasible. Indeed, Google Maps is an example of such a technique (see Section~\ref{sec-overlay}). \begin{figure} \centering \begin{tabular}{ccc} \includegraphics[scale=0.9]{data_server} & \qquad & \includegraphics[scale=0.9]{image_server} \\ \footnotesize (a) Data server & \qquad & \footnotesize (b) Image server \\ \\ \\ \includegraphics[scale=0.9]{model_interaction} & \qquad & \includegraphics[scale=0.9]{shared} \\ \footnotesize (c) Model interaction environment & \qquad & \footnotesize (d) Shared environment \\ \end{tabular} \caption{Distribution styles for web-based geographic visualization \protect\cite{Wood-J-1996-vis}. (F = filtering, M = mapping, R = rendering.)} \label{fig-distribution-styles} \end{figure} Conversely, the \emph{image server} style is where the display is created and manipulated entirely at the server and is only passively viewed at the client. In other words, this is primarily a server-side processing model, as illustrated in Figure~\ref{fig-distribution-styles}(b). Consequently, techniques that use this style require no additional client-side software. The downside is that the resultant visualization can tend to be very static and non-interactive in nature, as it is typically just a simple bitmap image. The \emph{model interaction environment} style is where a model created at the server can be explored at the client, as illustrated in Figure~\ref{fig-distribution-styles}(c). \citeN{MacE-AM-1998-GIS} calls this the ``3D model interaction'' style, but this seems slightly out of place in the current context. \citeN{Wood-J-1996-vis} originally intended this distribution style to apply to VRML models for GIS applications, but it could be equally applied to any situation where an interactive model is generated at the server, then downloaded to and manipulated at the client. This is very similar to what happens with many Flash-based applications, for example. ``Model interaction environment'' therefore seems a more appropriate name for this style. The key distinguishing feature of this style is that there is no further interaction between the client and server after the model has been downloaded. This means that while the downloaded model can be very dynamic and interactive, changing the underlying data requires a new model to be generated at the server and downloaded to the client. Similar restrictions apply to techniques using this style as to the data server style, so Java- and Flash-based model interaction environment techniques can be eliminated from consideration. For similar reasons, solutions such as VRML or SVG that require external browser plugins can also be eliminated (although native support for SVG is beginning to appear in some browsers). It may be possible to implement this distribution style using only client-side JavaScript, but it is presently unclear as to how effective this might be. Finally, the \emph{shared environment} style is where data manipulation is done at the server, but control of that manipulation, and rendering and display all occur at the client, as illustrated in Figure~\ref{fig-distribution-styles}(d). This is similar to the model interaction environment style, but with the addition of a feedback loop from the client to the server, thus enabling a more flexible and dynamic interaction. Ajax technologies can easily support this kind of distribution style. For example, \citeN{Saya-A-2006-GISWS} use Ajax to integrate Google Maps with existing GIS visualization web services. Specific shared environment techniques can be eliminated from consideration based on the same criteria as were applied to the other three styles (e.g., no Java- or Flash-based techniques). \subsection{Single-layer techniques} \label{sec-image-gen} As noted earlier, single-layer techniques work by directly plotting geolocated IP addresses onto a base map image, then displaying the composite image at the client. A typical example of the kind of output that might be produced is shown in Figure~\ref{fig-image}. Such techniques require two specific components: software to programmatically create and manipulate bitmap images (for example, the GD image library\footnote{\url{http://www.boutell.com/gd/}}); and software to transform latitude/longitude coordinates into projected map coordinates on the base map (for example, the PROJ.4 cartographic projections library\footnote{\url{http://www.remotesensing.org/proj/}}). \begin{figure} \centering \includegraphics[width=\textwidth,keepaspectratio]{ImageGeneration-full} \caption{Sample output from the (single-layer) server-side image generation technique.} \label{fig-image} \end{figure} Single-layer techniques could use any of the distribution styles discussed in Section~\ref{sec-distribution}. However, all but the image server style would require the installation of additional client-side software for generating images and performing cartographic projection operations, so only single-layer techniques that use the image server distribution style (or \textbf{server-side image generation} techniques) are considered here. For this research, the author chose a representative server-side image generation technique based on the GD and PROJ.4 libraries. Server-side image generation techniques provide some distinct advantages. They are relatively simple to implement and are fast at producing the final image, mainly because they use existing, well-established technologies. They are also bandwidth efficient, because the size of the generated map image is determined by its pixel dimensions and the compression method used, rather than by the number of points to be plotted. The amount of data to be sent to the client should therefore remain more or less constant, regardless of the number of points plotted. Server-side image generation techniques also have some disadvantages, however. First, a suitable base map image must be acquired. This could be generated from a GIS, but if this is not an option an appropriate image must be obtained from a third party. Care must be taken in the latter case to avoid copyright issues. Second, the compression method used to produce the final composite map image can have a significant impact on visual quality. For example, lossy compression methods such as JPEG can make the points plotted on the map appear distinctly fuzzy or ``muddy'', even at high quality levels. Lossless compression methods such as PNG avoid this problem, but may produce larger files for the same image. Finally, it is harder to provide interactive map manipulation features with server-side image generation techniques, as the output is a simple static image. Anything that changes the content of the map (such as panning or changing the visibility of certain points) will require the entire image to be regenerated. Zooming could be achieved if a very high resolution base map image was available, but the number of possible zoom levels is likely to be restricted. \subsection{Multi-layer techniques} \label{sec-overlay} Multi-layer techniques also involve plotting points onto a base map image, but they differ from single-layer techniques in that the points are not plotted directly onto the base map image. Rather, the points are displayed as an independent overlay on top of the base map image. This provides a significant advantage over single-layer techniques, as it enables the possibility of multiple independent layers that can be individually shown or hidden. This is very similar to the multi-layer functionality provided by a GIS, and is an effective way to provide interactive visualizations of geographic data \cite{Wood-J-1996-vis,MacE-AM-1998-GIS}. There is still the problem of finding a suitable base map image, however. Until relatively recently, implementing multi-layer techniques would likely have required additional software to be installed at the client, but most modern browsers now support absolute positioning of HTML elements using CSS. This enables the creation of a map overlay using nothing more than HTML, CSS and a few bitmap images. The author has identified two main alternatives for producing such an overlay, which can be termed \emph{image overlay} and \emph{HTML overlay}. An image overlay comprises a transparent bitmap image into which the points are plotted, which is then overlaid on the base map image (in the author's implementation, the output looks essentially identical to that shown in Figure~\ref{fig-image} on page~\pageref{fig-image}). This requires the overlay image to be in either PNG or GIF format, as JPEG does not support transparency. The overlay image is likely to contain considerable ``white space'', which compresses very well, so use of a lossless compression method should not be an issue. This also eliminates the image quality issue noted earlier. The size of the image overlay will generally be proportional to the number of points to be plotted, but the image compression should moderate this. As noted in Section~\ref{sec-image-gen}, generating images at the client would require additional software to be installed, so only the data server distribution style will be considered here for image overlays (i.e., \textbf{server-side image overlay}). That is, both the base map image and the overlay(s) are generated at the server. An HTML overlay comprises a collection of HTML elements corresponding to the points to be plotted, which are positioned over the base map image using CSS absolute positioning. There is considerable flexibility as to the types of elements that could be used to construct the overlay. One possibility is to use \verb|<IMG>| elements to place icons on the base map, which appears to be the approach adopted by Google Maps (see Figure~\ref{fig-google}). Another possibility is to use appropriately sized and colored \verb|<DIV>| elements, which then appear as colored blocks ``floating'' over the base map image (in the author's implementation, the output looks essentially identical to that shown in Figure~\ref{fig-image} on page~\pageref{fig-image}). \begin{figure} \centering \includegraphics[width=\textwidth,keepaspectratio]{GoogleMap-full.png} \caption{Sample output from the (multi-layer) Google Maps technique.} \label{fig-google} \end{figure} HTML overlays may be generated at either the server or the client. Unlike the techniques discussed previously, however, HTML overlays can be generated at the client without the need for additional software, because only HTML (i.e., text) is being generated, not images. This can be easily achieved using client-side JavaScript, so HTML overlays can use any of the distribution styles discussed in Section~\ref{sec-distribution} without violating the requirement to avoid additional client-side software. Two representative HTML overlay techniques have thus been adopted for the experiments: \textbf{server-side HTML overlays} (using the image server distribution style) and \textbf{Google Maps} (using the data server distribution style). Since Google Maps uses \verb|<IMG>| elements, \verb|<DIV>| elements have been used for the server-side HTML overlay. Server-side HTML overlays are actually slightly simpler to implement than either server-side image generation or image overlays, because it is not necessary to write any code to generate or manipulate images (the base map image is static and thus requires no additional processing). All that is required is code to transform latitude/longitude coordinates into projected map coordinates and generate corresponding \verb|<DIV>| elements. Google Maps \cite{Goog-M-2006-maps} is a more complex proposition. This technique uses the data server distribution style, where JavaScript code running within the browser enables the client to manipulate the base map and its overlays. Data and map images are requested asynchronously from the server as required using Ajax technologies, which seems to imply that Google Maps in fact uses the shared environment distribution style. However, the server has no involvement beyond simply supplying data to the client. In the shared environment distribution style, the server is directly involved in manipulating the map, under the control of the client. This is clearly not the case with Google Maps. The primary advantage of Google Maps is the powerful and compelling functionality it provides for generating and interacting with the map. Users may pan the map in any direction and zoom to many different levels of detail. A satellite imagery view is also available. In addition, further information about each point plotted (such as the name of the city) can be displayed in a callout attached to the point, as shown in Figure~\ref{fig-google}. Google Maps also has a proven record for visualization of network resources. For example, \citeN{Gibb-H-2006-Gridscape} use Google Maps to visualize and manage world-wide computing grids. However, there are also some significant disadvantages to the Google Maps technique\footnote{Interestingly, the Google Earth application addresses many of these issues, but since it is not a browser-based solution it falls outside the scope this research.}. First, it is a distributed application, thus making it more complex to implement, test and debug \cite{Bates-PC-1995-distdebug,Ensl-PH-1978-distributed}. Second, the server must have a registered API key from Google, which is verified every time that a page attempts to use the API. Similarly, the client must connect to Google's servers in order to to download the API's JavaScript source. This means that the technique requires an active Internet connection in order to work. Finally, the Google Maps API does not currently provide any way to toggle the visibility of markers on the map, so it is not possible to implement the interactive ``layers'' mentioned at the start of this section. (It is possible, of course, that Google may implement this feature in a future version of the API.) The most significant disadvantage of all HTML overlay techniques, however, is that the size of the HTML overlay is directly proportional to the number of points to be plotted. There will be one overlay element (\verb|<DIV>| or \verb|<IMG>|) per point, so a very large number of points will result in an even larger amount of HTML source being generated. It is expected that this will lead to excessive browser memory usage for large data sets, and consequently that these techniques will not scale well at the high end. However, they may still be appropriate for smaller data sets that require interactive manipulation. \section{Experimental design} \label{sec-experiment} After some preliminary testing with live data from the Otago School of Business repository, a series of experiments was undertaken to test the scalability of the four chosen techniques. Each technique was tested using the same collection of progressively larger synthetic data sets. The first data set comprised one point at the South Pole. A regular grid of points at one degree intervals was then constructed by progressively incrementing the latitude and longitude, with each data set being twice the size of its predecessor. A total of twenty-one data sets were created in this way, with the number of points ranging from one to 1,048,576 (\(=2^{20}\)). The result of plotting the 16,384-point data set is shown in Figure~\ref{fig-grid-points}. The grid spacing used meant that 64,800 points were sufficient to fill the entire map, so the five largest data sets had many duplicate points. This does not affect the results of the experiments, however, as it is the total number of points that is significant, not their location. \begin{figure} \centering \includegraphics[width=\textwidth,keepaspectratio]{16384_points} \caption{The 16,384-point data set plotted on the base map.} \label{fig-grid-points} \end{figure} The focus on scalability meant that the primary measures of interest were page load time, memory usage and the volume of data generated (which impacts on both storage and network bandwidth). Page load time can be further broken down into the time taken to generate the map data, the time taken to transfer the map data and other ancillary material to the client across the network, and the time taken by the client to display the map. Unfortunately, as noted in Section~\ref{sec-overlay}, the Google Maps technique requires an active Internet connection, so the experiments could not be run on an isolated network. This meant that traffic on the local network was a potential confounding factor. It was therefore decided to eliminate network performance from the equation by running both the server and the client on the same machine\footnote{A Power Macintosh G5 1.8\,GHz (single processor) with 1\,GB RAM, running Mac OS X 10.4.7, Apache 2.0.55, PHP 4.4 and Perl 5.8.6.}. This in turn enabled independent measurement of the times for data generation and page display, thus simplifying the process of data collection and also ensuring that the client and server processes did not unduly interfere with each other, despite running on the same machine. It could be argued that network performance would still have a confounding effect on the Google Maps technique, but this would only be likely for the initial download of the API (comprising about 235\,kB of JavaScript source and images), which would be locally cached thereafter. The API key verification does occur every time a map is loaded, but the amount of data involved is very small, so it is less likely that this would be significantly affected by network performance. Any such effect would also be immediately obvious as it would simply block the server from proceeding. For each data set generated, its size, the time taken to generate it, the time taken to display the resultant map in the browser, and the amount of real and virtual memory used by the browser during the test were recorded. It was also intended to measure the memory usage of the server, but this proved more difficult to isolate than expected, and was thus dropped from the experiments. The data set generation time and browser memory usage were measured using the \texttt{time} and \texttt{top} utilities respectively (the latter was run after each test run to avoid interference). The map display time was measured using the ``page load test'' debugging feature of Apple's Safari web browser, which can repetitively load a set of pages while recording various statistics, in particular the time taken to load the page. Tests were run up to twenty times each where feasible, in order to reduce the impact of random variations. Some tests were run fewer times because they took an excessive amount of time to complete. Further testing for a particular technique was generally halted when a single test run took longer than about five minutes, as by this stage performance had already deteriorated well beyond usable levels. The web browser was quit and reloaded afresh before each group of tests. \subsection{Technique implementation} As noted in Sections~\ref{sec-image-gen} and \ref{sec-overlay}, the server-side image generation, server-side image overlay and server-side HTML overlay techniques were all implemented using the image server distribution style. A separate dispatcher page was written in PHP for each technique, which enabled arguments---such as the number of points to be plotted---to be passed from the client to a corresponding Perl script for each technique. The final page was then constructed as follows: \begin{description} \item[server-side image generation] The dispatcher page included a standard \verb|<IMG>| element that called the Perl script. This script loaded a base map PNG image, plotted points directly onto it, and returned the composite map to the client as a JPEG image (with the ``quality'' parameter set to 90). \item[server-side image overlay] The dispatcher page included two \verb|<IMG>| elements, the first for the base map and the second for the overlay, both with identical CSS positioning attributes. The first \verb|<IMG>| simply loaded a static JPEG image representing the base map. The second \verb|<IMG>| called the Perl script, which generated and returned the overlay as a transparent PNG image. \item[server-side HTML overlay] The dispatcher page included an \verb|<IMG>| element for the base map and a \verb|<DIV>| element for the overlay, both with identical CSS positioning attributes. The \verb|<IMG>| simply loaded a static JPEG image representing the base map. The \verb|<DIV>| contained inline PHP code that called the Perl script. This in turn generated and returned the overlay as a collection of CSS-positioned \verb|<DIV>| elements, nested within the top-level \verb|<DIV>| element. \end{description} For all of these techniques, the base map image was 1,024 \(\times\) 520 pixels. In PNG format it occupied approximately 1.2\,MB (but this version was never returned to the client), while in JPEG format (Q=90) it occupied approximately 180\,kB. The base map image was derived from an original 3,599 \(\times\) 1,826 pixel image, which was part of a collection of maps released into the public domain by the \citeN{CIA-WFB-2006}. All three techniques used the PROJ.4 cartographic projections library to convert latitude/longitude pairs into projected map coordinates, while the first two techniques also used the GD graphics library to programmatically generate and manipulate images. The Google Maps technique was implemented using the data server distribution style. As with the other three techniques, a PHP dispatcher page was used. This time, however, the page included client-side JavaScript code to load and initialize the Google Maps API, create the base map, and build the map overlay. The first two steps were achieved using standard Google Maps API calls. For the last step, the client used an \texttt{XMLHttpRequest} object to asynchronously call a server-side Perl script, which generated and returned to the client an XML data set containing the points to be plotted. The client then looped through this data set and used Google Maps API calls to create a marker on the base map corresponding to each point. \section{Results} \label{sec-results} As noted in the introduction, the intent of these experiments was not to do a full analysis and statistical comparison of the performance of the different techniques, but rather to identify broad trends. There has not, therefore, been any statistical analysis carried out on the results. The remainder of this section will discuss the results for data size, page load time and memory usage. Because the number of points in each data set increases in powers of two, log-log scales have been used for all plots. \subsection{Data size} For each data set, the data generated by the server was saved to a file and its size in bytes recorded. In the case of the server-side image generation and server-side image overlay techniques, the file comprised a bitmap image; whereas for the server-side HTML overlay and Google Maps techniques, the file comprised HTML or XML text, respectively. In addition to the generated data, there was a certain amount of fixed overhead associated with each technique tested, as summarized in Table~\ref{tab-overhead}. This overhead comprised static files that were always downloaded to the client regardless of the number of points to be plotted. Examples of fixed overhead items include the base map image, various icons, the PHP source of the dispatcher page and the JavaScript source for the Google Maps API. \begin{acmtable}{11cm} \centering \begin{tabular}{lll} Technique & Fixed overhead & Content \\ \hline Server-side image generation & 629\,bytes & \textbf{dispatcher (PHP)}\medskip \\ Server-side image overlay & \(\approx\) 181\,kB & dispatcher (PHP) \\ & & \textbf{base map image (JPEG)}\medskip \\ Server-side HTML overlay & \(\approx\) 181\,kB & dispatcher (PHP) \\ & & \textbf{base map image (JPEG)}\medskip \\ Google Maps & \(\approx\) 235\,kB & dispatcher (PHP) \\ & & base map image tiles (PNG) \\ & & \textbf{API (JavaScript)} \\ & & various icons (PNG) \\ \end{tabular} \caption{Fixed overhead for each technique. The largest contributing item for each technique is shown in \textbf{bold face}.} \label{tab-overhead} \end{acmtable} \begin{figure} \centering \includegraphics[scale=0.55]{data_size} \caption{Comparison of generated data size for each technique (log-log scale).} \label{fig-data-size} \end{figure} The volume of data generated by each technique, including fixed overhead, is shown in Figure~\ref{fig-data-size}. It is immediately apparent from these results that there is a divergence between the two techniques that generate images (server-side image generation and server-side image overlay), and the two techniques that generate text (server-side HTML overlay and Google Maps). Both the server-side image generation and server-side image overlay techniques scale particularly well with regard to the amount of data generated. Interestingly, the amount of data generated by the image generation technique increases by about 8\,kB up to the 8,192-point data set, but then \emph{drops} by about 90\,kB over the next three data sets. This occurs because the number of points plotted has become sufficient to cover most of the base map. This means that a large portion of the composite map image is a single color (see Figure~\ref{fig-grid-points} on page~\pageref{fig-grid-points} for an example), which compresses more efficiently. The amount of data generated by the image overlay technique appears constant, but actually increases by about 2\,kB across the range. This has important implications for the ability of this technique to handle multiple layers. Because the overlay images are quite small (less than 2\,kB for up to one million points), it should be feasible to pre-load several overlay images into a client-side array and switch them on and off as desired. The server-side HTML overlay and Google Maps techniques clearly do not scale well with respect to data size, and begin to visibly diverge from the other two techniques once the amount of data generated exceeds about 5\% of the fixed overhead. For the HTML overlay technique this occurs somewhere between 64 and 128 points, whereas for Google Maps it occurs somewhere between 256 and 512 points. The divergence increases rapidly for both techniques beyond these points, with the HTML overlay technique suffering the most. The latter occurs because the HTML overlay technique needs to generate additional CSS attributes (i.e., more text) in order to correctly position the \verb|<DIV>| elements, whereas the Google Maps technique needs only to return a more compact list of latitude/longitude coordinates. \subsection{Page load time} For each test run, both the length of time taken to generate the data at the server and to display the page in the client browser were recorded. The former is illustrated in Figure~\ref{fig-data-generation-time} and the latter in Figure~\ref{fig-page-load-time}. The combined time (data generation + display time) is shown in Figure~\ref{fig-combined-time}. \subsubsection{Data generation time} \begin{figure} \centering \includegraphics[scale=0.55]{data_generation_time} \caption{Comparison of data generation time for each technique (log-log scale).} \label{fig-data-generation-time} \end{figure} The results (see Figure~\ref{fig-data-generation-time}) show that the length of time taken to generate the source data increases in proportion to the number of points to be plotted, as expected. It is interesting to note the differences in data generation time for each technique, however. Data generation for both of the ``text-based'' techniques (HTML overlay and Google Maps) is generally faster than for the ``image-based'' techniques (image generation and image overlay). The results show that server-side image generation generally takes the longest to generate its data. This is because it not only has to map points from latitude/longitude into projected map coordinates, but also must plot these points onto the base map image, then compress the composite image as a JPEG. The image to be compressed is also moderately complex, which only adds to the data generation time. Server-side image overlay performs somewhat better because it uses a less complex compression method (PNG) and the image to be compressed is much simpler (a collection of colored points on a blank background). The server-side HTML overlay technique appears faster at generating data than either of the two image-based techniques at the low end, but is similar in performance at the high end. In this technique the server only needs to map latitude/longitude to projected map coordinates; no images need to be generated and there is no compression to deal with. At the high end, however, this advantage is clearly offset by the significant volume of data being generated. Google Maps is faster again, because almost all processing is carried out on the client; the server's only involvement is to generate a simple list of latitude/longitude coordinates. In terms of data generation, it appears that all techniques tested scale reasonably well. The image-based techniques perform worse at the low end because they involve more complex processing than the text-based techniques, but this is offset at the high end by the relatively constant amount of data generated. Conversely, the text-based techniques perform better at the low end, but are negatively impacted at the high end by the sheer volume of data produced (tens or hundreds of megabytes vs.\ hundreds of kilobytes). \subsubsection{Map display time} \begin{figure} \centering \includegraphics[scale=0.55]{page_load_time} \caption{Comparison of map display time for each technique (log-log scale).} \label{fig-page-load-time} \end{figure} These results (see Figure~\ref{fig-page-load-time}) reveal quite a spectacular difference between the image-based and text-based techniques. The time taken to display the map is essentially constant for both of the image-based techniques, regardless of the number of points to be plotted. This is not surprising given that the size of the generated data is also essentially constant, and that the browser is simply loading and displaying static images. The image overlay technique appears slightly slower than the image generation technique. This is probably because the image overlay technique has to load two images from the server (the base map and the overlay), compared to one image for the image generation technique. In contrast, the text-based techniques clearly do not scale well with respect to map display time. Google Maps suffers particularly in this regard, with display time exceeding ten seconds shortly past 512 points. Testing was abandoned at 4,096 points, with a single test run taking over seven minutes. The HTML overlay technique fares better, exceeding ten seconds somewhere between 4,096 and 8,192 points. Testing was abandoned at 32,768 points, with a single test run taking almost ten minutes. \subsubsection{Combined time} \begin{figure} \centering \includegraphics[scale=0.55]{combined_time} \caption{Comparison of combined page load time for each technique (log-log scale).} \label{fig-combined-time} \end{figure} Combining the data generation and map display times (see Figure~\ref{fig-combined-time}) yields little change in the curves for the text-based techniques, because the data generation times are very small compared to the map display times. There is a more obvious impact on the image-based techniques, with both techniques remaining more or less constant up to about 2,048 points, then slowing as the number of points increases beyond that. However, the slowdown is nowhere near as dramatic as for the text-based techniques; even the largest data set only takes about nineteen seconds overall. The image overlay technique does display a slight advantage of about half a second over the image generation technique for the largest data set, but further experiments will be required to determine whether this difference is statistically significant. \subsection{Memory usage} Both the real and virtual memory usage of the browser were measured by running the \texttt{top} utility after each test run and observing the memory usage in each category. This provided the size of both the current ``working set'' and the total memory footprint of the browser process after it had completed a test run. The real memory results are shown in Figure~\ref{fig-real-memory} and the virtual memory results are shown in Figure~\ref{fig-virtual-memory}. \begin{figure} \centering \includegraphics[scale=0.55]{real_memory} \caption{Comparison of browser real memory usage for each technique (log-log scale).} \label{fig-real-memory} \end{figure} \begin{figure} \centering \includegraphics[scale=0.55]{virtual_memory} \caption{Comparison of browser virtual memory usage for each technique (log-log scale).} \label{fig-virtual-memory} \end{figure} While both sets of results display similar trends, the real memory data proved somewhat problematic. Real memory usage was generally consistent across test runs, but would also frequently fluctuate upwards by a factor of nearly two for no readily apparent reason. This is particularly apparent with the HTML overlay technique beyond 1,024 points. It seems likely that this was a result of other processes on the test machine interacting with the browser process in unexpected ways. There is some doubt therefore as to the validity of the real memory data, but they are at least broadly consistent with the virtual memory data. The virtual memory data proved more consistent overall, as the virtual memory footprint of a process is less likely to be impacted by other running processes. The results show that the two image-based techniques have essentially constant virtual memory usage of about 170\,MB regardless of the number of points plotted. This is to be expected, given that the size of the generated data is also essentially constant. The text-based techniques, however, clearly begin to diverge as the number of points increases. The HTML overlay technique starts to visibly diverge somewhere between 2,048 and 4,096 points, reaching a maximum of about 216\,MB at the point that testing was terminated. Google Maps starts to visibly diverge between 64 and 128 points, reaching a maximum of about 264\,MB at the point that testing was terminated. This is in line with the initial expectation for these techniques, that is, that memory usage would increase in proportion to the number of points to be plotted. \section{Conclusion and future work} \label{sec-conclusion} In this research, the scalability of four techniques for online geovisualization of web site hits was tested, with respect to the number of points to be plotted on the map. The four techniques tested were server-side image generation, server-side image overlay, server-side HTML overlay and Google Maps. The results clearly show that the server-side image generation and server-side image overlay techniques scale the best from small to large data sets. The HTML overlay and Google Maps techniques work well for small data sets, but their performance rapidly deteriorates as the size of the data set increases, to the point where they become unusable. Despite this clear difference in scalability, there are still some interesting questions remaining. The model interaction environment distribution style was not investigated in this research, as it was unclear whether this could be achieved using only client-side JavaScript. This is clearly an avenue for further investigation. In addition, the appearance of native SVG support in browsers means that this may become a viable option for implementing this distribution style in future. It was somewhat surprising that the server-side HTML overlay and Google Maps techniques exhibited no obvious consistency in where the different measures (data size, map display time and virtual memory usage) diverged, as shown in Table~\ref{tab-divergence}. Google Maps appears to exhibit greater consistency than HTML overlay in this respect, but it seems logical to expect some form of correlation, so further research will be required to investigate this. One possibility might be to implement an instrumented web browser and server in order to gather more precise data. \begin{acmtable}{11cm} \centering \begin{tabular}{lccc} Technique & Data size & Map display time & Virtual memory \\ \hline Server-side HTML overlay & 64--128 & 128--256 & 2,048--4,096 \\ Google Maps & 256--512 & 64--128 & 64--128 \\ \end{tabular} \caption{Approximate number of points at which each measure begins to visibly diverge, for the HTML overlay and Google Maps techniques.} \label{tab-divergence} \end{acmtable} Shortly after completing the experiments, the author discovered \emph{msCross Web\-gis}\footnote{\url{http://datacrossing.crs4.it/en_Documentation_mscross.html}}, an open source Google Maps clone. Its documentation implies that it may be possible to build a fully self-contained implementation that requires no external network access. This would enable testing on an isolated network with the client and server running on different machines. Measurements of network transfer time could then be included, and any issues arising from running the client and server on the same machine would be eliminated. This would require a distributed measurement infrastructure similar to that developed by \citeN{Barf-P-1999-webperf}. The overall aim of this work was to identify the best technique for plotting the large number of downloads and abstract views from the Otago School of Business digital repository. Based on the results, both the server-side HTML overlay and Google Maps techniques are clearly inappropriate for this task. This leaves a choice between two very similarly-performing techniques: server-side image generation and server-side image overlay. However, multi-layer techniques display many practical advantages over single-layer techniques, such as the ability to dynamically show and hide multiple overlays. These advantages provide greater flexibility and a more dynamic experience for end-users. Taking these end-user benefits into consideration, the server-side image overlay technique is the clear winner in this case. \begin{acks} The author would like to acknowledge Dr.\ Antoni Moore and Prof.\ George Benwell for their input into this research. \end{acks} \bibliography{Map_Visualisation} \begin{received} ... \end{received} \end{document}