- Tweaked abstract slightly. - Added labels to all sections.

nigel.stanger / Publications

Browse code - Tweaked abstract slightly. - Added labels to all sections. - Added figures. Williamson_2005
1 parent e30bff8 commit 2d98376d8da362236ecb341c335aa27a13b04819 nstanger authored on 16 Aug 2005

Patch

Showing 1 changed file

Ignore Space Show notes View Atom_updates.tex
\documentclass{CRPITStyle} \usepackage{harvard} \usepackage{graphicx} \pagestyle{empty} \thispagestyle{empty} \title{Lightweight Update Propagation using Atom} \author{David W.\ Williamson \and Nigel J.\ Stanger} \affiliation{Department of Information Science, \\ University of Otago, \\ PO Box 56, Dunedin, New Zealand \\ Email:~\texttt{\{dwilliamson,nstanger\}@infoscience.otago.ac.nz}} \begin{document} \maketitle \begin{abstract} There are many situations where some form of automated update propagation across disparate databases may be beneficial. For example, a retailer could automatically retrieve the latest pricing data from their suppliers' databases, and use these data to update their own internal database. Doing so at regular intervals ensures that the retailer always has current pricing information in their database. Electronic Data Integration (EDI) tools that provide such features already exist but can be expensive to implement, particularly for small to medium enterprises (SME's). In this paper we propose a lightweight approach for propagating updates from one database to another using the Atom XML syndication format, thus providing a simpler, cost-effective technology for facilitating data integration. This approach enables a target database to regularly poll one or more source databases for updates, which are then applied to the target database (alternatively, updates could be ``pushed'' to the target from the sources). This approach can be used in typical data integration scenarios where the data sources are updated at irregular intervals, such as the aforementioned retailer example, or when extracting data from multiple data sources for loading into a data warehouse. In the paper we discuss the underlying principles and motivation for the approach, discuss possible architectures, and describe an early prototype implementation. \end{abstract} \vspace{.1in} \noindent {\em Keywords:} update propagation, data integration, Atom, SME, lightweight architecture, Semantic Web, B2B \section{Introduction} \label{sec-intro} The ability to integrate data from multiple heterogeneous sources is becoming a key issue for modern businesses, and yet the number of businesses implementing data integration solutions is smaller than we might expect \cite{Beck-R-2002-Bled,vaHe-E-1999-EDI}. This is particularly true for small to medium enterprises (SME's), for whom the cost of implementing an enterprise-scale data integration solution can often be prohibitive \cite{Beck-R-2002-Bled,Guo-J-2003-DocEng,Somm-RA-2002-SIGMOD}. In this paper, we propose a lightweight architecture for propagating updates from one database to another using the Atom XML syndication format. This architecture could provide a cost-effective alternative technology for SME's to facilitate data integration rather than having to purchase expensive enterprise grade systems. We have implemented a basic proof of concept of this architecture, and are currently evaluating it using three case studies. The body of this paper comprises four main sections. In Section~\ref{sec-background} we provide some general background information regarding data integration and the Atom syndication format. In Section 3 we discuss the motivation behind our proposed architecture. We then discuss the proposed architecture and the goals of our research in Section 4, and present some possible directions for future work in Section 5. The paper concludes in Section 6. \section{Background} \label{sec-background} In this section, we briefly discuss the concepts and technologies that underlie our proposed architecture. In Section 2.1 we provide a brief overview of data integration, especially in the context of SME's attempting to implement a data integration solution. This is followed by a brief discussion of the development of Atom and related technologies such as RSS and RDF. \subsection{Data Integration} \label{sec-data-integration} Data integration is a term used to describe the combining of data residing in different sources to provide the user with a unified view of data \cite{Bati-C-1986,Yu-C-2004-SIGMOD}. This activity is becoming increasingly important to modern business operation as more and more organizations rely upon applications that support staff in undertaking informed decision making \cite{Calv-D-1998-CoopIS,Yu-C-2004-SIGMOD}. Data integration is a domain that has been a topic of research for some time \cite{Beck-R-2002-Bled,Wied-G-1993-SIGMOD}; today this domain is of no less significance with many organizations requiring the aggregation of data from multiple and often heterogeneous sources, for a wide variety of applications \cite{Haas-LM-1999-DEB}. \citeasnoun{Bati-C-1986} illustrated three common scenarios for integration environments: \begin{itemize} \item homogeneous, where all the sources of data share the same schema; \item heterogeneous, where data must be integrated from sources that may use different schemas or platforms (e.g., a combination of relational and hierarchical databases); and \item federated, where integration is facilitated by the use of a common export schema over all data sources. \end{itemize} A typical example of data integration from heterogeneous sources can be found in the arena of business-to-business (B2B) commerce, where, for example, a manufacturer may have to interact with multiple suppliers or temporary contractors each of whom may have completely different data structures and data exchange formats \cite{Ston-M-2001-SIGMOD}. With the introduction of cheaper web based technology, many additional organizations have been able to undertake projects to facilitate data integration, however, the costs associated with such technology are still quite prohibitive to the many smaller companies and organizations that comprise the majority of most countries' economies. Many initiatives have been put forward to try and alleviate this situation, one of the more recent being the OASIS Universal Business Language (UBL) standard \cite{Mead-B-2004-UBL}, which is a project to standardize common business documentation---invoices, purchase orders etc.---so that it is easier for companies to establish and maintain automated transactions with other parties. UBL has been designed to operate with ebXML. XML has been widely adopted as a standard platform for exchanging data between organizations, and many specialist standards---such as the aforementioned ebXML---have been developed to cater to the unique needs certain business sectors present. In addition to XML-based language specifications, other standards such as EDIFACT and EXPRESS have been defined to facilitate the transmission of information from various sources so that it may be integrated with other data. \subsection{The Atom Syndication Format} \label{sec-atom-overview} In this section we provide a brief overview of the Atom syndication format and the technologies that led to its development. \subsubsection{RDF, RSS and the Semantic Web} \label{sec-rdf-rss} The World Wide Web (WWW) as it stands today consists mostly of documents intended for humans to read, i.e., ``\ldots{}a medium of documents for people rather than for data and information that can be processed automatically\ldots'' \cite{Bern-T-2001-SciAm}, which provides minimal opportunity for computers to perform additional interpretation or processing on them \cite{Bern-T-1999-WWW,Bern-T-2001-SciAm}. In essence, computers in use on the Web today are primarily concerned with the parsing of elementary layout information, for example headers, graphics or text and processing like user input forms \cite{Bern-T-1999-W3C,Bern-T-2001-SciAm}. There are few means by which computers can perform more powerful processing or manipulation on web resources \cite{Bern-T-2001-SciAm,Fens-D-2003}, most often because the additional semantics required do not exist or are not in a form that can be interpreted by computers \cite{Koiv-MR-2001-W3C}. The motivation for the adoption of semantics in Web documents can be made evident simply by using a contemporary search engine to look for an ``address''. This search may well return a plethora of results ranging from street addresses and email addresses to public addresses made by important individuals through the ages. This kind of scenario is one of the reasons for the W3C's Semantic Web project \cite{Koiv-MR-2001-W3C}. In the words of its creator, Tim Berners-Lee, its goal is to: \begin{quotation} ``\ldots{}develop enabling standards and technologies designed to help machines understand more information on the Web so that they can support richer discovery, data integration, navigation, and automation of tasks. With Semantic Web we not only receive more exact results when searching for information, but also know when we can integrate information from different sources, know what information to compare, and can provide all kinds of automated services in different domains from future home and digital libraries to electronic business and health services.'' \cite{Koiv-MR-2001-W3C} \end{quotation} In other words, the Semantic Web will provide a space where more intelligent searching and processing of information will be made possible by further extending the existing capabilities of the World Wide Web (WWW). RDF is a technology that is an integral part of the W3C Semantic Web initiative, as the following excerpt from the W3C Semantic Web activity statement will attest: \begin{quotation} ``The Resource Description Framework (RDF) is a language designed to support the Semantic Web, in much the same way that HTML is the language that helped initiate the original Web. RDF is a frame work for supporting resource description, or metadata (data about data), for the Web. RDF provides common structure that can be used for interoperable XML data exchange.'' \cite{Powe-S-2003-RDF} \end{quotation} What RDF does in the context of the Semantic Web is to provide the capability of recording data in a way that can be interpreted easily by machines, which in turn provides an avenue to ``\ldots{}more efficient and sophisticated data interchange, searching, cataloguing, navigation, classification and so on\ldots{}'' \cite{Powe-S-2003-RDF}. Since its inception in the late 1990's, the RDF specification has spawned several applications, RSS being but one example. RDF Site Summary (RSS) is an XML application, of which versions 0.9 and 1.0 conform to the W3C's RDF specification. It is a format intended for metadata description and content syndication \cite{Mano-F-2004-RDF}. Originally developed by Netscape as a means to syndicate content from multiple sources onto one page \cite{Nott-M-2005-Atom}, RSS has been embraced by other individuals and organizations resulting in the spawning of multiple versions. At its most simple, the information provided in an RSS document comprises the description of a ``channel'' (that could be on a specific topic such as current events, sport or the weather, etc.) consisting of URL linked items. Each item consists of a title, a link to the actual content and a brief description or abstract. Because of the proliferation of differing RSS standards and associated problems with compatibility, a group of service providers, vendors and developers have initiated the development of a separate syndication standard named Atom, which will, according to the Atom Publishing Format and Protocol (Atompub) Working Group, be heavily influenced by the lessons learned in the evolution of RSS. \subsubsection{Atom} \label{sec-atom-detail} The Atom specification is an XML-based document format that has been designed to describe lists of related information \cite{Nott-M-2005-Atom}. These lists are known as ``feeds''. Feeds are made up of multiple items, known as ``entries''; each entry can have an extensible set of attached metadata \cite{Nott-M-2005-Atom}. Atom as a technology comprises four key related components: a conceptual model of a resource, a well defined syntax for this model, the actual atom feed format itself and the editing protocol. Both the feed format and editing protocol also make use of the aforementioned syntax. In addition to these features, the Atompub Working Group have outlined several design objectives for the feed format and the editing protocol. The feed format must be able to represent the following: a resource that is a weblog entry or article, a feed or channel of entries, a complete archive of all entries within a feed, existing well formed XML (especially XHTML) content and additional information in a user-extensible manner. The editing protocol must support creating, deleting or editing feed entries, multiple authors for a single feed, user authentication, user management and the ability to create, obtain and configure complementary material such as comments or templates. The latest specification of Atom, which at the time of writing is still in a draft form, states the main purpose that Atom is intended to address is ``\ldots{}the syndication of Web content such as Weblogs and news headlines to Web sites as well as directly to user agents'' \cite{Nott-M-2005-Atom}. The specification also suggests that Atom should not be limited to just web based content syndication but in fact may be adapted for other uses or content types. The Atompub Working Group aim to submit the Atom feed format and editing protocol to the IETF for consideration as a proposed standard in early April 2005. \section{Motivation} \label{sec-motivation} One of the example domains of data integration is that of Electronic Data Interchange (EDI), a concept used by companies to exchange information such as goods procurement documentation. EDI is not new \cite{Beck-R-2002-Bled,Medj-B-2003-VLDB}, and has been used for many years by various organizations to reduce costs by replacing more traditional paper based systems. It is interesting to note, however, that in surveys regarding the extent of adoption of EDI, only a fraction of the companies that might be perceived as beneficiaries of such technology have actually implemented or attempted to implement it \cite{Beck-R-2002-Bled,vaHe-E-1999-EDI}. This naturally raises the question of why? We can refine this question further by asking why so few smaller companies (SME's) have adopted EDI or indeed other technologies that rely on accurate automated data integration, such as data warehousing. Perhaps the most important reason is that of cost: to a small company the perceived benefits of introducing the technology may not be sufficient to justify the expense \cite{Beck-R-2002-Bled,Guo-J-2003-DocEng,Somm-RA-2002-SIGMOD}. When a decision has been made to implement new technology, it is often the case that the SME in question has been forced into an investment that is, to them, an expensive solution, perhaps due to demands imposed by larger clients and partners, or as a response to competitors in an attempt to maintain market position \cite{Beck-R-2002-Bled,vaHe-E-1999-EDI}. Attempts have been made to make EDI more cost effective by introducing EDI on a web-based platform \cite{Beck-R-2002-Bled}, and through the development of standards such as the recently sanctioned OASIS Universal Business Language (UBL) standard \cite{Mead-B-2004-UBL}. While UBL is new and has probably not had sufficient time to make a substantial impact, the fact remains that the underlying reason these types of technologies are still not attractive enough to SME's is cost \cite{Beck-R-2002-Bled,Guo-J-2003-DocEng,Somm-RA-2002-SIGMOD,vaHe-E-1999-EDI}. To summarize, data integration related technologies are often not readily or willingly implemented by SME's because of the perceived high costs involved, and at best are implemented only if it is deemed vitally important to the continued survival of the organization in the marketplace. Such a situation leads us to the conclusion that there is an apparent need for an alternative data integration solution that is cost effective, enabling SME's to embrace the benefits of applications that use data integration technologies, such as data warehousing, EDI networks or e-catalogues. This identified need provides the motivation for our proposed architecture, which we will discuss in the next section. \section{Proposed Architecture and Research Goals} \label{sec-architecture} To address the issue of lack of SME adoption of data integration technologies, we propose a lightweight data integration architecture based on Atom, as illustrated in Figure 1. Atom was chosen as the underlying technology because of its XML heritage, and because the Atom community is trying to encourage different uses for the format beyond the traditional application of weblog syndication \cite{Nott-M-2005-Atom}. Although the standard has yet to be officially ratified, it already has a large user and development community. \begin{figure}[htb] \fbox{\parbox[b]{.99\linewidth}{% \vskip 0.5cm% \centerline{\includegraphics[scale=0.9]{Architecture_basic}}% \vskip 0.5cm% }} \caption{Overview of the basic architecture} \label{fig-basic} \end{figure} We are currently implementing a basic proof of concept of this architecture, and will evaluate its cost-effectiveness and performance compared to other data integration technologies. The prototype builds upon existing software available for processing Atom feeds, and adds a module (written in PHP) for integrating incoming data from different feeds. The integration module takes as input Atom feeds from multiple data sources, which simulate incoming data from client or supplier data sets. (For the initial prototype we have assumed that the data feeds are homogeneous; obviously this will need to be extended to heterogeneous feeds in later versions.) After the Atom feeds have been collected, the integration module will integrate the data supplied by the feeds into a schema that matches that of the target database, as shown in Figure 1. A transaction simulator will be employed to simulate workload and updates to the source databases, in order to recreate a day-to-day production environment. In order to evaluate the prototype, we will implement three different simulated scenarios derived from actual use cases of previous projects. All three case studies follow a similar structure whereby data will be exported as Atom feeds from the source database(s), which are then consumed by the integration module before being sent to the target database for insertion. The first scenario will simulate the integration of product data from multiple suppliers into a vendor's product information database. The product information database is used to populate the vendor's online product catalogue, which clients use to make decisions regarding goods procurement. The Atom feeds in this scenario represent flows of product data from the supplier to the vendor. The second scenario follows on from an earlier research project to develop a kiosk system for the sale and distribution of music in digital format. The database the kiosk(s) use will be populated with information from vendors who have agreed to supply content (e.g., a record label's collection of music files). What is needed is a mechanism to integrate all the music data from each supplier into the music kiosk system's own database. The Atom feeds in this scenario are used to maintain an up to date database that has the location and description of each available music track for sale in the system. The third scenario will simulate the implementation of a data warehousing solution for a computer components distributor. Preliminary results from the case study evaluations are expected to be available by June 2005. Our primary goal with the initial prototype is to prove the feasibility of our approach. We will compare our proposed architecture against existing data integration solutions by means of a cost/benefit analysis. We may also investigate measuring various software quality characteristics as defined by the ISO 9126 standard \cite{ISO-2001-9126-1}. \section{Future Work} \label{sec-future-work} As the initial prototype is intended as a basic proof of concept of our proposed architecture, it has been kept as simple as possible in order to facilitate the implementation and evaluation. There are several obvious extensions to the basic prototype that will be investigated in later iterations of the architecture. The initial prototype assumes that all data sources are largely homogeneous, that is, that they all share similar semantics and can therefore be relatively easily integrated. An obvious extension is to permit heterogeneous data sources that have differing semantics. Such an extension would require the addition of an ontology management module between the Atom feed processor and the integration module. This module will probably be based around the W3C's Web Ontology Language (OWL) \cite{McGu-DL-2004-OWL}. \begin{figure}[htb] \fbox{\parbox[b]{.99\linewidth}{% \vskip 0.5cm% \centerline{\includegraphics[scale=0.9]{Architecture_extended}}% \vskip 0.5cm% }} \caption{Overview of the extended architecture} \label{fig-extended} \end{figure} The initial prototype also assumes only a single ``author'' per Atom feed, that is, there is only a single database underlying each feed (as implied by Figure 1). We can envisage a situation where what appears to be a single data source is actually a view layered on top of a collection of underlying databases (e.g., a supplier might draw data for their Atom feed from multiple databases within their organization). It would therefore be useful to investigate the possibility of multiple ``authors'' per Atom feed. This could imply an additional layer of data integration within the data source itself. The data flows shown in Figure 1 imply that the proposed architecture is one-way only (i.e., from the data sources to the target database), but this may not be true in general. It would therefore be interesting to investigate extending the architecture to allow for the possibility of two-way data transfers, i.e., allowing data to flow from the target back to the sources. \section{Conclusion} \label{sec-conclusion} In this paper, we discussed a lightweight data integration architecture based on the Atom XML syndication format. Cost is a major factor in the slow adoption of data integration technologies by small to medium enterprises, so the proposed architecture could provide a cost-effective alternative for implementing data integration infrastructures in small business environments. We are currently developing a basic proof-of-concept prototype system that will be evaluated using a series of realistic case studies. We expect to have preliminary results from these evaluations by June 2005. \section{Acknowledgements} \label{sec-acknowledgements} The authors would like to thank Dr. Colin Aldridge and Dr. Stephen Cranefield for their helpful comments on an early draft of this paper. \bibliographystyle{agsm} \bibliography{Atom_updates} \end{document} \documentclass{CRPITStyle} \usepackage{harvard} \usepackage{graphicx} \pagestyle{empty} \thispagestyle{empty} \title{Lightweight Update Propagation using Atom} \author{David W.\ Williamson \and Nigel J.\ Stanger} \affiliation{Department of Information Science, University of Otago, \\ PO Box 56, Dunedin, New Zealand \\ Email:~\texttt{\{dwilliamson,nstanger\}@infoscience.otago.ac.nz}} \begin{document} \maketitle \begin{abstract} There are many situations where some form of automated update propagation across disparate databases may be beneficial. For example, a retailer could automatically retrieve the latest pricing data from their suppliers' databases, and use these data to update their own internal database. Doing so at regular intervals ensures that the retailer always has current pricing information in their database. Electronic Data Integration (EDI) tools that provide such features already exist but can be expensive to implement, particularly for small to medium enterprises (SME's). In this paper we propose a lightweight approach for propagating updates from one database to another using the Atom XML syndication format, thus providing a simpler, cost-effective technology for facilitating data integration. This approach enables a target database to regularly poll one or more source databases for updates, which are then applied to the target database (alternatively, updates could be ``pushed'' to the target from the sources). This approach can be used in typical data integration scenarios where the data sources are updated at irregular intervals, such as the aforementioned retailer example, or when extracting data from multiple data sources for loading into a data warehouse. In the paper we discuss the underlying principles and motivation for the approach, describe the architecture that we have used, and describe an early prototype implementation. \end{abstract} \vspace{.1in} \noindent {\em Keywords:} update propagation, data integration, Atom, SME, lightweight architecture, Semantic Web, B2B \section{Introduction} The ability to integrate data from multiple heterogeneous sources is becoming a key issue for modern businesses, and yet the number of businesses implementing data integration solutions is smaller than we might expect \cite{Beck-R-2002-Bled,vaHe-E-1999-EDI}. This is particularly true for small to medium enterprises (SME's), for whom the cost of implementing an enterprise-scale data integration solution can often be prohibitive \cite{Beck-R-2002-Bled,Guo-J-2003-DocEng,Somm-RA-2002-SIGMOD}. In this paper, we propose a lightweight data integration architecture based on the Atom XML syndication format, which may provide a cost-effective alternative technology for SME's to facilitate data integration rather than having to purchase expensive enterprise grade systems. We are currently implementing a basic proof of concept of this architecture, and plan to evaluate it using three case studies. The body of this paper comprises three main sections. In Section 2 we provide some general background information regarding data integration and the Atom syndication format. In Section 3 we discuss the motivation behind our proposed architecture. We then discuss the proposed architecture and the goals of our research in Section 4, and present some possible directions for future work in Section 5. The paper concludes in Section 6. \section{Background} In this section, we briefly discuss the concepts and technologies that underlie our proposed architecture. In Section 2.1 we provide a brief overview of data integration, especially in the context of SME's attempting to implement a data integration solution. This is followed by a brief discussion of the development of Atom and related technologies such as RSS and RDF. \subsection{Data Integration} Data integration is a term used to describe the combining of data residing in different sources to provide the user with a unified view of data \cite{Bati-C-1986,Yu-C-2004-SIGMOD}. This activity is becoming increasingly important to modern business operation as more and more organizations rely upon applications that support staff in undertaking informed decision making \cite{Calv-D-1998-CoopIS,Yu-C-2004-SIGMOD}. Data integration is a domain that has been a topic of research for some time \cite{Beck-R-2002-Bled,Wied-G-1993-SIGMOD}; today this domain is of no less significance with many organizations requiring the aggregation of data from multiple and often heterogeneous sources, for a wide variety of applications \cite{Haas-LM-1999-DEB}. \citeasnoun{Bati-C-1986} illustrated three common scenarios for integration environments: \begin{itemize} \item homogeneous, where all the sources of data share the same schema; \item heterogeneous, where data must be integrated from sources that may use different schemas or platforms (e.g., a combination of relational and hierarchical databases); and \item federated, where integration is facilitated by the use of a common export schema over all data sources. \end{itemize} A typical example of data integration from heterogeneous sources can be found in the arena of business-to-business (B2B) commerce, where, for example, a manufacturer may have to interact with multiple suppliers or temporary contractors each of whom may have completely different data structures and data exchange formats \cite{Ston-M-2001-SIGMOD}. With the introduction of cheaper web based technology, many additional organizations have been able to undertake projects to facilitate data integration, however, the costs associated with such technology are still quite prohibitive to the many smaller companies and organizations that comprise the majority of most countries' economies. Many initiatives have been put forward to try and alleviate this situation, one of the more recent being the OASIS Universal Business Language (UBL) standard \cite{Mead-B-2004-UBL}, which is a project to standardize common business documentation---invoices, purchase orders etc.---so that it is easier for companies to establish and maintain automated transactions with other parties. UBL has been designed to operate with ebXML. XML has been widely adopted as a standard platform for exchanging data between organizations, and many specialist standards---such as the aforementioned ebXML---have been developed to cater to the unique needs certain business sectors present. In addition to XML-based language specifications, other standards such as EDIFACT and EXPRESS have been defined to facilitate the transmission of information from various sources so that it may be integrated with other data. \subsection{The Atom Syndication Format} In this section we provide a brief overview of the Atom syndication format and the technologies that led to its development. \subsubsection{RDF, RSS and the Semantic Web} The World Wide Web (WWW) as it stands today consists mostly of documents intended for humans to read, i.e., ``\ldots{}a medium of documents for people rather than for data and information that can be processed automatically\ldots'' \cite{Bern-T-2001-SciAm}, which provides minimal opportunity for computers to perform additional interpretation or processing on them \cite{Bern-T-1999-WWW,Bern-T-2001-SciAm}. In essence, computers in use on the Web today are primarily concerned with the parsing of elementary layout information, for example headers, graphics or text and processing like user input forms \cite{Bern-T-1999-W3C,Bern-T-2001-SciAm}. There are few means by which computers can perform more powerful processing or manipulation on web resources \cite{Bern-T-2001-SciAm,Fens-D-2003}, most often because the additional semantics required do not exist or are not in a form that can be interpreted by computers \cite{Koiv-MR-2001-W3C}. The motivation for the adoption of semantics in Web documents can be made evident simply by using a contemporary search engine to look for an ``address''. This search may well return a plethora of results ranging from street addresses and email addresses to public addresses made by important individuals through the ages. This kind of scenario is one of the reasons for the W3C's Semantic Web project \cite{Koiv-MR-2001-W3C}. In the words of its creator, Tim Berners-Lee, its goal is to: \begin{quotation} ``\ldots{}develop enabling standards and technologies designed to help machines understand more information on the Web so that they can support richer discovery, data integration, navigation, and automation of tasks. With Semantic Web we not only receive more exact results when searching for information, but also know when we can integrate information from different sources, know what information to compare, and can provide all kinds of automated services in different domains from future home and digital libraries to electronic business and health services.'' \cite{Koiv-MR-2001-W3C} \end{quotation} In other words, the Semantic Web will provide a space where more intelligent searching and processing of information will be made possible by further extending the existing capabilities of the World Wide Web (WWW). RDF is a technology that is an integral part of the W3C Semantic Web initiative, as the following excerpt from the W3C Semantic Web activity statement will attest: \begin{quotation} ``The Resource Description Framework (RDF) is a language designed to support the Semantic Web, in much the same way that HTML is the language that helped initiate the original Web. RDF is a frame work for supporting resource description, or metadata (data about data), for the Web. RDF provides common structure that can be used for interoperable XML data exchange.'' \cite{Powe-S-2003-RDF} \end{quotation} What RDF does in the context of the Semantic Web is to provide the capability of recording data in a way that can be interpreted easily by machines, which in turn provides an avenue to ``\ldots{}more efficient and sophisticated data interchange, searching, cataloguing, navigation, classification and so on\ldots{}'' \cite{Powe-S-2003-RDF}. Since its inception in the late 1990's, the RDF specification has spawned several applications, RSS being but one example. RDF Site Summary (RSS) is an XML application, of which versions 0.9 and 1.0 conform to the W3C's RDF specification. It is a format intended for metadata description and content syndication \cite{Mano-F-2004-RDF}. Originally developed by Netscape as a means to syndicate content from multiple sources onto one page \cite{Nott-M-2005-Atom}, RSS has been embraced by other individuals and organizations resulting in the spawning of multiple versions. At its most simple, the information provided in an RSS document comprises the description of a ``channel'' (that could be on a specific topic such as current events, sport or the weather, etc.) consisting of URL linked items. Each item consists of a title, a link to the actual content and a brief description or abstract. Because of the proliferation of differing RSS standards and associated problems with compatibility, a group of service providers, vendors and developers have initiated the development of a separate syndication standard named Atom, which will, according to the Atom Publishing Format and Protocol (Atompub) Working Group, be heavily influenced by the lessons learned in the evolution of RSS. \subsubsection{Atom} The Atom specification is an XML-based document format that has been designed to describe lists of related information \cite{Nott-M-2005-Atom}. These lists are known as ``feeds''. Feeds are made up of multiple items, known as ``entries''; each entry can have an extensible set of attached metadata \cite{Nott-M-2005-Atom}. Atom as a technology comprises four key related components: a conceptual model of a resource, a well defined syntax for this model, the actual atom feed format itself and the editing protocol. Both the feed format and editing protocol also make use of the aforementioned syntax. In addition to these features, the Atompub Working Group have outlined several design objectives for the feed format and the editing protocol. The feed format must be able to represent the following: a resource that is a weblog entry or article, a feed or channel of entries, a complete archive of all entries within a feed, existing well formed XML (especially XHTML) content and additional information in a user-extensible manner. The editing protocol must support creating, deleting or editing feed entries, multiple authors for a single feed, user authentication, user management and the ability to create, obtain and configure complementary material such as comments or templates. The latest specification of Atom, which at the time of writing is still in a draft form, states the main purpose that Atom is intended to address is ``\ldots{}the syndication of Web content such as Weblogs and news headlines to Web sites as well as directly to user agents'' \cite{Nott-M-2005-Atom}. The specification also suggests that Atom should not be limited to just web based content syndication but in fact may be adapted for other uses or content types. The Atompub Working Group aim to submit the Atom feed format and editing protocol to the IETF for consideration as a proposed standard in early April 2005. \section{Motivation} One of the example domains of data integration is that of Electronic Data Interchange (EDI), a concept used by companies to exchange information such as goods procurement documentation. EDI is not new \cite{Beck-R-2002-Bled,Medj-B-2003-VLDB}, and has been used for many years by various organizations to reduce costs by replacing more traditional paper based systems. It is interesting to note, however, that in surveys regarding the extent of adoption of EDI, only a fraction of the companies that might be perceived as beneficiaries of such technology have actually implemented or attempted to implement it \cite{Beck-R-2002-Bled,vaHe-E-1999-EDI}. This naturally raises the question of why? We can refine this question further by asking why so few smaller companies (SME's) have adopted EDI or indeed other technologies that rely on accurate automated data integration, such as data warehousing. Perhaps the most important reason is that of cost: to a small company the perceived benefits of introducing the technology may not be sufficient to justify the expense \cite{Beck-R-2002-Bled,Guo-J-2003-DocEng,Somm-RA-2002-SIGMOD}. When a decision has been made to implement new technology, it is often the case that the SME in question has been forced into an investment that is, to them, an expensive solution, perhaps due to demands imposed by larger clients and partners, or as a response to competitors in an attempt to maintain market position \cite{Beck-R-2002-Bled,vaHe-E-1999-EDI}. Attempts have been made to make EDI more cost effective by introducing EDI on a web-based platform \cite{Beck-R-2002-Bled}, and through the development of standards such as the recently sanctioned OASIS Universal Business Language (UBL) standard \cite{Mead-B-2004-UBL}. While UBL is new and has probably not had sufficient time to make a substantial impact, the fact remains that the underlying reason these types of technologies are still not attractive enough to SME's is cost \cite{Beck-R-2002-Bled,Guo-J-2003-DocEng,Somm-RA-2002-SIGMOD,vaHe-E-1999-EDI}. To summarize, data integration related technologies are often not readily or willingly implemented by SME's because of the perceived high costs involved, and at best are implemented only if it is deemed vitally important to the continued survival of the organization in the marketplace. Such a situation leads us to the conclusion that there is an apparent need for an alternative data integration solution that is cost effective, enabling SME's to embrace the benefits of applications that use data integration technologies, such as data warehousing, EDI networks or e-catalogues. This identified need provides the motivation for our proposed architecture, which we will discuss in the next section. \section{Proposed Architecture and Research Goals} To address the issue of lack of SME adoption of data integration technologies, we propose a lightweight data integration architecture based on Atom, as illustrated in Figure 1. Atom was chosen as the underlying technology because of its XML heritage, and because the Atom community is trying to encourage different uses for the format beyond the traditional application of weblog syndication \cite{Nott-M-2005-Atom}. Although the standard has yet to be officially ratified, it already has a large user and development community. We are currently implementing a basic proof of concept of this architecture, and will evaluate its cost-effectiveness and performance compared to other data integration technologies. The prototype builds upon existing software available for processing Atom feeds, and adds a module (written in PHP) for integrating incoming data from different feeds. The integration module takes as input Atom feeds from multiple data sources, which simulate incoming data from client or supplier data sets. (For the initial prototype we have assumed that the data feeds are homogeneous; obviously this will need to be extended to heterogeneous feeds in later versions.) After the Atom feeds have been collected, the integration module will integrate the data supplied by the feeds into a schema that matches that of the target database, as shown in Figure 1. A transaction simulator will be employed to simulate workload and updates to the source databases, in order to recreate a day-to-day production environment. In order to evaluate the prototype, we will implement three different simulated scenarios derived from actual use cases of previous projects. All three case studies follow a similar structure whereby data will be exported as Atom feeds from the source database(s), which are then consumed by the integration module before being sent to the target database for insertion. The first scenario will simulate the integration of product data from multiple suppliers into a vendor's product information database. The product information database is used to populate the vendor's online product catalogue, which clients use to make decisions regarding goods procurement. The Atom feeds in this scenario represent flows of product data from the supplier to the vendor. The second scenario follows on from an earlier research project to develop a kiosk system for the sale and distribution of music in digital format. The database the kiosk(s) use will be populated with information from vendors who have agreed to supply content (e.g., a record label's collection of music files). What is needed is a mechanism to integrate all the music data from each supplier into the music kiosk system's own database. The Atom feeds in this scenario are used to maintain an up to date database that has the location and description of each available music track for sale in the system. The third scenario will simulate the implementation of a data warehousing solution for a computer components distributor. Preliminary results from the case study evaluations are expected to be available by June 2005. Our primary goal with the initial prototype is to prove the feasibility of our approach. We will compare our proposed architecture against existing data integration solutions by means of a cost/benefit analysis. We may also investigate measuring various software quality characteristics as defined by the ISO 9126 standard \cite{ISO-2001-9126-1}. % Figure 1. Proposed architecture showing integration module \section{Future Work} As the initial prototype is intended as a basic proof of concept of our proposed architecture, it has been kept as simple as possible in order to facilitate the implementation and evaluation. There are several obvious extensions to the basic prototype that will be investigated in later iterations of the architecture. The initial prototype assumes that all data sources are largely homogeneous, that is, that they all share similar semantics and can therefore be relatively easily integrated. An obvious extension is to permit heterogeneous data sources that have differing semantics. Such an extension would require the addition of an ontology management module between the Atom feed processor and the integration module. This module will probably be based around the W3C's Web Ontology Language (OWL) \cite{McGu-DL-2004-OWL}. The initial prototype also assumes only a single ``author'' per Atom feed, that is, there is only a single database underlying each feed (as implied by Figure 1). We can envisage a situation where what appears to be a single data source is actually a view layered on top of a collection of underlying databases (e.g., a supplier might draw data for their Atom feed from multiple databases within their organization). It would therefore be useful to investigate the possibility of multiple ``authors'' per Atom feed. This could imply an additional layer of data integration within the data source itself. The data flows shown in Figure 1 imply that the proposed architecture is one-way only (i.e., from the data sources to the target database), but this may not be true in general. It would therefore be interesting to investigate extending the architecture to allow for the possibility of two-way data transfers, i.e., allowing data to flow from the target back to the sources. \section{Conclusion} In this paper, we discussed a lightweight data integration architecture based on the Atom XML syndication format. Cost is a major factor in the slow adoption of data integration technologies by small to medium enterprises, so the proposed architecture could provide a cost-effective alternative for implementing data integration infrastructures in small business environments. We are currently developing a basic proof-of-concept prototype system that will be evaluated using a series of realistic case studies. We expect to have preliminary results from these evaluations by June 2005. \section{Acknowledgements} The authors would like to thank Dr. Colin Aldridge and Dr. Stephen Cranefield for their helpful comments on an early draft of this paper. \bibliographystyle{agsm} \bibliography{Atom_updates} \end{document}

Ignore Space Show notes View

Atom_updates.tex

\documentclass{CRPITStyle}

\usepackage{harvard}
\usepackage{graphicx}

\pagestyle{empty}
\thispagestyle{empty}

\title{Lightweight Update Propagation using Atom}
\author{David W.\ Williamson \and Nigel J.\ Stanger}
\affiliation{Department of Information Science, \\
	University of Otago, \\
	PO Box 56, Dunedin, New Zealand \\
	Email:~\texttt{\{dwilliamson,nstanger\}@infoscience.otago.ac.nz}}

\begin{document}

\maketitle

\begin{abstract}
There are many situations where some form of automated update
propagation across disparate databases may be beneficial. For example, a
retailer could automatically retrieve the latest pricing data from their
suppliers' databases, and use these data to update their own internal
database. Doing so at regular intervals ensures that the retailer always
has current pricing information in their database. Electronic Data
Integration (EDI) tools that provide such features already exist but can
be expensive to implement, particularly for small to medium enterprises
(SME's). In this paper we propose a lightweight approach for propagating
updates from one database to another using the Atom XML syndication
format, thus providing a simpler, cost-effective technology for
facilitating data integration. This approach enables a target database
to regularly poll one or more source databases for updates, which are
then applied to the target database (alternatively, updates could be
``pushed'' to the target from the sources). This approach can be used in
typical data integration scenarios where the data sources are updated at
irregular intervals, such as the aforementioned retailer example, or
when extracting data from multiple data sources for loading into a data
warehouse. In the paper we discuss the underlying principles and
motivation for the approach, discuss possible architectures, and
describe an early prototype implementation.
\end{abstract}
\vspace{.1in}

\noindent {\em Keywords:} update propagation, data integration, Atom,
SME, lightweight architecture, Semantic Web, B2B

\section{Introduction}
\label{sec-intro}

The ability to integrate data from multiple heterogeneous sources is
becoming a key issue for modern businesses, and yet the number of
businesses implementing data integration solutions is smaller than we
might expect \cite{Beck-R-2002-Bled,vaHe-E-1999-EDI}. This is
particularly true for small to medium enterprises (SME's), for whom the
cost of implementing an enterprise-scale data integration solution can
often be prohibitive
\cite{Beck-R-2002-Bled,Guo-J-2003-DocEng,Somm-RA-2002-SIGMOD}.

In this paper, we propose a lightweight architecture for propagating
updates from one database to another using the Atom XML syndication
format. This architecture could provide a cost-effective alternative
technology for SME's to facilitate data integration rather than having
to purchase expensive enterprise grade systems. We have implemented a
basic proof of concept of this architecture, and are currently
evaluating it using three case studies.

The body of this paper comprises four main sections. In
Section~\ref{sec-background} we provide some general background
information regarding data integration and the Atom syndication format.
In Section 3 we discuss the motivation behind our proposed architecture.
We then discuss the proposed architecture and the goals of our research
in Section 4, and present some possible directions for future work in
Section 5. The paper concludes in Section 6.

\section{Background}
\label{sec-background}

In this section, we briefly discuss the concepts and technologies that
underlie our proposed architecture. In Section 2.1 we provide a brief
overview of data integration, especially in the context of SME's
attempting to implement a data integration solution. This is followed by
a brief discussion of the development of Atom and related technologies
such as RSS and RDF.

\subsection{Data Integration}
\label{sec-data-integration}

Data integration is a term used to describe the combining of data
residing in different sources to provide the user with a unified view of
data \cite{Bati-C-1986,Yu-C-2004-SIGMOD}. This activity is becoming
increasingly important to modern business operation as more and more
organizations rely upon applications that support staff in undertaking
informed decision making \cite{Calv-D-1998-CoopIS,Yu-C-2004-SIGMOD}.

Data integration is a domain that has been a topic of research for some
time \cite{Beck-R-2002-Bled,Wied-G-1993-SIGMOD}; today this domain is of
no less significance with many organizations requiring the aggregation
of data from multiple and often heterogeneous sources, for a wide
variety of applications \cite{Haas-LM-1999-DEB}.
\citeasnoun{Bati-C-1986} illustrated three common scenarios for
integration environments:

\begin{itemize}

\item homogeneous, where all the sources of data share the same
	schema;

\item heterogeneous, where data must be integrated from sources that
	may use different schemas or platforms (e.g., a combination of
	relational and hierarchical databases); and

\item federated, where integration is facilitated by the use of a
	common export schema over all data sources.

\end{itemize}

A typical example of data integration from heterogeneous sources can be
found in the arena of business-to-business (B2B) commerce, where, for
example, a manufacturer may have to interact with multiple suppliers or
temporary contractors each of whom may have completely different data
structures and data exchange formats \cite{Ston-M-2001-SIGMOD}. With the
introduction of cheaper web based technology, many additional
organizations have been able to undertake projects to facilitate data
integration, however, the costs associated with such technology are
still quite prohibitive to the many smaller companies and organizations
that comprise the majority of most countries' economies.

Many initiatives have been put forward to try and alleviate this
situation, one of the more recent being the OASIS Universal Business
Language (UBL) standard \cite{Mead-B-2004-UBL}, which is a project to
standardize common business documentation---invoices, purchase orders
etc.---so that it is easier for companies to establish and maintain
automated transactions with other parties. UBL has been designed to
operate with ebXML.

XML has been widely adopted as a standard platform for exchanging data
between organizations, and many specialist standards---such as the
aforementioned ebXML---have been developed to cater to the unique needs
certain business sectors present. In addition to XML-based language
specifications, other standards such as EDIFACT  and EXPRESS have been
defined to facilitate the transmission of information from various
sources so that it may be integrated with other data.

\subsection{The Atom Syndication Format}
\label{sec-atom-overview}

In this section we provide a brief overview of the Atom syndication
format and the technologies that led to its development.

\subsubsection{RDF, RSS and the Semantic Web}
\label{sec-rdf-rss}

The World Wide Web (WWW) as it stands today consists mostly of documents
intended for humans to read, i.e., ``\ldots{}a medium of documents for
people rather than for data and information that can be processed
automatically\ldots'' \cite{Bern-T-2001-SciAm}, which provides minimal
opportunity for computers to perform additional interpretation or
processing on them \cite{Bern-T-1999-WWW,Bern-T-2001-SciAm}. In essence,
computers in use on the Web today are primarily concerned with the
parsing of elementary layout information, for example headers, graphics
or text and processing like user input forms
\cite{Bern-T-1999-W3C,Bern-T-2001-SciAm}.

There are few means by which computers can perform more powerful
processing or manipulation on web resources
\cite{Bern-T-2001-SciAm,Fens-D-2003}, most often because the additional
semantics required do not exist or are not in a form that can be
interpreted by computers \cite{Koiv-MR-2001-W3C}. The motivation for the
adoption of semantics in Web documents can be made evident simply by
using a contemporary search engine to look for an ``address''. This
search may well return a plethora of results ranging from street
addresses and email addresses to public addresses made by important
individuals through the ages.

This kind of scenario is one of the reasons for the W3C's Semantic Web
project \cite{Koiv-MR-2001-W3C}. In the words of its creator, Tim
Berners-Lee, its goal is to:

\begin{quotation}
	``\ldots{}develop enabling standards and technologies designed to help
	machines understand more information on the Web so that they can
	support richer discovery, data integration, navigation, and
	automation of tasks. With Semantic Web we not only receive more
	exact results when searching for information, but also know when we
	can integrate information from different sources, know what
	information to compare, and can provide all kinds of automated
	services in different domains from future home and digital libraries
	to electronic business and health services.'' \cite{Koiv-MR-2001-W3C}
\end{quotation}

In other words, the Semantic Web will provide a space where more
intelligent searching and processing of information will be made
possible by further extending the existing capabilities of the World
Wide Web (WWW).

RDF is a technology that is an integral part of the W3C Semantic Web
initiative, as the following excerpt from the W3C Semantic Web activity
statement will attest:

\begin{quotation}
	``The Resource Description Framework (RDF) is a language designed to
	support the Semantic Web, in much the same way that HTML is the
	language that helped initiate the original Web. RDF is a frame work
	for supporting resource description, or metadata (data about data),
	for the Web. RDF provides common structure that can be used for
	interoperable XML data exchange.'' \cite{Powe-S-2003-RDF}
\end{quotation}

What RDF does in the context of the Semantic Web is to provide the
capability of recording data in a way that can be interpreted easily by
machines, which in turn provides an avenue to ``\ldots{}more efficient and
sophisticated data interchange, searching, cataloguing, navigation,
classification and so on\ldots{}'' \cite{Powe-S-2003-RDF}.

Since its inception in the late 1990's, the RDF specification has
spawned several applications, RSS being but one example. RDF Site
Summary (RSS) is an XML application, of which versions 0.9 and 1.0
conform to the W3C's RDF specification. It is a format intended for
metadata description and content syndication \cite{Mano-F-2004-RDF}.
Originally developed by Netscape as a means to syndicate content from
multiple sources onto one page \cite{Nott-M-2005-Atom}, RSS has been
embraced by other individuals and organizations resulting in the
spawning of multiple versions.

At its most simple, the information provided in an RSS document
comprises the description of a ``channel'' (that could be on a specific
topic such as current events, sport or the weather, etc.) consisting of
URL linked items. Each item consists of a title, a link to the actual
content and a brief description or abstract.

Because of the proliferation of differing RSS standards and associated
problems with compatibility, a group of service providers, vendors and
developers have initiated the development of a separate syndication
standard named Atom, which will, according to the Atom Publishing Format
and Protocol (Atompub) Working Group, be heavily influenced by the
lessons learned in the evolution of RSS.

\subsubsection{Atom}
\label{sec-atom-detail}

The Atom  specification is an XML-based document format that has been
designed to describe lists of related information
\cite{Nott-M-2005-Atom}. These lists are known as ``feeds''. Feeds are
made up of multiple items, known as ``entries''; each entry can have an
extensible set of attached metadata \cite{Nott-M-2005-Atom}.

Atom as a technology comprises four key related components: a conceptual
model of a resource, a well defined syntax for this model, the actual
atom feed format itself and the editing protocol. Both the feed format
and editing protocol also make use of the aforementioned syntax.

In addition to these features, the Atompub Working Group have outlined
several design objectives for the feed format and the editing protocol.
The feed format must be able to represent the following: a resource that
is a weblog entry or article, a feed or channel of entries, a complete
archive of all entries within a feed, existing well formed XML
(especially XHTML) content and additional information in a
user-extensible manner.

The editing protocol must support creating, deleting or editing feed
entries, multiple authors for a single feed, user authentication, user
management and the ability to create, obtain and configure complementary
material such as comments or templates.

The latest specification of Atom, which at the time of writing is still
in a draft form, states the main purpose that Atom is intended to
address is ``\ldots{}the syndication of Web content such as Weblogs and
news headlines to Web sites as well as directly to user agents''
\cite{Nott-M-2005-Atom}. The specification also suggests that Atom
should not be limited to just web based content syndication but in fact
may be adapted for other uses or content types. The Atompub Working
Group aim to submit the Atom feed format and editing protocol to the
IETF for consideration as a proposed standard in early April 2005.

\section{Motivation}
\label{sec-motivation}

One of the example domains of data integration is that of Electronic
Data Interchange (EDI), a concept used by companies to exchange
information such as goods procurement documentation. EDI is not new
\cite{Beck-R-2002-Bled,Medj-B-2003-VLDB}, and has been used for many
years by various organizations to reduce costs by replacing more
traditional paper based systems. It is interesting to note, however,
that in surveys regarding the extent of adoption of EDI, only a fraction
of the companies that might be perceived as beneficiaries of such
technology have actually implemented or attempted to implement it
\cite{Beck-R-2002-Bled,vaHe-E-1999-EDI}. This naturally raises the
question of why? We can refine this question further by asking why so
few smaller companies (SME's) have adopted EDI or indeed other
technologies that rely on accurate automated data integration, such as
data warehousing.

Perhaps the most important reason is that of cost: to a small company
the perceived benefits of introducing the technology may not be
sufficient to justify the expense
\cite{Beck-R-2002-Bled,Guo-J-2003-DocEng,Somm-RA-2002-SIGMOD}. When a
decision has been made to implement new technology, it is often the case
that the SME in question has been forced into an investment that is, to
them, an expensive solution, perhaps due to demands imposed by larger
clients and partners, or as a response to competitors in an attempt to
maintain market position \cite{Beck-R-2002-Bled,vaHe-E-1999-EDI}.

Attempts have been made to make EDI more cost effective by introducing
EDI on a web-based platform \cite{Beck-R-2002-Bled}, and through the
development of standards such as the recently sanctioned OASIS Universal
Business Language (UBL) standard \cite{Mead-B-2004-UBL}. While UBL is
new and has probably not had sufficient time to make a substantial
impact, the fact remains that the underlying reason these types of
technologies are still not attractive enough to SME's is cost
\cite{Beck-R-2002-Bled,Guo-J-2003-DocEng,Somm-RA-2002-SIGMOD,vaHe-E-1999-EDI}.

To summarize, data integration related technologies are often not
readily or willingly implemented by SME's because of the perceived high
costs involved, and at best are implemented only if it is deemed vitally
important to the continued survival of the organization in the
marketplace.

Such a situation leads us to the conclusion that there is an apparent
need for an alternative data integration solution that is cost
effective, enabling SME's to embrace the benefits of applications that
use data integration technologies, such as data warehousing, EDI
networks or e-catalogues.

This identified need provides the motivation for our proposed
architecture, which we will discuss in the next section.

\section{Proposed Architecture and Research Goals}
\label{sec-architecture}

To address the issue of lack of SME adoption of data integration
technologies, we propose a lightweight data integration architecture
based on Atom, as illustrated in Figure 1. Atom was chosen as the
underlying technology because of its XML heritage, and because the Atom
community is trying to encourage different uses for the format beyond
the traditional application of weblog syndication
\cite{Nott-M-2005-Atom}. Although the standard has yet to be officially
ratified, it already has a large user and development community.

\begin{figure*}[htb]
	\fbox{\parbox[b]{.99\linewidth}{%
		\vskip 0.5cm%
		\centerline{\includegraphics[scale=0.9]{Architecture_basic}}%
		\vskip 0.5cm%
	}}
	\caption{Overview of the basic architecture}
	\label{fig-basic}
\end{figure*}

We are currently implementing a basic proof of concept of this
architecture, and will evaluate its cost-effectiveness and performance
compared to other data integration technologies. The prototype builds
upon existing software available for processing Atom feeds, and adds a
module (written in PHP) for integrating incoming data from different
feeds.

The integration module takes as input Atom feeds from multiple data
sources, which simulate incoming data from client or supplier data sets.
(For the initial prototype we have assumed that the data feeds are
homogeneous; obviously this will need to be extended to heterogeneous
feeds in later versions.) After the Atom feeds have been collected, the
integration module will integrate the data supplied by the feeds into a
schema that matches that of the target database, as shown in Figure 1. A
transaction simulator will be employed to simulate workload and updates
to the source databases, in order to recreate a day-to-day production
environment.

In order to evaluate the prototype, we will implement three different
simulated scenarios derived from actual use cases of previous projects.
All three case studies follow a similar structure whereby data will be
exported as Atom feeds from the source database(s), which are then
consumed by the integration module before being sent to the target
database for insertion.

The first scenario will simulate the integration of product data from
multiple suppliers into a vendor's product information database. The
product information database is used to populate the vendor's online
product catalogue, which clients use to make decisions regarding goods
procurement. The Atom feeds in this scenario represent flows of product
data from the supplier to the vendor.

The second scenario follows on from an earlier research project to
develop a kiosk system for the sale and distribution of music in digital
format. The database the kiosk(s) use will be populated with information
from vendors who have agreed to supply content (e.g., a record label's
collection of music files). What is needed is a mechanism to integrate
all the music data from each supplier into the music kiosk system's own
database. The Atom feeds in this scenario are used to maintain an up to
date database that has the location and description of each available
music track for sale in the system.

The third scenario will simulate the implementation of a data
warehousing solution for a computer components distributor.

Preliminary results from the case study evaluations are expected to be
available by June 2005. Our primary goal with the initial prototype is
to prove the feasibility of our approach. We will compare our proposed
architecture against existing data integration solutions by means of a
cost/benefit analysis. We may also investigate measuring various
software quality characteristics as defined by the ISO 9126 standard
\cite{ISO-2001-9126-1}.

\section{Future Work}
\label{sec-future-work}

As the initial prototype is intended as a basic proof of concept of our
proposed architecture, it has been kept as simple as possible in order
to facilitate the implementation and evaluation. There are several
obvious extensions to the basic prototype that will be investigated in
later iterations of the architecture.

The initial prototype assumes that all data sources are largely
homogeneous, that is, that they all share similar semantics and can
therefore be relatively easily integrated. An obvious extension is to
permit heterogeneous data sources that have differing semantics. Such an
extension would require the addition of an ontology management module
between the Atom feed processor and the integration module. This module
will probably be based around the W3C's Web Ontology Language (OWL)
\cite{McGu-DL-2004-OWL}.

\begin{figure*}[htb]
	\fbox{\parbox[b]{.99\linewidth}{%
		\vskip 0.5cm%
		\centerline{\includegraphics[scale=0.9]{Architecture_extended}}%
		\vskip 0.5cm%
	}}
	\caption{Overview of the extended architecture}
	\label{fig-extended}
\end{figure*}

The initial prototype also assumes only a single ``author'' per Atom feed,
that is, there is only a single database underlying each feed (as
implied by Figure 1). We can envisage a situation where what appears to
be a single data source is actually a view layered on top of a
collection of underlying databases (e.g., a supplier might draw data for
their Atom feed from multiple databases within their organization). It
would therefore be useful to investigate the possibility of multiple
``authors'' per Atom feed. This could imply an additional layer of data
integration within the data source itself.

The data flows shown in Figure 1 imply that the proposed architecture is
one-way only (i.e., from the data sources to the target database), but
this may not be true in general. It would therefore be interesting to
investigate extending the architecture to allow for the possibility of
two-way data transfers, i.e., allowing data to flow from the target back
to the sources.

\section{Conclusion}
\label{sec-conclusion}

In this paper, we discussed a lightweight data integration architecture
based on the Atom XML syndication format. Cost is a major factor in the
slow adoption of data integration technologies by small to medium
enterprises, so the proposed architecture could provide a cost-effective
alternative for implementing data integration infrastructures in small
business environments. We are currently developing a basic
proof-of-concept prototype system that will be evaluated using a series
of realistic case studies. We expect to have preliminary results from
these evaluations by June 2005.

\section{Acknowledgements}
\label{sec-acknowledgements}

The authors would like to thank Dr. Colin Aldridge and Dr. Stephen
Cranefield for their helpful comments on an early draft of this paper.

\bibliographystyle{agsm}
\bibliography{Atom_updates}

\end{document}

\documentclass{CRPITStyle}

\usepackage{harvard}
\usepackage{graphicx}

\pagestyle{empty}
\thispagestyle{empty}

\title{Lightweight Update Propagation using Atom}
\author{David W.\ Williamson \and Nigel J.\ Stanger}
\affiliation{Department of Information Science, University of Otago, \\
	PO Box 56, Dunedin, New Zealand \\
	Email:~\texttt{\{dwilliamson,nstanger\}@infoscience.otago.ac.nz}}

\begin{document}

\maketitle

\begin{abstract}
There are many situations where some form of automated update
propagation across disparate databases may be beneficial. For example, a
retailer could automatically retrieve the latest pricing data from their
suppliers' databases, and use these data to update their own internal
database. Doing so at regular intervals ensures that the retailer always
has current pricing information in their database. Electronic Data
Integration (EDI) tools that provide such features already exist but can
be expensive to implement, particularly for small to medium enterprises
(SME's). In this paper we propose a lightweight approach for propagating
updates from one database to another using the Atom XML syndication
format, thus providing a simpler, cost-effective technology for
facilitating data integration. This approach enables a target database
to regularly poll one or more source databases for updates, which are
then applied to the target database (alternatively, updates could be
``pushed'' to the target from the sources). This approach can be used in
typical data integration scenarios where the data sources are updated at
irregular intervals, such as the aforementioned retailer example, or
when extracting data from multiple data sources for loading into a data
warehouse. In the paper we discuss the underlying principles and
motivation for the approach, describe the architecture that we have
used, and describe an early prototype implementation.
\end{abstract}
\vspace{.1in}

\noindent {\em Keywords:} update propagation, data integration, Atom,
SME, lightweight architecture, Semantic Web, B2B

\section{Introduction}

In this paper, we propose a lightweight data integration architecture
based on the Atom XML syndication format, which may provide a
cost-effective alternative technology for SME's to facilitate data
integration rather than having to purchase expensive enterprise grade
systems. We are currently implementing a basic proof of concept of this
architecture, and plan to evaluate it using three case studies.

The body of this paper comprises three main sections. In Section 2 we
provide some general background information regarding data integration
and the Atom syndication format. In Section 3 we discuss the motivation
behind our proposed architecture. We then discuss the proposed
architecture and the goals of our research in Section 4, and present
some possible directions for future work in Section 5. The paper
concludes in Section 6.

\section{Background}

\subsection{Data Integration}

\begin{itemize}

\item homogeneous, where all the sources of data share the same
	schema;

\item heterogeneous, where data must be integrated from sources that
	may use different schemas or platforms (e.g., a combination of
	relational and hierarchical databases); and

\item federated, where integration is facilitated by the use of a
	common export schema over all data sources.

\end{itemize}

\subsection{The Atom Syndication Format}

In this section we provide a brief overview of the Atom syndication
format and the technologies that led to its development.

\subsubsection{RDF, RSS and the Semantic Web}

This kind of scenario is one of the reasons for the W3C's Semantic Web
project \cite{Koiv-MR-2001-W3C}. In the words of its creator, Tim
Berners-Lee, its goal is to:

RDF is a technology that is an integral part of the W3C Semantic Web
initiative, as the following excerpt from the W3C Semantic Web activity
statement will attest:

\subsubsection{Atom}

\section{Motivation}

This identified need provides the motivation for our proposed
architecture, which we will discuss in the next section.

\section{Proposed Architecture and Research Goals}

The third scenario will simulate the implementation of a data
warehousing solution for a computer components distributor.

%   Figure 1. Proposed architecture showing integration module

\section{Future Work}

\section{Conclusion}

\section{Acknowledgements}

The authors would like to thank Dr. Colin Aldridge and Dr. Stephen
Cranefield for their helpful comments on an early draft of this paper.

\bibliographystyle{agsm}
\bibliography{Atom_updates}

\end{document}

Show line notes below