Newer
Older
Digital_Repository / OARiNZ / DIY / DIY_spec.tex
\documentclass[12pt,pdftex,a4paper,titlepage]{article}


\usepackage[T1]{fontenc}
\usepackage{textcomp}
\usepackage{lmodern}
\usepackage{mathpazo}
\usepackage{graphicx}
\usepackage[margin=1in]{geometry}
\usepackage{pifont}
\usepackage{url}


\graphicspath{{images/}}


\renewcommand{\ttdefault}{blg}


\title{\textsf{\textbf{OARiNZ DIY Repository Solution}}}
\author{\textsf{\textbf{Nigel Stanger}}}
\date{\textsf{\textbf{October/November 2006}}%
	\linebreak\linebreak\linebreak\linebreak\linebreak%
	\includegraphics[scale=0.4]{OU-Logo-Colour}}


\begin{document}


\maketitle


\tableofcontents

\vfill

{\small\sffamily
\begin{tabular}{|c|c|l|c|}
	\multicolumn{3}{l}{\textbf{Document history}}	\\
	\hline
	\textbf{Version}	&	\textbf{Date}	&	\textbf{Description}	&	\textbf{Author(s)}	\\
	\hline\hline
	1.1					&	12/09/2006		&	Original version from Eduforge wiki.	&	NS	\\
	\hline
	1.2					&	14/09/2006		&	First release for feedback.	&	NS	\\
	\hline
	1.3					&	20/10/2006		&	Incorporated feedback from John Rankin.	&	NS	\\
	\hline
	1.4					&	09/11/2006		&	Final candidate for sign-off.	&	NS	\\
	\hline
\end{tabular}
}


\newpage


\section{Introduction}

Implementing a digital repository, using a typical open source solution such as GNU EPrints or DSpace, is currently a complex proposition that requires a reasonable level of technical expertise in order to find, download and install all the required software, then separately configure these components appropriately for the target operating system. Ongoing maintenance of the repository configuration can also be complex. Both tasks can be simplified, in particular removing the need to manually find, download, install and configure multiple separate components. Instead, separate higher-level installer and configuration tools could manage these tasks.

Objective 7 of the OARiNZ project aims to address this need. This stated aim of this objective is to ``produce a freely distributable, easy to install CD-ROM containing pre-configured (or self-configuring) open source software for use by institutions looking for entry-level assistance with developing their own shareable digital repository''\footnote{\url{http://www.oarinz.ac.nz/objectives.php#seven}}. This document outlines a specification for such a solution.

The nature of currently available open source repository software makes it unlikely that we can completely eliminate the need for some technical expertise. Most such software targets a LAMP environment (Linux/Unix, Apache, MySQL, Perl/PHP), and several installation and configuration tasks require administrator level access, so the solution cannot be fully automated. Regardless, the solution will enable repository implementers to quickly install and configure a complete digital repository, either from ``bare metal'' on a new server or on an existing system (i.e., one that already has installed an operating system and preferably the required LAMP components). In addition, the level of required technical expertise and the complexity of the installation and configuration process will be reduced, thus lowering the bar for implementing a digital repository.

In the spirit of ``lowering the bar'', a key aim should be to automate or abstract as much of the repository installation and configuration process as possible, focusing attention instead on only those elements that \emph{require} human intervention. In other words, repository implementers will not be forced to type in arcane commands unless it is absolutely unavoidable, nor will they be forced to read many pages of dense and obscure documentation before they start, or be burdened with byzantine installation procedures. A laudable, but perhaps overly optimistic, goal would be to make the installation process as easy as installing software under Mac OS X or Windows.

With regard to repository maintenance, the ideal would be to produce a high-level configuration tool that is able to configure a repository (or collection of repositories) without requiring the administrator to manually edit configuration files. This will certainly be feasible with repositories that are installed by the proposed installer tool, and may even be feasible, with some restrictions, for manually installed repositories.

The following key deliverables are therefore proposed:
\begin{enumerate}

	\item A ``bare metal'' installer for creating completely new repositories on new hardware, that includes an operating system, the required LAMP components and all the required repository software.
	
	\item A standalone tool for installing an EPrints repository on an existing server, i.e., a server with an already installed operating system and preferably the required LAMP components.
	
	\item A pre-packaged EPrints distribution in \texttt{.deb} (and possibly \texttt{.rpm}) format for use by the two previous deliverables.

	\item A standalone tool for configuring an EPrints repository.

\end{enumerate}
Both of these deliverables would be distributed in the form of a CD-ROM (or equivalent medium) containing all the required software and a ``shell'' for managing the installation and configuration process. Downloadable disk images would also be made available.

The remainder of this document discusses various design and implementation options, typical usage scenarios, and the implementation plan.


\section{Design and implementation options}


\subsection{Repository software}

Ultimately it would be nice to provide a solution for both GNU EPrints and DSpace, which are the two major open source solutions for smaller-scale repository implementation. However, we currently have little expertise at Otago with DSpace, so the initial focus will be on delivering a solution for EPrints. This is supported by the \emph{Technical Evaluation of Selected Open Access Repositories in New Zealand} report (OARiNZ Deliverable 2), which recommended EPrints for a ``self-configuring solution''. We will focus on the current stable release of EPrints (2.3.13.1). EPrints 3.0 is still in development and not yet available to the general community.

The Tasmania EPrints statistics software will also be included as a standard component, so that any repositories installed using the proposed solution generate download statistics out of the box.


\subsection{Operating systems}

EPrints repositories are typically run on Unix-based systems (e.g., Linux, BSD, Mac OS X), and we have experience at Otago with installing EPrints on Debian Linux, FreeBSD, Mac OS X and Ubuntu Linux. Unix-based systems will therefore be our primary target for implementation. Note that the EPrints web site currently states that there are ``no plans for a version to run under Microsoft Windows''\footnote{\url{http://www.eprints.org/documentation/tech/php/intro.php#what_will_it_run_on}}.

For bare metal installations, a complete operating system distribution will also be required. It is clearly not feasible to provide an installation disk for every possible Unix platform, nor for proprietary operating systems such as Mac OS X. The bare metal installer can therefore realistically only support one operating system platform. The easiest way to achieve this is to pick a Unix-based operating system that provides a bootable ``live CD''.

We have experience at Otago with installing EPrints repositories under Ubuntu Linux\footnote{\url{http://www.ubuntu.com/}}, which provides a live CD feature, so this is an obvious choice. The Ubuntu live CD is also easily customisable, so a custom live CD could be created that installed not only the base operating system but also the required packages for installing EPrints and our configurator software. We will restrict ourselves to the x86 architecture in order minimise complexity.

Installation of the repository software could be incorporated directly into the operating system installation process, implying that the standalone repository installer would not be required for bare metal installs. Alternatively, the standalone installer could be provided on a separate CD. The OS installer could then say something like ``please insert the CD labelled `EPrints Installer'{}'' and simply call the standalone installer once the CD has been inserted. The latter option should be easier to achieve and avoids any potential duplication of effort in both the bare metal and standalone installers.


\subsection{Package installation}
\label{sec-installation}

Unfortunately Unix-based environments do not provide as much uniformity of operating environment as we would like. There is wide variation even across different Linux distributions, with regard to package installation and management, system environment and standard toolsets. The process for installing a required package is completely different under Mac OS X, Debian Linux, Red Hat Linux and FreeBSD, for example, and there are even sometimes multiple package management mechanisms available within the same operating system distribution.

It therefore needs to be considered whether the standalone repository installer for existing systems should use the native package management software (e.g., Red Hat's \texttt{rpm} or Debian's \texttt{dpkg}), or independent installer software. If the native route is taken, the installer will need to detect the operating system version and then look for appropriate package management tools, which of course makes implementation more complex. The non-native route will lead to a simpler implementation, but would lose the significant advantage of having packages managed by the operating system, which is particularly useful for dependency management and upgrades. The native option is therefore preferred.

Another consideration is how to handle pre-existing EPrints installations, whether they be installed manually or by the DIY installer. For ongoing sustainability, the installer should be able to install in a way that enables future version installers, thus enabling future upgrades to the EPrints software in a reasonably transparent manner. Things become a bit murky, however, when installing over the top of a pre-existing manual EPrints installation. EPrints does use a standard directory and file structure, so as long as the installation has not been radically restructured, it should in theory be feasible to install over the top (it could make sense, however, to limit any such capability to a particular range of EPrints versions).

A much safer option in general, however, would be for the installer to create a new installation alongside the existing one, then perhaps offer to copy across any customised files. This would give the repository administrator the opportunity to thoroughly test the new installation before going live, which would probably just be a matter of swapping the ``new'' and ``old'' installation directories.


\subsection{Repository installation and configuration interface}

The kind of interface to present to the person performing the repository installation and configuration process also needs to be considered. Since the installation and configuration tools will be separate, it also makes sense to present separate interfaces for each phase (this fits the typical ``install then configure'' model that occurs with most software). Note that this discussion applies only to the standalone repository installer and configuration tools, not the bare metal operating system installer. 

There are three obvious options for the installation step:
\begin{description}

	\item[Operating system-provided installer] The native installer program supplied by the operating system could be used (if such exists), such as the Mac OS X installer application. While this would provide an installation experience that is consistent with the user's interface expectations, this would almost certainly require the development of separate installers for each operating system platform, with consequent increase in development and maintenance complexity. It is also unclear whether such tools would also be able to integrate with any native OS package management tools.
	
	\item[Cross-platform GUI installer] There are many cross-platform installer tools available that could be used to build the repository installer. Many of these tools are written in Java, which could enable the installation user interface to look reasonably ``native'' for each platform. Non Java-based tools may impose a particular look and feel which could be visually jarring on different platforms. As with the native installer option, it is also currently unclear whether any of these tools are able to integrate with the native operating system package management tools.
	
	\item[Shell-based installer] This is the lowest common denominator for all Unix-based systems. Almost any Unix-based system will have some variant of C-shell available, or at least something compatible. The interface will not be very ``pretty'', but will be relatively simple to implement, and can easily handle issues like integrating with package managers and prompting for administrator-level access. If implemented in a modular fashion, the installer should be readily portable to other Unix-based operating systems.

\end{description}
The main issue with the first two options is clearly the ability to interface with native package management tools. Any installer tool that is able to do so would be a suitable candidate, but if no such tool can be found, then a shell-based installer may be the only option.

There are two obvious options for the configuration step:
\begin{description}
	
	\item[Web-based configurator] A web interface could be used to manage the configuration process. This would require an active web server with some sort of back-end scripting support. There is also the issue of gaining administrator level access in order to install and configure many of the components. This is not insurmountable, however, as web-based system administration tools like Webmin can already do this. The big advantage of using a web browser is that it should work on almost any platform if web standards are adhered to, and it will provide a reasonably ``native'' user interface experience in all cases.
	
	\item[Shell-based configurator] As described above. If implemented in a modular fashion, the configurator should be readily portable to other Unix-based operating systems. Furthermore, a shell-based configurator could even act as a back-end application layer behind a web-based front end, solving two problems at once.

\end{description}
The web-based option provides a more consistent cross-platform user experience with the flexibility required to provide a cross-platform solution that can interface with native package management tools (especially when combined with the shell-based option).

In all cases, consideration should be given to alternate language interfaces (M\={a}ori in particular). Regardless of the interface method used, users should be able to easily select their preferred language. Some installer tools provide this capability already, and the web-based configuration interface should be designed in such a way as to support language templates (similar in concept to what EPrints provides).


\subsection{Distribution media}

While the discussion so far has been about distribution on CD-ROMs, there is no particular reason to limit the solution to only this medium. For example, the solution could also be made available in DVD form and as downloadable disk images. This will provide repository implementers with a choice of installation media to suit the vagaries of their particular installation environment.

Furthermore, it is likely that the CD-ROM version would actually comprise more than just a single CD-ROM. A bare metal install would not only need the operating system files, but also pre-compiled versions of all the EPrints prerequisite software in a package format appropriate for that operating system. An existing system install could reasonably assume a pre-existing functional LAMP installation, but would still need to include copies of other EPrints prerequisites such as libraries, Perl modules, etc., in appropriate formats for the various supported package management tools. Combined, this could easily run to at least two CD-ROMs, but would definitely fit onto a single DVD.

It is also recommended that there should be separate disks for the bare metal install and the existing system install options, for the following reasons:
\begin{itemize}

	\item People with existing systems would not want to download an unnecessary operating system distribution just to get the repository software.
	
	\item The bare metal installer would minimally need only the base operating system installer and the repository configurator, as the repository software installation could be incorporated into the base operating system installation process.
	
	\item Keeping the two separate simplifies the installation instructions. If the disks were combined the instructions might read something like this: ``If you want to install a complete operating system and repository from scratch, boot from this CD and follow the instructions. If you want to install the repository on an existing system, insert the CD and run XXX.'' This is long-winded and potentially confusing.
	
	With separate disks, the instructions could read more like this: ``To install the operating system and repository software, boot from this CD and follow the instructions'' (bare metal install disk), and ``To install the repository software, insert the CD and run XXX'' (existing system install disk). As noted earlier, the first CD could also automatically request the user to insert the second CD.
	
	\item A combined installer would probably not fit on one CD-ROM, whereas a separate CD-ROM for each installer should be feasible.
	
\end{itemize}


\subsection{Items to be configured}
\label{sec-configure}

The basic repository configuration includes things like its internal identifier, domain name, HTTP port number and so on. All of these items are required as part of the base configuration and will need to be included in the configurator. Configuration of the Tasmania EPrints statistics software would also be included here.

In addition to these compulsory items, there are also numerous optional aspects of EPrints itself that can be configured, such as enabling the editorial buffer, required document formats, etc. These will be included as optional items within the configuration process, accessed via an ``advanced configuration'' page. The list of advanced configuration items should be easily extensible, probably via some form of XML specification, so as to cater for future developments. (The same mechanism could also be used to specify compulsory configuration items.)

One optional configuration item of particular relevance to the OARiNZ project is configuration of the EPrints OAI-PMH interface. While it is recommended that this remain an optional configuration (as some preparation is required to set it up properly), an unconfigured OAI-PMH subsystem should be prominently highlighted within the configurator interface, preferably on the main page. This gives repository implementers the option to forgo initial configuration of OAI-PMH, while gently encouraging them to eventually do so.

On this note, there is no reason why the configurator should be limited to once-only use when the repository software is first installed. Rather, it should be installed alongside the repository software and used as a general management tool for creating and configuring repositories on that server. The configurator should keep an internal cache of the configuration settings for each repository that it creates, which will make it easier to re-configure repositories at any time. The configurator should also check the cached configuration against the actual configuration files when opened, in case someone manually edits the original files.

Another consideration is whether the configurator should be able to configure pre-existing manual EPrints installations. We have excluded this from the solution on the grounds that it would introduce considerable complexity. For example, the configurator would need to be able to detect the version of EPrints that was installed and keep a database of which configuration items apply to which version. Additional problems would arise if the pre-existing EPrints was installed in a non-standard manner. We may consider this capability for a future version of the configurator.

The configurator will not assist with the process of customising the look and feel of the repository web pages, simply because there are too many possible permutations of how to do so. The configurator could, however, provide information on which files need to be changed in order to achieve this. This information would also be included in the OARiNZ knowledge base wiki.


\subsection{Other items}

The repository installer will include the M\={a}ori and Pacific Island language packs for EPrints that were developed at Wintec. No special handling is required for these; they will simply be included as standard components in the EPrints installation.

Another issue not yet discussed is when new versions of the DIY solution should be released. An obvious approach is to key the version number of the DIY solution to the corresponding EPrints version, and release a new version of the DIY solution every time a new version of EPrints is released. For minor version number changes the differences should be small. Major version number changes may require more substantial re-engineering. This approach will ensure that the correct version of the configurator is always used with the correct version of EPrints. (This could be subverted by a manual upgrade, so the configurator should always check the installed EPrints version number before proceeding.)

Newer versions of the configurator should either maintain backwards compatibility with older versions of EPrints, or it should be possible to install multiple versions of the configurator in parallel. Otherwise, it might be impossible to maintain an older version repository after a configurator version upgrade.


\subsection{Summary of design recommendations}

\subsubsection*{Repository software}

\begin{itemize}

	\item GNU EPrints 2.3.13.1
	
	\item Tasmania statistics software

\end{itemize}

\subsubsection*{Target operating system platform}

\begin{itemize}

	\item Unix-based operating systems that have functional Apache, MySQL and Perl/PHP components already installed

	\item Ubuntu Linux (server distribution, x86 architecture) for the bare metal install option
	
	\item Base OS installer to call standalone repository installer (on separate CD)

\end{itemize}
	
\subsubsection*{Package installation}
	
\begin{itemize}

	\item Use native package management tools provided by the operating system wherever possible
	
	\item For prior EPrints installations, install alongside rather than overwrite

\end{itemize}

\subsubsection*{Repository installation \& configuration interface}

\begin{itemize}

	\item Cross-platform GUI installation interface (if feasible), with shell-based option as fallback

	\item Web-based configuration interface, with shell-based option (ideally usable as a back-end CGI script) as fallback
	
	\item Alternate language options

\end{itemize}

\subsubsection*{Distribution media}

\begin{itemize}

	\item CD-ROM

	\item DVD

	\item Downloadable disk images in standard formats

	\item At least two disks for bare metal installs: base operating system (disk 1) + repository software (disk 2, including configurator)

	\item One disk (or set of disks) for existing system installs: repository software only (including installer and configurator)

\end{itemize}

\subsubsection*{Items to be configured}

\begin{itemize}

	\item All required EPrints, etc., configuration items
	
	\item OAI-PMH configuration optional but strongly encouraged
	
	\item Other optional configuration items
	
	\item Extensible specification of configuration items

\end{itemize}

\subsubsection*{Other items}

\begin{itemize}

	\item M\={a}ori and Pacific Island language packs for EPrints to be included
	
	\item Synchronise EPrints and DIY solution version numbers; release new versions corresponding to EPrints releases.
	
	\item Maintain configurator backwards compatibility or enable parallel version installs.

\end{itemize}


\section{Typical usage scenarios}


\subsection{Bare metal installation}

\begin{figure}
	\begin{center}
	\includegraphics{bare_metal}
		\caption{Installation on a new system (``bare metal'')}
		\label{fig-bare-metal}
	\end{center}
\end{figure}

In this scenario, shown in Figure~\ref{fig-bare-metal}, a repository implementer wishes to bootstrap a complete repository installation on new hardware (this includes virtualisation environments such as VMware or Virtual PC). They boot from the repository live CD (\ding{'300}), which installs the Ubuntu operating system along with all the required packages for EPrints (\ding{'301}). The latter will probably also include the repository configurator and configuration items list, as implied by the dashed arrows at bottom right. After the base installation completes (a reboot may be required), the operating system (\ding{'302}) and repository configurators (\ding{'303}) are executed in sequence. The repository configuration is saved for future reference.


\subsection{Installation on existing system}

\begin{figure}
	\begin{center}
		\includegraphics{existing_system}
		\caption{Installation on an existing system (including version upgrades)}
		\label{fig-existing}
	\end{center}
\end{figure}

In this scenario, shown in Figure~\ref{fig-existing}, a repository implementer wishes to install a repository on an existing server, which already has an operating system and associated software installed (including the various LAMP components), but does not already have EPrints installed. They insert the installation CD and launch the repository installer application (\ding{'300}). The installer installs all the necessary packages to support EPrints (those that are not installed already), including the repository configurator and configuration items list, as implied by the dashed arrows at bottom right. After the installation completes, the installer opens the repository configurator in a web browser (\ding{'301}). The repository configuration is saved for future reference.

Note the separation of the repository configurator into a web user interface and back-end shell for executing installation tasks.


\subsection{Reconfiguring an existing system}

\begin{figure}
	\begin{center}
	\includegraphics{reconfigure}
		\caption{Reconfiguring an existing installation}
		\label{fig-reconfigure}
	\end{center}
\end{figure}

In this scenario, shown in Figure~\ref{fig-reconfigure}, a repository administrator wishes to reconfigure their installation\footnote{As noted in Section~\ref{sec-configure}, this only applies to EPrints installations that were installed by the DIY installer, not to manual EPrints installations.}, for example, to create new repository or to change the settings of an existing repository. They launch the repository configurator that was installed on the server during the original installation process (\ding{'300}). This reads the existing repository configuration (\ding{'301}) and the configuration items list (\ding{'302}) and uses these to initialise the configurator. When complete, the new configuration is saved for future reference.


\subsection{Upgrading to a new version}

In this scenario, a repository administrator wishes to upgrade an existing EPrints installation. The process is essentially the same as that shown in Figure~\ref{fig-existing}, with a couple of additions. The existing installation could have been installed either manually or by an earlier version of the DIY solution. Regardless, the installer checks for prior EPrints installations in standard locations (e.g., \texttt{/usr/local/eprints}). If the installer cannot find any, it asks the administrator to specify the location of any prior installations. The installer then offers to create a new installation alongside the existing one, as discussed in Section~\ref{sec-installation}. After the installation completes, the installer opens the repository configurator in a web browser. The repository configuration is saved for future reference. As a final step, either the installer or the configurator may offer to copy across certain files from the prior installation.


\section{Implementation plan}

A phased implementation approach will be adopted, with each phase building on the outputs from the previous phase. However, not all of the tasks are sequential in nature and some can be carried out in parallel. Indeed, some of the phases overlap and share work. Estimated start and finish dates are provided, but may be subject to change as work progresses. All components will be fully documented (both user and technical) on the OARiNZ knowledge base wiki, which will provide a base for future support.


\subsection{Pre-packaged EPrints distribution}
\label{sec-package}

\noindent \textbf{Start:} Mid-October 2006	\\
\textbf{Finish:} 31 November 2006

\begin{itemize}

	\item Initially in \texttt{.deb} format, possibly \texttt{.rpm} later.
	
	\item To be used by the various installers.
	
	\item Specify appropriate dependencies on prerequisite packages.
	
	\item Build as separate packages any prerequisite items that are not already available in pre-packaged form (such as required Perl modules, etc.).
	
	\item Standalone testing of \texttt{.deb} packages on Ubuntu and Debian systems (Red Hat for \texttt{.rpm} if necessary).

\end{itemize}


\subsection{Web-based EPrints configurator}
\label{sec-configurator}

\noindent \textbf{Start:} 1 November 2006	\\
\textbf{Finish:} 31 December 2006

\begin{itemize}

	\item Web-based UI as front end.

	\item Shell-based configurator as a back end.

	\item Standalone testing.

\end{itemize}


\subsection{EPrints installer for existing systems}

\noindent \textbf{Start:} 1 December 2006	\\
\textbf{Finish:} 31 January 2007

\begin{itemize}

	\item Initially target Ubuntu/Debian Linux.
	
	\item Modular implementation so that it is readily generalisable to 	other platforms. Likely target platforms are Debian, Mac OS X (investigate installation and use of Fink package manager), FreeBSD, others?

	\item Infrastructure for specifying configuration items and saving repository configuration information.

	\item Must be able to obtain administrator level access.

	\item Can be run either standalone or called by another application.
	
	\item Standalone test with dummy EPrints and configurator packages, followed by integration testing with EPrints package from Section~\ref{sec-package} and completed configurator from Section~\ref{sec-configurator}.

\end{itemize}


\subsection{EPrints installer for new systems (bare metal)}

\noindent \textbf{Start:} Mid-October 2006	\\
\textbf{Finish:} Mid-February 2007

\begin{itemize}

	\item Customise Ubuntu live CD with required packages for repository installation.

	\item Integrate repository configurator into Ubuntu installation process.

	\item Standalone test with dummy EPrints and configurator packages, followed by integration testing with EPrints package from Section~\ref{sec-package} and completed configurator from Section~\ref{sec-configurator}.
	
\end{itemize}


\subsection{Testing}

\noindent \textbf{Start:} 1 December 2006	\\
\textbf{Finish:} ongoing

\begin{itemize}

	\item Ongoing standalone testing with various components on throwaway testing environment.
	
	\item On-site testing with typical ``client'' organisations and users.
	
\end{itemize}


\section{Conclusion}

This document has discussed the implementation of a DIY repository solution consistent with Objective 7 of the OARiNZ project. The proposed solution covers four main usage scenarios: installing a repository from scratch on new hardware, installing a repository on an existing system, reconfiguring an existing repository, and performing a version upgrade. In all cases, the solution will reduce the complexity of the process and thus make it considerably easier for repository implementers to get up and running.
	

\vfill {\scriptsize \hfill \verb+$Id$+}


\end{document}