Newer
Older
Digital_Repository / OARiNZ / DIY / DIY_spec.tex
\documentclass[12pt,pdftex,a4paper,titlepage]{article}


\usepackage[T1]{fontenc}
\usepackage{textcomp}
\usepackage{lmodern}
\usepackage{mathpazo}
\usepackage{graphicx}
\usepackage[margin=1in]{geometry}
\usepackage{pifont}
\usepackage{url}


\graphicspath{{images/}}


\renewcommand{\ttdefault}{blg}


\title{\textsf{\textbf{OARiNZ DIY Repository Solution}}}
\author{\textsf{\textbf{Nigel Stanger}}}
\date{\textsf{\textbf{August/September 2006}}%
	\linebreak\linebreak\linebreak\linebreak\linebreak%
	\includegraphics[scale=0.4]{OU-Logo-Colour}}


\begin{document}


\maketitle


\tableofcontents

\vfill

{\small\sffamily
\begin{tabular}{|c|c|l|c|}
	\multicolumn{3}{l}{\textbf{Document history}}	\\
	\hline
	\textbf{Version}	&	\textbf{Date}	&	\textbf{Description}	&	\textbf{Author(s)}	\\
	\hline\hline
	1.1					&	12/09/2006		&	Original version from Eduforge wiki.	&	NS	\\
	\hline
	1.2					&	14/09/2006		&	First release for feedback.	&	NS	\\
	\hline
	1.3					&	20/10/2006		&	Incorporated feedback from John Rankin.	&	NS	\\
	\hline
	1.4					&	09/11/2006		&	Final candidate for sign-off.	&	NS	\\
	\hline
\end{tabular}
}


\newpage


\section{Introduction}

Implementing a digital repository, using a typical open source solution such as GNU EPrints or DSpace, is currently a complex proposition that requires a reasonable level of technical expertise in order to find, download and install all the required software, then separately configure these components appropriately for the target operating system. Ongoing maintenance of the repository configuration can also be complex. Both tasks can be simplified, in particular removing the need to manually find, download, install and configure multiple separate components. Instead, separate higher-level installer and configuration tools could manage these tasks.

Objective 7 of the OARiNZ project aims to address this need. This stated aim of this objective is to ``produce a freely distributable, easy to install CD-ROM containing pre-configured (or self-configuring) open source software for use by institutions looking for entry-level assistance with developing their own shareable digital repository''\footnote{\url{http://www.oarinz.ac.nz/objectives.php#seven}}. This document outlines a specification for such a solution.

The nature of currently available open source repository software makes it unlikely that we can completely eliminate the need for some technical expertise. Most such software targets a LAMP environment (Linux/Unix, Apache, MySQL, Perl/PHP), and several installation and configuration tasks require administrator level access, so the solution cannot be fully automated. Regardless, the solution will enable repository implementers to quickly install and configure a complete digital repository, either from ``bare metal'' on a new server or on an existing system (i.e., one that already has installed an operating system and preferably the required LAMP components). In addition, the level of required technical expertise and the complexity of the installation and configuration process will be reduced, thus lowering the bar for implementing a digital repository.

In the spirit of ``lowering the bar'', a key aim should be to automate or abstract as much of the repository installation and configuration process as possible, focusing attention instead on only those elements that \emph{require} human intervention. In other words, repository implementers will not be forced to type in arcane commands unless it is absolutely unavoidable, nor will they be forced to read many pages of dense and obscure documentation before they start, or be burdened with byzantine installation procedures. A laudable (but perhaps overly optimistic) goal would be to make the installation process as easy as installing software under Mac OS X or Windows.

With regard to repository maintainance, the ideal would be to produce a high-level configuration tool that is able to configure a repository (or collection of repositories) without requiring the administrator to manually edit configuration files. This will certainly be feasible with repositories that are installed by the proposed installer tool, and may be feasible, with some restrictions, for manually installed repositories.

The following key deliverables are therefore proposed:
\begin{enumerate}

	\item A ``bare metal'' installer for creating completely new repositories on new hardware, that includes an operating system, the required LAMP components and all the required repository software.
	
	\item A standalone tool for installing an EPrints repository on an existing server, i.e., a server with an already installed operating system and preferably the required LAMP components.
	
	\item A pre-packaged EPrints distribution in \texttt{.deb} (and possibly \texttt{.rpm}) format for use by the two previous deliverables.

	\item A standalone tool for configuring an EPrints repository.

\end{enumerate}
Both of these deliverables would be distributed in the form of a CD-ROM (or equivalent medium) containing all the required software and a ``shell'' for managing the installation and configuration process. Downloadable disk images would also be made available.

The remainder of this document discusses various design and implementation options, typical usage scenarios, and the implementation plan.


\section{Design and implementation options}


\subsection{Repository software}

Ultimately it would be nice to provide a solution for both GNU EPrints and DSpace, which are the two major open source solutions for smaller-scale repository implementation. However, we currently have little expertise at Otago with DSpace, so the initial focus will be on delivering a solution for EPrints. The Tasmania EPrints statistics software will also be included as a standard component, so that any repositories installed using this solution generate download statistics out of the box.


\subsection{Operating systems}

EPrints repositories are typically run on Unix-based systems (e.g., Linux, BSD, Mac OS X), and we have experience at Otago with installing EPrints on Debian Linux, FreeBSD, Mac OS X and Ubuntu Linux. Unix-based systems will therefore be our primary target for implementation. Note that the EPrints web site currently states that there are ``no plans for a version to run under Microsoft Windows''\footnote{\url{http://www.eprints.org/documentation/tech/php/intro.php#what_will_it_run_on}}.

For bare metal installations, a complete operating system distribution will also be required. It is clearly not feasible to provide an installation disk for every possible Unix platform, nor for proprietary operating systems such as Mac OS X. The bare metal installer can therefore realistically only support one operating system platform. The easiest way to achieve this is to pick a Unix-based operating system that provides a bootable ``live CD''.

We have experience at Otago with installing EPrints repositories under Ubuntu Linux\footnote{\url{http://www.ubuntu.com/}}, which provides a live CD feature, so this is an obvious choice. The Ubuntu live CD is also easily customisable, so a custom live CD could be created that installed not only the base operating system but also the required packages for installing EPrints and our configurator software. We will limit ourselves to the x86 architecture in order to keep things simpler.

Installation of the repository software could be incorporated directly into the operating system installation process, implying that the standalone repository installer would not be required for bare metal installs. Alternatively, the standalone installer could be provided on a separate CD. The OS installer could then say something like ``please insert the CD labelled `EPrints Installer'{}'' and simply call the standalone installer once the CD has been inserted. The latter option should be easier to achieve and avoids any potential duplication of effort in both the bare metal and standalone installers.


\subsection{Package installation}

Unfortunately Unix-based environments do not provide as much uniformity of operating environment as we would like. There is wide variation even across different Linux distributions, with regard to package installation and management, system environment and standard toolsets. The process for installing a required package is completely different under Mac OS X, Debian Linux, Red Hat Linux and FreeBSD, for example, and there are even sometimes multiple package management mechanisms available within the same operating system distribution.

It therefore needs to be considered whether the standalone repository installer for existing systems should use the native package management software (e.g., Red Hat's \texttt{rpm} or Debian's \texttt{dpkg}), or independent installer software. If the native route is taken, the installer will need to detect the operating system version and then look for appropriate package management tools, which of course makes implementation more complex. The non-native route will lead to a simpler implementation, but would lose the significant advantage of having packages managed by the operating system, which is particularly useful for dependency management and upgrades. The native option is therefore preferred.

Another consideration is how to handle pre-existing EPrints installations, whether they be installed manually or by the DIY installer. For ongoing sustainability, the installer should be able to install in a way that enables future version installers, thus enabling future upgrades to the EPrints software in a reasonably transparent manner. Things become a bit murky when installing over the top of a pre-existing manual EPrints installation, however. EPrints does use a standard directory and file structure, so as long as the installation has not been radically restructured, it should in theory be feasible to install over the top (it could make sense to limit any such capability to a particular range of EPrints versions, however).

A much safer option in general, however, would be for the installer to create a new installation alongside the existing one, then perhaps offer to copy across any customised files. This would give the repository administrator the opportunity to thoroughly test the new installation before going live, which would probably just be a matter of swapping the ``new'' and ``old'' installation directories.


\subsection{Repository installation and configuration interface}

The kind of interface to present to the person performing the repository installation and configuration process also needs to be considered. Since the installation and configuration tools will be separate, it probably also makes to present separate interfaces for each phase (this also fits with the typical ``install then configure'' model that occurs with most software). Note that this discussion applies only to the standalone repository installer and configuration tools, not the bare metal operating system installer. 

There are three obvious options for the installation step:
\begin{description}

	\item[Operating system-provided installer] The native installer program supplied by the operating system could be used (if such exists), such as the Mac OS X installer application. While this would provide an installation experience that is consistent with the user's interface expectations, this would almost certainly require the development of separate installers for each operating system platform, with consequent increase in development and maintenance complexity. It is also unclear whether such tools would also be able to integrate with any native OS package management tools.
	
	\item[Cross-platform GUI installer] There are many cross-platform installer tools available that could be used to build the repository installer. Many of these tools are written in Java, which could enable the installation user interface to look reasonably ``native'' for each platform. Non Java-based tools may impose a particular look and feel which could be visually jarring on different platforms. As with the native installer option, it is also currently unclear whether any of these tools are able to integrate with the native operating system package management tools.
	
	\item[Shell-based installer] This is the lowest common denominator for all Unix-based systems. Almost any Unix-based system will have some variant of C-shell available, or at least something compatible. The interface will not be very ``pretty'', but will be relatively simple to implement, and can easily handle issues like integrating with package managers and prompting for administrator-level access. If implemented in a modular fashion, the installer should be readily portable to other Unix-based operating systems.

\end{description}
The main issue with the first two options is clearly the ability to interface with native package management tools. Any installer tool that is able to do so would be a suitable candidate, but if no such tool can be found, then a shell-based installer may be the only option.

There are two obvious options for the configuration step:
\begin{description}
	
	\item[Web-based interface] A web interface could be used to manage the configuration process. This would require an active web server with some sort of back-end scripting support. There is also the issue of gaining administrator level access in order to install and configure many of the components. This is not insurmountable, however, as web-based system administration tools like Webmin can already do this. The big advantage of using a web browser is that it should work on almost any platform if web standards are adhered to, and it will provide a reasonably ``native'' user interface experience in all cases.
	
	\item[Shell-based interface] As described above. If implemented in a modular fashion, the configurator should be readily portable to other Unix-based operating systems. Furthermore, a shell-based configurator could even act as a back-end application layer behind a web-based front end, solving two problems at once.

\end{description}
The web-based option provides a more consistent cross-platform user experience with the flexibility required to provide a cross-platform solution that can interface with native package management tools (especially when combined with the shell-based option).

In all cases, consideration should be given to alternate language interfaces (M\={a}ori in particular). Regardless of the interface method used, users should be able to easily select their preferred language. Some installer tools provide this capability already, and the web-based configuration interface should be designed in such a way as to support language templates.


\subsection{Distribution media}

While the discussion so far has been about distribution on CD-ROMs, there is no particular reason to limit the solution to only this medium. For example, the solution could also be made available in DVD form and as downloadable disk images. This will provide repository implementers with a choice of installation media to suit the vagaries of their particular installation environment.

Furthermore, it is likely that the CD-ROM version would actually comprise more than just a single CD-ROM. A bare metal install would not only need the operating system files, but also pre-compiled versions of all the EPrints prerequisite software in a package format appropriate for that operating system. An existing system install could reasonably assume a pre-existing functional LAMP installation, but would still need to include copies of other EPrints prerequisites such as libraries, Perl modules, etc., in appropriate formats for the various supported package management tools. Combined, this could easily run to at least two CD-ROMs, but would definitely fit onto a single DVD.

It is also recommended that there should be separate disks for the bare metal install and the existing system install options, for the following reasons:
\begin{itemize}

	\item People with existing systems would not want to download an unnecessary operating system distribution in order to get the just repository software.
	
	\item The bare metal installer would minimally need only the base operating system installer and the repository configurator, as the repository software installation could be incorporated into the base operating system installation process.
	
	\item Keeping the two separate simplifies the installation instructions. If the disks were combined the instructions might read something like this: ``If you want to install a complete operating system and repository from scratch, boot from this CD and follow the instructions. If you want to install the repository on an existing system, insert the CD and run XXX.'' This is long-winded and potentially confusing.
	
	With separate disks, the instructions could read more like this: ``To install the operating system and repository software, boot from this CD and follow the instructions'' (bare metal install disk), and ``To install the repository software, insert the CD and run XXX'' (existing system install disk).
	
	\item A combined installer would probably not fit on one CD-ROM, whereas a separate CD-ROM for each installer might be feasible.
	
\end{itemize}


\subsection{Items to be configured}
\label{sec-configure}

The basic repository configuration includes things like its internal identifier, domain name, HTTP port number and so on. All of these items are required as part of the base configuration and will need to be included in the configurator. Configuration of the Tasmania EPrints statistics software would also be included here.

In addition to these compulsory items, there are also numerous optional aspects of EPrints itself that can be configured, such as enabling the editorial buffer, required document formats, etc. These will be included as optional items within the configuration process, accessed via an ``advanced configuration'' page. The list of advanced configuration items should be easily extensible, probably via some form of XML specification, so as to cater for future developments. (The same mechanism could also be used to specify compulsory configuration items.)

One optional configuration item of particular relevance to the OARiNZ project is configuration of the EPrints OAI-PMH interface. While it is recommended that this remain an optional configuration (as some thought is required to set it up properly), an unconfigured OAI-PMH subsystem should be prominently highlighted within the configurator interface, preferably on the main page. This gives repository implementers the option to forgo initial configuration of OAI-PMH, while gently encouraging them to eventually do so.

On this note, there is no reason why the configurator should be limited to once-only use when the repository software is first installed. Rather, it should be installed alongside the repository software and used as a general management tool for creating and configuring repositories on that server. The configurator should keep an internal record of the configuration settings for each repository that it creates, which will make it easier to re-configure repositories at any time. The configurator should probably also check the saved configuration against the actual configuration files when opened, in case someone manually edits them.

Another consideration is whether the configurator should be able to configure pre-existing manual EPrints installations. We have excluded this from the solution on the grounds that it would introduce considerable complexity. For example, the configurator would need to be able to detect the version of EPrints that was installed and keep a database of which configuration items apply to which version. Additional problems would arise if the pre-existing EPrints was installed in a non-standard manner. We may consider this capability for a future version of the configurator.

The configurator will not assist with the process of customising the look and feel of the repository web pages, simply because there are too many possible permutations of how to modify the look and feel. The configurator could, however, provide information on which files need to be changed in order to achieve this. This information would also be included in the OARiNZ knowledge base wiki.


\subsection{Other items}

The repository installer will include the M\={a}ori and Pacific Island language packs for EPrints that were developed at Wintec. No special handling is required for these; they will simply be included as standard components in the EPrints installation.


\subsection{Summary of design recommendations}

\subsubsection*{Repository software}

\begin{itemize}

	\item GNU EPrints

\end{itemize}

\subsubsection*{Target operating system platform}

\begin{itemize}

	\item Unix-based operating systems that have functional Apache, MySQL and Perl/PHP components already installed

	\item Ubuntu Linux (server distribution, x86 architecture) for the bare metal install option
	
	\item Base OS installer to call standalone repository installer (on separate CD)

\end{itemize}
	
\subsubsection*{Package installation}
	
\begin{itemize}

	\item Use native package management tools provided by the operating system wherever possible
	
	\item For prior EPrints installations, install alongside rather than overwrite

\end{itemize}

\subsubsection*{Repository installation \& configuration interface}

\begin{itemize}

	\item Shell-based option (ideally usable as a back-end CGI script), as the ultimate fallback

	\item Cross-platform GUI installation interface (if feasible)

	\item Web-based configuration interface
	
	\item Alternate language options

\end{itemize}

\subsubsection*{Distribution media}

\begin{itemize}

	\item CD-ROM

	\item DVD

	\item Downloadable disk images in standard formats

	\item At least two disks for bare metal installs: base operating system (disk 1) + repository software (disk 2, including configurator)

	\item One disk (or set of disks) for existing system installs: repository software only (including installer and configurator)

\end{itemize}

\subsubsection*{Items to be configured}

\begin{itemize}

	\item All required EPrints, etc., configuration items
	
	\item OAI-PMH configuration optional but encouraged
	
	\item Other optional configuration items
	
	\item Extensible specification of configuration items

\end{itemize}

\subsubsection*{Other items}

\begin{itemize}

	\item M\={a}ori and Pacific Island language packs for EPrints to be included

\end{itemize}


\section{Typical usage scenarios}


\subsection{Bare metal installation}

\begin{figure}
	\begin{center}
	\includegraphics{bare_metal}
		\caption{Installation on a new system}
		\label{fig-bare-metal}
	\end{center}
\end{figure}

In this scenario, shown in Figure~\ref{fig-bare-metal}, a repository implementer wishes to bootstrap a complete repository installation on new hardware (this includes virtualisation environments such as VMware or Virtual PC). They boot from the repository live CD (\ding{'300}), which installs the Ubuntu operating system along with all the required packages for EPrints (\ding{'301}). The latter will probably also include the repository configurator and configuration items list, as implied by the dashed arrows at bottom right. After the base installation completes (a reboot may be required), the operating system (\ding{'302}) and repository configurators (\ding{'303}) are executed in sequence. The repository configuration is saved for future reference.


\subsection{Installation on existing system}

\begin{figure}
	\begin{center}
		\includegraphics{existing_system}
		\caption{Installation on an existing system}
		\label{fig-existing}
	\end{center}
\end{figure}

In this scenario, shown in Figure~\ref{fig-existing}, a repository implementer wishes to install a repository on an existing server, which already has an operating system and associated software installed (including the various LAMP components). They insert the installation CD and launch the repository installer application (\ding{'300}). The installer installs all the necessary packages to support EPrints (those that are not installed already), including the repository configurator and configuration items list, as implied by the dashed arrows at bottom right. After the installation completes, the installer opens the repository configurator in a web browser (\ding{'301}). The repository configuration is saved for future reference.

Note the separation of the repository configurator into a web user interface and back-end shell for executing installation tasks.


\subsection{Reconfiguring an existing system}

\begin{figure}
	\begin{center}
	\includegraphics{reconfigure}
		\caption{Reconfiguring an existing installation}
		\label{fig-reconfigure}
	\end{center}
\end{figure}

In this scenario, shown in Figure~\ref{fig-reconfigure}, a repository administrator wishes to reconfigure their installation\footnote{As noted in Section~\ref{sec-configure}, this only applies to EPrints installations that were installed by the DIY installer, not to manual EPrints installations.}, for example, to create new repository or to change the settings of an existing repository. They launch the repository configurator that was installed on the server during the original installation process (\ding{'300}). This reads the existing repository configuration (\ding{'301}) and the configuration items list (\ding{'302}) and uses these to initialise the configurator. When complete, the new configuration is saved for future reference.


\subsection{Updating to a new version}

\ldots{}blah blah\ldots


\section{Implementation plan}

A phased implementation approach will be adopted, with each phase building on the outputs from the previous phase. However, not all of the tasks are sequential in nature and may be able to be carried out in parallel. Estimated start and finish dates are provided, but may be subject to change as work progresses.


\subsection{Phase 1: Build shell-based repository installer/configurator}

\noindent \textbf{Start:} Mid-September 2006	\\
\textbf{Finish:} 31 October 2006

\begin{itemize}

	\item Initially for Ubuntu Linux only.

	\item Modular implementation so that it is readily generalisable to
	other platforms.

	\item Infrastructure for specifying configuration items and saving
	repository configuration information.

	\item Must be able to obtain administrator level access.

	\item Can be run either standalone or as a CGI script.
	
	\item Test.

\end{itemize}


\subsection{Phase 2: Build web-based installer/configurator interface}

\noindent \textbf{Start:} 1 October 2006	\\
\textbf{Finish:} Mid-November 2006

\begin{itemize}

	\item Use shell-based installer/configurator as a back end.

	\item Investigate feasibility of a web-based UI for the installation
	step (e.g., by providing an Apache executable on the CD).

	\item Test.

\end{itemize}


\subsection{Phase 3: Build bare metal installer (live CD)}

\noindent \textbf{Start:} Mid-October 2006	\\
\textbf{Finish:} 30 November 2006

\begin{itemize}

	\item Create \texttt{.deb} packages for EPrints and other associated
	software that are not available in this format.

	\item Customise Ubuntu live CD with required packages for repository
	installation.

	\item Integrate repository configurator into Ubuntu installation
	process.

	\item Test.
	
\end{itemize}


\subsection{Phase 4: Port standalone installer/configurator to other platforms}

\noindent \textbf{Start:} 1 November 2006	\\
\textbf{Finish:} 31 December 2006

\begin{itemize}

	\item Debian Linux

	\item Mac OS X (investigate installation and use of Fink package
	manager).

	\item FreeBSD

	\item Others?
	
\end{itemize}


\section{Conclusion}

This document has discussed the implementation of a DIY repository solution consistent with Objective 7 of the OARiNZ project. The proposed solution covers three main usage scenarios: installing a repository from scratch on new hardware, installing a repository on an existing system, and reconfiguring an existing repository. In all cases, the solution will reduce the complexity of the process and thus make it considerably easier for repository implementers to get up and running.
	

\vfill {\scriptsize \hfill \verb+$Id$+}


\end{document}