Newer
Older
Publications / DP_2017 / DP.tex
\documentclass[12pt]{article}


\usepackage[margin=1in]{geometry}
\usepackage{fontspec}
\usepackage{harvard}

\setmainfont{Minion Pro}
\setmonofont{Letter Gothic 12 Pitch}


\title{Automatic classification of \\ old digital document formats}
\author{Nigel Stanger \and Brendon Woodford \and William Sanson}

\pagestyle{empty}

\begin{document}

\maketitle
\thispagestyle{empty}

\begin{abstract}
It’s now common to be unable to open older digital documents because either the creating software has been discontinued, or it no longer supports that format. Worse, modern versions of software may open old documents but lose elements of the original (e.g., formatting). More precise identification of the software version that created a file would enable better recovery or migration of the file. This paper describes preliminary work on attempts to extract such information from old Microsoft Word documents.
\end{abstract}

\section{Introduction}

The world is awash with digital documents going back several decades, with many of significant historical, cultural, scientific, or legal importance. Most people will probably access only very recent documents during normal activities, but there are many reasons to access older digital documents, such as studying the works of important writers \cite{Kolowich.S-2009a-Archiving}, re-analysing old research data using new methods \cite{Mount.R-2009a-Data,Pringle.H-2010a-NASA}, and finding new evidence for a “cold case” through forensic examination. However, digital documents tend to become progressively more difficult to open over time as the software that created them evolves, or is even discontinued. Even documents younger than 20 years are not safe: e.g., Microsoft Word 2013 cannot open documents created by Word 95 or earlier \cite{Microsoft-2014a-Word2013}, and it can be hard to even find a computer that can run such “antique” software. Vinton \citeasnoun{Cerf.V-2015a-Digital}, one of the creators of the Internet, recently warned that we are in danger of a “forgotten generation, or even a forgotten century” due to this “bit rot”.

A less obvious, but in many ways more significant issue is that even when modern software \emph{can} open an old document, it may not accurately reproduce the document’s original form due to changes in functionality. This is analogous to human languages, where grammar and meaning change significantly over time. A person who knows only modern English will struggle to accurately comprehend Chaucer. Similarly, Word 2013 may struggle to accurately interpret a document created by Word 98. This issue arises much sooner with digital documents due to the rapid pace of software evolution, and may manifest as anything from subtle layout changes through to entire elements (e.g., graphics) being omitted. Thus, when we open an old document with modern software \emph{we cannot guarantee that it truly represents the original in both appearance and content}. This is a significant issue with historical or cultural material \cite{Kolowich.S-2009a-Archiving}, and is extremely dangerous in a forensic or legal context, where the ability to accurately reproduce a document in its original form may be crucial \cite{Gillespie.J-2004a-Coping}. Imagine, for example, if the Treaty of Waitangi or the Declaration of Independence were originally created in digital form, but when opened 20 years later, important parts were either not displayed at all or were differently formatted. This could completely change the meaning of the document.

It is therefore essential from a preservation perspective to open a digital document with the same—or the nearest possible—version of the software that was used to create it. Unfortunately, identifying the correct version is not always a simple task for older digital documents. There are several resources for identifying document formats and extracting useful metadata from them, including the Unix \texttt{file} tool, \citeasnoun{JHOVE-2009a}, DROID \cite{Brown.A-2006a-Automatic}, the UK National Archives’ PRONOM database \cite{Brown.A-2006a-PRONOM}, and the \possessivecite{NatLib.NZ-2007a-Metadata} Metadata Extraction Tool. Most of these use well-known patterns or “signatures” specific to particular document formats. Signature-based methods can typically identify at least the broad class of document format (e.g., Microsoft Word), and can sometimes be more specific (e.g., Word 6/95 vs.\ Word 97–2003). They cannot however identify the specific software sub-version used (e.g., Word 95 version 1.1, or even Word 6 vs.\ Word 95), except in very limited cases. This is because the key differences across software versions are more likely related to the functionality offered (e.g., a new version of a word processor might add “tables”) than to the document format, which may be the same across several different software versions.

Using features or characteristics indicating specific functionality to identify the range of possible software versions is a classification problem that may be amneable to automated machine learning. Machine learning techniques has previously been used in digital forensics to identify document formats, but only in the contexts of more reliably identifying the \emph{general type} of a document \cite{Mokhov.S-2008a-File} rather than which specific software version created it, and identifying the format of file \emph{fragments} rather than complete documents \cite{Li.Q-2010a-SVM,Roussev.V-2009a-File}. We therefore set out to determine whether it was possible to apply similar techniques to Microsoft Word documents in “.doc” format. Unfortunately, this proved to be a tougher problem than we initially expected. While we were able to get extract somewhat more version information from these documents than is possible with existing tools, we were unable to identify useful features for machine learning due to the complexity of the file format.


\section{Method}


\section{Results}


\section{Conclusion and further work}


\bibliographystyle{dcu}
\bibliography{DP}

\end{document}