Dissertation: Advanced ‘diff’ for OpenDocument files

The software was created for my final year project for my MEng Computer Science degree at the University of Bristol. My project dissertation/thesis explains the project in detail, including some of the design choices behind it, and the way it works.

The executive summary/abstract of the dissertation is shown below.

Download the full dissertation (PDF format, 1.45 MB).

Executive summary

Office documents are some of the most widely-used general-purpose files in business and home computing environments. Recently there has been a change away from proprietary binary office document file formats and de facto standards towards publicly-designed open XML-based standards, in particular the OpenDocument file format.

The problem of document comparison and change detection is an old one, however existing differencing algorithms do not lend themselves well to office documents. Those aimed at structured documents (such as XML) tend to focus on obtaining a short edit script for efficient storage in a version control system rather than displaying changes in a way which is clear and meaningful to a human user. The latter is often an important business requirement, for instance when considering two versions of a contract, but the comparison tools built-in to word processors and in third-party document comparison applications are not good at fulfilling this.

Here I discuss the OpenDocument file format along with the problem of file comparison and various differencing algorithms that have been published in the last 30 or so years. I explain what makes the problem of change detection in office documents significant, the reasons why current document differencing algorithms are unsuitable for this problem, and propose a product brief for an office document differencing application.

I describe the structure of an OpenDocument file, discuss the design decisions necessary when developing an OpenDocument differencing algorithm, and present an algorithm to match paragraphs in OpenDocument text files based on an extension to Paul Heckel’s 1978 technique for isolating differences between files [1]. I explain some of the design and implementation issues I faced when producing a graphical OpenDocument differencing application and conclude by demonstrating the application, assessing it against the product brief, and considering ways in which the project could be taken further.

[1] P Heckel. A technique for isolating differences between files. Communications of the ACM, vol. 21, no. 4, pp. 264-268, 1978.