Saturday, December 08, 2012

Converting Microsoft Word Docs to XML and Processing with PHP's SimpleXML

In this post, I'll describe how to do automated processing on a set of Microsoft Word documents. The task is to compare an old version of a document to a new version section by section, and generate an output table listing sections that were modified, added, deleted, or stayed the same.

You can do this sort of thing manually in Word by opening up the document and telling Word to compare it with the older version of the document.  Word will show you the diff marks and you could then scroll through the document and record all the section numbers and whether there are changes or not.  Very labor intensive and tedious.  In this case there are dozens of pairs of documents to be compared, and the whole process will need to be repeated every month as new versions of each document are created.  Clearly the manual approach would not work - this has to be automated.

One way to go about doing this is with Word macros which can be programmed in Visual Basic. But that still requires you to sit at a PC, launch Word and run the macro.  But a better approach that will let you completely automate this on a Linux server is to convert the documents to XML and process them with PHP.

Converting a Word doc to XML is pretty easy if you know about Abiword, which is an open source word processor for Linux.  Abiword provies a full fledged GUI interface, but it can also be run from the command line to do format conversions, including Word to XML, and Word to HTML.  Note that there are a some very complex Word documents that Abiword can't quite handle and it will give you an error message to let you know, but for most documents it works just fine.

Here's what I did:

On my Ubuntu machine I installed abiword:
 sudo apt-get install abiword

Then from the command line I convert a document like this:
 abiword --to=xml WordExample.doc

The result is a file called WordExample.xml, which can be easily parsed by PHP's SimpleXML library.

Once I had all the documents in XML format I used SimpleXML to process them, easily pulling out the parts I needed using xpath queries.  Finally I generated HTML tables to summarize the results.

Here is an example word document, and the resulting XML file.

And here's the PHP code that will pull out the headings as well as the text content of each section of the document using SimpleXML and a few xpath queries, and then generate a couple of HTML tables showing the changes in the document headings, as well as the actual text content:

// Use argv[] for running from Linux command line:
$file1 = $argv[1]; 
$file2 = $argv[2];
// Get headings and content of both files:
echo "Heading Comparison:\n";
echo "Content Comparison:\n";

// Get document Headings and section content:
function getDocSections($filepath) {
  $x = simplexml_load_file($filepath);
  $headings = array();
  $content = array();
  $sections = $x->xpath("//*/section[starts-with(@role,'Heading')]");
  foreach ($sections as $section) {
    $title_a = $section->xpath("title/*/text()");
    if (!empty($title_a[0])) {
     $a = explode("\t",$title_a[0],2);
     if (count($a) == 1) {
       $a = explode(" ",$title_a[0],2);
     $headings[trim($a[0])] = trim($a[1]);
     $content_a = $section->xpath("para/phrase/text()");
     $content[trim($a[0])] = trim((string)implode(' ',$content_a));
  return array($headings,$content);

// Compare two arrays and generate table summarizing differences:
function compareSections($old,$new) {
  $done = false;
  do {
    $ov = current($old);
    $ok = key($old);
    $nv = current($new);
    $nk = key($new);
    if ($ok == $nk) {
      // Keys match.  See if contents do:
      if ($ov == $nv) {
      } else {
      $nnext = next($new);
      $onext = next($old);
    } else {
      // Keys do not match.
      if (!isset($new[$ok])) {
        // Old key not present in new array, so this is a delete:
        $onext = next($old);
      } else {
        $nnext = next($new);
    if ($nnext === false || $onext === false) $done=true;
  } while (!$done);
  while ($nnext !== false)  {
    $nv = current($new);
    $nk = key($new);
    $nnext = next($new);
  while ($onext !== false) {
    $ov = current($old);
    $ok = key($old);
    $onext = next($old);

function outputStart() {
  echo "<html>\n";
  echo "<style type='text/css'>\n";
  echo "table { border-collapse: collapse; }\n";
  echo "td { padding: 2px; border: 1px solid #999; }\n";
  echo ".add { background: lightgreen; }\n";
  echo ".delete { background: pink; }\n";
  echo ".modified { background: gold; }\n";
  echo "</style>\n";
  echo "<body>";
  echo "<table class='diff'>\n";

function output($k,$v,$type,$ok="",$ov="") {
  echo "<tr class='$type'>\n";
  echo "  <td>$k</td><td>$v</td>\n";
  echo "</tr>\n";

function outputEnd() {
  echo "</table>";

Post a Comment