WordXML.jpgThe Microsoft Office Word Team blog runs down how you can see inside the contests of a Word 2007 file (essentially, renaming it to a .zip extension and then looking inside at the collection of XML files; the details are in the Appendix). The same technique can be used on any Office 2007 file, and could prove useful if a file gets corrupted and you’re trying to extract some key data. It also provides an insight into how Office files are structured, though casually parsing XML is not for the faint of heart.


  • The interesting thing (well interesting to me anyway) is that even though .docx is just a file container for the XML, it’s still not very accessible and definitely very difficult for a human to read. On the other hand, .odt (Open Document), while also a container for XML and not easy to read, is much easier to read (you can see screenshots comparing the two in a post I wrote ages ago here –

    I think being human readable is important when it comes to documents being produced and archived with public money (eg Government documents) as the easier it is for a human to parse the SML, the easier it is to ensure the information is appropriately stored and accessed into the future. File formats are so fragile and can disappear in a relatively short period of time. Anything that makes information easier to preserve for future generations has to be a good thing.

