Export from OpenOffice to clean HTML

OpenOffice’s export to HTML feature produces very crufty HTML because it attempts to make the outputted document appear as similar as possible to the original document. Most of the time, I just want clean HTML. Here’s one way to get it:

  1. Export your OpenOffice document to HTML (I used the XHTML strict option)
  2. Install Ruby and the Sanitize gem
  3. Download this handy script
  4. Run like so:
    ruby sanitize_oo_html.rb < unwashed.html > pretty.html

The script contains a custom Sanitize filter that’s very simple, and it may not meet your needs. If not, feel free to tweak it. The Sanitize docs should help with that.

