Microsoft Word Save as HTML sucks
In general a lot of tools do a terrible job of converting to html. I'm picking on Word because it is popular (and in the outside chance Scoble or someone like him sees this and asks the Office team to fix it).
In general html is simple. Want to make a paragraph? OK open paragraph tag <p>. Then Type your paragraph text. Then close your paragraph tag </p>. That's the essence of html. Now if you want to make it fancier you can add a style sheet. This isn't good enough for word though. I opened up the first word Doc I could find on my drive and did save as web page. A typical paragraph looks like this:
<p class=MsoNormal align=center style='text-align:center'>
Well first there is a reference to a class MSONormal which uses a general margin of zero, bottom margin of 0.0001pt (I guess zero isn't good enough), and sets the the default font to Times-New Roman and font-size to 12 point. Why is this so bad? First, The creator the Word document didn't intend to use any special font formatting - they wanted the default. Word should produce an unformatted paragraph in this case <p>.
Second, the above is just one of many heinous examples for the document in question. It's so bad that tools like Dreamweaver and HTML Tidy have commands to clean up Word html. Here is what Dreamweaver MX 2004 reports after running the clean word html command:
Cleanup Word HTML Results:
1 metatag removed
166 empty paragraphs removed
103 margin defines removed
78 instances of unneeded inline CSS removed
4 instances of unused CSS style definitions removed
Source Formatting applied
I think the results speak for themselves and even after this cleanup there is still a lot of junk left over to clean up.
Last, why doesn't Microsoft make Word more useful. For example, instead of only having a Word processor for letters - it would be nice to have a word processor for different environments (letters, email, blog post, html document, etc...). Then it would be really nice if Word could create clean output for each type of writing. It might even make the next version of MS Office worth buying. It would certainly be a better feature than the animated paper clip.
In the meantime here is some reg ex to help combat the problem:
Here is the reg ex to strip out all html tag attributes starting from the current directory traversing downward from a unix command prompt:
$ find . -name '*htm*' -exec perl -i.bak -p -e 's#</?([a-zA-Z]+)\s+[^>]*>#<$1>#g' {} \;
Line #
1. Start with unix find command on current directory and below to find all html documents
2. exec a perl one liner that backs up the file as myfile.html.bak
3. run perl from the command line
4. regex to search for any alpha character inside <> that is followed by a space and replace with those alpha characters inside of <> - {} matches each file from the find command
In a tool like jEdit (www.jedit.org), you can put this in the find field: </?([a-zA-Z]+)\s+[^>]*>
And this in the replace field: <$1>
At least you are on your way to having clean html now.
Only problem is making it match html tags that span multiple lines. Hopefully someone else can advise me on how to do that. I though adding ms to the end of 's#</?([a-zA-Z]+)\s+[^>]*>#<$1>#g' would do it but no dice