Wednesday, May 04, 2005

Microsoft Word Save as HTML sucks

In general a lot of tools do a terrible job of converting to html. I'm picking on Word because it is popular (and in the outside chance Scoble or someone like him sees this and asks the Office team to fix it).

In general html is simple. Want to make a paragraph? OK open paragraph tag <p>. Then Type your paragraph text. Then close your paragraph tag </p>. That's the essence of html. Now if you want to make it fancier you can add a style sheet. This isn't good enough for word though. I opened up the first word Doc I could find on my drive and did save as web page. A typical paragraph looks like this:
<p class=MsoNormal align=center style='text-align:center'>

Well first there is a reference to a class MSONormal which uses a general margin of zero, bottom margin of 0.0001pt (I guess zero isn't good enough), and sets the the default font to Times-New Roman and font-size to 12 point. Why is this so bad? First, The creator the Word document didn't intend to use any special font formatting - they wanted the default. Word should produce an unformatted paragraph in this case <p>.

Second, the above is just one of many heinous examples for the document in question. It's so bad that tools like Dreamweaver and HTML Tidy have commands to clean up Word html. Here is what Dreamweaver MX 2004 reports after running the clean word html command:

Cleanup Word HTML Results:
1 metatag removed
166 empty paragraphs removed
103 margin defines removed
78 instances of unneeded inline CSS removed
4 instances of unused CSS style definitions removed
Source Formatting applied

I think the results speak for themselves and even after this cleanup there is still a lot of junk left over to clean up.

Last, why doesn't Microsoft make Word more useful. For example, instead of only having a Word processor for letters - it would be nice to have a word processor for different environments (letters, email, blog post, html document, etc...). Then it would be really nice if Word could create clean output for each type of writing. It might even make the next version of MS Office worth buying. It would certainly be a better feature than the animated paper clip.

In the meantime here is some reg ex to help combat the problem:

Here is the reg ex to strip out all html tag attributes starting from the current directory traversing downward from a unix command prompt:

$ find . -name '*htm*' -exec perl -i.bak -p -e 's#</?([a-zA-Z]+)\s+[^>]*>#<$1>#g' {} \;

Line #
1. Start with unix find command on current directory and below to find all html documents
2. exec a perl one liner that backs up the file as myfile.html.bak
3. run perl from the command line
4. regex to search for any alpha character inside <> that is followed by a space and replace with those alpha characters inside of <> - {} matches each file from the find command

In a tool like jEdit (, you can put this in the find field: </?([a-zA-Z]+)\s+[^>]*>

And this in the replace field: <$1>

At least you are on your way to having clean html now.
Only problem is making it match html tags that span multiple lines. Hopefully someone else can advise me on how to do that. I though adding ms to the end of 's#</?([a-zA-Z]+)\s+[^>]*>#<$1>#g' would do it but no dice

Listen to this article Listen to this article

Sunday, May 01, 2005

Quicktime 7 a Watershed for Digital Media

There has been some hype by Apple about Quicktime 7 and the new H.264 codec. Well the proof is in the pudding and IMHO the hype is justified. If you can download and install QT 7 (caution: requires purchasing a new Pro key if you have QT Pro 6.x) and view the HD wildlife video: It is breathtaking and of a quality you would expect to see in showroom Hi-def Tvs at Best Buy compressed so much that it works as a downloadable web video - this is not a progressive download so you have to wait for the whole thing to download.

The video mentioned above is foramatted like so:
Audio: AAC, Stereo (L R), 44.100 kHz
Video: H.264, 960 x 540, Millions
Frames Per Second: 30
Data Rate: 2126.73 kbits/sec
File size: 35.73 MB

Amazing stuff. The bad news is my 700mhz G4 eMac and my Dad's 800mhz g4 iMac can't play the video mentioned above without dropping frames. Both have 640 mb RAM .

What I found just as exciting is the H.264 and AAC codecs are avaliable via iMovie (and presumably Final Cut Pro and Express)! I wanted to see what I could do with a H.264 encoded video. I had a 34.5 minute video of my Great Uncle recounting his World War II experiences as a Doctor in India.

I compressed it at these settings with a lot of trial and error:
Audio: AAC, Mono, 32.000 kHz
Video: H.264, 320 x 240, Millions
Frames Per Second: 30
Data Rate: 110.86 kbits/sec
File size: 27.6 MB!
Quality very good - sharp looking.

Under 0ne mb per minute at 30 frames per second at 320 x 240 which is about half the dimensions of standard broadcast video! Compare this to what was previously avaliable in Quicktime and you'll see why this is so impressive:
See: - this is not a progressive download so you have to wait for the whole thing to download.

A previous video I did in Quicktime was much larger and lower quality
112.3 mb 23 minutes
Video: Mpeg 4, millions of colors, 24 fps
Audio: Mpeg 4 Mono 22050 hz
Size: 320 by 240 pixels

So the H.264 is longer, higher quality and much lower in file size 75% lower!

The bad news: it takes some serious CPU power. When I tried to compress this movie at 640 x 480, it first said it would take 54 minutes and the time it would take kept increasing until at one point after about an hour of processing this 34.5 minute video would take more than 7 hours to complete and every time I checked the time was increasing. After several attempts I had to choose single-pass encoding and 320x240 size and then iMovie was able to process this in about an hour or so on my 867 mhz G4 12 inch Powerbook .

I'll try a 640x480 version later if I can and update this post with a link.

I'm not sure how Quicktime 7 compares to Real and Windows Media current offerings. If you do, please leave a comment.

One more comment,
one has to think the ability to move video of this quality over the Web has to have implications for the TV/Movie industry.

Listen to this article Listen to this article