02 June 2011

HTML5ify your existing code base

Do you love HTML5’s simplified syntax for markup?

Is your old code base littered with long doctypes and verbose tag attributes?

Are there too many files to edit by hand?

Well, don’t rage about it! Use these handy-dandy shell scripts to convert your old files in a jiffy:

# doctype
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/<\!DOCTYPE\s\+html[^>]*>/<\!DOCTYPE html>/gi" {} \;

# meta charset
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/<meta[^>]*content=[\"'][^\"']*utf-8[\"'][^>]*>/<meta charset=\"utf-8\">/gi" {} \;

# script text/javascript
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<script[^>]*\)\(\stype=[\"']text\/javascript[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

# style text/css
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<style[^>]*\)\(\stype=[\"']text\/css[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

# html xmlns
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<html[^>]*\)\(\sxmlns=[\"'][^\"']*[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

# html xml:lang
find . -regex ".*\.\(html\|py\)$" -type f -exec sed -i "s/\(<html[^>]*\)\(\sxml:lang=[\"'][^\"']*[\"']\)\(\s\?[^>]*>\)/\1\3/gi" {} \;

What to expect

Here are examples of HTML5 simplifications that the above scripts will make:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<html lang="en">

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta charset="utf-8">

<script type="text/javascript">...</script>
<script>...</script>

<script type="text/javascript" src="foo.js">...</script>
<script src="foo.js">...</script>

<style type="text/css">...</style>
<style>...</style>

While HTML5 is new, these syntax changes are supported by virtually all browsers, even ornery, old IE6 will accept them. Also, they will not strip out any other attributes from your tags, e.g., a class or id on the html tag, and they’re idempotent (you can safely run them as many times as you want on the same files).

Notes & Disclaimers

I’m aware that these scripts don’t catch tags that span multiple lines (the most common case for this is a doctype with a line-break in it)—I’m leaving this case as an exercise to the reader.

I only work with html and python files—if you find it necessary, you should update the (html\|py\) parts to include the extensions of any additional file types in your project that may contain HTML.

This was written for GNU find and sed. They ship with most linux systems, but not mac osx (it has the BSD variants). So, if you’re on a mac, you can install them using homebrew (recommended) or whatever other packaging system you’re running.

Finally, make sure to review changes that are made with these scripts. I can’t guarantee they will work perfectly on your personal setup, but I did safely run them a few months ago on both mrcoles.com and hunch.com. Let me know if you find any issues or want to offer some improvements.

Update: thanks Brent for finding an issue with extra quotes within template logic in the html tag attributes.

That’s it, this post doesn’t cover using the new markup tags (section, header, footer, etc.) or any of the other more advanced features of HTML5, this is just a quick way to cleanup your old code with some snazzy HTML5ification! Enjoy!

*images jacked from reddit fffffffuuuuuuuuuuuu

Comments (13)

1. Andrew Pennebaker wrote:

That's extremely useful. Thanks!

Posted on 7 June 2011 at 6:06 PM  |  permalink

2. Fred wrote:

You don't need to have quotes in html5 so <meta charset="utf-8"> can read <meta charset=utf-8>

Posted on 8 June 2011 at 12:06 AM  |  permalink

3. Ioannis Cherouvim wrote:

Great.

In your example you have a typo: javscript

Posted on 8 June 2011 at 4:06 AM  |  permalink

4. peter wrote:

@fred, you’re absolutely right. The quotes are optional.

@Ioannis, thanks!

Posted on 8 June 2011 at 10:06 AM  |  permalink

5. Schalk wrote:

IE will break

Posted on 8 June 2011 at 10:06 AM  |  permalink

6. peter wrote:

@schalk — no, it won’t. These simplifications are all backwards compatible, even the doctype change.

Posted on 8 June 2011 at 11:06 AM  |  permalink

7. Tamas wrote:

The code highlighter your are using, it is evil. I made the mistake of being curious about the end of the lines, so I made an attempt to grab the horizontal scrollbar under the code box. Poof, whole box expanded, making the scroll bar disappear just to reappear for the whole browser window (as the expanded box did not fit in the window). Which of course disappeared as soon as I left the code box, so I couldn't reach that one either using mouse. The only way I found to view the end of the lines is to move mouse inside code box, and then scroll with the cursor keys. Annoying.

Posted on 8 June 2011 at 1:06 PM  |  permalink

8. peter wrote:

Hey Tamas, sorry about that, I need to fix up the expander. I wrote it real quick a while ago and forgot to fix it up for the few browsers that have issues with it. What browser/OS are you using—are you by chance that guy from reddit who’s using Chromium 11 on Arch Linux? http://www.reddit.com/r/programming/comments/httlv/html5ify_your_existing_code_base/c1yb9uo

I’m busy at work right now… so, I’ll just remove the js file that does the expanding for now…

Posted on 8 June 2011 at 3:06 PM  |  permalink

9. Tamas wrote:

No, it wasn't me on reddit. I tried it with both Firefox 3.6 and Chromium 10 xor 11 on Ubuntu 10.04. And I was using a monitor in portait mode, so the window width was quite small.

Posted on 8 June 2011 at 6:06 PM  |  permalink

10. peter wrote:

Got it, thanks for the followup, now I understand what was going on. I’ll put in something better in the future that doesn’t have this problem.

Posted on 8 June 2011 at 8:06 PM  |  permalink

11. Jose wrote:

This is great! Thanks for sharing.

Posted on 27 June 2011 at 12:06 AM  |  permalink

12. Brent Tubbs wrote:

Seems to get a bit confused with this line from the Django admin base template:

<html xmlns="http://www.w3.org/1999/xhtml" lang="{{ LANGUAGE_CODE|default:"en-us" }}" xml:lang="{{ LANGUAGE_CODE|default:"en-us" }}" {% if LANGUAGE_BIDI %}dir="rtl"{% endif %}>

It turned it into this:

<html lang="{{ LANGUAGE_CODE|default:"en-us" }}"en-us" }}" {% if LANGUAGE_BIDI %}dir="rtl"{% endif %}>

Posted on 27 June 2011 at 1:06 PM  |  permalink

13. peter wrote:

Thanks for sharing this bug Brent! The regex looks for the next quote after lang=" and ends up messing it up for this particular django template syntax. I’m going to leave the scripts as is since I don’t have the desire to write and test a more complicated regex to solve this problem.

Fortunately, the html tag should—in practice—be the only tag for which this templating bug will show up with my scripts and there should rarely be more than one html tag in any project that uses templates. So, in addition to quickly checking everything with a git or svn diff (assuming everyone uses revision control), it wouldn’t hurt to run something like: svn diff | grep '<html' to check for any needed manual adjustments. You clearly already did something like this :)

Also, it’s probably a good idea to only run this on your own code base, not other projects like django—since you should just update those when they release newer versions.

Posted on 27 June 2011 at 5:06 PM  |  permalink

Peter Coles

Peter Coles

is a software engineer who lives in NYC, worked at Hunch/eBayNYC, and blogs here.
More about Peter »

@lethys · github · rss

It’s time to get big money out of politics. Join the kick-started campaign to put government back in the hands of the people. Pledge mayday.us now