| xah lee ( @ 2008-12-28 18:59:00 |
| Entry tags: | html, validator, w3c, web dev, xml |
HTML Correctness and Validators
• http://xahlee.org/js/html_correctness.h
plain text version follows.
---------------------------
HTML Correctness and Validators
Xah Lee, 2008-12-28
Some notes about html correctness and html validator.
Condition Of Website Correctness
My website “xahlee.org” has close to 4000 html files. All are valid html files. “Valid” here means passing the w3c's validator at http://validator.w3.org/. Being a programing and correctness nerd, correct html is important to me. (correct markup has important, practical, benefits, such as machine parsing and transformation, as picked up by the XML movement. Ultimately, it is a foundation of semantic web↗.)
In programing lang communities, the programer tech geekers are fanatical about their fav lang's superiority, and in the case of functional langs, they are often proud of their correctness features. However, a look at their official docs or websites, they are ALL invalid html, with errors in just about every 10 lines of html source code. It is fucking ridiculous.
In the web development geeker communities, you can see how they are tight-assed about correct use of HTML/CSS, etc, where there are often frequent and heated debates about propriety of semantic markup, and they don't hesitate to ridicule Microsoft Internet Explorer browser, or the average HTML content producer. However, a look at the html they produced, also almost none are valid.
The bad html also happens in vast majority of docs produced by organization of standards, such as the Unicode Consortium↗, IETF↗. For example, if you run w3c's validator on their IETF's home page, there are 32 errors, including “no doctype found”, and if you validate unicode's http://www.unicode.org/faq/utf_bom.h
In about 2006, i spent few hours research on what major websites produces valid html. To this date, I know of only one major site that produces valid html, and that is Wikipedia. This is fantastic. Wikipedia is produced by MediaWiki↗ engine, written in PHP. Many other wiki sites also run MediaWiki, so they undoubtfully are also valid. As far as i know, few other wiki or forum software also produces valid html, though they are more the exceptions than norm. (did try to check 7 random pages from “w3.org”, looks like they are all valid today.) Personal Need For Validator
My personal need is to validate typically hundreds of files on my local drive. Every month or so, i do systematic regex find-replace operation on a dir. This often results over a hundred changed files. Every now and then, i improve my css or html markup semantics site wide, so the find-replace is on all 4000 files. Usually the find-replace is carefully crafted with attention to correctenss, or done in emacs interactively, so possible regex screwup is minimal, but still i wish to validate by batch after the operation.
Batch validation is useful because, if i screwed up in my regex, usually it ends up with badly formed html, so html validation can catch the result.
In 2008, i converted most my sites from html 4 transitional to html 4 strict. The process is quite a manual pain, even the files i start with are valid.
Here are some examples. In html4strict:
* “‹br›” must be inside block level tags. Image tag “‹img ...›” needs to be enclosed in a block level tag such as “‹div›”. Content inside blockquote must be wrapped with a block level tag. e.g. “‹blockquote›Time Flies‹/blockquote›” would be invalid in html4strict; you must have “‹blockquote›‹p›Time Flies‹/p›‹/blockquote›”
Lets look at the image tag example. You might think it is trivial to transform because you can simply use regex to wrap a “‹div›” to image tags. However, it's not that simple. Because, for example, often i have this form:
‹img src="pretty.jpg" alt="pretty girl" width="565" height="809"› ‹p›above: A pretty girl.‹/p›
The “p” tag immediately below a “img” tag, functions as the image's caption. I have css setup so that this caption has no gap to the image above it, like this:
img + p {margin-top:0px;width:100%} /* img caption */
I have the “width:100%” because normally “p” has “width:80ex” for normal paragraph.
Now, if i simply wrap a “div” tag to all my “img” tags, i will end up with this form:
‹div›‹img src="pretty.jpg" alt="pretty girl" width="565" height="809"›‹/div› ‹p›above: A pretty girl.‹/p›
Now this screws up with my caption css, and there's no way to match “p” that comes after a “div › img”.
Also, sometimes i have a sequence of images. Wrapping each with a “div” would introduce gaps between them.
This is just a simplified example. In short, converting from html4transitional to html4strict while hoping to retain appearance or markup semantics in practical ways is pretty much a manual pain. (the ultimate reason is because html4transitional is far from being a good semantic markup. (html4strict is a bit better)) Validators
In my work i need a batch validator. What i want is a command line utility, that can batch validate all files in a dir. Here are some solutions related to html validation.
* The standard validator service by w3c: http://validator.w3.org/ (see also: W3C Markup Validation Service↗ ). The problem with this is that it can't validate local files, and can't do in batch. Using it to validate 4000 files thru network (with a help of perl script) would not be acceptable, since each job means massive web traffic. (my site is near 754 Mebibyte↗.)
* FireFox has a “Html Validator” add-on by Marc Gueury. https://addons.mozilla.org/en-US/firefo
* FireFox has a “Web Developer” add-on by Chris Pederick. https://addons.mozilla.org/en-US/firefo
I heavily relie on the above 2 FireFox tools. However, the FireFox tools do not let me do batch validation. Over the years i've searched for batch validation tools. Here's some list:
* HTML Tidy↗ A batch tool primarily for cleanup html markup. I didn't find it useful for batch validation purposes, nor for html conversion jobs. It doesn't do well for my html conversion needs because the tool is incapable of retaining your html formatting (i.e. retain your newlines locations). I do a lot regex based text procesing on my html files, so i need assumptions about how newlines are in my html files. If i use tidy on my site, that means i have to abandon regex based text processing, and instead, have to treat my files using html and dom parsers, which makes most practical text processing needs quite more complex and cumbersome.
* A perl module “HTML::Lint”, at http://search.cpan.org/~petdance/HTML-L
* http://htmlhelp.com/tools/validator/off
* OpenJade and OpenSP. http://openjade.sourceforge.net/ Seems a good tool. Haven't looked into.
* Emacs's nxml mode http://www.thaiopensource.com/nxml-m
One semi solution for batch validation i found is: “Validator S.A.C.”, at http://habilis.net/validator-sac/
Here's the perl script:
# perl
# 2008-06-20 validates a given dir's html files recursively requires the mac os x app Validator-SAC.app at http://habilis.net/validator-sac/
use strict; use File::Find;
my $dirPath = q(/Users/xah/web/emacs); my $validator = q(/Applications/Validator-SAC.app/Conten
sub wanted { if ($_ =~ m{\.html$} && not -d $File::Find::name) {
my $output = qx{$validator "$File::Find::name" | head -n 11 | grep 'X-W3C-Validator-Status:'};
if ($output ne qq(X-W3C-Validator-Status: Valid\n)) { print q(Problem: ), $File::Find::name, "\n";
} else {
print qq(Good: $_) ,"\n";
}
}
}
find(\&wanted, $dirPath);
print q(Done.)
However, for some reason, “Validator S.A.C.” took nearly 2 seconds to check each file, in contrast, the FireFox html validator add-on took a fraction of a second while also render the whole page completely. For example, suppose i have 20 files in a dir i need to validate. It's faster, if i just open all of them in FireFox and eyeball the validity indicator, then running the “Validator SAC” on them.
I wrote to its author Chuck Houpt about this. It seems that the validator uses Perl and loads about 20 heavy duty web related perl modules to do its job, and over all is wrapped as a Common Gateway Interface↗. Perhaps there is a way to avoid these wrappers and call the parser or validator directly.
I'm still looking for a fast, batch, html validation tool.