Time to resurface a few good comments I made at Tim's place last year:
> if an electronic-trading system receives an XML message for a transaction valued at â‚¬2,000,000, and there's a problem with a missing end tag, you do not want the system guessing what the message meant
You [Tim] have used this example, or variations of it, since 1997. I think I can finally express why it irritates me so much: you are conflating "non-draconian error handling" with "non-deterministic error handling". It is true that there are some non-draconian formats which do not define an error handling mechanism, and it is true that this leads to non-interoperable implementations, but it is not true that non-draconian error handling implies "the system has to guess." It is possible to specify a deterministic algorithm for graceful (non-draconian) error handling; this is one of the primary things WHATWG is attempting to do for HTML 5.
If any format (including an as-yet-unspecified format named "XML 2.0") allows the creation of a document that two clients can parse into incompatible representations, and both clients have an equal footing for claiming that their way is correct, then that format has a serious bug. Draconian error handling is one way to solve such a bug, but it is not the only way, and for 10 years you've been using an overly simplistic example that misleadingly claims otherwise.
And, in the same thread but on a different note:
I would posit that, for the vast majority of feed producers, feedvalidator.org *is* RSS (and Atom). People only read the relevant specs when they want to argue that the validator has a false positive (which has happened, and results in a new test) or a false negative (which has also happened, and also results in a new test). Around the time that RFC 4287 was published, Sam rearranged the tests by spec section. This is why specs matter. The validator service lets morons be efficient morons, and the tests behind it let the assholes be efficient assholes. More on this in a minute.
> A simpler specification would require a smaller and finite amount of test cases.
The only thing with a "finite amount of test cases" is a dead fish wrapped in yesterday's newspaper.
On October 2, 2002, the service that is now hosted at feedvalidator.org came bundled with 262 tests. Today it has 1707. That ain't all Atom. To a large extent, the increase in tests parallels an increase in understanding of feed formats and feed delivery mechanisms. The world understands more about feeds in 2007 than it did in 2002, and much of that knowledge is embodied in the validator service.
If a group of people want to define an XML-ish format with robust, deterministic error handling, then they will charge ahead and do so. Some in that group will charge ahead to write tests and a validator, which (one would hope) will be available when the spec finally ships. And then they will spend the next 5-10 years refining the validator, and its tests, based on the world's collective understanding. It will take this long to refine the tests into something bordering on comprehensive *regardless of how simple the spec is* in the first place.
In short, you're asking the wrong question: "How can we reduce the number of tests that would we need to ship with the spec in order to feel like we had complete coverage?" That's a pernicious form of premature optimization. The tests you will actually need (and, hopefully, will actually *have*, 5 years from now) bears no relationship to the tests you can dream up now. True "simplicity" emerges over time, as the world's understanding grows and the format proves that it won't drown you in "gotchas" and unexpected interactions. XML is over 10 years old now. How many XML parsers still don't support RFC 3023? How many do support it if you only count the parts where XML is served as "application/xml"?
I was *really proud* of those 262 validator tests in 2002. But if you'd forked the validator on October 3rd, 2002, and never synced it, you'd have something less than worthless today. Did the tests rot? No; the world just got smarter.
On a somewhat related note, I've cobbled together a firehose which tracks comments (like these) that I make on other selected sites. Many thanks to Sam for teaching me about Venus filters, which make it all possible. If you've been thinking "Gee, I just can't get enough of that Pilgrim guy, I wish there were a way that I could stalk him without being overly creepy about it," then this firehose is for you.
Tim Bray is learning Python and using my feed parser to parse the feeds at Planet Sun. I am suitably flattered, and I sincerely hope that one of the 57 lines in Tim's first Python program checks the bozo bit so Tim can ignore the 13 Planet Sun feeds which are not well-formed XML.
One is served as
text/plain, which means it can never be well-formed.
Ten (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) are served as
text/xml with no
charset parameter. Clients are required to parse such feeds as
us-ascii, but the feeds contain non-ASCII characters and are therefore not well-formed XML.
On a positive note, it's nice to see that Norman Walsh has an Atom feed (#10 in that list). Pity it's not well-formed. I'm sure he'll fix that in short order. He's no bozo.
You know what I want for Christmas? Markup Barbie. You pull a string and she says "XML is tough."
But obviously, the fault for this interoperability nightmare lies entirely in my flawed personality.
In related news, Universal Feed Parser 3.0 beta 19 is out. It supports
textInput). It works around a Heisenbug with libxml2. It works around a widely reported segfault with expat when a feed contains an invalid numeric or character entity. It turns off mxTidy support until I can track down a segfault in that library. It is the mental projection of my digital self, and a perfect reflection of my flawed personality.
I suspect that most of the people discussing liberal XML parsing today are unaware that Tim Bray was the singular force behind the
fail on first error behavior of XML. Virtually everyone in the XML working group disagreed with him, and many people pleaded for a sane method of error recovery, or at least the application-specific option to provide error recovery that was suitable for the application. (XML is uniquely suited for such error-tolerant applications. Because it is text-based and has so much redundant information, like verbose end tags, it provides easier re-entry points to recover after a parsing error, unlike most binary formats.)
In the end, Tim basically said
there are two camps here, they both have good points, we aren't going to convince each other on this one and then proceeded to compromise by doing it his way. Seven years later, we are still paying the price for his dogmatic draconianism.
Update: Tim agrees with the following timeline but disagrees with my conclusion. I would tend to believe him, since he was, you know, there. But we agree on my fundamental point: XML's error handling has always been controversial, and lots of smart people disagreed with it from the beginning for lots of good reasons.
April 18, 1997. Tim Bray: Error Handling in XML
Well-formedness should be easy for a document to attain. In XML, documents will carry a heavy load of semantics and formatting, attached to elements and attributes, probably with significant amounts of indirection. Can any application hope to accomplish meaningful work in this mode if the document does not even manage to be well-formed!?!?
I suggest that we add language to section 5, "conformance", which says:
"An XML processor which encounters a violation of the constraints of well-formedness must not thereafter pass any information about text or markup to the application. It must pass to the application a notification of the first such violation encountered. It MAY thereafter, at user option, pass to the application information about well-formedness violations encountered after the first."
[or in English: you gotta tell the app about the first syntax botch you hit; you're allowed to send the app more error messages, but you're not allowed to send anything but error messages after you've detected an error]
April 19, 1997. Sean McGrath: Re: Error Handling in XML
Programming languages that barf on a syntax error do so because a partial executable image is a useless thing. A partial document is *not* a useless thing. One of the cool things about XML as a document format is that some of the content can be recovered even in the face of error. Compare this to our binary document friends where a blown byte can render the entire content inaccessible.
As I said in a previous post, I can think of a number of useful apps that can work sensibly with broken XML. I think the problem with M [Microsoft] and N [Netscape] is that there is no way to say "warnings = high" and get told about WF [well-formedness] problems.
April 19, 1997. Paul Prescod: Re: Error Handling In XML
I would like to weigh in on the side of moderation: require the user agent to alert the use that the parse was invalid, but don't require it to throw away the rest of the data. Vendors will just ignore that rule anyhow.
Error recovery in HTML is a product differentiator. No matter how much they bitch moan and complain, nobody would ever unilaterally move to a "validate or reject" model. And if they had started out with that model, some product would just have removed the "rejection" part in the race to be the most "flexible" and "user friendly" and the rest would have inevitably followed.
April 20, 1997. Tim Bray: Error handling: yes, I did mean it
The vendors and *serious* information providers at one in wanting to create a non-HTML-like culture of publish-it-right on the Net; one way to do this is to shout, loudly, that there are a few (simple, thank goodness) rules, and they must be obeyed.
April 21, 1997. James Clark: Re: Error handling: yes, I did mean it
If the parser tells you about the error, then you, as an application builder, can choose to ignore any data sent by the parser after the error. The parser may even provide you with a way to do that automatically (nsgmls -E1 will stop after the first error). I think users and application builders should have a choice with what they do with invalid data. I cannot see how a user or application builder can be disadvantaged by being provided with this choice, and I therefore plan to continue to provide it even if the spec says that this is non-conforming.
April 22, 1997. Paul Prescod: Re: Error handling: yes, I did mean it
Being strict on export is laudable. Being strict on import is a hassle. I don't want the spec. to REQUIRE that you cause me a hassle. Nor do I want it to require Netscape to cause me a hassle when some bozo leaves out some easily implied quotes. I want it to notify me that he is a bozo, but let me at the data anyhow. I think that in that scenario everybody wins.
April 22, 1997. Paul Prescod: Re: Error handling: yes, I did mean it
People must have the option to decide for themselves. They have different applications and different needs. Hopefully the business-critical application people know how to capture stderr and know how to pipe to /dev/null if that's what they decide is best. Let's please leave the whole class of business-mission-life-critical applications out of this discussion because those people can take care of themselves. If they can't we have much bigger problems than well-formedness.
Tim's policy is not a strengthening of XML's well-formedness, but a discarding of its ability to resynchronise after an error. The ability to resynchronise, by not having context dependent delimiters or CDATA and RCDATA declared content types or STAGO in text, was always, to me, not so much to allow a simpler production rule, but also to allow robustness, a major fault in SGML. I *really* hope this is not being abandoned.
April 26, 1997. Michael Spergberg-McQueen: Re: Error handling: yes I did mean it
The arguments of the Draconian camp are all centered around the unquestioned observations that
- there are applications where ill-formed data is useless or worse than useless, and where ill-formedness must be detected
- by their unwillingness to issue error messages, and their determination to provide attractive displays even of badly ill-formed documents, HTML browser makers have made their own lives very difficult
Neither of these observations supports a blanket ban on error recovery by XML processors.
April 29, 1997. Bill Smith: Re: Error handling: yes, I did mean it
The draconian XML model says religion is more important than ease-of-use. That's backwards.
April 28, 1997. Dave Hollander: Re: Sudden death: request for missing input
The argument seems to be, 'don't worry. Since most if not all XML documents will be machine generated they will all be well formed.' I don't buy it! Programmers are human to and make as many errors as prose authors.
April 29, 1997. Terry Allen: Re: Error handling: yes, I did mean it
We cannot play Canute. XML is envisioned as the data format for an unimaginable range of applications, and some of those will benefit from error recovery. Humans do error recovery almost continuously (I know it's one of my specialties), why should not their software? and if it's useful, what chance have we of forbidding it successfully?
April 29, 1997. Tim Bray: Re: Sudden death: request for missing input
We went to a lot of work to make well-formedness easy. It is a very low bar to get over... much easier than producing valid HTML. I cannot for the life of me see why so many people here are willing to tolerate gross error, and run the risk of another race-to-the-bottom a la HTML, when the standard required to achieve reliable interoperability is so easy to explain and to achieve.
April 30, 1997. Murray Altheim: Re: Sudden death: request for missing input
I don't think anyone is advocating tolerance for gross error, as we've all seen what that has done with HTML. I think some of us are simply trying to leave *exactly* what happens up to the vendors. Some sort of error notification is essential, but in certain applications the method of error "recovery" may require sending the XML source on through, others sudden death makes sense.
... Error notification is a "must", but *how* it is done is application-specific. Error recovery is a "maybe", depending on the application.
The race to the bottom can be prevented several ways. One way is simply that XML is simply more useful if it's correct -- and people can always fall back to HTML if they don't care.
Why is specifying mandatory error notification harder to enforce than specifying mandatory refusal to process erroneous documents?
May 6, 1997. Terry Allen: Re: Jon on Error
Anyone who has a single error in his document is a bozo? Ahem. I don't buy any of this.
May 6, 1997. Paul Prescod: Re: Jon on Error
If [Microsoft and Netscape] want to "solve the HTML problem" they can. They can launch a "Web Correctness Initiative" within W3C. They will get lots of good press in the trade rags. They can agree to add validators to both of their HTML browser products. They can agree that their editor products will not make bad HTML. This is all entirely within their power and does not require any new specifications.
May 6, 1997. Tim Bray: Final words, I think, on error handling
I think that the draconians and the tolerants really do understand each others' positions, and at the same time can't fathom why each other can possibly think the way they do.
... Bottom line: we aren't going to convince each other on this one.
Browsers do not just need a well-formed XML document. They need a well-formed XML document with a stylesheet in a known location that is syntactically correct and *semantically correct* (actually applies reasonable styles to the elements so that the document can be read). They need valid hyperlinks to valid targets and pretty soon they may need some kind of valid SGML catalog. There is still so much room for a document author to screw up that well-formedness is a very minor step down the path. The idea that well-formedness-or-die will create a "culture of quality" on the Web is totally bogus. People will become extremely anal about their well-formedness and transfer their laziness to some other part of the system.
The basic point against the Draconian case is that a single (monolithic?) policy towards error handling is a recipe for failure. ... The Good News for XML is that DTD conformance is not an (immediate) issue; the Bad News is that there are nevertheless enough merely lexical/syntactic gotchas to be fertile sources of errors -- and not every XML document put on the wire will be the output of a smart editor.
Norman Walsh (invalid XML), Danny Ayers (invalid XML), Brent Simmons (invalid XML), Nick Bradbury (invalid XML), and Joe Gregorio (invalid XML claiming to be HTML) have all denounced me as a heretic for pointing out that, perhaps, rejecting invalid XML on the client side is a bad idea. The reason I know that they have denounced me is that I read what they had to say, and the reason I was able to read what they had to say is that my browser is very forgiving of all their various XML wellformedness and validity errors.
Tim Bray has chimed in by calling all of those people names, stating in no uncertain terms that
anyone who can't make ... well-formed XML is an incompetent fool. Well, technically he was only talking about syndication feeds, but since XHTML is as much XML as Atom or RSS, I'm pretty sure he would apply the same measurement. So if you can't make well-formed XML, don't despair; you may be a fool, but you are, if nothing else, in outstanding company.
Rather than call people names, I'd like to propose a thought experiment.
Imagine, if you will, that all web browsers use strict XML parsers. That is, whenever they encounter an XHTML page, they parse it with a conforming XML parser and refuse to display the page unless it is well-formed XML. This part of the thought experiment is not terribly difficult to imagine, since Mozilla actually works this way under certain circumstances (if the server sends an XHTML page with the MIME type
application/xhtml+xml, instead of the normal
text/html). But imagine that all browsers worked this way, regardless of MIME type.
Now imagine that you were using a publishing tool that prided itself on its standards compliance. All of its default templates were valid XHTML. It incorporated a nifty layout editor to ensure that you couldn't introduce any invalid XHTML into the templates yourself. It incorporated a nifty validating editor to ensure that you couldn't introduce any invalid XHTML into your authored content. It was all very nifty.
Imagine that you posted a long rant about how this is the way the world should work, that clients should be the gatekeepers of wellformedness, and strictly reject any invalid XML that comes their way. You click 'Publish', you double-check that your page validates, and you merrily close your laptop and get on with your life.
A few hours later, you start getting email from your readers that your site is broken. Some of them are nice enough to include a URL, others simply scream at you incoherently and tell you that you suck. (This part of the thought experiment should not be terribly difficult to imagine either, for anyone who has ever dealt with end-user bug reports.) You test the page, and lo and behold, they are correct: the page that you so happily and validly authored is now not well-formed, and it not showing up at all in any browser. You try validating the page with a third-party validator service, only to discover that it gives you an error message you've never seen before and that you don't understand.
You pore through the raw source code of the page and find what you think is the problem, but it's not in your content. In fact, it's in an auto-generated part of the page that you have no control over. What happened was, someone linked to you, and when they linked to you they sent a trackback with some illegal characters (illegal for you, not for them, since they declare a different character set than you do). But your publishing tool had a bug, and it automatically inserted their illegal characters into your carefully and validly authored page, and now all hell has broken loose.
The emails are really pouring in now. You desperately jump to your administration page to delete the offending trackback, but oh no! The administration page itself tries to display the trackbacks you've received, and you get an XML processing error. The same bug that was preventing your readers from reading your published page is now preventing you from fixing it! You're caught in a catch-22. And what's worse, your site is part of a completely hosted solution, so you can't even dig into the source or the underlying database and fix it yourself; all the code is locked away on someone else's server, beyond your control. There's nothing you can do now but fire off a desperate email to your hosting provider and hope they can fix the underlying problem and clean up your bad data. You know, whenever they get around to it.
All the while, your page is completely inaccessible and visibly broken, and readers are emailing you telling you this over and over again. And of course the discussion you were trying to start with your eloquent words has come to a screeching halt; no new comments can be added because your comment form is on the same broken page.
Here's the thing: that wasn't a thought experiment; it all really happened. It's a funny story, actually, because it happened to Nick Bradbury, on the very page where he was explaining why it was so important for clients to reject non-wellformed XML. His original post was valid XHTML, and his surrounding page was valid XHTML, but a trackback came in with a character that wasn't in his character set, and Typepad didn't catch it, and suddenly his page became non-wellformed XML.
Except that none of the rest of it happened, because Nick is not publishing his page as
application/xhtml+xml, and browsers are forgiving by default. And although I did happen to notice the XML wellformedness error and report it, it wasn't a critical problem for him. I was even able to report the error using the comment form on the broken page itself. And while SixApart should certainly fix the error, they are free to fix it on their own schedule; it is not a crisis for them or for Nick, no one is being flooded with angry emails, no one is caught in a catch-22 of invalidity.
Want another thought experiment? Imagine Nick was serving Google ads on that page. Not so funny anymore, is it?
The client is the wrong place to enforce data integrity. It's just the wrong place. I hear otherwise intelligent people claim that
if everyone did it, it would be okay. No, if everyone did it, it would be even worse. If you want to do it, of course I can't stop you. But think about who it will hurt.