Google (my current employer) has finally open sourced protocol buffers, the data interchange format we use for internal server-to-server communication. The blogosphere's response? "No wireless. Less space than a Nomad. Lame."
Protocol buffers are "just" cross-platform data structures. All you have to write is the schema (a
.proto file), then generate bindings in C++, Java, or Python. (Or Haskell. Or Perl.) The
.proto file is just a schema; it doesn't contain any data except default values. All getting and setting is done in code. The serialized over-the-wire format is designed to minimize network traffic, and deserialization (especially in C++) is designed to maximize performance. I can't begin to describe how much effort Google spends maximizing performance at every level. We would tear down our data centers and rewire them with $500 ethernet cables if you could prove that it would reduce latency by 1%.
Besides being blindingly fast, protocol buffers have lots of neat features. A zero-size PB returns default values. You can nest PBs inside each other. And most importantly, PBs are both backward and forward compatible, which means you can upgrade servers gradually and they can still talk to each other in the interim. (When you have as many machines as Google has, it's always the interim somewhere.)
Comparisons to other data formats was, I suppose, inevitable. Old-timers may remember ASN.1 or IIOP. Kids these days seem to compare everything to XML or JSON. They're actually closer to Facebook's Thrift (written by ex-Googlers) or SQL Server's TDS. Protocol buffers won't kill XML (no matter how much you wish they would), nor will they replace JSON, ASN.1, or carrier pigeon. But they're simple and they're fast and they scale like crazy, and that's the way Google likes it.
Time to resurface a few good comments I made at Tim's place last year:
> if an electronic-trading system receives an XML message for a transaction valued at â‚¬2,000,000, and there's a problem with a missing end tag, you do not want the system guessing what the message meant
You [Tim] have used this example, or variations of it, since 1997. I think I can finally express why it irritates me so much: you are conflating "non-draconian error handling" with "non-deterministic error handling". It is true that there are some non-draconian formats which do not define an error handling mechanism, and it is true that this leads to non-interoperable implementations, but it is not true that non-draconian error handling implies "the system has to guess." It is possible to specify a deterministic algorithm for graceful (non-draconian) error handling; this is one of the primary things WHATWG is attempting to do for HTML 5.
If any format (including an as-yet-unspecified format named "XML 2.0") allows the creation of a document that two clients can parse into incompatible representations, and both clients have an equal footing for claiming that their way is correct, then that format has a serious bug. Draconian error handling is one way to solve such a bug, but it is not the only way, and for 10 years you've been using an overly simplistic example that misleadingly claims otherwise.
And, in the same thread but on a different note:
I would posit that, for the vast majority of feed producers, feedvalidator.org *is* RSS (and Atom). People only read the relevant specs when they want to argue that the validator has a false positive (which has happened, and results in a new test) or a false negative (which has also happened, and also results in a new test). Around the time that RFC 4287 was published, Sam rearranged the tests by spec section. This is why specs matter. The validator service lets morons be efficient morons, and the tests behind it let the assholes be efficient assholes. More on this in a minute.
> A simpler specification would require a smaller and finite amount of test cases.
The only thing with a "finite amount of test cases" is a dead fish wrapped in yesterday's newspaper.
On October 2, 2002, the service that is now hosted at feedvalidator.org came bundled with 262 tests. Today it has 1707. That ain't all Atom. To a large extent, the increase in tests parallels an increase in understanding of feed formats and feed delivery mechanisms. The world understands more about feeds in 2007 than it did in 2002, and much of that knowledge is embodied in the validator service.
If a group of people want to define an XML-ish format with robust, deterministic error handling, then they will charge ahead and do so. Some in that group will charge ahead to write tests and a validator, which (one would hope) will be available when the spec finally ships. And then they will spend the next 5-10 years refining the validator, and its tests, based on the world's collective understanding. It will take this long to refine the tests into something bordering on comprehensive *regardless of how simple the spec is* in the first place.
In short, you're asking the wrong question: "How can we reduce the number of tests that would we need to ship with the spec in order to feel like we had complete coverage?" That's a pernicious form of premature optimization. The tests you will actually need (and, hopefully, will actually *have*, 5 years from now) bears no relationship to the tests you can dream up now. True "simplicity" emerges over time, as the world's understanding grows and the format proves that it won't drown you in "gotchas" and unexpected interactions. XML is over 10 years old now. How many XML parsers still don't support RFC 3023? How many do support it if you only count the parts where XML is served as "application/xml"?
I was *really proud* of those 262 validator tests in 2002. But if you'd forked the validator on October 3rd, 2002, and never synced it, you'd have something less than worthless today. Did the tests rot? No; the world just got smarter.
On a somewhat related note, I've cobbled together a firehose which tracks comments (like these) that I make on other selected sites. Many thanks to Sam for teaching me about Venus filters, which make it all possible. If you've been thinking "Gee, I just can't get enough of that Pilgrim guy, I wish there were a way that I could stalk him without being overly creepy about it," then this firehose is for you.
Tim Bray is learning Python and using my feed parser to parse the feeds at Planet Sun. I am suitably flattered, and I sincerely hope that one of the 57 lines in Tim's first Python program checks the bozo bit so Tim can ignore the 13 Planet Sun feeds which are not well-formed XML.
One is served as
text/plain, which means it can never be well-formed.
Ten (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) are served as
text/xml with no
charset parameter. Clients are required to parse such feeds as
us-ascii, but the feeds contain non-ASCII characters and are therefore not well-formed XML.
On a positive note, it's nice to see that Norman Walsh has an Atom feed (#10 in that list). Pity it's not well-formed. I'm sure he'll fix that in short order. He's no bozo.
You know what I want for Christmas? Markup Barbie. You pull a string and she says "XML is tough."
I was walking across a bridge one day, and I saw a man standing on the edge, about to jump off. So I ran over and said, "Stop! Don't do it!"
"I can't help it," he cried. "I've lost my will to live."
"What do you do for a living?" I asked.
He said, "I create web services specifications."
"Me too!" I said. "Do you use REST web services or SOAP web services?"
He said, "REST web services."
"Me too!" I said. "Do you use text-based XML or binary XML?"
He said, "Text-based XML."
"Me too!" I said. "Do you use XML 1.0 or XML 1.1?"
He said, "XML 1.0."
"Me too!" I said. "Do you use UTF-8 or UTF-16?"
He said, "UTF-8."
"Me too!" I said. "Do you use Unicode Normalization Form C or Unicode Normalization Form KC?"
He said, "Unicode Normalization Form KC."
"Die, heretic scum!" I shouted, and I pushed him over the edge.
(with apologies to Emo Philips)
The release makes a significant change: if XML parsing fails due to character encoding problems, the parser will attempt to auto-determine the character encoding and re-parse with a real XML parser. This is noted in the results as
results['bozo'] = 1 and
results['bozo_exception'] = feedparser.CharacterEncodingOverride.
results['encoding'] will contain the encoding that was actually used to parse the feed (not the original declared encoding).
This release makes another significant change: Unicode support for ill-formed feeds. All individual data values will be returned as Unicode strings if they can be converted using the document's character encoding. I had a flash of insight and suddenly the entirety of Python's Unicode support became clear to me. I coded madly for several hours until it faded. It's entirely possible that that's just the LSD talking, but thanks to the magic of open source, everyone can now share in my good trip.
This release also makes significant changes to internal classes. If you were subclassing or accessing these classes, your code will likely break. If you were just using the public
parse() function, you will not notice any change.
My change reporting history has been lax throughout the 3.0 beta process, so I went back and recreated it from file timestamps, comments, and judicious use of
diff. Full user documentation is coming next.