Skip to main content

Google (my current employer) has finally open sourced protocol buffers, the data interchange format we use for internal server-to-server communication. The blogosphere's response? "No wireless. Less space than a Nomad. Lame."


Protocol buffers are "just" cross-platform data structures. All you have to write is the schema (a .proto file), then generate bindings in C++, Java, or Python. (Or Haskell. Or Perl.) The .proto file is just a schema; it doesn't contain any data except default values. All getting and setting is done in code. The serialized over-the-wire format is designed to minimize network traffic, and deserialization (especially in C++) is designed to maximize performance. I can't begin to describe how much effort Google spends maximizing performance at every level. We would tear down our data centers and rewire them with $500 ethernet cables if you could prove that it would reduce latency by 1%.

Besides being blindingly fast, protocol buffers have lots of neat features. A zero-size PB returns default values. You can nest PBs inside each other. And most importantly, PBs are both backward and forward compatible, which means you can upgrade servers gradually and they can still talk to each other in the interim. (When you have as many machines as Google has, it's always the interim somewhere.)

Comparisons to other data formats was, I suppose, inevitable. Old-timers may remember ASN.1 or IIOP. Kids these days seem to compare everything to XML or JSON. They're actually closer to Facebook's Thrift (written by ex-Googlers) or SQL Server's TDS. Protocol buffers won't kill XML (no matter how much you wish they would), nor will they replace JSON, ASN.1, or carrier pigeon. But they're simple and they're fast and they scale like crazy, and that's the way Google likes it.



Time to resurface a few good comments I made at Tim's place last year:

> if an electronic-trading system receives an XML message for a transaction valued at €2,000,000, and there's a problem with a missing end tag, you do not want the system guessing what the message meant

You [Tim] have used this example, or variations of it, since 1997. I think I can finally express why it irritates me so much: you are conflating "non-draconian error handling" with "non-deterministic error handling". It is true that there are some non-draconian formats which do not define an error handling mechanism, and it is true that this leads to non-interoperable implementations, but it is not true that non-draconian error handling implies "the system has to guess." It is possible to specify a deterministic algorithm for graceful (non-draconian) error handling; this is one of the primary things WHATWG is attempting to do for HTML 5.

If any format (including an as-yet-unspecified format named "XML 2.0") allows the creation of a document that two clients can parse into incompatible representations, and both clients have an equal footing for claiming that their way is correct, then that format has a serious bug. Draconian error handling is one way to solve such a bug, but it is not the only way, and for 10 years you've been using an overly simplistic example that misleadingly claims otherwise.

And, in the same thread but on a different note:

I would posit that, for the vast majority of feed producers, *is* RSS (and Atom). People only read the relevant specs when they want to argue that the validator has a false positive (which has happened, and results in a new test) or a false negative (which has also happened, and also results in a new test). Around the time that RFC 4287 was published, Sam rearranged the tests by spec section. This is why specs matter. The validator service lets morons be efficient morons, and the tests behind it let the assholes be efficient assholes. More on this in a minute.

> A simpler specification would require a smaller and finite amount of test cases.

The only thing with a "finite amount of test cases" is a dead fish wrapped in yesterday's newspaper.

On October 2, 2002, the service that is now hosted at came bundled with 262 tests. Today it has 1707. That ain't all Atom. To a large extent, the increase in tests parallels an increase in understanding of feed formats and feed delivery mechanisms. The world understands more about feeds in 2007 than it did in 2002, and much of that knowledge is embodied in the validator service.

If a group of people want to define an XML-ish format with robust, deterministic error handling, then they will charge ahead and do so. Some in that group will charge ahead to write tests and a validator, which (one would hope) will be available when the spec finally ships. And then they will spend the next 5-10 years refining the validator, and its tests, based on the world's collective understanding. It will take this long to refine the tests into something bordering on comprehensive *regardless of how simple the spec is* in the first place.

In short, you're asking the wrong question: "How can we reduce the number of tests that would we need to ship with the spec in order to feel like we had complete coverage?" That's a pernicious form of premature optimization. The tests you will actually need (and, hopefully, will actually *have*, 5 years from now) bears no relationship to the tests you can dream up now. True "simplicity" emerges over time, as the world's understanding grows and the format proves that it won't drown you in "gotchas" and unexpected interactions. XML is over 10 years old now. How many XML parsers still don't support RFC 3023? How many do support it if you only count the parts where XML is served as "application/xml"?

I was *really proud* of those 262 validator tests in 2002. But if you'd forked the validator on October 3rd, 2002, and never synced it, you'd have something less than worthless today. Did the tests rot? No; the world just got smarter.

On a somewhat related note, I've cobbled together a firehose which tracks comments (like these) that I make on other selected sites. Many thanks to Sam for teaching me about Venus filters, which make it all possible. If you've been thinking "Gee, I just can't get enough of that Pilgrim guy, I wish there were a way that I could stalk him without being overly creepy about it," then this firehose is for you.


Tim Bray is learning Python and using my feed parser to parse the feeds at Planet Sun. I am suitably flattered, and I sincerely hope that one of the 57 lines in Tim's first Python program checks the bozo bit so Tim can ignore the 13 Planet Sun feeds which are not well-formed XML.

One is served as text/plain, which means it can never be well-formed.

Two (a, b) contain invalid XML characters.

Ten (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) are served as text/xml with no charset parameter. Clients are required to parse such feeds as us-ascii, but the feeds contain non-ASCII characters and are therefore not well-formed XML.

On a positive note, it's nice to see that Norman Walsh has an Atom feed (#10 in that list). Pity it's not well-formed. I'm sure he'll fix that in short order. He's no bozo.

You know what I want for Christmas? Markup Barbie. You pull a string and she says "XML is tough."


I was walking across a bridge one day, and I saw a man standing on the edge, about to jump off. So I ran over and said, "Stop! Don't do it!"

"I can't help it," he cried. "I've lost my will to live."

"What do you do for a living?" I asked.

He said, "I create web services specifications."

"Me too!" I said. "Do you use REST web services or SOAP web services?"

He said, "REST web services."

"Me too!" I said. "Do you use text-based XML or binary XML?"

He said, "Text-based XML."

"Me too!" I said. "Do you use XML 1.0 or XML 1.1?"

He said, "XML 1.0."

"Me too!" I said. "Do you use UTF-8 or UTF-16?"

He said, "UTF-8."

"Me too!" I said. "Do you use Unicode Normalization Form C or Unicode Normalization Form KC?"

He said, "Unicode Normalization Form KC."

"Die, heretic scum!" I shouted, and I pushed him over the edge.

(with apologies to Emo Philips)


3.0 beta 22 of my Universal Feed Parser is out. This release fixes all known bugs, and I hope it will be the last beta before 3.0 final. After all, this is getting a bit ridiculous.

The release makes a significant change: if XML parsing fails due to character encoding problems, the parser will attempt to auto-determine the character encoding and re-parse with a real XML parser. This is noted in the results as results['bozo'] = 1 and results['bozo_exception'] = feedparser.CharacterEncodingOverride. results['encoding'] will contain the encoding that was actually used to parse the feed (not the original declared encoding).

This release makes another significant change: Unicode support for ill-formed feeds. All individual data values will be returned as Unicode strings if they can be converted using the document's character encoding. I had a flash of insight and suddenly the entirety of Python's Unicode support became clear to me. I coded madly for several hours until it faded. It's entirely possible that that's just the LSD talking, but thanks to the magic of open source, everyone can now share in my good trip.

This release also makes significant changes to internal classes. If you were subclassing or accessing these classes, your code will likely break. If you were just using the public parse() function, you will not notice any change.

My change reporting history has been lax throughout the 3.0 beta process, so I went back and recreated it from file timestamps, comments, and judicious use of diff. Full user documentation is coming next.

3.0b3 - 1/23/2004 - MAP
  • parse entire feed with real XML parser (if available)
  • added several new supported namespaces
  • fixed bug tracking naked markup in description
  • added support for enclosure
  • added support for source
  • re-added support for cloud which got dropped somehow
  • added support for expirationDate
3.0b4 - 1/26/2004 - MAP
  • fixed xml:lang inheritance
  • fixed multiple bugs tracking xml:base URI, one for documents that don't define one explicitly and one for documents that define an outer and an inner xml:base that goes out of scope before the end of the document
3.0b5 - 1/26/2004 - MAP
  • fixed bug parsing multiple links at feed level
3.0b6 - 1/27/2004 - MAP
  • added feed type and version detection, result["version"] will be one of SUPPORTED_VERSIONS.keys() or empty string if unrecognized
  • added support for creativeCommons:license and cc:license
  • added support for full Atom content model in title, tagline, info, copyright, summary
  • fixed bug with gzip encoding (not always telling server we support it when we do)
3.0b7 - 1/28/2004 - MAP
  • support Atom-style author element in author_detail (dictionary of "name", "url", "email")
  • map author to author_detail if author contains name + email address
3.0b8 - 1/28/2004 - MAP
  • added support for contributor
3.0b9 - 1/29/2004 - MAP
  • fixed check for presence of dict function
  • added support for full Atom content model in summary
3.0b10 - 1/31/2004 - MAP
  • incorporated ISO-8601 date parsing routines from xml.util.iso8601
3.0b11 - 2/2/2004 - MAP
  • added 'rights' to list of elements that can contain dangerous markup
  • fiddled with decodeEntities (not right)
  • liberalized date parsing even further
3.0b12 - 2/6/2004 - MAP
  • fiddled with decodeEntities (still not right)
  • added support to Atom 0.2 subtitle
  • added support for Atom content model in copyright
  • better sanitizing of dangerous HTML elements with end tags (script, frameset)
3.0b13 - 2/8/2004 - MAP
  • better handling of empty HTML tags (br, hr, img, etc.) in embedded markup, in either HTML or XHTML form (<br>, <br/>, <br />)
3.0b14 - 2/8/2004 - MAP
  • fixed CDATA handling in non-wellformed feeds under Python 2.1
3.0b15 - 2/11/2004 - MAP
  • fixed bug resolving relative links in wfw:commentRSS
  • fixed bug capturing author and contributor URL
  • fixed bug resolving relative links in author and contributor URL
  • fixed bug resolvin relative links in generator URL
  • added support for recognizing RSS 1.0 in results['version']
  • passed Simon Fell's namespace tests, and included them permanently in the test suite with his permission
  • fixed namespace handling under Python 2.1
3.0b16 - 2/12/2004 - MAP
  • fixed support for RSS 0.90 (broken in b15)
3.0b17 - 2/13/2004 - MAP
  • determine character encoding as per RFC 3023
3.0b18 - 2/17/2004 - MAP
  • always map description to summary_detail (Andrei)
  • use libxml2 (if available)
3.0b19 - 3/15/2004 - MAP
  • fixed bug exploding author information when author name was in parentheses
  • removed ultra-problematic mxTidy support
  • patch to workaround crash in PyXML/expat when encountering invalid entities (MarkMoraes)
  • support for textinput/textInput
3.0b20 - 4/7/2004 - MAP
  • added CDF support
3.0b21 - 4/14/2004 - MAP
  • added Hot RSS support
3.0b22 - 4/19/2004 - MAP
  • map 'channel' to 'feed', 'items' to 'entries' in results dict (old keys still work)
  • changed results dict to allow getting values with results.key as well as results[key]
  • work around embedded illformed HTML with half a DOCTYPE
  • work around malformed Content-Type header
  • if character encoding is wrong, try several common ones before falling back to regexes (if this works, bozo_exception is set to CharacterEncodingOverride)
  • fixed character encoding issues in BaseHTMLProcessor by tracking encoding and converting from Unicode to raw strings before feeding data to sgmllib.SGMLParser
  • convert each value in results to Unicode (if possible), even if using regex-based parsing
  • re-added mxTidy support, but off by default; install mxTidy and set feedparser.TIDY_MARKUP=1 to enable it