Skip to main content

Most of my recent writing has happened elsewhere.

That last article came about during the creation of mimesniff, my open source Python 3 library that implements the HTML5 Content-Type detection and character encoding detection algorithms.

If none of that is your cup of tea, here is a picture of my dog Beauregard, enjoying the beautiful North Carolina summer weather.

Beauregard on deck


Google (my current employer) has finally open sourced protocol buffers, the data interchange format we use for internal server-to-server communication. The blogosphere's response? "No wireless. Less space than a Nomad. Lame."


Protocol buffers are "just" cross-platform data structures. All you have to write is the schema (a .proto file), then generate bindings in C++, Java, or Python. (Or Haskell. Or Perl.) The .proto file is just a schema; it doesn't contain any data except default values. All getting and setting is done in code. The serialized over-the-wire format is designed to minimize network traffic, and deserialization (especially in C++) is designed to maximize performance. I can't begin to describe how much effort Google spends maximizing performance at every level. We would tear down our data centers and rewire them with $500 ethernet cables if you could prove that it would reduce latency by 1%.

Besides being blindingly fast, protocol buffers have lots of neat features. A zero-size PB returns default values. You can nest PBs inside each other. And most importantly, PBs are both backward and forward compatible, which means you can upgrade servers gradually and they can still talk to each other in the interim. (When you have as many machines as Google has, it's always the interim somewhere.)

Comparisons to other data formats was, I suppose, inevitable. Old-timers may remember ASN.1 or IIOP. Kids these days seem to compare everything to XML or JSON. They're actually closer to Facebook's Thrift (written by ex-Googlers) or SQL Server's TDS. Protocol buffers won't kill XML (no matter how much you wish they would), nor will they replace JSON, ASN.1, or carrier pigeon. But they're simple and they're fast and they scale like crazy, and that's the way Google likes it.



Despite a complete lack of fanfare or self-promotion, much of the Python-loving world seems to have found my Universal Encoding Detector, which is a pure-Python port of Mozilla's encoding detection. UED is used in a variety of end-user applications and other developer libraries, including:

And probably some others I don't know about.

This is what it feels like to be an upstream author. And I use the term "author" loosely, since all I did was port somebody else's wicked-smart algorithm, introduce new bugs, and write a few incoherent pages of documentation. But still, it is humbling to step back and observe the enormous worldwide community that is constantly packaging, updating, integrating, and distributing this stuff.

Anyway, version 1.0.1 is out, with a whopping two bugs fixed. Sorry it's so late, but I was busy practicing witchcraft and becoming a lesbian.

Yeah, I didn't see that coming either.


[Dive Into Python]

Please buy 4000 copies so I can pay back my advance. Thank you.


Universal Feed Parser 3.3 is out. You can download it at SourceForge. That package no longer includes the more than 2700 unit tests; they are now available separately.

The major new feature in this release is improved performance, thanks to a patch from Juri Pakaste. Under Python 2.2, this version runs twice as fast as previous versions. Under Python 2.3, it runs five times as fast. No kidding. Thanks, Juri. Juri is the project lead of Straw, a desktop aggregator for Linux, which uses the Universal Feed Parser.

Other changes in this release:

  • Refactored the date parsing routines, and added a new public function registerDateHandler().
  • Added support for parsing more kinds of dates, including Korean, Greek, Hungarian, and MSSQL-style dates. Thanks to ytrewq1 for numerous patches and help refactoring the date handling code.
  • In the "things nobody cares about but me" department, UFP now detects feeds served over HTTP with a non-XML Content-Type header (such as text/plain) and sets bozo_exception to NonXMLContentType. Such feeds can never be well-formed XML; in fact, they should not be treated as XML at all. (Note that not everyone shares this view.)
  • Documented UFP's relative link resolution.
  • Fixed problem tracking xml:base and xml:lang when one element declares it, its child doesn't override it, its first grandchild does override it, but then its second grandchild doesn't.
  • Use Content-Language HTTP header as the default language, if no xml:lang attribute, <language> element, or <dc:language> element is present.
  • Optimized EBCDIC to ASCII conversion.
  • Added zopeCompatibilityHack(), which makes the parse() routine return a regular dict instead of a subclass. I have been told that this is required for Zope compatibility (hence the name). It also makes command-line debugging easier, since the pprint module inexplicably pretty-prints real dictionaries differently than dict subclasses.
  • Support xml:lang="" for setting the current language to "unknown." This behavior is straight from the XML specification. Anyone who tells you that good specs don't matter is lying, or ignorant, or trying to sell you a bad one, or... hey look, shiny objects!
  • Recognize RSS 1.0 feeds as version="rss10" even when the RSS 1.0 namespace is not the default namespace.
  • Expose the status code on HTTP 303 redirects.
  • Don't overwrite the final status on redirects, in the case where redirecting to a URL returns a 304, or another redirect, or any non-200 status code.