BlogML 2.0 Impressions

subtext, blog, xml comments edit

I’ve been looking for a while to migrate off this infernal pMachine blog engine I’m on. The major problem is how to migrate my data to the new platform. Enter BlogML.

BlogML is an XML format for the contents of a blog. You can read about it and download it on the CodePlex BlogML site. They’re currently at version 2.0, which implies there was a 1.0 somewhere along the lines that I missed.

Anyway, the general idea is that you can export blog contents in BlogML from one blog engine and import into another blog engine, effectively migrating your content. Thus began my journey down the BlogML road.

If you download BlogML from the site it comes with an XSD schema for BlogML, a sample BlogML export file, a .NET API, and a schema validator.

I didn’t use the .NET API because pMachine is in PHP and all of the routines for extracting data are already in PHP, so I wrote my pMachine BlogML exporter in - wait for it - PHP. As such, I can’t really lend any commentary to the quality of the API’s functionality. That said, a quick perusal of the source shows that there are almost no comments and the rest looks a lot like generated XmlSerializable style code.

The schema validator is a pretty basic .NET application that can validate any XML against any schema - you select the schema and the XML files manually and it just runs validation. This actually makes it troublesome to use; you’d think the schema would be embedded by default. If you have some other schema validation tool, feel free to ignore the one that comes with BlogML.

The real meat of BlogML is the schema. That’s where the value of BlogML is - in defining the standard format for the blog contents.

The overall format of things seems to have been thought out pretty well. The schema accounts for posts, comments and trackbacks on each post, categories, attachments, and authors. I was pretty easily able to map the blog contents of pMachine into the requisite structure for BlogML.

There are three downsides with the schema:

First, the schema could really stand to be cleaned up. This may not be obvious if you’re editing the thing in a straight text editor, but when you throw it into something like XMLSpy, you can see the issues. Things could be made simpler by better use of common base types that get extended. There are odd things like an empty, hanging element sequence in one of the types. Generally speaking, a good tidy-up might make it a lot easier to use, because…

Second, the documentation is super duper light. I think there are like 10 lines of documentation in the schema, tops, and there’s nothing outside the schema that explains it, either. Without going back and forth between the schema and the sample document, I’d have no idea what exactly was supposed to be where, what the format of things needed to be, etc.

Third, and admittedly this may be more pMachine-specific, there’s no notion of distinguishing between a “trackback” and a “pingback.” There’s only a “trackback” entity in the schema, so if your blog supports the notion of a “pingback,” you will lose the differentiation when you export.

Anyway, I planned on importing my blog into Subtext, so I set up a test site on my development machine, ran the export on my pMachine blog (through a utility I wrote; I’m going to do some fine-tuning and release it for all you stranded pMachine users) and did the import. This is where I started noticing the real shortcomings in BlogML proper. These fall into two categories:

Shortcoming 1: Links. If you’ve had a blog for any length of time, you’ve got posts that link to other posts. That works great if your link format doesn’t change. If I’m moving from pMachine to Subtext, though, I don’t want to have to keep my old PHP blog around (hence “moving”), and, if possible, I’d like to have any intra-site links get updated. There doesn’t seem to be any notion in BlogML pre-defining a “new link mapping” (like being able to say “for this post here, its new link will be here”) so import engines will be able to convert content on the fly. There’s also no notion of a response from an import engine to be able to say “Here’s the old post ID, here’s the new one” so you can write your own redirection setup (which you will have to do, regardless of whether you update the links inside the posts).

I think there needs to be a little more with respect to link and post ID handling. BlogML might be great for defining the contents of a blog from an exported standpoint, but it doesn’t really help from an imported standpoint. Maybe offering a second schema for old-ID-to-new-ID mapping (or even old-ID-to-new-post-URL) that blog import engines could return when they finish importing… something to address the mapping issue. As it stands, I’m going to be doing some manual calculation and post-import work.

Shortcoming 2: Non-Text Content If you’ve got images or downloads or other non-text content on your blog posts, it’s most likely stored in some proprietary folder hierarchy for the blog engine you’re on… and if you’re moving, you won’t be having that hierarchy anymore, will you? That means you’ve got to not only move the text content, but the rest of the content into the new blog engine.

There is a notion of attachments in BlogML, but it’s not clear that solves the issue. You can apparently even embed “attachments” for each entry as a base64 encoded entity right in the BlogML. It’s unclear, however, how this attachment relates back to the entry and, further, unclear how the BlogML import will handle it. This could probably be remedied with some documentation, but like I said, there really isn’t any.

This sort of leaves you with one of two options: You can leave the non-text content where it is and leave the proprietary folder structure in place… or you can move the non-text content and process all of the links in all of your posts to point to the new location. One way is less work but also less clean; the other is cleaner but a lot of work. Lose-lose.

Anyway, the end result of my working with BlogML: I like the idea and I’ll be using it as a part of a fairly complex multi-step process to migrate off pMachine. That said, I think it’s got a long way to go for widespread use.