As part of creating a service platform, e.g. a platform focused on the creation and consumption of services by other services, we realized that we needed a consistent way to model messages within our system. This isn't about creating some universal message model for everyone, everywhere. This is strictly about creating a message model for our use. But I think the issues we are dealing with are fairly universal so I thought it would be interesting to share our current thinking. Please keep in mind that this is all very preliminary and subject to change without notice.
[Note: Updated to add a section on extensibility]
The Problem
As part of our work in Live land we are trying to figure out how to represent information in messages, e.g. we need an infoset. In our case the work we are doing is focused exclusively on machine to machine communication. This isn't about, say, markup languages which are primarily focused on adding machine processable semantics to primarily human focused data (e.g. strings decorated with elements). But even in machine to machine communication we share many of the same issues around extensibility and multi-platform support that markup languages have to deal with.
To make matters more complex it's pretty obvious to us that we need to have support for multiple data serializations. At a minimum we think we have to support XML and JSON but following the old computer science rule that there are only three numbers "0, 1 and infinity" this means we will eventually have to support more. So we want an infoset that will allow us to serialize across these various formats. Inevitably this means either taking a lowest common denominator approach or 'tunneling' our infoset through other infosets. Tunneling is particularly scary because tunneling is what leads to things like the massive stack of WS-* specs. But, to a certain extent, that's just tough for us. We have to support multiple serializations and for sanity sake we really can only have one infoset so we'll try to create one that strives for lowest common denominator but there is no doubt in my mind that we will end up tunneling. When the tunneling gets too painful we will have to revisit our infoset and/or the serializations that we support. There are no magic solutions.
I don't expect to see any code written directly to the infoset specified here. Rather, in another article, I will publish our thinking around a schema based on this infoset. So the infoset is not something our users would run into on a regular basis. But what kinds of structures we can and cannot put into our messages will be controlled by the infoset we choose.
The Proposed Solution
The Live infoset consists of two information items, the element information item and the string information item. Each information item has the properties specified below. When we refer to Unicode it is to the abstract unicode character planes rather than to any specific encoding (e.g. utf-8, 16, 32, etc.).
Element Information Item
Name – A globally unique name which MUST consist of a reverse DNS path. Note however that RFC 3490 is not necessary, unicode characters may be used directly.
Parent – A pointer to the parent element of the current element or null if the current element is the root of an infoset
Children – A list of element information item pointers
Ordered – A Boolean value. If true the list of children is ordered, if false then it is not.
The parent/children relationship MUST form an acyclic single rooted tree. That is, each element can only be listed in at most one children list and the pointed at element's parent MUST be the same element whose children list the element appears in.
String Information Item
Value – Contains an ordered series of zero or more unicode characters.
Parent – A pointer to the element that contains the string.
A string MUST appear in exactly one children list of an element.
Extensibility
Discussion About The Proposed Solution
Below I walk through the decision making that went into creating our infoset. Yes, I know, it has two components, how complex can its creation be? But in reality we started with the XML infoset and then cut things out. The discussion below explains what we left out and why.
Diffing off the XML InfoSet
The obvious place to start is the XML infoset because, well, it exists and lots of people have had time to review it. In going through the infoset we can throw out the processing instruction information item, the unexpanded entity reference information item, the document type declaration information item, the unparsed entity information item and the notation information items without a second thought. These are all XMLisms that are of no relevance to anything we are trying to do.
I believe we can also throw out the comment information item because comments are not something we are going to make available in our infoset. They might be in the serializations but they aren't part of our processing model. E.g. we will not specify any semantics that directly or indirectly rely on the use of comments.
I would also throw out the namespace information items. The split between 'local names' and 'namespace names' in XML is an artifact of XML's history and not something that we would want in our infoset (see here for some background). Elements have globally unique names and that should just be that.
The character information items make some sense to me, especially because markup languages are all about interspersing text and elements. But our infoset will be simpler. For reasons explored later on we prefer to have a string information item rather than character information items.
A subject of more than a little debate around here has been the attribute information item. I was discussing this with Mark Nottingham from Yahoo and he made the point that attributes are great so long as no one but him are allowed to use them. The point being that there are certainly cases where having an attribute available makes the data model simpler (think IDs, for example) but that in practice people abuse attributes. The fundamental problem with attributes is that they are not, strictly speaking, necessary (E.g. you can always encode in elements what you can express in attributes) and they are not extensible since they are just strings. After much discussion our opening position is that our infoset won't support attributes. On balance we think attributes cause more harm than good. We'll see in practice how well this holds up.
The document information item contains lots of interesting data that isn't relevant to us. It has things like PIIs, comments, DTDs, etc. It also have version information but this is version information for the serialization, not for the XML infoset. The two are not necessarily the same. I will talk about infoset versioning issues in a later section but in practical terms we do not need the document information item.
If the logic holds up this means that our internal infoset needs exactly two information items, a string information item and an element information item.
Element Information Item – Naming, Order, Banning Markup and Such
The element information item has a number of properties in the XML Infoset. Since our infoset gives each element a globally unique name we can throw away all the detritus of the hacked in XML namespace model. So the namespace name, local name, prefix, namespace attributes and in-scope namespaces properties can all be quickly discarded and replaced with a single name property that provides the globally unique name for the element.
For simplicity sake we will name our attributes using DNS as a mechanism to both provide uniqueness and still make the result human readable. Specifically we will use the reverse DNS convention. E.g. a first name element could be named com.microsoft.live.schemas.firstName. For cases where we want to serialize to XML we can trivially break this up into a namespace (e.g. data:com.microsoft.live.schemas) and a local name (e.g. firstName).
As we aren't supporting attributes we can get rid of the attributes property. The base URI property won't be supported because none of our early scenarios require it but I am confident that we will end up adding it back in as one of our first extensions. We will need the parent property although we can restrict it so that its only legal value is an element.
This leaves the children property. In the XML infoset the contents of the children property are ordered. But in our infoset most data will be unordered. Order is a very big deal for markup languages because the human understandable semantics are embedded in the ordering (e.g. "putting off" does not mean the same thing as "off putting"). But in a machine focused language order is typically not all that relevant. If one thinks of say, a buddy list, the actual order of the buddies is usually a secondary consideration that deals more with human needs than machine ones. For example, if we are returning a buddy list the IM client receiving it could display the contents in any number of orders so the order that the list is sent in over the wire doesn't really matter.
There is however one major counter example, search order. When an expensive search operation is requested it is common to also specify the order in which the results are to be returned. Such an ordering can always be explicitly encoded into the results (e.g. each result member could have an explicit order number associated with it) but this doesn't seem to happen in practice.
I suspect the subject of ordering needs more investigation but for now our intention is to add in an explicit ordering property with a boolean value that if true means that the children property's value is ordered otherwise it is unordered.
The children property itself is restricted so that its values must either be pointers to a series of elements or a pointer to a single string. The justification for this restriction is that we are only interested in self describing data and strings are not self describing (at least not to a machine). So any time a string is used we need it to be wrapped in an element in order to provide machine processable semantics. Since we are only worried about machine processing this seems a reasonable restriction.
String Information Item – What To Do About Whitespace?
The XML Infoset character information item has three properties, character code, element content whitespace and parent. In our case we will use a string, not a single character and we will allow all unicode characters without restriction. We view it as a serialization problem to escape the unicode characters in order to fit in any particular serialization format. We will address such serialization issues on a serialization by serialization basis (e.g. our solution for JSON will be different than XML because they have different reserved characters). All unicode characters are relevant in our infoset so we don't have the concept of whitespace handling, vis a vis XML. But when we specify a XML serialization we will have to address the whitespace issue.
Graphs
There seems to be a fairly obvious pattern in the growth of data formats over time. First, there was ASCII (ANSI, EBCDIC, etc.) which was linear. Then there was HTML (SGML, XML, etc.) which is hierarchical. So the next step would seem to be something graph based. Yes, I know, RDF will save us all. But until we can truly welcome our new RDF overlords my suspicion is that we will just have to make due with hierarchical rather than graph based data formats. For all of the use cases we have floating around hierarchical is more than sufficient. Until we start to get compelling use cases for a graph format within our system we will require our infoset to be hierarchical.
Versioning the Infoset
One of the cardinal rules of system design is not to paint oneself into a corner. So it would seem natural to worry about versioning the infoset itself. But what would such versioning mean? Would we put into each message we serialize based on a specific infoset an identifier for that infoset? That would be our first example of tunneling. What would such an identifier achieve? The only time I can see such an identifier being important is if we implicitly rather than explicitly include semantics in a message that are derived from the infoset.
For example, lets say that we extend the infoset to specify that all relative URLs in the infoset are relative to the root element. But when we serialize an infoset instances into say, JSON, we include relative URLs but we do not include any object or other identifier that explicitly states "relative URLs are relative to the root object". Someone trying to resolve our URLs in the message won't be able to unless they are using the same version of the infoset and so understand how to resolve the URL.
In a case such as the above it would make sense for us to include an identifier for which infoset we are using so that systems could understand when a message may have semantics they don't support.
But I actually think this approach would be a mistake. After all, Live needs to talk to lots of non-Microsoft third parties and we certainly cannot expect those third parties to be using our infoset. Therefore I would argue that any time Infoset semantics leak into a message (e.g. ordering, base URLs, whatever) those semantics need to be explicitly and individually marked in the message. So, for example, in the case of relative URLs we would either need to introduce our own JSON object saying "this message interprets relative URLs relative to the root" or, much better, we would work with the JSON community to come up with a community standard to handle relative URL resolution. But in no case should we put ourselves in a position where third parties need to know which version of our infoset a particular service is using in order to figure out how to use that service
So long as we stick to the rule that the infoset's semantics are explicitly made known in the message there is no need for us to explicitly version the infoset itself.
I like that ordering is explicit; that’s one of the biggest mistakes of XML, IMO.
My point about attributes was (I hope ;) a tiny bit more subtle than that. Attributes should be reserved for standard use only; they’re a convenience when you need to add things like typing, serialisation details, ordering, etc., but in the hands of end users, are almost always abused. They’re a syntactic honey trap.
Even then I wonder if it’s really good to allow them to intrude into the data model.
Consider the RDF data model (yeah, yeah, I know, insert cynicism about SW here). They came up with this wonderful, beautifully simple data model — just nodes and arcs, baby — and then went and screwed it up by adding warts for datatyping and i18n. Urgh.
Speaking of which, what about profiling RDF? If you constrain it to rooted subgraphs, it’s really intuitive, and maps to most languages well (try playing with sparta). Having explicitly named properties is a good thing.
Finally, how do you see this relating to feeds (RSS and Atom)? Would you try to model the whole feed as this, or treat a feed as a container of infosets? Off the top of my head, the latter seems more interesting, but I’m still thinking about it.
Cheers,
P.S. Make your edit box bigger!
Hi Yaron,
wrt.
data:com.microsoft.live.schemas
: I think it would be good to have a plan to come up with a URI that uses a properly registered scheme.The “tag” URI scheme (RFC4151) comes to mind, but it requires to decorate the DNS name with a date
when the URI was issued (this to prevent URIs to become ambiguous when ownership changes). Of course that could be compensated by always using
the data of the Live Infoset spec, for instance
tag:com.microsoft.live.schemas,2006
.In general however, I think mapping to
http
would be better; it provides asimple way to optionally provide documentation for the element.
Regarding the other simplicifations compared to the XML Infoset: they all
seem to make sense, although I would rethink leaving comments out. One would
need to specify that they don’t have sematics, that’s it. Keep in mind
how useful they can be in debugging issues. If they are left out, people
are likely to use element information items instead (which may be ok if there’s
a mustIgnore extension rule).
Best regards, Julian
XML as it should have been designed 10 years ago. Remember SML (Simple Markup Language)? Back then those thoughts constituted heresy. Today they are chic :-)
One question though: if E4X was widely implemented cross-browser, would you even care? You would just stick to using the subset of XML you propose (which is basically the SOAP infoset minus attributes), and wouldn’t ever need something like JSON.
The really sad part is that I did propose an infoset just like this to the XML folks. We used to argue about s-expressions. But there was this insane belief (and I explicitly declared it as being insane at the time) that compatibility with SGML mattered. I remember XML folks saying “There are billions of SGML documents, we have to have compatibility!” to which I responded “yeah, and they are all locked into their own proprietary closets and aren’t coming out so who cares?” Sigh.. what a waste.
In any case, I like E4X a lot and if it were widely implemented then I do believe that JSON would lose a lot of its allure. Unfortunately until 80% or so of browsers support E4X we can’t justify supporting it in Windows Live Land. Hence the allure of JSON. You can validate that a JSON structure from a third party is ‘safe’ using regex (which is built into modern Javascript so it’s reasonably fast) and then eval it in order to load it (which is also reasonably fast). The perf boost is just nuts. The bandwidth reduction is also nice.
Sorry Julien, somehow I didn’t see your post in my mail and only caught it later, I apologize for the delay in posting and responding to it.
I like the idea of the Tag URI scheme. I wasn’t aware of it. Thanks for pointing it out. It certainly would be a better choice than data.
I also think it’s not a big deal to map the DNS names into http names. The infoset name could be “com.microsoft.schemaspace.live.schema” and the HTTP equivalent could be http://schema.live.schemaspace.microsoft.com. We could write a trivial piece of code (probably stolen from our custom domains group) to automatically create domain names in some reserved part of our DNS range (E.g. schemaspace.microsoft.com) that could provide info on the schema and work in XML.
As for comments. My feeling is that comments shouldn’t be in the infoset because they aren’t and shouldn’t be processable. But that does not stop them from being in the serialization! If we serialize to XML then XML can contain a comment and if we serialize to JSON then JSON can contain (albeit slightly illegally) a Javascript comment.
But I think the real thing to keep in mind is that unlike the XML infoset, our infoset is really meant to be theoretically. I have no intention nor desire to ever create something like the XML DOM that attempts to manifest the infoset directly. That’s why I want schemas. Programmers can annotate the schema with information on how they want to read the data in and out of their language of choice.
Mark – I owe you the same apology I just sent out to Julien. Somehow both of your posts went down the same rat hole and I just found them. As with Julien, I apologize for the delay in posting and responding.
I haven’t really worried much about RDF because it doesn’t buy Windows Live enough for it to matter. XML and JSON support are widespread on our primary target platforms, RDF isn’t. This isn’t about technical supremacy, it’s about business. Right now the business case for RDF doesn’t appear to be there so I don’t worry about it.
As for Feeds (e.g. RSS/ATOM), I tend to see them as result sets. In other words a feed is, to my mind, a bunch of independent documents that have been pulled together using RSS/ATOM as a wrapper. Obviously I’ve been heavily influenced by GDATA.
It will still be necessary to model the feed in the infoset so it can be directly manipulated (which, ironically, is so far the only scenario we have where it would make sense to mark an element’s contents as being ordered – remember, we explicitly don’t support markup) but semantically I see it as a container for data rather than data itself. I know I’m slicing this one pretty damn fine.
As for the text box, I agree, I’ll add that to my infinite ‘to do’ list. :( The top of which is to move to WordPress 2.