Saturday, January 14, 2006

Data model for XML processing

A recent discussion in the public-xml-processing-wg mailing-list raises the quite interesting question of which data model to choose for an XML processing language. Most XML processing languages have historically bet on simplicity. In the case of XPL, XML components (called XML processors in the XPL terminology) exchange complete XML information sets, that is, basically, XML documents. A component's input reads a complete XML document, and a component's output produces a complete XML document, that's it.

It is interesting to contemplate this scenario: what if components could not only exchange full infosets, but just any sequence of items as defined by the XQuery 1.0 and XPath 2.0 Data Model (XDM)? Besides clearly increasing complexity, basing the processing data model on the XDM does bring some benefits:

  • It has proven useful in the past to be able to pass text or binary data in an XML pipeline. One use case consists in sending an XQuery document to a component. The OPS XPL implementation, to solve this problem, currently allows you to embed textual and binary information within an XML infoset.

    Using the XDM, you can easily pass text information (as xs:string) and even binary information (as xs:base64Binary) using types native to the data model. Arguably, this is a cleaner solution than embedding text and binary data within an XML document.

  • You can pass multiple documents in sequence (as document-node()+). Now whether you really need this is an open question. But in general, you could pass sequences of elements and more, therefore opening the door to components that are more versatile than components that just process XML infosets.

  • You no longer need a special XML pipeline concept for so-called "parameters" (think XSLT stylesheet parameters): pipeline steps can now consume and produce such parameters in the same way they would read full XML infosets, possibly with a simple boolean flag allowing a component such as an XSLT processor to discriminate between regular inputs and parameter inputs.

  • You can do simple type-checking, even possibly static type-checking, between components, if they declare the types they exchange (with something like XSLT's as attribute). Without mandating it, you also open the door to performing full XML schema validation on the data exchanged by components, therefore reporting more useful errors at runtime.

  • You do not reinvent the wheel, and you leverage an existing specification (the XDM recommendation). This will tend to make the XML processing language specification itself simpler and shorter.

So, is the solution worth the added complexity? The risk is to turn XML pipelines into something that goes too much over the scope of defining an "XML processing language". However, if done in the spirit of XML (and I think following the XDM qualifies), I think it could work.

No comments:

Post a Comment