Tuesday, February 19, 2013

Faster XML search with the id() function

Photo by Jason Pearce
Consider the following XML document in an XForms instance:
<xf:instance id="my-book">
    <book id="simple_book">
        <chapter id="chapter_1">
            <title>Chapter 1</title>
            <para>Hello world!</para>
        <chapter id="chapter_2">
Some key elements in this document have id attributes which identify the element they are placed on. This means that if you have the value of an id, you can find the associated element. For example the following XPath expression finds the second chapter element:
//*[@id = 'chapter_2']
The downside of using plain XPath here is that if you have a very large document, the expression might have to scan large portions of the document, possily the entire document. This can be costly. In theory, the XPath engine can back the XML document with indexes, but this is typically done only by XML databases such as eXist.

However XPath specifies an id() function exactly for the purpose of improving on searching element with a given id. It is in fact very similar to the JavaScript getElementById() function now implemented natively (and efficiently) by web browsers.

The id() function is not strictly required to be faster than a plain document scan, but it can obviously be backed by an index (a hash map), and this is exactly how we just implemented it![1]
The index is enabled only for a specific instance with:
<xf:instance id="my-book" xxf:index="id">
When support is enabled for an instance, all ids are indexed the first time the id() function is called. Then, for each mutation of the document (insertion, deletion, replacement, change of value), the index is updated. [2] At this time, only elements with a local name of id are indexed. This covers plain id and xml:id attributes.[3]

You can use the id() function this way:
  ref="id('chapter_2', instance('my-book'))"
  origin="xf:element('chapter', xf:attribute('id', 'chapter_3'))"/>
This XForms action inserts a new empty chapter element with id chapter_3 after the chapter element with id chapter_2.

Form Builder benefits from this change a lot, because:
  • The entire form being edited is stored in an XForms instance.
  • All controls and other important elements are assigned ids.
  • Finding elements by id is a very common operation.
This is available in nightly builds and will be part of Orbeon Forms 4.0.1.

  1. Until this change, calling id() on regular (not readonly) instances in Orbeon Forms always returned an empty result. However, calling it on readonly instances already behaved as expected, as we are relying on the Saxon TinyTree implementation for readonly instances.  ↩
  2. The implementation is just slightly non-trivial, because the spec says that when more than one element has a given id, the first in document order must be returned. So we must properly track multiple ids for that reason.  ↩
  3. In theory, id() must support any attribute or element with type xs:ID but Orbeon Forms doesn’t implement this yet. In the future, we might also support indexing elements by other aspects, such as CSS classes or element names, as HTML does.  ↩

No comments:

Post a Comment