The idea behind HTML screen-scraping consists in accessing an HTML page on the web, and extracting information out of it. The DeveloperWorks article proposes two ideas:
- Using JTidy to convert the HTML to well-formed XML
- Using XQuery to extract and reformat the data
I have to admit that I was not familiar with the term "screen-scraping", but in fact we've had some examples of this technique in OPS for a very long time. In particular, the URL Generator example was retrieving the latest CNN headline by accessing the HTML of the CNN web site, and the Google Spell-Checker example was doing even more complex HTML-based interaction with Google.
The URL generator processor in OPS is able to produce XML from data fetched from a URL that you pass it. That source data may be of four different types:
- XML. In this case, it is just parsed
- HTML. In this case, it is automatically cleaned-up (with JTidy!) and converted into XML
- Text. In this case, the text is encapsulated within XML
- Binary. In this case, the binary is Base64-encoded and encapsulated within XML
So it is trivial to extract an existing HTML page from OPS with code like this in XPL:
Now the thing we did not do with the URL generator example was using XQuery: we used XSLT instead. In fact, XQuery and XSLT are very similar in what they can accomplish in such a use case, the XQuery syntax being a little lighter. OPS had an XQuery processor as well, illustrated by the XQuery sandbox example. We were using a fairly old version of Qexo, which implemented a version of the XQuery spec that was not quite up to date. I used the opportunity to move to Saxon. Saxon is primarily known as an XSLT transformer, but because XQuery 1.0 is so close from XPath 2.0 and XSLT 2.0, Saxon also implements XQuery. Since Saxon is already the default XSLT transformer in OPS, implementing an XQuery processor based on it was a breeze. Here is the Yahoo! Weather example from the article, written in XPL:
As usual with XPL, the XML pipeline language of OPS, you notice how easy it is to connect together small components and make them work together without writing any line of Java and without going through compilation and deployment.
Now here is how you could write the XQuery fragment with XSLT:
This Yahoo! Weather example using XQuery is now in CVS and should have already shown up in the unstable builds.
The Yahoo! Weather Screen-Scraping Example in OPS