Wednesday, March 30, 2005

Screen-Scraping with OPS

I read a couple of days ago an article by Brian Goetz on IBM DeveloperWorks about with XQuery.

The idea behind HTML screen-scraping consists in accessing an HTML page on the web, and extracting information out of it. The DeveloperWorks article proposes two ideas:

  1. Using JTidy to convert the HTML to well-formed XML
  2. Using XQuery to extract and reformat the data

I have to admit that I was not familiar with the term "screen-scraping", but in fact we've had some examples of this technique in OPS for a very long time. In particular, the URL Generator example was retrieving the latest CNN headline by accessing the HTML of the CNN web site, and the Google Spell-Checker example was doing even more complex HTML-based interaction with Google.

The URL generator processor in OPS is able to produce XML from data fetched from a URL that you pass it. That source data may be of four different types:

  • XML. In this case, it is just parsed
  • HTML. In this case, it is automatically cleaned-up (with JTidy!) and converted into XML
  • Text. In this case, the text is encapsulated within XML
  • Binary. In this case, the binary is Base64-encoded and encapsulated within XML

So it is trivial to extract an existing HTML page from OPS with code like this in XPL:

  <p:processor name="oxf:url-generator" xmlns:p="http://www.orbeon.com/oxf/pipeline">
  <p:input name="config">
  <config>
  <url>http://weather.yahoo.com/</url>
  <content-type>text/html</content-type>
  </config>
  </p:input>
  <p:output name="data" id="page"/>
  </p:processor>

Now the thing we did not do with the URL generator example was using XQuery: we used XSLT instead. In fact, XQuery and XSLT are very similar in what they can accomplish in such a use case, the XQuery syntax being a little lighter. OPS had an XQuery processor as well, illustrated by the XQuery sandbox example. We were using a fairly old version of Qexo, which implemented a version of the XQuery spec that was not quite up to date. I used the opportunity to move to Saxon. Saxon is primarily known as an XSLT transformer, but because XQuery 1.0 is so close from XPath 2.0 and XSLT 2.0, Saxon also implements XQuery. Since Saxon is already the default XSLT transformer in OPS, implementing an XQuery processor based on it was a breeze. Here is the Yahoo! Weather example from the article, written in XPL:

  <p:config xmlns:p="http://www.orbeon.com/oxf/pipeline">
  <p:processor name="oxf:url-generator">
  <p:input name="config">
  <config>
  <url>http://weather.yahoo.com/</url>
  <content-type>text/html</content-type>
  </config>
  </p:input>
  <p:output name="data" id="page" debug="page"/>
  </p:processor>
  <p:processor name="oxf:xquery">
  <p:input name="config">
  <xquery>
  <html>
  <body>
  <table>
{ for $d in //td[contains(a/small/text(), "New York, NY")] return for $row in $d/parent::tr/parent::table/tr where contains($d/a/small/text()[1], "New York") return
  <tr>
  <td>{data($row/td[1])}</td>
  <td>{data($row/td[2])}</td>
  <td>{$row/td[3]//img}</td>
  </tr>
}
  </table>
  </body>
  </html>
  </xquery>
  </p:input>
  <p:input name="data" href="#page"/>
  <p:output name="data" id="html-page"/>
  </p:processor>
  <p:processor name="oxf:html-serializer">
  <p:input name="config">
  <config/>
  </p:input>
  <p:input name="data" href="#html-page"/>
  </p:processor>
  </p:config>

As usual with XPL, the XML pipeline language of OPS, you notice how easy it is to connect together small components and make them work together without writing any line of Java and without going through compilation and deployment.

Now here is how you could write the XQuery fragment with XSLT:

  <p:processor name="oxf:xslt" xmlns:p="http://www.orbeon.com/oxf/pipeline">
  <p:input name="config">
  <html xsl:version="2.0">
  <body>
  <table>
  <xsl:for-each select="//td[contains(a/small/text(), 'New York, NY')]" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:variable name="d" select="."/>
  <xsl:if test="contains($d/a/small/text()[1], 'New York')">
  <xsl:for-each select="$d/parent::tr/parent::table/tr">
  <xsl:variable name="row" select="."/>
  <tr>
  <td><xsl:value-of select="$row/td[1]"/></td>
  <td><xsl:value-of select="$row/td[2]"/></td>
  <td><xsl:copy-of select="$row/td[3]//img"/></td>
  </tr>
  </xsl:for-each>
  </xsl:if>
  </xsl:for-each>
  </table>
  </body>
  </html>
  </p:input>
  <p:input name="data" href="#page"/>
  <p:output name="data" ref="data"/>
  </p:processor>

This Yahoo! Weather example using XQuery is now in CVS and should have already shown up in the unstable builds.

OPS Screen-Scraping Example
The Yahoo! Weather Screen-Scraping Example in OPS

4 comments:

  1. Erik, you're too modern! Screen Scraping is actually an acceptable way to create an interface to a legacy app. I can't find the information on the IBM team working on this, but an example of pushing AS400 green screens to the web via scraping:
    http://www.jacada.com/Products/Legacy_Extension.htm . It was a way of putting an interface on a legacy app while you re-architected it, but I'm not sure that once companies created a thin client layer, they had a lot of incentive to rewrite the application in, say Websphere.

    ReplyDelete
  2. Thanks for the precision! It is a similar approach with HTML screen-scraping. You could build "modern" web services on top of legacy HTML-based applications. If you really need to interact with the application, you would need to be able to post forms as well.

    ReplyDelete
  3. , which uses the same principle and must create a new version every so often, when eBay changes the layout of it's pages. If you can, you're better off creating a SOAP or REST interface to your legacy apps, than just to depend on HTML views of your data. Trust me, I've been there (and back ;).

    ReplyDelete
  4. Another article on Screen Scraping. Not too clear from the text, but it sounds like the strategy is to create wsdl descriptions for things like CICS mainframe applications. If there really is a way to to streamline that, it would be great. The main point, is that scraping is an option where the the data and code are difficult to separate.

    http://idevnews.com/PrintVersion.asp?ID=161

    ReplyDelete