Tangled in the Threads
Jon Udell, October 22, 2001Trees of Data
Some Zope and MSXML tricks for working with tree-structured data
In object databases and XML documents, data wants to be hierarchical. Here are some tools to help you work productively with trees of data.One of the great themes of modern IT, it seems to me, is the emergence of tools and techniques for working with tree-structured, as well as row-and-column-structured, data. The standard tool for working with rectangular datasets -- SQL -- struggles when applied to tree-like datasets. Or at least, so we commonly suppose, as seen in this posting to the BYTE.com databases newsgroup:
Ron Peterson:
SQL has several well-know shortcomings. Most notable, in my opinion, is the inability to work with variable depth hierarchies. Or more generally, graphs.Not so, argued database guru Ken North:
Actually that's one of the common myths discussed in this article:
"Text Indexing, XML Searches, And Other Database Tricks" http://www.webtechniques.com/archives/2000/02/data/index.shtml
After the class, I spoke with Dick Beck of Information Builders about the origin of the row-and-column myth. Dick developed much of the logic to optimize SQL queries for EDA/SQL, which can query hierarchical, relational, CODASYL, and flat-file data. I asked him, "Why do you think people have a problem with the concept of using SQL to operate on XML documents and trees? It's not rocket science. Developers have been using SQL for years to do bill-of-materials processing, which is also hierarchical data." Dick's reply was insightful. He said, "Most of the early articles published about relational databases talked about rows and columns, instead of tuples and attributes. People see the row-and-column model of spreadsheets, and think that's how the data is physically laid out in a database."
In reality, the physical organization of data in a database (physical model) is separate from the logical model .....
Chapter 26 of Joe Celko's SQL FOR SMARTIES (Morgan Kaufmann) discusses tree structures and directed graphs. Joe also wrote an SQL column for DBMS and those columns discussed solutions such as the nested set model and adjacency model for tree processing.
Excellent points! And yet, although I have (and revere) Joe Celko's book, I confessed to the newsgroup that I am not clever enough to usefully apply the SQL tree processing techniques he outlines in that chapter. While such processing is clearly possible, I would argue that it is not an easy or natural way to use SQL.
Subsequent discussion turned to the impending merger -- in major SQL products from Oracle, Microsoft, and others -- of methods for handling tree-like XML documents with methods for handling conventional SQL records. The trend toward such hybridization is really exciting and, as I noted in my report from the XML DevCon, is well underway.
What marks a real technological sea-change, though, isn't just a few high-end commercial products. Rather, it's broad-based adoption of new idioms. Last week, in two different situations -- one involving extensions to a Zope application, the other transformation of XML using XSLT -- it struck me that new ways of managing tree-like data are becoming routine.
Extending Zope's <dtml-tree> tag
Last time I mentioned Zope, I focused on its innovative scheme for environmental acquisition. The application discussed there, a simple content management system, continues to evolve. This week, I finally tackled and solved a nagging problem with the HTML tree control that's used to display and interact with the ZODB (Zope object database) datastore at the core of this application. Here's a snapshot of the tree:
2001-12some articlefinal draft, Oct 14 2001, 10:37pm, Jane ManagingEditorfirst draft, Oct 12 2001, 11:14am, John Editorfigure one, Oct 11 2001, 12:14pm, Joan Artistfigure two, Oct 11 2001, 12:15pm, Joan Artistanother article2001-11this articlethat articleThe crux of the problem was that different sorts were desirable at different levels of the tree. In particular, the top-level folders should appear in reverse order by name, but the leaves should appear in reverse order by date (so that collaborators can easily track recent changes).
As typically used, the engine that displays the tree, Zope's <dmtl-tree> tag, can support one or another of these strategies, but can't mix them. Here, for example, is how you do a reverse sort by name using <dmtl-tree>:
<dtml-tree RootFolder branches_expr="objectValues(['Folder', 'File','Image'])" reverse sort=id> &dtml-if "meta_type=='Folder'"> <-- apply node logic --> &/dtml-if> &dtml-if "meta_type=='File'"> <-- apply leaf logic --> &/dtml-if> &/dtml-tree>The objectValues API call, attached to <dtml-tree>'s branches_expr, selects three types of objects: Folders, Files, Images. And the sort expression -- reverse sort=id -- puts everything into reverse order by name.
To achieve more flexible sorting, you can replace the objectValues call with your own function. Like objectValues, it will be invoked repeatedly during tree traversal, when enumeration of a folder's contents is required. Unlike objectValues, which returns unordered results that <dtml-tree> must arrange, an alternative branches_expr function can return ordered results -- and can control them in folder- or level-specific ways.
Here's how the <dmtl-tree> tag looks now:
<dtml-tree RootFolder branches_expr="sortFoldersByNameFilesById()"> <-- apply node and leaf logic --> <dtml-tree>And here's the branch function, implemented as a Zope Python Script:
objects = context.objectValues(['Folder','File','Image']) objects = list(objects) types = '' for i in range(len(objects)): types = types + objects[i].meta_type import string if ( string.find(types,'Folder') != -1 ): # contains nodes objects.sort(lambda x, y: cmp(y.id, x.id)) else: # contains only leaves objects.sort(lambda x, y: cmp(y.bobobase_modification_time(), x.bobobase_modification_time())) return tuple(objects)In other words:
Use the bound variable context as a handle to the current node.
Convert the immutable tuple returned by calling objectValues on the current node into a mutable (and therefore sortable) list.
Enumerate the types of objects in the list.
If Folder objects appear, sort by name, otherwise sort by modification time.
Convert the reordered list back to a tuple, and return it.
It's a bit lame, I'll admit, to have to assemble and search strings like "FolderFolderFolder," "FolderFolderFileFolder," and "FileFileImageFileImage," in order to decide how to arrange a node's contents. Why not just fabricate Zope properties to label things more directly, and attach those properties to objects?
You can certainly do that. But if you use application logic to create nonstandard properties, then only that application logic is aware of them. Here's why that can be a problem. As this application has evolved, it has gradually taken on more of the capabilities built into Zope's native web management UI. However, that UI remains available to privileged users, and that's one reason why I find the whole environment so productive. The application encapsulates common operations -- creating standard subtrees from templates, adding files and images to folders -- in a simplified UI for ordinary users. But rather than reimplement advanced operations -- cutting and pasting subtrees, creating nonstandard folders -- I reuse the Zope management UI by making it available to privileged users. This works nicely, and yields a lot of powerful behavior for free, but that wouldn't be possible if the application came to depend on nonstandard object properties.
The classic drawback to using an object database such as ZODB is the lack of a declarative update syntax. As a result, I find it prudent to leverage the native system as fully as possible, and to extend it judiciously. Someday, I hope, it will become possible to update ZODB without writing procedural code, perhaps using a syntax like XML-Query.
Using Perl and regular expressions in XSLT transforms
The tension between declarative and procedural ways of working with tree-like data appeared in another guise last week. As I have elsewhere confessed, I've been a late and reluctant adopter of XSLT. Although I do a lot of web content management based on XML data, I haven't until recently made much use of XSLT. Instead, I've tended to write scripts that parse and process XML content. That's because the HTML content I generate from XML sources is rarely a straight one-to-one structure transform. Rather, the HTML typically requires complex string processing, and external logic, in order to weave sets of pages into a coherent information system.
XSLT specializes in the declarative search and rearrangement of tree-like content. It has procedural capability, but you can't natively use regular expressions, or functions written in another scripting language. "So what?" say XSLT purists. "You can do anything in XSLT." And that's true. It's equally true that you can write a flight simulator in Cobol. But should you? Some of the XSLT programming examples I've seen strike me as almost that silly. When notions of purity lead to abuse of common sense, I hunt for pragmatism.
The pragmatic solution is not, in my view, to throw everything and the kitchen sink into XSLT. Rather, why not enable XSLT to gracefully make use of resources -- such as Perl, JavaScript, regular expressions -- that complement XSLT's strengths?
Microsoft's XML parser, MSXML, has had this capability for some time. Because of the stigma of nonstandardization, I've avoided it until now. But when confronted with a simple problem last week that just begged for an XSLT solution, I took a closer look at the situation.
Here's the setup. An XML datastore contains, among other things, the ISBNs and titles of books, like so:
...preceding cruft... <ISBN>1-2341-2345-555 </ISBN> ...intervening cruft... <Title>Title of Book </Title> ...trailing cruft...The desired HTML output goes like this:
<a href="/url_pattern/book=123412345555">Title of Book</a>XSLT is hands down the best way to winnow these scraps of information from the unneeded cruft surrounding them. It's not, however, my preferred way to:
Remove the hyphens from the ISBN.
Remember the de-hyphenated ISBN for later use in conjunction with the Title element.
Since this transformation doesn't need to run in heterogenous client-side environments -- it's perfectly happy to be a standalone server-side batch process -- I decided to see how MSXML's nonstandard XSLT-cum-script capability would work. Here's a solution that worked in MSXML2, an older but still common version of MSXML:
<?xml version='1.0'?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl"> <xsl:script><![CDATA[ var strippedIsbn; function stripIsbn(e) { isbn = e.text; strippedIsbn = isbn.replace (/\-/i, ''); return; } function getStrippedIsbn() { return strippedIsbn; } ]]></xsl:script> <xsl:template match="ISBN"> <xsl:eval>stripIsbn(this)</xsl:eval> <div><b>ISBN:</b> <xsl:value-of /></div> </xsl:template> <xsl:template match="Title"> <div><b>Title: </b> <a><xsl:attribute name="HREF">/url_pattern/book=<xsl:eval>getStrippedIsbn()"/> </xsl:attribute><xsl:value-of /></a> </div> </xsl:template>In other words:
When an ISBN node appears in the stream, call a JavaScript function that uses a regexp to de-hyphenate the node's content, and save the result in a variable.
When Title appears in the stream, call another JavaScript to retrieve the de-hyphenated ISBN.
This, I think, is a great way to marry the strengths of XSLT, regular expressions, and script. What about that nagging standards issue, though? It would be nice to relieve guilt on that point. So I upgraded my parser -- not only to the current MSXML3, but on to the still-experimental MSXML4. Naturally, everything broke. When I finally got things put back together again, here's what the same stylesheet fragment looked like:
<?xml version='1.0'?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:user="jon"> <msxsl:script language="PerlScript" implements-prefix="user"><![CDATA[ my $strippedIsbn = ''; sub stripIsbn { my ($nodelist) = (@_); $strippedIsbn = $nodelist->nextNode->{text}; $strippedIsbn =~ s/\-//g; return ''; } sub getStrippedIsbn { preturn $strippedIsbn; } ]]></msxsl:script> <xsl:template match="ISBN"> <xsl:value-of select="user:stripIsbn(.)"/> <div><b>ISBN</b><xsl:text>: </xsl:text> <xsl:value-of select="normalize-space(.)"/></div> </xsl:template> <xsl:template match="Title"> <div><b>Title: </b> <a><xsl:attribute name="HREF">/usl_pattern/book=<xsl:value-of select="user:getStrippedIsbn()"/> </xsl:attribute><xsl:value-of select="normalize-space(.)"/> </a> </div> </xsl:template>Here are some of the salient changes:
The XSL namespace is the official one.
The script tag, not included in that official namespace, lives instead in a provisional Microsoft namespace.
Functions defined in script must live in a user-defined namespace.
Calls to script functions must refer through the appropriate user-defined namespace.
No more xsl:eval. Invocation of scripts is by way of xsl:value-of. (Also, in general, an empty xsl:value-of tag is no longer allowed; explicit use of select="..." is required.)
Script functions receive the current node as a nodeset.
The idioms are more complex, but also more rigorous, and closely resemble the mechanisms defined in the XSLT 1.1 working draft.
Pure? I guess not, until and unless XSLT 1.1 is approved. Pragmatic? You bet. And here's the icing on the cake. The second (MSXML4) example uses Perl for its script functions. In this example, that's not a huge win. JavaScript/ECMAScript/JScript, to which the environment defaults in the absence of a language="..." attribute, would be just as handy. But for more complex management of tree-like data, I'm really looking forward to the combined power of XSLT, Perl, and regular expressions.
Jon Udell (http://udell.roninhouse.com/) was BYTE Magazine's executive editor for new media, the architect of the original www.byte.com, and author of BYTE's Web Project column. He is the author of Practical Internet Groupware, from O'Reilly and Associates. Jon now works as an independent Web/Internet consultant. His recent BYTE.com columns are archived at http://www.byte.com/tangled/
This work is licensed under a Creative Commons License.