20081123

How to query an XML file as if it was a relational database

For a couple of times I've spent my blogging time explaining how you can use MetaModel to query CSV files and Excel spreadsheets just as if they where regular databases which enable filtering, grouping, sorting etc. in their query languages (usually some form of SQL with more or less dialect).

Today let's take a look at MetaModels support for XML content. XML is, to say the least, very different from relational databases. Whereas relational databases consist of tables as the basic structural element, the XML structure is hierarchical and not table-based. You could say that the structure of XML is potentially a lot more dynamic because metaphorically XML supports nesting tables inside each others to emphasize the structure and relations. So dealing with XML in a framework like MetaModel was not a no-brainer in terms of design considerations and decisions. The main theme of MetaModel have always been to provide a consistent model that was the same even though you had to work with different kinds o datastores. So how did we do this when XML and other table-based datastores are so inherently different? I'll show you using an example...

Consider this compressed XML example, ie. an RSS newsfeed (coincidentally taken from the eobjects.org newsfeed):

<?xml version="1.0"?><rss version="2.0">
  <channel>
  <title>eobjects - Blog</title>
  <link>http://eobjects.org/trac/blog
  <description>Welcome to the new eobjects.dk website</description>
  <item>
    <title>MetaModel 1.1 released!</title>
    <pubdate>Wed, 05 Nov 2008 14:01:02 GMT</pubdate>
    <link>http://eobjects.org/trac/blog/metamodel-1.1-released
    <description>text goes here...</description>
    <category>release</category><category>license</category><category>metamodel</category>
  </item>
  <item>
    <title>DataCleaner 1.5 "snapshot" released</title>
    <pubdate>Mon, 13 Oct 2008 07:00:13 GMT</pubdate>
    <link>http://eobjects.org/trac/blog/datacleaner-1.5-snapshot-released
    <description>text goes here...</description>
    <category>release</category><category>datacleaner</category><category>license</category>
  </item>
  </channel>
</rss>

I want you to notice a few things about the RSS structure so that we get a full understanding on how MetaModel will interpret the XML file and map it to a table based structure. The resulting MetaModel structure is illustrated to the right (you will perhaps also notice that I have edited out a lot of the details in the XML code above, for brevity):

  • Notice that the <channel> element does not contain any data but only acts as what I would call a "wrapper tag". There are some different inner elements inside <channel> which causes MetaModel to name tables according to this shared root node. Hence the table names "channel_item", "channel_title" etc.

  • A lot of values are contained in inner tags within the <item> elements, like the "title", "pubDate" and "link" elements. Since these appear only once for each <item> they are considered columns in a general <item> table: "channel_item".

  • Notice the multiplicity of the <category> tags inside the <item>'s. This causes a seperate table to be created to handle multiplicity: "channel_item_category".

As you can see we do a rather intuitive "XML to table-based model" mapping in MetaModel. This is done automatically through the process that we call auto-flattening of tables. If you prefer, you can also flatten the tables manually if you wish to compose the table model yourself (but I'll have to dedicate a seperate blog entry to that topic sometime).
The XML-elements to tables mapping enables you to do all the cool querying stuff with your XML files, that I have shown you so many times before. Let's do a simple example:
File file = new File("newsfeed.rss");
DataContext dc = DataContextFactory.createXmlDataContext(file, true, true);
Schema schema = dc.getDefaultSchema();

Table table = schema.getTableByName("channel_item_category");
Column col = table.getColumnByName("category");

// We'll make a query to show us the most popular newsitem categories:
// SELECT category, COUNT(*) as num_items FROM channel_item_category GROUP BY category ORDER BY num_items DESC

SelectItem countItem = SelectItem.getCountAllItem();
countItem.setAlias("num_items");

Query q = new Query().select(column).select(countItem).from(table).groupBy(column).orderBy(countItem, Direction.DESC);

DataSet dataSet = dc.executeQuery(q)
// Do something with the dataset!

If you're just interested in finding out how your XML format looks like after being autoflattened by MetaModel, I recommend downloading DataCleaner (from which the screenshot to the right was taken from). Over and out, enjoy MetaModel...

1 comment:

zcocorporation said...
This comment has been removed by the author.