How to query an XML file as if it was a relational database

For a couple of times I've spent my blogging time explaining how you can use MetaModel to query CSV files and Excel spreadsheets just as if they where regular databases which enable filtering, grouping, sorting etc. in their query languages (usually some form of SQL with more or less dialect).

Today let's take a look at MetaModels support for XML content. XML is, to say the least, very different from relational databases. Whereas relational databases consist of tables as the basic structural element, the XML structure is hierarchical and not table-based. You could say that the structure of XML is potentially a lot more dynamic because metaphorically XML supports nesting tables inside each others to emphasize the structure and relations. So dealing with XML in a framework like MetaModel was not a no-brainer in terms of design considerations and decisions. The main theme of MetaModel have always been to provide a consistent model that was the same even though you had to work with different kinds o datastores. So how did we do this when XML and other table-based datastores are so inherently different? I'll show you using an example...

Consider this compressed XML example, ie. an RSS newsfeed (coincidentally taken from the eobjects.org newsfeed):

<?xml version="1.0"?><rss version="2.0">
  <title>eobjects - Blog</title>
  <description>Welcome to the new eobjects.dk website</description>
    <title>MetaModel 1.1 released!</title>
    <pubdate>Wed, 05 Nov 2008 14:01:02 GMT</pubdate>
    <description>text goes here...</description>
    <title>DataCleaner 1.5 "snapshot" released</title>
    <pubdate>Mon, 13 Oct 2008 07:00:13 GMT</pubdate>
    <description>text goes here...</description>

I want you to notice a few things about the RSS structure so that we get a full understanding on how MetaModel will interpret the XML file and map it to a table based structure. The resulting MetaModel structure is illustrated to the right (you will perhaps also notice that I have edited out a lot of the details in the XML code above, for brevity):

  • Notice that the <channel> element does not contain any data but only acts as what I would call a "wrapper tag". There are some different inner elements inside <channel> which causes MetaModel to name tables according to this shared root node. Hence the table names "channel_item", "channel_title" etc.

  • A lot of values are contained in inner tags within the <item> elements, like the "title", "pubDate" and "link" elements. Since these appear only once for each <item> they are considered columns in a general <item> table: "channel_item".

  • Notice the multiplicity of the <category> tags inside the <item>'s. This causes a seperate table to be created to handle multiplicity: "channel_item_category".

As you can see we do a rather intuitive "XML to table-based model" mapping in MetaModel. This is done automatically through the process that we call auto-flattening of tables. If you prefer, you can also flatten the tables manually if you wish to compose the table model yourself (but I'll have to dedicate a seperate blog entry to that topic sometime).
The XML-elements to tables mapping enables you to do all the cool querying stuff with your XML files, that I have shown you so many times before. Let's do a simple example:
File file = new File("newsfeed.rss");
DataContext dc = DataContextFactory.createXmlDataContext(file, true, true);
Schema schema = dc.getDefaultSchema();

Table table = schema.getTableByName("channel_item_category");
Column col = table.getColumnByName("category");

// We'll make a query to show us the most popular newsitem categories:
// SELECT category, COUNT(*) as num_items FROM channel_item_category GROUP BY category ORDER BY num_items DESC

SelectItem countItem = SelectItem.getCountAllItem();

Query q = new Query().select(column).select(countItem).from(table).groupBy(column).orderBy(countItem, Direction.DESC);

DataSet dataSet = dc.executeQuery(q)
// Do something with the dataset!

If you're just interested in finding out how your XML format looks like after being autoflattened by MetaModel, I recommend downloading DataCleaner (from which the screenshot to the right was taken from). Over and out, enjoy MetaModel...


Setting goals for volunteer Open Source projects and DataCleaner in particular

As you probably know I'm the founder and main developer of the Open Source project called DataCleaner. While there have been some notable contributions from outsiders I wouldn't be arrogant if I said that the crucial and main parts of the code-base for DataCleaner was written by me. But in a lot ways I often think of me as a medium of change, the bringer of the code, not the real source of the goal-setting and planning of the project. Also my main job in regards to DataCleaner have been to try and attract more developers and broaden the interest among users for DataCleaner, but that's not really relevant for this topic.

In this blog post I'll investigate the relationship between me, the medium, and the community who represents the real decisionmaking entity of DataCleaner. I suspect that a lot of times people who participate in a community doesn't realize the powerful position that they possess and how they should utilize the mediums of change.

Ok, so here's my main point that I want to stress:

The goals of DataCleaner are socially constructed by the desires of the community
What does this mean? This means that I don't set the goals for the project myself. Actually I have been doing so a lot but only because no one else did it. If the community was to say that they wanted the project to go in a direction I would be perfectly happy to help them.

What sparked me to do this blog entry was actually my friend and co-worker at eobjects.org, Asbjørn Leeth. When I was discussing some more or less technical changes that I was thinking of in regards to DataCleaner he said:
"I think you should carefully consider the overall purpose of DataCleaner and where you want it to go. Who are the users and how should DataCleaner be used in a broader context?"
I absolutely agree on this quote but the thing is that I wouldn't be the one to make those decisions! I would probably have an oppinion but ultimately it's not my decision because I'm not the user, I'm just the medium of change. I think that the DataCleaner community should be better to involve themselves in these crucial themes, but I don't blame them because this is perhaps not a role that we are yet familiar with as Open Source actors.

We need to recognize why the developers do Open Source software and when they are rewarded. This varies a lot from community to community but in the case of DataCleaner I personally get a kick out of it everytime say to me that they are using the product and that they think it rocks! This is my greatest reward. It's a lot greater than the times that people pay me money to help them with their problems (that may or may not be DataCleaner related). What does it mean that the greatest reward is the recognition of others? It means that you could effectively steer the development of DataCleaner simply by putting out your own goals and ideas for people like me to realize! Easy! This requires involvement and gratitude from the community but given that the community will be able to use the mediums of change in a far more effective way.

One example of how this works is when Ben Bor some months ago sent me a list of some-20 changes and fixes that he wanted in DataCleaner. Within a few days I had effectively fixed around half of them and the rest where submitted to the roadmap so we'll work ourselves through them as we go. My point here is that while I may be the one with the easiest access to changes (because I know the product and code so well) I may not be the one who knows what the user wants. That being said, put very shortly here are a couple of things that I have been thinking of in regards to high-level changes in the strategy of DataCleaner:
  • Change the name of the application! We don't provide data cleansing. Rather we mostly do profiling and our validation engine is also rather OK, so perhaps we should think of more fitting names.
  • Remove the separation of profiling and validation in the User Interface. The User Interface should rather reflect the process and provide convenient tools to the user instead of represent the internal entities of the application.
  • These changes would definately imply a change to version 2.0 of DataCleaner because it would mean fundamental changes both to the User Interface and the core module.
Those are just my 25 cents in the goal-setting debate. I think for Open Source to really prosper we need user-based communities who understand that they are not just "takers of software" they are also "givers of oppinions".


Query your Excel spreadsheet with Java

The MetaModel project lets you do wonderful and advanced things like filtering, ordering, grouping etc. when working with otherwise static content in CSV files, Excel spreadsheets, XML files etc. Many times I have knocked my head to the door when trying to get simple summary data out of my excel spreadsheets or simply filtering on single rows in the sheets.

An example: Recently I was working on my tax return statement and throughout the year I had recorded all relevant activities in a simple spreadsheet. I had marked the activities on various accounts such as representation, spendings, earnings whatever. Simplified it looked something like this:

Asus EEE1.600Spendings
Advisory for Company X4.000Earnings

One thing I think is funny when dealing with Excel is that throughout time you've learned how to set up your spreadsheet for easy hacking instead of dynamic querying. What I mean by this is that if I wanted to get SUM for each account in my spreadsheet I would propbably have fundamentally changed my (otherwise nice and pure) spreadsheet in order to make it easy to perform the SUM function on singular accounts. For example I may have introduced an ammount column for each account. But alas, adding a new Account would be pretty cumbersome.

Enter MetaModel! With MetaModel I can write queries in a SQL-like manner. When I say "SQL-like" you may fret and think "oh but don't learn me another SQL dialect, just give me the real deal!". The answer to this is that what I'll learn you now will effectively replace all SQL dialects because the same model (or MetaModel ;)) is usable for all kinds of datastores: Excel, CSV/TSV files, XML and SQL-databases). Let's have a look at some code. First we'll load the Excel file into MetaModel and get a hold of our schema and table structure and identify our columns of interest:

// Use the factory to create a DataContext and automatically (true) narrow column types
DataContext dataContext = DataContextFactory.createExcelDataContext(new File("my_tax_return_activities.xls"), true);
Schema schema = dataContext.getDefaultSchema();
// A simple way of getting the table object is just to pick the first one (the first sheet in the spreadsheet)
Table table = schema.getTables()[0];
Column amountColumn = table.getColumnByName("Amount");
Column accountColumn = table.getColumnByName("Account");

Now that we have the columns represented and the DataContext which we can use to perform queries, here's how we make a couple of interesting queries using the MetaModel query API:

// SELECT SUM(Amount) FROM sheet WHERE Account = "Spendings"
Query q = new Query().select(new SelectItem(FunctionType.SUM, amountColumn)).from(table).where(accountColumn, OperatorType.EQUALS_TO, "Spendings");

// SELECT Account, SUM(Amount) FROM sheet GROUP BY Account
Query q = new Query().select(accountColumn).select(new SelectItem(FunctionType.SUM, amountColumn)).from(table).groupBy(accountColumn);

Now when we've authored the queries we can let MetaModel execute them and deal with the result as appropriate. If we just want to print out the results to the console we'll do something like this:

DataSet dataSet = dataContext.executeQuery(q);
while (dataSet.next()) {


Querying a CSV file!

Today I'm going to demonstrate some of the functionality of MetaModel 1.1, which provides a datastore-transparent querying API for Java. Actually what I'll show you isn't all that new, we've been able to do most of the querying stuff for CSV files (and other datastores) roughly since MetaModel 1.0, but it seems to me that too few realize what a powerful tool we have with MetaModel, so I'm just going to repeat myself a bit here ;) Also, I'll demonstrate on of the new cool features of MetaModel 1.1 - column type detection, narrowing and value transformation.

For this example I'll use some real data: I've extracted this CSV file from the eobjects.org trac system. It contains a list of all the tickets (issues, bugs, tasks, whatever) that are active at the time of writing... You'll notice if you take a look at the file, that it's not exactly a simple CSV file - a lot of text spans multiple lines and there are quite a lot of different data types.

Ok, so let's get started. I've saved the data to a file: tickets.csv. Now I want to read the file and let MetaModel generate a metadata model based on it. I will also let MetaModel try and detect the column types (since CSV only contains text-types natively) which will automatically transform all values to the most fitting and narrow type that MetaModel can find (this is indicated by the 'true' parameter in the code  below). Here's how we get a hold of the datastore model for the file:

File file = new File("tickets.csv");
DataContext dataContext = DataContextFactory.createCsvDataContext(file, true);

Once we have a DataContext object we are ready to go for our datastore-transparent way of querying. What we do is: We get a hold of the schema of our CSV file and we locate the table of interest. Since CSV is a single-table datastore type, getting the table of interest can be done in two ways:

Table table;

//get table by name
table = schema.getTableByName("tickets");

//get table by index
table = schema.getTables()[0];

Now we can go ahead and investigate the structure of the CSV file. Since we turned on automatic column type narrowing you will see that the 'ticket' column have been converted from a text-based column to an INTEGER type. Also, as MetaModel can verify that 'ticket'-values are never missing, it is asserted that the column is not nullable:

Column ticketColumn = table.getColumnByName("ticket");

//Will print: "Ticket column type: INTEGER"
System.out.println("Ticket column type: " + ticketColumn.getType());

//Will print: "Ticket column nullable: false"
System.out.println("Ticket column nullable: " + ticketColumn.isNullable());

And now for the fun and impressing part... Let's try to make some queries! Here are a couple of examples:

Query q;

//SELECT * FROM tickets ORDER BY ticket
q = new Query().select(table.getColumns()).from(table).orderBy(ticketColumn);

//SELECT SUM(ticket) FROM tickets
q = new Query().select(FunctionType.SUM, ticketColumn).from(table);

//SELECT _reporter AS rep_name, COUNT(*) AS num_tickets FROM tickets GROUP BY rep_name
Column reporterColumn = table.getColumnByName("_reporter");
q = new Query().select(reporterColumn,"rep_name).selectCount().from(table).groupBy(reporterColumn);

To execute the queries is very simple - just ask your DataContext object to execute the query object. MetaModel will then analyze the query and process the result behind the scenes:

DataSet dataSet = dataContext.executeQuery(q);

We can process MetaModel DataSets in quite a few ways. A typical way would be to iterate through it, similar to a ResultSet:

while (dataSet.next()) {
    Row row = dataSet.getRow();
    //Extract values or do something similar with the row
    System.out.println("row: " + row);

... Or we can transform the DataSet into a TableModel object:

TableModel tableModel = dataSet.toTableModel();

... Or we can transform it into a list of Object arrays:

List<Object[]> objectArrays = dataSet.toObjectArrays();

As you can see, with the MetaModel API a lot of things that used to be difficult is now really, really easy!

Click here to display the full example.


MetaModel 1.1 released!

I just ditched my masters thesis today to work on good 'ol MetaModel! I spent my time straightening out the last remaining tasks on the 1.1 release which I have been looking forward to for some time now. So I'm now happy to announce that

MetaModel 1.1 have just been released!

Head over to the eobjects.org news site to learn more about what this release is all about!