How to query an XML file as if it was a relational database

For a couple of times I've spent my blogging time explaining how you can use MetaModel to query CSV files and Excel spreadsheets just as if they where regular databases which enable filtering, grouping, sorting etc. in their query languages (usually some form of SQL with more or less dialect).

Today let's take a look at MetaModels support for XML content. XML is, to say the least, very different from relational databases. Whereas relational databases consist of tables as the basic structural element, the XML structure is hierarchical and not table-based. You could say that the structure of XML is potentially a lot more dynamic because metaphorically XML supports nesting tables inside each others to emphasize the structure and relations. So dealing with XML in a framework like MetaModel was not a no-brainer in terms of design considerations and decisions. The main theme of MetaModel have always been to provide a consistent model that was the same even though you had to work with different kinds o datastores. So how did we do this when XML and other table-based datastores are so inherently different? I'll show you using an example...

Consider this compressed XML example, ie. an RSS newsfeed (coincidentally taken from the eobjects.org newsfeed):

<?xml version="1.0"?><rss version="2.0">
  <title>eobjects - Blog</title>
  <description>Welcome to the new eobjects.dk website</description>
    <title>MetaModel 1.1 released!</title>
    <pubdate>Wed, 05 Nov 2008 14:01:02 GMT</pubdate>
    <description>text goes here...</description>
    <title>DataCleaner 1.5 "snapshot" released</title>
    <pubdate>Mon, 13 Oct 2008 07:00:13 GMT</pubdate>
    <description>text goes here...</description>

I want you to notice a few things about the RSS structure so that we get a full understanding on how MetaModel will interpret the XML file and map it to a table based structure. The resulting MetaModel structure is illustrated to the right (you will perhaps also notice that I have edited out a lot of the details in the XML code above, for brevity):

  • Notice that the <channel> element does not contain any data but only acts as what I would call a "wrapper tag". There are some different inner elements inside <channel> which causes MetaModel to name tables according to this shared root node. Hence the table names "channel_item", "channel_title" etc.

  • A lot of values are contained in inner tags within the <item> elements, like the "title", "pubDate" and "link" elements. Since these appear only once for each <item> they are considered columns in a general <item> table: "channel_item".

  • Notice the multiplicity of the <category> tags inside the <item>'s. This causes a seperate table to be created to handle multiplicity: "channel_item_category".

As you can see we do a rather intuitive "XML to table-based model" mapping in MetaModel. This is done automatically through the process that we call auto-flattening of tables. If you prefer, you can also flatten the tables manually if you wish to compose the table model yourself (but I'll have to dedicate a seperate blog entry to that topic sometime).
The XML-elements to tables mapping enables you to do all the cool querying stuff with your XML files, that I have shown you so many times before. Let's do a simple example:
File file = new File("newsfeed.rss");
DataContext dc = DataContextFactory.createXmlDataContext(file, true, true);
Schema schema = dc.getDefaultSchema();

Table table = schema.getTableByName("channel_item_category");
Column col = table.getColumnByName("category");

// We'll make a query to show us the most popular newsitem categories:
// SELECT category, COUNT(*) as num_items FROM channel_item_category GROUP BY category ORDER BY num_items DESC

SelectItem countItem = SelectItem.getCountAllItem();

Query q = new Query().select(column).select(countItem).from(table).groupBy(column).orderBy(countItem, Direction.DESC);

DataSet dataSet = dc.executeQuery(q)
// Do something with the dataset!

If you're just interested in finding out how your XML format looks like after being autoflattened by MetaModel, I recommend downloading DataCleaner (from which the screenshot to the right was taken from). Over and out, enjoy MetaModel...


Setting goals for volunteer Open Source projects and DataCleaner in particular

As you probably know I'm the founder and main developer of the Open Source project called DataCleaner. While there have been some notable contributions from outsiders I wouldn't be arrogant if I said that the crucial and main parts of the code-base for DataCleaner was written by me. But in a lot ways I often think of me as a medium of change, the bringer of the code, not the real source of the goal-setting and planning of the project. Also my main job in regards to DataCleaner have been to try and attract more developers and broaden the interest among users for DataCleaner, but that's not really relevant for this topic.

In this blog post I'll investigate the relationship between me, the medium, and the community who represents the real decisionmaking entity of DataCleaner. I suspect that a lot of times people who participate in a community doesn't realize the powerful position that they possess and how they should utilize the mediums of change.

Ok, so here's my main point that I want to stress:

The goals of DataCleaner are socially constructed by the desires of the community
What does this mean? This means that I don't set the goals for the project myself. Actually I have been doing so a lot but only because no one else did it. If the community was to say that they wanted the project to go in a direction I would be perfectly happy to help them.

What sparked me to do this blog entry was actually my friend and co-worker at eobjects.org, Asbjørn Leeth. When I was discussing some more or less technical changes that I was thinking of in regards to DataCleaner he said:
"I think you should carefully consider the overall purpose of DataCleaner and where you want it to go. Who are the users and how should DataCleaner be used in a broader context?"
I absolutely agree on this quote but the thing is that I wouldn't be the one to make those decisions! I would probably have an oppinion but ultimately it's not my decision because I'm not the user, I'm just the medium of change. I think that the DataCleaner community should be better to involve themselves in these crucial themes, but I don't blame them because this is perhaps not a role that we are yet familiar with as Open Source actors.

We need to recognize why the developers do Open Source software and when they are rewarded. This varies a lot from community to community but in the case of DataCleaner I personally get a kick out of it everytime say to me that they are using the product and that they think it rocks! This is my greatest reward. It's a lot greater than the times that people pay me money to help them with their problems (that may or may not be DataCleaner related). What does it mean that the greatest reward is the recognition of others? It means that you could effectively steer the development of DataCleaner simply by putting out your own goals and ideas for people like me to realize! Easy! This requires involvement and gratitude from the community but given that the community will be able to use the mediums of change in a far more effective way.

One example of how this works is when Ben Bor some months ago sent me a list of some-20 changes and fixes that he wanted in DataCleaner. Within a few days I had effectively fixed around half of them and the rest where submitted to the roadmap so we'll work ourselves through them as we go. My point here is that while I may be the one with the easiest access to changes (because I know the product and code so well) I may not be the one who knows what the user wants. That being said, put very shortly here are a couple of things that I have been thinking of in regards to high-level changes in the strategy of DataCleaner:
  • Change the name of the application! We don't provide data cleansing. Rather we mostly do profiling and our validation engine is also rather OK, so perhaps we should think of more fitting names.
  • Remove the separation of profiling and validation in the User Interface. The User Interface should rather reflect the process and provide convenient tools to the user instead of represent the internal entities of the application.
  • These changes would definately imply a change to version 2.0 of DataCleaner because it would mean fundamental changes both to the User Interface and the core module.
Those are just my 25 cents in the goal-setting debate. I think for Open Source to really prosper we need user-based communities who understand that they are not just "takers of software" they are also "givers of oppinions".


Query your Excel spreadsheet with Java

The MetaModel project lets you do wonderful and advanced things like filtering, ordering, grouping etc. when working with otherwise static content in CSV files, Excel spreadsheets, XML files etc. Many times I have knocked my head to the door when trying to get simple summary data out of my excel spreadsheets or simply filtering on single rows in the sheets.

An example: Recently I was working on my tax return statement and throughout the year I had recorded all relevant activities in a simple spreadsheet. I had marked the activities on various accounts such as representation, spendings, earnings whatever. Simplified it looked something like this:

Asus EEE1.600Spendings
Advisory for Company X4.000Earnings

One thing I think is funny when dealing with Excel is that throughout time you've learned how to set up your spreadsheet for easy hacking instead of dynamic querying. What I mean by this is that if I wanted to get SUM for each account in my spreadsheet I would propbably have fundamentally changed my (otherwise nice and pure) spreadsheet in order to make it easy to perform the SUM function on singular accounts. For example I may have introduced an ammount column for each account. But alas, adding a new Account would be pretty cumbersome.

Enter MetaModel! With MetaModel I can write queries in a SQL-like manner. When I say "SQL-like" you may fret and think "oh but don't learn me another SQL dialect, just give me the real deal!". The answer to this is that what I'll learn you now will effectively replace all SQL dialects because the same model (or MetaModel ;)) is usable for all kinds of datastores: Excel, CSV/TSV files, XML and SQL-databases). Let's have a look at some code. First we'll load the Excel file into MetaModel and get a hold of our schema and table structure and identify our columns of interest:

// Use the factory to create a DataContext and automatically (true) narrow column types
DataContext dataContext = DataContextFactory.createExcelDataContext(new File("my_tax_return_activities.xls"), true);
Schema schema = dataContext.getDefaultSchema();
// A simple way of getting the table object is just to pick the first one (the first sheet in the spreadsheet)
Table table = schema.getTables()[0];
Column amountColumn = table.getColumnByName("Amount");
Column accountColumn = table.getColumnByName("Account");

Now that we have the columns represented and the DataContext which we can use to perform queries, here's how we make a couple of interesting queries using the MetaModel query API:

// SELECT SUM(Amount) FROM sheet WHERE Account = "Spendings"
Query q = new Query().select(new SelectItem(FunctionType.SUM, amountColumn)).from(table).where(accountColumn, OperatorType.EQUALS_TO, "Spendings");

// SELECT Account, SUM(Amount) FROM sheet GROUP BY Account
Query q = new Query().select(accountColumn).select(new SelectItem(FunctionType.SUM, amountColumn)).from(table).groupBy(accountColumn);

Now when we've authored the queries we can let MetaModel execute them and deal with the result as appropriate. If we just want to print out the results to the console we'll do something like this:

DataSet dataSet = dataContext.executeQuery(q);
while (dataSet.next()) {


Querying a CSV file!

Today I'm going to demonstrate some of the functionality of MetaModel 1.1, which provides a datastore-transparent querying API for Java. Actually what I'll show you isn't all that new, we've been able to do most of the querying stuff for CSV files (and other datastores) roughly since MetaModel 1.0, but it seems to me that too few realize what a powerful tool we have with MetaModel, so I'm just going to repeat myself a bit here ;) Also, I'll demonstrate on of the new cool features of MetaModel 1.1 - column type detection, narrowing and value transformation.

For this example I'll use some real data: I've extracted this CSV file from the eobjects.org trac system. It contains a list of all the tickets (issues, bugs, tasks, whatever) that are active at the time of writing... You'll notice if you take a look at the file, that it's not exactly a simple CSV file - a lot of text spans multiple lines and there are quite a lot of different data types.

Ok, so let's get started. I've saved the data to a file: tickets.csv. Now I want to read the file and let MetaModel generate a metadata model based on it. I will also let MetaModel try and detect the column types (since CSV only contains text-types natively) which will automatically transform all values to the most fitting and narrow type that MetaModel can find (this is indicated by the 'true' parameter in the code  below). Here's how we get a hold of the datastore model for the file:

File file = new File("tickets.csv");
DataContext dataContext = DataContextFactory.createCsvDataContext(file, true);

Once we have a DataContext object we are ready to go for our datastore-transparent way of querying. What we do is: We get a hold of the schema of our CSV file and we locate the table of interest. Since CSV is a single-table datastore type, getting the table of interest can be done in two ways:

Table table;

//get table by name
table = schema.getTableByName("tickets");

//get table by index
table = schema.getTables()[0];

Now we can go ahead and investigate the structure of the CSV file. Since we turned on automatic column type narrowing you will see that the 'ticket' column have been converted from a text-based column to an INTEGER type. Also, as MetaModel can verify that 'ticket'-values are never missing, it is asserted that the column is not nullable:

Column ticketColumn = table.getColumnByName("ticket");

//Will print: "Ticket column type: INTEGER"
System.out.println("Ticket column type: " + ticketColumn.getType());

//Will print: "Ticket column nullable: false"
System.out.println("Ticket column nullable: " + ticketColumn.isNullable());

And now for the fun and impressing part... Let's try to make some queries! Here are a couple of examples:

Query q;

//SELECT * FROM tickets ORDER BY ticket
q = new Query().select(table.getColumns()).from(table).orderBy(ticketColumn);

//SELECT SUM(ticket) FROM tickets
q = new Query().select(FunctionType.SUM, ticketColumn).from(table);

//SELECT _reporter AS rep_name, COUNT(*) AS num_tickets FROM tickets GROUP BY rep_name
Column reporterColumn = table.getColumnByName("_reporter");
q = new Query().select(reporterColumn,"rep_name).selectCount().from(table).groupBy(reporterColumn);

To execute the queries is very simple - just ask your DataContext object to execute the query object. MetaModel will then analyze the query and process the result behind the scenes:

DataSet dataSet = dataContext.executeQuery(q);

We can process MetaModel DataSets in quite a few ways. A typical way would be to iterate through it, similar to a ResultSet:

while (dataSet.next()) {
    Row row = dataSet.getRow();
    //Extract values or do something similar with the row
    System.out.println("row: " + row);

... Or we can transform the DataSet into a TableModel object:

TableModel tableModel = dataSet.toTableModel();

... Or we can transform it into a list of Object arrays:

List<Object[]> objectArrays = dataSet.toObjectArrays();

As you can see, with the MetaModel API a lot of things that used to be difficult is now really, really easy!

Click here to display the full example.


MetaModel 1.1 released!

I just ditched my masters thesis today to work on good 'ol MetaModel! I spent my time straightening out the last remaining tasks on the 1.1 release which I have been looking forward to for some time now. So I'm now happy to announce that

MetaModel 1.1 have just been released!

Head over to the eobjects.org news site to learn more about what this release is all about!


Open Source acknowledged by the Data Quality community?

Here in Denmark I often feel that the Data Quality, Master Data Management and Business Intelligence field is pretty fearsome towards Open Source software. I think this largely has to do with lack of presence, a lot of prejudices and few established consultancy firms in this part of the world.

This is also why it's always a great surprise, and a good one, when you catch the interest of the international Data Quality venue. Last week I was in a correspondence with the guys over at Data Quality Pro who was building a new Open Source Data Quality page and we had a nice chat about the tools available and the opportunities out there. I'm very glad that people are showing interest and hopefully the Danish BI scene will also adapt to the wonderful world of Open Source software as we go along...

Talking about getting the word out, where should you "advertise" your Open Source product to the business world? I personally think it's hard work to market your product even though you're giving it away for free and you would think that people automatically rushed to your website ;) Of course everyone needs to be aware that the software is here and I try a lot to put the word out there on conferences, wikipedia, sourceforge, freshmeat, ohloh etc.. But let me raise this question to everybody involved with marketing software for no cost: How do you do it?


Why can't my masters thesis be more like my Open Source project?

Being a hard working student I sometimes have to question (and from time to time applaude) the practices of academia and the tools that we use to foster innovation, creativity and knowledge-sharing. My masters thesis subject is concerned with the development methods of open source communities and companies that try to enable community-based development of their products. Yesterday I was considering this quote:

"In the cathedral-builder view of programming, bugs and development problems are tricky, insidious, deep phenomena. It takes months of scrutiny by a dedicated few to develop confidence that you've winkled them all out." - Eric S. Raymond - The Cathredral and The Bazaar

Looking aside from the fact that it deals with programming and not writing a paper (and trying to grow global awareness and knowledge on a specific topic) I thought to myself, that the cathedral-builder process is pretty similar to the process of writing a masters thesis. There are pretty strict guidelines to follow, a lot of scrutiny involved and planning by a dedicated few - in my case myself and my supervisor.

So is there a room for another way, a more open process with distributed peers, continous redesign, short release-spans etc. Obviously there are things like wikipedia that provide this for topics of interest to the general public, but needless to say science projects often go beyond that level of information and have to deal with experiments, not just facts of life such as those in an encyclopedia.

Also there's the issue of academia culture. Ego and elitism doubtlessly play a big part in maintaining a high degree of secrecy and closeness of scientific endeavor. I'd love to see scientists work in a community-enabling fashion and then I'd love to contribute to one of those (or create my own for my masters these). Let's for example try something like this out and we'll be well under way:

  • "Bugtrackers" for all the items that needs investigations
  • "Source control management" and versioning systems for revisions of the paper(s)
  • Chatty mailing lists for peer review and discussions
  • "Continous integration" for managing/matching references and terms within the paper
  • Free availability to all underlying data, not just the published parts

The beauties of this would be similar to the beauties of open source. And particularly in academia there's a strong need to be able to track down who's done what and source control management and reference-management would greatly improve on that account. For evaluators it would be possible to see the actual changes made by each student in group work, for group-working students it would be possible to track the actual changes to the project (as opposed to having to read it all over again everytime you exchange documents)... Perhaps a more "open" science would be just what we need?


A day of releases!

I saw this morning that the new OpenOffice 3 is out! Congratulations to the OO.o crew, I'm enjoying using your product for my upcoming masters thesis :)

Today is also the day that DataCleaner 1.5 "snapshot" has been released. Here's the press release:

As we're moving steadily along towards the release of DataCleaner 1.5 we are fixing a few bugs and enhancing a lot of features. This leads to the desire to release our work since practically nothing has undergone changes that could destabilize the application since the 1.4 release. So today we're releasing DataCleaner 1.5 "snapshot". This also marks the first release under our new LGPL license.

Here are the changes from 1.4 so far:
  • Change of license to LGPL.
  • New profile: Date mask matcher.
  • New profile: Regex matcher.
  • More file types supported (.dat, .txt)
  • XML file support improved (.xml)
Although this is in principle a development/beta release, we feel that it would be worth working with for most of your profiling needs. So... Go on, download it, tell us what you think and we'll see you around!

I hope you all enjoy the new version of DataCleaner!


Fast as lightning!

Whoa! I just got through my Lightning Speak about eobjects.org and DataCleaner at the Open Source Days '08 conference about an hour ago. It was a great experience - very fun and kinda stressing (in the good, "get to the point"-kinda way) to have an alarm clock counting down for your 15 minutes of fame!

And in deed my presentation was very closely to the point. I wanted to tell people about the great creative projects at eobjects.org and especially about DataCleaner and the MetaModel project, which I dubbed a "derivative" project. My speak also quickly sketched the domain of data quality and people where nodding when I concluded that they all should download DataCleaner and give their datasources a quick profile the next time they worked on their projects.

You can download my slides here: http://eobjects.org/resources/download/opensourcedays.pdf

Unfortunately the format of the Ligthning Speak didn't allow for much time for comments and questions from the audience, but I hope and think that they had a good time!


Presenting DataCleaner at Open Source Days '08

The Open Source Days conference in Copenhagen have been growing steadily for the last couple of years. Some of you may know it by it's former name, LinuxForum, though. This year the conference is featuring a new concept: Lightning speaks. Lightning speaks are less comprehensive speaks by less profiled speakers that are not necessarily based upon a paper. This doesn't mean that the speaks will be less relevant, on the contrary I think we will see a lot of hyper-interesting "up and coming" speakers delivering early keynotes of major topics to come.

As you may have guessed by my enthusiasm about the Lightning Speak concept, I am speaking at this years Open Source Days conference on Sunday the 4th of October at 14:00! The topic will be DataCleaner and how the eobjects.dk (and now eobjects.org) community have delivered one of the leading open source Data Quality solutions within less than a year! Here's the official speak statement:

DataCleaner is the most advanced open source data quality solution available. You can use DataCleaner to profile, validate and compare your data in an intuitive graphical environment. The application is compliant with lots of datasources such as JDBC databases, CSV files, Excel spreadsheets, OpenOffice.org database-files and xml files. Among the interesting features are identification of string patterns, value distributions, dictionary lookups, javascripted validation rules, regular expression validation along with traditional metrics used for data profiling and analysis. Come to this lightning speak to listen to Kasper Sørensen, the founder of eobjects.dk and main developer of DataCleaner who will demonstrate DataCleaner and give a quick overview of what's going on with open source data quality.
I hope that if you're joining the conference you'll come and hear me out. And if you're not planning to come - then please reconsider ;) I'm very much looking forward to showing of our great product and I think it will be a lovely experience to get some more focus on data quality as well.


We're moving eobjects.dk to a new server

Hello everybody. This is a practical announcement ...

We're in the process of moving eobjects.dk to a new server.

Please return shortly to access the new and improved eobjects.dk!


We're finally finished, but the IP address change will need some hours to cascade worldwide. If you're still being redirected to this page it's because the DNS changes haven't set in yet. The new eobjects.dk website is online! Impatient people who are suffering from the slow DNS cascades can also access the website directly by it's IP address (minor glitches is to be expected this way though).


By now, all DNS changes should be complete så go ahead
enjoy the new eobjects.dk website


Development/snapshot release of DataCleaner 1.4

We've released a development/snapshot release of DataCleaner 1.4 in order to get early reactions for all the improvements and new features as well as supporting our users with up to date functionality. In my own opinion the development release is just as stable and "safe to use" as 1.3, but of course it lacks a bit of the manual testing that we put into the real releases.

You can download the development release at our sourceforge download site.

Here's a short list of fixes since DataCleaner 1.3:

  • Better memory handling and garbage collection
  • Reference columns in drill-to-details windows
  • Better error handling when loading schemas
  • Quoting of string values in visualized tables (in order to distinguish empty strings and white spaces)
  • New profile: Value Distribution, which is an improved version of the Repeated Values profile. The Value Distribution profile has an option to configure the top/bottom n values to include in the result.
  • Better control of profile result column width.
  • Bugfix: Copy to clipboard functions now work properly.
  • Bugfix: Scrollbars added to visualized tables.


New eobjects.dk website

Hi everybody,

I'm anticipating the release of a new eobjects.dk website design. The website will be launched pretty soon I hope - we're doing it as a part of a general server move. The move is to the laboratory at Copenhagen Business School (CBS) called Business of Open Source Software and Standards (BOSSS). With BOSSS we'll have a much better bandwidth and a better performing server as well as better physical security conditions.

The new website will be based on trac 0.11 (currently we use 0.10) and will feature a lot of improvements for visitors, users and contributors:

  • Better theming engine has enabled us to use a more flexible website design with wiki pages appearing as menu items.
  • A whole new news page which will be used to perform announcements on the progress of our projects.
  • A lot of other improvements caused by the trac upgrade.
Here's a little screenshot of the new webpage design (work in progress):

Any comments are welcome!


Considering MetaModel functions

This blog entry could just as well have been a feature request but I'm going to kick-start it with a couple of thoughts I have for one of the crucial improvements to MetaModel that I've been dreaming about.

The last couple of weeks have brought considerable interest in MetaModel, largely thanks to articles posted on the server side and infoq. It's been great to get the message out and it's also sparked a lot of great ideas from users/evaluators on the discussion forum. A couple of them have been requests that we build more advanced SELECT items into the query model. In this post I'm going to discuss type-casting and extraction functions and how they can be made possible using the new IQueryRewriter interface.

The idea about query rewriting had been going on for some time, of course inspired by Hibernate's dialects. The thing was though, that for a start I wanted to skip dialect handling completely in order to get to know how far one could actually go without having to do any "hacking" in SQL. It worked out quite well but now that we need to incorporate more advanced, non-standardised features, we will of course need to be able to manipulate with the standard output. This is what the query rewriter is for, and in particular the AbstractQueryRewriter helps you do. I've made my first query rewriting "hack" today - using the TOP function for limiting the result set size, which is (as far as I know) only available in MySQL.

What we need to do now is expand the Query model API. We need to incorporate type casting. My thoughts are:

  • We must take an interface-first approach - how would one most appropriately like to type-cast a select item in a query? I'm thinking that we should add a "castAs(ColumnType type)" method on SelectItem.
  • Because not all of the ColumnType's are supported by all databases we should consider making a more abstract type enum. Something that will only contain a couple of more basic types like String, Integer, Decimal, Date, Boolean.
  • We should use the query rewriting approach to generate the actual SQL cast syntax. Some databases use the CAST(x AS y) function, others use special-purpose functions like TO_NUMBER(x).
Another feature that I want to include in MetaModel is functions for calculating or extracting something on behalf of a column. Let's take for example the YEAR(x) function (or in some databases the EXTRACT(YEAR FROM x) function).
  • One would initially just think that we should add this function to the FunctionType enum and then take it from there. But actually it's quite a different type of function. While SUM, COUNT etc. are aggregate functions, the YEAR function is a single-value function, ie you can't call YEAR on a set of values.
  • Therefore we should consider a rename of FunctionType to AggregateFunction and create a new enum, CalculationFunction (or maybe we can come up with a better name?)
  • We can use the same approach as before (query rewriting) to handle different dialects, but we need to make sure that we pick function names that are widely accepted and understandable to the user. Personally I prefer YEAR(x) over EXTRACT(YEAR FROM x) as the syntax is clearer and there are no constants inside the parameter, which is more java-ish. The downside is that we will then also need a MONTH(x), DAY(x) etc. function but that's not a biggie I think.
One last note - we should also consider if it's reasonable to keep using enums. Maybe we should switch to interfaces (and constants in the interface to ensure no API changes) for the sake of extensibility.


DataCleaner 1.3 and MetaModel 1.0.3 released

Today is the day that we've released DataCleaner 1.3 and MetaModel 1.0.3. Here's a summary of changes:

DataCleaner 1.3
This release contains many new features and improvements.

  • You can save and load work in the Profiler and Validator.
  • Support for working with data quality in XML files.
  • Functionality to swap/dice columns and rows in profiler result tables.
  • A more modular and plugin-friendly UI.
  • Various bugfixes, optimizations, and wider database support.
MetaModel 1.0.3
A minor changes release. Only new feature is the capability to parse and map XML files to the schema-based model. Also the release marks some improvements to documentation and minor bugfixes.


What to remember when presenting OS BI

As I wrote about in my previous blog post I went to Danish IT the other day to talk about Open Source BI. I've spent a lot of time the last two days contemplating on the presentation and how the audience percievede OS BI as pretty immature and insecure... Somehow some of my points about how one could take advantage of Open Source instead of seeing it as a threat and a risk got lost in the mix. I will try and sum up some of the thoughts that I've had on what went wrong and how to present BI products for people who a unfamiliar with Open Source.

  • First of the presentation wasn't really a "sales meeting", so I took on the academic perspective and showed the audience a broad view of the OS BI arena, including two alternatives for each product group; databases, ETL, reporting and OLAP. This was a really bad idea: Presenting the alternatives within OS was simply too much; instead of being impressed that there was a volume for rivalry between and within the OS communities this was thought of as a bad thing - fragmentation, instability etc. So instead, just show the business people a single suite of products, a silver bullet, even though we all know that this does not exist (neither does it in the commercial world, which is a good thing).
  • Taking a too feature-driven product-focus was not the best idea. Sure you should point out all the good features of OS BI products but for my demonstrations I focused on the great computational power and advanced features instead of showing them some nice user interfaces. In selecting what to present I would definately recommend to use more user friendly products like eclipse BIRT, Talend, OpenLaszlo, etc. Instead I showed them the Pentaho suite which for a large part has a somewhat boring theme.
  • Remember to get them a list of companies that have already undergone OS BI initiatives. I forgot this completely and it was a big mistake. I hope for the attendees that they folllowed my advise to go check it out themselves at Pentaho's and JasperSoft's websites.
  • Stress that the participation thing is not something that nescesarily requires coding-skills. Show them how support forums work and that the communication part is just as important. Without useful information the developers are lost and will probably not focus on the exact same thing as the customers do.
  • Get more authority into the room. Hard to admit I have a hard time getting authority in a room of business people because I'm more of the academic type. So bring along colleagues and trustees to help convince the audience that your message is legitimate. Also this will help potential customers understand that you're not the only one in the business who cares about OS BI - it will tell them that there are others and certainly enough to get started with consulting, training and recruiting.
  • Show them the numbers of Open Source. David Wheelers article should give you some good starting points. Address the fact that there are plenty of developers and tell them what motivates them (these points only seemed to kick in when I told them about my own OS projects so I must have not made it clear from the beginning)... Learning, reputation, ideology and "real" work/sponsored developers.
On the positive note however (seeing that the points above may leave you with the impression that the meeting went all wrong, which is not the case) there was several positive experiences. It helped a lot to show the actual community webpages and show how the development process worked. I did this as my last topic and it should have been in there earlier to help the audience get a feeling of the underlying ideas. Also, supplementing with notes on related Open Source products helps people understand that this is not just a "crazy idea" that popped up for BI. Show them JBoss, Apache, Linux, OpenOffice, Eclipse, Mozilla etc. And then tell them about integration, SOA etc. which are architectual challenges that are very suitably overcome using Open Source based solutions.


Addressing Open Source BI and data quality at Danish IT

I've been offered to address the Danish IT (Dansk IT) networking group on Business Intelligence next wednesday. I'll be concentrating on talking about Open Source business models, pitfalls and opportunities and an overview of the Open Source BI market including demos of various tools - including DataCleaner of course. I'm thinking this is a great opportunity to get people involved with OS BI - an area that has been somewhat overlooked, at least in Denmark, I think.

Update: Just got home from the networking group and it was a very interesting day sparked with lots of discussions and perspectives on BI products. I can't say that everybody was convinced with going Open Source for their BI solutions but they definitely got an impression of what goes on and most people there was very interested in DataCleaner, perhaps because of the few implications it has on the rest of the BI portfolio to apply our data quality solution.

Update: You can now download my slides about Open Source Business Intelligence and please let me know what you think.


Report on DataCleaner development process

As some of you may know DataCleaner started as an academic project for me, investigating how Open Source projects are established, managed and developed. I've been waiting for a long time for the evaluation of the project but yesterday I finally got the results, and I'm proud to announce that I got the top-grade for the assignment! (12 in the danish grading system, which spans from -2 to 12 - I'll have to blog about the grading system some time, it's hilarious).

The project received notable credits for the explorative style of development and this is something that I'm very proud to keep on practicing. I'm publishing the report for free download here, but I'm affraid it's in Danish, so if you don't understand it you'll have to ... learn Danish ;-)

I would appreciate any kind of feedback on my research and I don't mind critics now that I have the acknowledgment of Copenhagen Business School, heh.


Fluent interfaces in MetaModel

Just spent a couple of hours making the Query class of MetaModel implement fluent interfaces. Damn this syntax looks great:

new Query().select(myColumn).selectCount().from(myTable).where(myColumn, OperatorType.EQUALS_TO, "foobar").groupBy(anotherColumn).orderBy(myColumn);

This was one of the last TODO's with MetaModel in this round so I think we're going to release version 1.0 pretty soon! Stay tuned.


DataCleaner looks...

Been working a couple of days on a new great look for DataCleaner, check it out:

I'm starting to get real excited about releasing version 1.2, it'll be a radical improvement in terms of both visual experience and functionality I think!


How to process millions of resultset rows in java

I'm so excited, since I just think that we've solved a very common problem in java applications that have to deal with huge amounts of data. Here's the trouble:

  1. Even though the JDBC spec. defines a way to specify the fetch size when executing queries, some drivers do not implement this feature, which means your program will run out of memory if you query eg. a couple of millions of records.
  2. Even if your driver works as it is supposed to (that would be a reasonable assumption in most cases) there's still no effective way to optimize the computation of the many records by multithreading since the data is streamed through a single connection.
Because of the power of the MetaModel schema and query model we've been able to create a generic mechanism for splitting up a query into other queries that will yield less rows but the same collective result. The way we do this is by identifying attributes that can be used to filter in WHERE clauses, for example:
  • Consider we want to split up the query: "SELECT name, email FROM persons"
  • We will investigate the persons table and find columns that can be used to split the total resultset. We might find a reasonable age-column for this, so the query could be split to:
  1. SELECT name, email FROM persons WHERE age < 30 OR age IS NULL
  2. SELECT name, email FROM persons WHERE age > 30 OR age = 30
Depending on the desired size of the partial queries we will split up by finding further columns or by defining finer intervals to split by. Here's how it works:
DataContext dc = ...
Query q = ...

QuerySplitter qs = new QuerySplitter(dc, q);
List<Query> queries = qs.splitQueries();
I'd love to know what you all think of this? Personally I think it's a lovely way to optimize memory consumption and it offers new ways to utilize grid computing by distributing partial queries to diffent nodes in the grid to do remote processing. Also a lot of databases (MySQL for example) only dedicates a single thread per query - so by splitting the queries one could further optimize multithreading on the database.


Community choise awards

I just signed DataCleaner up for sourceforge's community choise awards 08 :) I definately don't count on us winning anything though... Not because we don't deserve it (we do) but because noone knows about us (yet). Anyhoot, you can all help change that - go nominate us for "best new project" and "best project for the enterprise"!

Click the banner to the right to give us your vote!


Drill-to-detail in profiling results

I've just added a new feature to DataCleaner that I'm very happy with. Now when you see a profiling result you can interact with it and drill to details on measures of interest to gain insight into how those measures are composed. Consider the screenshot below (sorry for the ugly window borders, I'm running linux here and we're still trying to tweak the UI to look sweet with the jgoodies look'n'feel).

What you're seeing is that the pattern finder identified a couple of patterns for the ADDRESSLINE2 column and I've just clicked the "aaaaa 999" pattern to find out which value where categorized to fit this pattern. This queries (drills to details) the datastore for these columns. Similarly I could have wanted to find out which values yielded the "max chars", "min chars" or other measures in the profile result matrix above. If I do that it will similarly query the values that originated the measures.

So how is this possible? Thanks to MetaModel it's quite easy to model such data explorations! Even though some of these queries cannot be transformed into SQL it is possible to post-process them using row filters that filter out only the correct rows based on an initial query. Let's look at an example, how to drill to details on the "max words" measure in String analysis:

//For each column result...
MatrixValue mv = matrixValues[7]; // 7 is the index of the "max words" measure

//Construct a query that get's as close as possible to the desired result
Query q = new Query();
q.addGroupBy(column).addSelect(new SelectItem(FunctionType.COUNT,"*",null));

//Set the query as the source of the details

//Set a post processing row filter to include only those
//rows that have the right number of words
//(something that cannot be expressed in a query)

mv.addDetailRowFilter(new IRowFilter() {
  public boolean accept(Row row) {
      Object value = row.getValue(column);
      if (value != null &&
            new StringTokenizer(value.toString()).countTokens() == numWords) {
            return true;
      return false;

The code above is a slight rewrite of the real code for String analysis and it shows how easy it is (and ought to be) to attach a query to a value in a profile result! Drill-to-detail features are now a common thing in DataCleaner and in my oppinion it leverages the use of the profiler functionality to a new level.



Demonstrating the mutable query

Having wanted for a long time to be able to manipulate queries in an object-oriented manner is not something that I think has been just my personal ambition, but a strong barrier for a lot of data-centric applications out there. Considering the many dialects of SQL it is also striking how few attempts there has been to unify these dialects using a common meta-layer. Sure there are very succesfull (and good!) ORM frameworks like Hibernate and JDO that have built in dialect-translation but when you think about it they actually apply to a very limited scenario where you know in advance how the database is structured and what objects you want to store the content of the database. Consider instead a model of a database that is not a mapping to domain objects, but more an object-oriented wrapper for the database since the database is the domain. The advantages of an object-oriented domain model for the database is obvious, here are a couple of pointers;

  • Traversing a database (schemas, tables, columns, relationships etc.) becomes as easy as it is to traverse any other domain model.
  • Building and manipulating queries can be a continous and type-safe process. This gives us the opportunity to do meaningful fault-tolerant query-modifications in order to recover from database-insufficiencies and to enterpret the query not only as SQL, but as a generic query built for non-SQL datastores.
  • It is possible now to mimic a database with other data technologies, for examples flat/csv files, excel spreadsheets, web services or hardcoded datastores.
  • Building systems to support the database structure (for example metadata applications) becomes possible without extensive JDBC hacking.
Anyhoot, look no further, because we have invested considerable time into these issues by developing eobjects MetaModel. Let me show you how you build queries using the MetaModel framework in a totally type-safe manner!

java.sql.Connection connection = ... //Obtain a JDBC connection
DataContext dataContext = DataContextFactory.createJdbcDataContext(connection);
Schema[] schemas = dataContext.getSchemas();
Table customerTable = schemas[0].getTableByName("customers");

Query q = new Query();
DataSet dataSet = dataContext.executeQuery(q);
while (dataSet.next()) {
  Row row = dataSet.getRow();

Of course in this example a couple of hardcoded values are entered such as the "customers" literal, but I could have just as well traversed the schema by going through every table without any hard references.

So I guess the most striking question is: Doesn't it cause some ineffeciencies to generalize queries like this? And can I still execute my custom SQL-string query? The answer is "no, because that goes against the whole idea of MetaModel". You shouldn't be able to do anything that is not portable to other datastores. Then you could just as well do it "the old" JDBC way and of course we don't prohibit you from doing so. But let's instead take a look at how to do regular (portable and standards-compliant) SQL queries in MetaModel:

Query q = new Query();
q.addFrom(new FromItem(schema.getTable("products")).setAlias("p");

//yields: "SELECT p.product_type FROM products p GROUP BY p.productType"

Notice here that there are multiple overloaded addFrom() methods in Query. In the first example we added a table as the argument, in this next example we used a FromItem and gave it an alias. The same principle applies to the other parts of the query, lets see it by adding an aggregate function to our query:

Column priceColumn = schema.getTable("products").getColumn("price");
q.addSelect(new SelectItem(FunctionType.SUM, priceColumn);
//yields: "SELECT p.product_type, SUM(p.price) FROM products p GROUP BY p.productType"

Okay, it seems pretty straight forward eh? That's because it is. So get started and use MetaModel. Here's some more examples for ya so you can see how easy it is to build even complex queries:

Column marketYearColumn = schema.getTable("products").getColumn("market_year");
SelectItem yearSelect = new SelectItem(marketYearColumn);

q.addWhere(new FilterItem(yearSelect , OperatorType.HIGHER_THAN, 2004));
//yields: "SELECT p.product_type, SUM(p.price) FROM products p WHERE p.market_year > 2004 GROUP BY p.productType"

q.addOrderBy(new OrderByItem(q.getSelectClause().getItem(0),false));
//yields: "SELECT p.product_type, SUM(p.price) FROM products p WHERE p.market_year > 2004 GROUP BY p.productType ORDER BY p.product_type"


A meta domain model

In the field of object-orientation it has for some years been common to talk about modelling the domain; to create domain models. This makes perfect sense if you're a car dealer or a pet shop, ie. a business with a specific domain. But what about those development projects that does not apply to a specific domain, but to all kinds of domains?

This was the situation when I decided to start the DataCleaner project some months ago. What I needed here was a domain model concentrated with data sources, their structures and the data they contained. I searched the Open Source offering on this and to my surprise didn't find any solid attempts to do this. Sure there are object-relational-frameworks like Hibernate which enables you to create your own domain model objects and "easily" map them to a database, but it's not possible to map the database itself to objects that represent the structure of the database.

Initially I just started making such a model myself - I mean, how hard could it be? The result was the DataCleaner metadata & data layer, which works fine in DataCleaner, but wasn't quite developed for reuse in other applications. So now I've started creating a new project, which I'm calling eobjects.dk MetaModel. The MetaModel project takes of where the DataCleaner metadata & data layer stops. We have classes like Schema, Table, Column etc. but we will try to remove (encapsulate) any tie to JDBC, because if you want to see a messy API (or more correctly: messy implementations), then look at JDBC. We will also use the MetaModel to take advantage of "new" language constructions in java like enumerations, generics etc.

Here are some of the plans for MetaModel:

  • Schema model: Schema (and unification of the very ambigious Catalog and Schema terminilogy in JDBC), Table, Column, ColumnType (enum), TableType (enum) etc.
  • Query model: Query, SelectClause, FromClause, WhereClause, GroupByClause, HavingClause, OrderByClause etc.
  • Data model: Dataset (with Streaming and keep-in-memory options), Row
My hope for this is to make an API which makes it possible to interact with your database in a type-safe manner and avoid query problems, hardcoded literals in the code etc.


Querying data with DataCleaner-core

I promised the other day that I would return on the topic of using the metadata & data layer of DataCleaner-core. So here's what we'll do;

1) Open up a connection to the database (this is plain old JDBC). Here's how to do it with Derby, but any database could do:

Connection con = DriverManager.getConnection("jdbc:derby:my_database;");

2) Let's create a schema object for an easier, object-oriented way of accessing the data.

JdbcSchemaFactory schemaFactory = new JdbcSchemaFactory();
ISchema[] schemas = schemaFactory.getSchemas(con);

Note that by default the JDBC schema factory only retrieves relations from the database of type "TABLE". You could in some situations though wish to broaden this restriction, for example to enable views:

JdbcSchemaFactory schemaFactory = new JdbcSchemaFactory();
schemaFactory.setTableTypes(new String[] {"TABLE","VIEW"});
ISchema[] schemas = schemaFactory.getSchemas(con);

3) Let's try exploring our metadata, consisting of schemas, tables and columns.

ITable productTable = schemas[0].getTableByName("Products");
IColumn[] productColumns = productTable.getColumns();
IColumn productCodeColumn = productTable.getColumnByName("product_code");

//This next int represents one of the constants in java.sql.Types.
int productCodeType = productCodeColumn.getColumnType();
boolean isProductCodeLiteral = MetadataHelper.isLiteral(productCodeType);
boolean isProductCodeNumber = MetadataHelper.isNumber(productCodeType);

4) Time to make a query or two. Let's start off by just querying the whole table and then querying two specific columns.

JdbcDataFactory dataFactory = new JdbcDataFactory();
IData someDataThatWeWillDiscard = dataFactory.getData(con, productTable);
IData data = dataFactory.getData(con, productTable, productCodeColumn, anotherColumn);
while (data.next()) {

IRow row = data.getRow();
int count = data.getCount();
System.out.println("Observed " + count + " rows with product code: " + row.getValue(productCodeColumn));


Notice the IData.getCount() which is crucial to understand. The data factory will try to generate a group by query to optimize the load on traffic between database server and client. Sometimes this is not possible though (for example for TEXT types in Derby, where GROUP BY is not allowed). The getCount() method returns how many occurances there are of this distinct combination of values, represented by the IRow interface. So make sure to always check the count, maybe there are less rows in the result than in the actual database, because the results have been compressed!

Observe in general how strongly typed an API this is. In other data-oriented API's one would have to type in the same column name several places (at least in the query and when iterating through the results) but with the DataCleaner-core metadata and data layer we get a completely object oriented and type safe way to do this. The amazing thing about this API is also that we could have just as well done the same thing with flat files or other data source types.


Getting uispec4j to run in Hudson on a headless linuxbox

I've been fiddling around with this problem for some time now and I finally got all the pieces together so I guess I'd better share my newfound knowledge on these obscure topics that I hope I'll never have to encounter again.

It all started with a new fine and dandy testframework called uispec4j, that we wanted to use for DataCleaners GUI. Uispec4j is supposedly "Java GUI testing made simple" and so they caught our attention because the code coverage of DataCleaner GUI was not that impressive (yet, if you read this blog post and a lot of time has passed, it may hopefully be looking better).

So we started of by creating some neat unittests for DataCleaner GUI, using uispec4j. Hurray. They worked fine and dandy on our Windows development machines so we uploaded them to the repository and into the Continous Integration loop. This is where hell broke loose.

First off, our Continous Integration server was headless (ie. no screens, monitors, displays, whatever, just a remote console). Surely this wouldn't do because uispec4j requires a window manager to use for emulating the Java GUI. Fair enough, I installed X with the Xfce window manager:

apt-get install xorg xfce4

Then came the next problem. When starting X a fatal error occurred, telling me that no screens where installed. That seems fairly reasonable, but what the heck should I do about it? I decided to install a VNC server to host a remote screen. This would hopefully rid me of my troubles, since I didn't have the (physical) room for installing a monitor for the damn thing.
apt-get install vncserver

After configuring the vncserver I tried running my tests... Next obstacle: Telling Java which screen to use. This required to set the DISPLAY environment variable in /etc/profile:
export DISPLAY=:1

Now came the time for some mind-bobbling uispec4j errors. I found out that uispec4j only works with Motif on linux so you had to append "-Dawt.toolkit=sun.awt.motif.MToolkit" to your commandline like this:
mvn install -Dawt.toolkit=sun.awt.motif.MToolkit

every time you need to build the damn thing. Sigh, this wasn't something that my Continous Integration system (Hudson) was built for so I started to edit various batchscripts to see if I appended the damn "-Dawt.toolkit=sun.awt.motif.MToolkit" parameter to my containers startup script it would work, but no. Instead I found out that you could set the MAVEN_OPTS environment variable, so I did that in /etc/profile:
export MAVEN_OPTS="-Dawt.toolkit=sun.awt.motif.MToolkit"

But that didn't work either because Hudson doesn't comply with the damn thing :( I tried to set that "awt.toolkit" system property using some static initializers (which I generally think is a poor, poor, poor thing to do in Java in general), but guess what? Uispec4j is filled with static initializers as well, so that brought me no guarantees whether or not I was the first static initializer run. (edit: Apparently I might be wrong in this claim about uispec4j, check out the comments for more details).

Finally I got a new version of Hudson that had a per-project configuration of MAVEN_OPTS and that did the job. The last issue was actually a JVM issue. I had to change the runtime user of my J2EE container to be the same user that hosts the VNC server instance. If you try to access another users desktop, the JVM turns fatal. So don't touch my desktop or you'll get your fingers burnt!

Ah and a last thing about GUI testing: Make sure to set the Locale in your junit setUp methods or else the unittests won't be portable between computers if they have different languages and you assert on the labels of UI elements.

I once heard a very wise colleague and fellow developer say:
"You should test functionality and domain models through unittesting and test UI through UI!"



Maven-ing your way around DataCleaner

There seems to be quite some frustrations for old ANT-users switching to use Maven so I thought I would make a small post about the main differences and various hacks that are useful to know as a Maven user. The good thing about ANT is that you can always hack your way around a problem and it's quite easy to find the problems that are stopping you. The bad thing is of course that the build-files seem to grow enormously and that you have to enforce your project infrastructure with some kind of common JAR-download area like a FTP or something similar. In contrast Maven focuses not on the build as a process, but more on the content of the build, because in 99% of the times the process of building a java project is pretty much the same, so why not omit the "how" completely and only focus on the "what" of your build? This "what" is configured in the pom.xml file!

Admitted, that was not my primary reason for choosing Maven! :) The thing that won me over was of course the dependency handling system which I really love and loathe a bit at the same time. What you need to be aware of about the dependencies is this:

  • Maven automagically creates a local repository for all the JARs you use in your projects.

  • There's also a central repository where maven will download the JARs from, if they are not found in the local one.

  • If you are working offline or behind a proxy and you need a new JAR you're bound to mess this up :( When Maven can't find it's JARs in the central repository or locally it will blacklist it!

  • You can however delete the blacklisting by removing (part of) the local repository, it is found in ~/.m2/repository...

    windows: C:/Documents and Settings/[username]/.m2/repository

    or linux: /home/[username]/.m2/repository).

OK, so that was the background-knowledge you had to know - now for some of the build goals. The mostly used maven goal is "install", oftenly prefixed with "clean", like this:

mvn clean install

The install goal will build the project, run the unittests, verify that everything worked and then install the resulting JAR/WAR/Whatever into your local repository. This means that you can then use the project as a dependency to another project, smart eh?
And now for some other commonly goals:

mvn site

Create a nifty project site with all sorts of nice information and reports (javadoc, unittests, codecoverage etc. depending on your configuration).

mvn install -Dmaven.test.skip=true

Ah, the "skip test" parameter. I spent a long time figuring that one out. This is handy if you're working with several projects at the same time and you've (consciously) broken the build and want to keep on using the dependency.

mvn jetty:run

My new DataCleaner-webmonitor favourite. This will bring up a Jetty container with DataCleaner-webmonitor running on localhost. This of course requires a little configuration in pom.xml, I'm sure you can figure it out, just find the plugin-elements that has to do with jetty :)


Traversing schemas with DataCleaner-core

Being a new framework a lot of you guys probably wonder how to use DataCleaner as a Java API. Unlike a lot of other tools around that I've seen in the Business Intelligence domain DataCleaner was built bottom-up from a developers perspective and the User Interface was added on afterward so to use the DataCleaner-core API can be a real pleasure ... (so much for user-orientation, I'll have to elaborate on that another time)

Let's take a look at how to get the data of a CSV file. This will give us the data of the file using the same interfaces as JDBC-databases, excel-files and possibly other data sources.

ISchemaFactory schemaFactory = new CsvSchemaFactory();
File file = new File("my_file.csv");
ISchema[] schemas = schemaFactory.getSchemas(file);

Or in the case of a JDBC connection:

ISchemaFactory schemaFactory = new JdbcSchemaFactory();
Connection connection = DriverManager.getConnection("jdbc:my:database://localhost/foobar");
ISchema[] schemas = schemaFactory.getSchemas(connection);

The schemas retrievede here can be accessed in a very natural way and with a strong domain model, unlike traversing schemas in JDBC. Here's some examples:

ITable[] tables = schemas[0].getTables();
IColumn[] columns = tables[0].getColumns();
String columnName = columns[0].getName();

Handling the schemas this way serves an obvious purpose. We can now design our profiles, our validation rules etc. in a very uniform way that can be reused accross data source types. We'll talk about that next time :)