May 2008

20080515

Community choise awards

I just signed DataCleaner up for sourceforge's community choise awards 08 :) I definately don't count on us winning anything though... Not because we don't deserve it (we do) but because noone knows about us (yet). Anyhoot, you can all help change that - go nominate us for "best new project" and "best project for the enterprise"!

Click the banner to the right to give us your vote!

20080514

Drill-to-detail in profiling results

I've just added a new feature to DataCleaner that I'm very happy with. Now when you see a profiling result you can interact with it and drill to details on measures of interest to gain insight into how those measures are composed. Consider the screenshot below (sorry for the ugly window borders, I'm running linux here and we're still trying to tweak the UI to look sweet with the jgoodies look'n'feel).

What you're seeing is that the pattern finder identified a couple of patterns for the ADDRESSLINE2 column and I've just clicked the "aaaaa 999" pattern to find out which value where categorized to fit this pattern. This queries (drills to details) the datastore for these columns. Similarly I could have wanted to find out which values yielded the "max chars", "min chars" or other measures in the profile result matrix above. If I do that it will similarly query the values that originated the measures.

So how is this possible? Thanks to MetaModel it's quite easy to model such data explorations! Even though some of these queries cannot be transformed into SQL it is possible to post-process them using row filters that filter out only the correct rows based on an initial query. Let's look at an example, how to drill to details on the "max words" measure in String analysis:

//For each column result...
MatrixValue mv = matrixValues[7]; // 7 is the index of the "max words" measure

//Construct a query that get's as close as possible to the desired result
Query q = new Query();
q.addSelect(column);
q.addFrom(column.getTable())
q.addGroupBy(column).addSelect(new SelectItem(FunctionType.COUNT,"*",null));

//Set the query as the source of the details
mv.setDetailSource(q);

//Set a post processing row filter to include only those
//rows that have the right number of words
//(something that cannot be expressed in a query)
mv.addDetailRowFilter(new IRowFilter() {
public boolean accept(Row row) {
Object value = row.getValue(column);
if (value != null &&
new StringTokenizer(value.toString()).countTokens() == numWords) {
return true;
}
return false;
}
});

The code above is a slight rewrite of the real code for String analysis and it shows how easy it is (and ought to be) to attach a query to a value in a profile result! Drill-to-detail features are now a common thing in DataCleaner and in my oppinion it leverages the use of the profiler functionality to a new level.

20080507

That was easy

This week's work...
MetaModel is now a part of the DataCleaner stack!

20080503

Demonstrating the mutable query

Having wanted for a long time to be able to manipulate queries in an object-oriented manner is not something that I think has been just my personal ambition, but a strong barrier for a lot of data-centric applications out there. Considering the many dialects of SQL it is also striking how few attempts there has been to unify these dialects using a common meta-layer. Sure there are very succesfull (and good!) ORM frameworks like Hibernate and JDO that have built in dialect-translation but when you think about it they actually apply to a very limited scenario where you know in advance how the database is structured and what objects you want to store the content of the database. Consider instead a model of a database that is not a mapping to domain objects, but more an object-oriented wrapper for the database since the database is the domain. The advantages of an object-oriented domain model for the database is obvious, here are a couple of pointers;

Traversing a database (schemas, tables, columns, relationships etc.) becomes as easy as it is to traverse any other domain model.
Building and manipulating queries can be a continous and type-safe process. This gives us the opportunity to do meaningful fault-tolerant query-modifications in order to recover from database-insufficiencies and to enterpret the query not only as SQL, but as a generic query built for non-SQL datastores.
It is possible now to mimic a database with other data technologies, for examples flat/csv files, excel spreadsheets, web services or hardcoded datastores.
Building systems to support the database structure (for example metadata applications) becomes possible without extensive JDBC hacking.

Anyhoot, look no further, because we have invested considerable time into these issues by developing eobjects MetaModel. Let me show you how you build queries using the MetaModel framework in a totally type-safe manner!

java.sql.Connection connection = ... //Obtain a JDBC connection
DataContext dataContext = DataContextFactory.createJdbcDataContext(connection);
Schema[] schemas = dataContext.getSchemas();
Table customerTable = schemas[0].getTableByName("customers");

Query q = new Query();
q.addFrom(customerTable);
q.addSelect(customerTable.getColumns());
DataSet dataSet = dataContext.executeQuery(q);
while (dataSet.next()) {
Row row = dataSet.getRow();
System.out.println(row.toString());
}

Of course in this example a couple of hardcoded values are entered such as the "customers" literal, but I could have just as well traversed the schema by going through every table without any hard references.

So I guess the most striking question is: Doesn't it cause some ineffeciencies to generalize queries like this? And can I still execute my custom SQL-string query? The answer is "no, because that goes against the whole idea of MetaModel". You shouldn't be able to do anything that is not portable to other datastores. Then you could just as well do it "the old" JDBC way and of course we don't prohibit you from doing so. But let's instead take a look at how to do regular (portable and standards-compliant) SQL queries in MetaModel:

Query q = new Query();
q.addFrom(new FromItem(schema.getTable("products")).setAlias("p");
q.addSelect(schema.getTable("products").getColumn("product_type"));
q.addGroupBy(schema.getTable("products").getColumn("product_type"));

System.out.println(q.toString());
//yields: "SELECT p.product_type FROM products p GROUP BY p.productType"

Notice here that there are multiple overloaded addFrom() methods in Query. In the first example we added a table as the argument, in this next example we used a FromItem and gave it an alias. The same principle applies to the other parts of the query, lets see it by adding an aggregate function to our query:

Column priceColumn = schema.getTable("products").getColumn("price");
q.addSelect(new SelectItem(FunctionType.SUM, priceColumn);
System.out.println(q.toString());
//yields: "SELECT p.product_type, SUM(p.price) FROM products p GROUP BY p.productType"

Okay, it seems pretty straight forward eh? That's because it is. So get started and use MetaModel. Here's some more examples for ya so you can see how easy it is to build even complex queries:

Column marketYearColumn = schema.getTable("products").getColumn("market_year");
SelectItem yearSelect = new SelectItem(marketYearColumn);

q.addWhere(new FilterItem(yearSelect , OperatorType.HIGHER_THAN, 2004));
System.out.println(q.toString());
//yields: "SELECT p.product_type, SUM(p.price) FROM products p WHERE p.market_year > 2004 GROUP BY p.productType"

q.addOrderBy(new OrderByItem(q.getSelectClause().getItem(0),false));
System.out.println(q.toString());
//yields: "SELECT p.product_type, SUM(p.price) FROM products p WHERE p.market_year > 2004 GROUP BY p.productType ORDER BY p.product_type"

kasper's source