The value of data quality

The other day I stumbled into an interesting blog post about software quality, and this brilliant quote (by Alan Weiss) kept flashing on my internal billboard over and over again...

Quality [...] is not the absence of something in management's eyes, that is, defects, but the presence of something in the consumer's eyes, that is, value.
What struck me was an explanation for one of the findings that I quite often get when people tell me that they are using DataCleaner or competing tools, and they find value in even some of the simplest functionalities in there - That even the simplest of features can provide a fortune of value. Coming from the world of tools and product development we tend to look at feature comparisons and technical capabilities a lot. And even customers also use such arguments for choosing a tool. And of course it makes a lot of sense because quality, as in the amount of value provided for the consumer, is very hard to measure and compare.

So how can a software product deliver high quality, in the sense of consumer value? I believe it is a mix of making the product fairly easy to use as well as solving concrete problems for the consumer. In DataCleaner we've done a lot of work to make the tool work as a generic Data Quality Analysis (DQA) tool for a wide variety of data types. But maybe we should also consider building more domain targeted packages where you can easily do a "data value assesment" (to twist the words of Alan Weiss a little) for particular domains, eg. customer data, product data, geographic data and more.

What do you think? Should data profiling and DQA stay generic, or should it target specific domains? Can it do both?


Speed up your JDBC queries with MetaModel

Although the recent MetaModel version 1.7.1 was a minor release there is something about this release that excites me very much! That thing is our new FETCH_SIZE calculator. So what is that you might say? Well, here’s a little background.

In JDBC there is an option to set the so-called FETCH_SIZE of a query statement, like this:

Statement st = connection.createStatement();
ResultSet rs = st.executeQuery(...);

What this does is that it tells the database how many rows should be fetched from the database when more rows are needed. In other words: A buffer size, measured in the amount of rows.

Cool – so this means you should set a rather high fetch size if you have plenty of memory? Say 20.000 records? Well, sometimes yes, but not quite at all times!

The guys over at Oracle has done an exemplary work on documenting the memory consumption of their latest JDBC driver. From this you can learn that if you put a too high fetch size, you might run out of memory! But on the other hand you would really like to optimize the buffering of your result sets! The trouble is that the row size in memory is hugely different if you select 1, 2, 10 or 300 columns in your query! Furthermore the size of a VARCHAR(4000) value in the buffer is about 8000 times larger than a BIT value! These findings lead to a very nice new feature in MetaModel: Dynamically setting an appropriate fetch size, depending on the query you are executing...

How does this work? Well basically MetaModel has all the needed meta-information for doing a strictly mathematical calculation. MetaModel knows the data types of all your columns, it knows their column sizes and most importantly – it knows which columns you are querying! So if we want to allow eg. up to 8 megs of memory to be used for the query buffer, then it’s very easy for us to figure out the needed FETCH_SIZE – in principle it’s just to divide the available memory with the size of a single row in the buffer.

So far I’ve tested it out together with some friends of DataCleaner that have both small tables and 200-300 column tables and they see a dramatic improvement in performance as well as it prevents a few OutOfMemory issues!

A final remark – I haven’t seen any other Java frameworks that do something like this. Even Hibernate only has the option to specify a constant FETCH_SIZE:
<property name="hibernate.jdbc.fetch_size">10000</property>

But as explained in this blog entry you will hopefully agree that such an approach is not optimal unless your application only executes a single query again and again!

For more information on MetaModel, please visit the website.