20111223

Push down query optimization in DataCleaner

As a follow-up to my previous post about how we make DataCleaner super-fast by applying some nice multi-threading tricks, The DataCleaner engine explained, I would now like to touch upon another performance booster: Push down query optimization.

To my knowledge "push down query optimization" is a trick that only very few tools support, since it requires a flow model that was actually built for it. The idea is that by inspecting an execution flow the tool might be able to identify steps in the beginning or in the end of the flow that can be replaced by query modifications.

For example, if your data flow begins with a filtering action that removes all records of a given type or restricts the further processing to only be the first 1000 records or something like that. Most tools simply require you to write some SQL yourself, which is also doable, but as I've said before on this blog, I think writing SQL is a barrier to productivity, creativity and good data quality results. So in DataCleaner we do not offer this option, because we have something that is much, much nicer. That solution is push down query optimization!

Let me illustrate. I will be using the Sakila example database for MySQL:

Say you want to do a simple pattern finding of film titles in the Sakila database, you would select the title column and you would get a result like this:

In the DataCleaner logs we can see what queries are actually fired to the database. Open up the log file (in the logs folder) and inspect datacleaner.log. You will find a line like this:

Executing query: SELECT `nicer_but_slower_film_list`.`title` FROM sakila.`nicer_but_slower_film_list`

That's fine. You can inspect the results closer, but that's not what this topic is about, so I'll carry on... Now let's say you want to refine your job. Let's instead see how the pattern distribution is if we want to only look at a few categories of films. So I add a 'Equals' filter to only select horror, sports and action movies and apply it to my pattern finder:

If we run the job, and inspect the log file again, we see now this entry:

Executing query: SELECT `nicer_but_slower_film_list`.`title`, `nicer_but_slower_film_list`.`category` FROM sakila.`nicer_but_slower_film_list` WHERE (`nicer_but_slower_film_list`.`category` = 'Horror' OR `nicer_but_slower_film_list`.`category` = 'Action' OR `nicer_but_slower_film_list`.`category` = 'Sports')

What's surprising here is that the filter actually got query optimized. Not all filters have this ability, since some of them have richer functionality than can be expressed as a query modification. But some of them do, and typically these are the small functions that make a big difference.

Let's also apply a Max rows filter that limits the analysis for only 20 records and chain it so that it depends on the Equals filter:

If we now run the job, both filters will have been applied to the query:

Executing query: SELECT `nicer_but_slower_film_list`.`title`, `nicer_but_slower_film_list`.`category` FROM sakila.`nicer_but_slower_film_list` WHERE (`nicer_but_slower_film_list`.`category` = 'Horror' OR `nicer_but_slower_film_list`.`category` = 'Action' OR `nicer_but_slower_film_list`.`category` = 'Sports') LIMIT 20

That means that we do as much as we can to optimize the query, without ever having to ask the user to help us. So if you modify the logical job, the physical queries are automatically adapted! This is why push down query optimization is a superior optimization technique to raw SQL. Happy data cleaning!

Additional information for developers: If you're developing plugins to DataCleaner and want to make a query optimized filter, then simply make sure you implement the QueryOptimizedFilter interface! Happy coding!

20111129

The DataCleaner engine explained

For this blog entry I have decided to record a short video instead of write till my fingers fall off :) So I present to you: My videoblog entry about DataCleaner's data quality processing engine, and how it compares to traditional ETL engines.

The DataCleaner engine was created from the ground-up to be optimized for Data Quality projects. It performs superiorly to any other engine that we've looked at, which I think is a pretty nice archievement. In the video I try to explain what makes it different!

The Camtasia Studio video content presented here requires JavaScript to be enabled and the latest version of the Adobe Flash Player. If you are using a browser with JavaScript disabled please enable it now. Otherwise, please update your version of the free Adobe Flash Player by downloading here.

Enjoy using DataCleaner :)

20111030

Standardize the date formats in your data

One of the things that I see sometimes is that web forms cause unstandardized data in your database. For example, text fields in web forms do not have a native way to specify the type of the data. So what if you have a field that is supposed to be a date? For example the birthdate of your web users? A lot of web applications are not performing real validations of the format and content of the data entered into such fields. I think this typically occurs because it was not thought of as important at the time of designing the initial web page. But maybe it will become important at a point in time if eg. you want to analyze the age groups of your users! The trouble is that later on in the applications lifecycle, a state of unchangeability enters because you're stuck with a bunch of unstandardized data that you cannot conform to a new standardized data format. This is because you will have a lot of different date formats represented. For example:

  • 2011-10-30
  • 20111030
  • 30th of October, 2011
  • 30/10/11
And maybe some even more exotic ones...
In this blog entry I will show you how to solve that migration issue with the use of DataCleaner.

1. Date mask matching
The first thing we should do is to analyze which date patterns are present in the data. To do this you need to combine two components: The Date mask matcher and the Boolean analyzer. Here are the steps involved.
  1. First set up you datastore in the welcome screen of DataCleaner.
  2. Click the "Analyze!" button to begin composing your job.
  3. In the tree to the left, select the columns of interest - in our example at least the birthdate column.
  4. Click "Add transformer -> Matching and standardization -> Date mask matcher".
Your screen will now look something like this:
In the middle of the screen you see a list of date masks. Each of these produce a boolean output column (seen below). The idea of the Date mask matcher is that it creates the boolean columns so that you can even assert if a particular date is parseable by using several date masks. That's because a single date string like "080910" can be understood in many ways!

2. Analyzing matches
Moving on, we want to see how well our dates match against the date masks. Since all the matches are now stored in boolean columns, we can apply the Boolean analyzer. Here are the steps involved:
  1. Click "Add analyzer -> Boolean analyzer".
  2. Make sure all the transformed boolean columns are checked.
  3. Click the "Run analysis" button.
  4. Wait for the analysis to run.
Your screen will now contain an analysis result like this:
The result has two parts: The Column statistics and the Frequency of combinations.
In the column statistics you can see how much individual date masks have been matched. In our example we can see that 4 of our date masks (no. 2, 3, 5 and 6) are not matched at all, so we may consider removing them from the Date mask matcher.
In the frequency of combinations we get a view of the rows and which match combinations are frequent and less frequent. The most frequent combination is that our date mask no. 1 is the only valid mask. The second most frequent combination (Combination 1) is that none of the date masks apply. If you click the green arrow to the right of the combination you will see which records fall into that category. In our example that looks like this:
This gives us a good hint about which date masks we need to add to our date mask matcher.
The "1982.03.21" date is a simple case - we should simply create a date mask like this: YYYY.MM.dd
The "11th of march, 1982" date is a bit more complex. We need to allow the date mask to have a literal string part (the "th of" part) and it needs to recognize the month by name ("march"), not by number. Fortunately this is still possible, the date mask looks like this: dd'th of' MMMMM, YYYY

3. Converting to dates
While we could continue to refine the analysis, this is a blog, not a reference manual and I want to cut to the chase - the actual migration to standardized dates!
So let us look at how you can convert your date strings to actual date fields which you can then choose to format using a standardized format. To do this, click "Add transformer -> Conversion -> Convert to date". You will now see a configuration panel like this:
In here you also see a list of example date masks. Click the plus-button to add additional date masks to convert by. The converter will try from the top to convert if it can, so in case you have cases like "091011" then you have to make your choice here (I would recommend based on your analysis).
We add the few masks that are relevant for our example:
And we verify that there are no immediate unrecognized dates, by clicking the "Preview data" button:
If a date is not recognized and converted, then the output column will have a null instead of a date. Therefore a good practice would be to look for null values and eg. save them to an error handling file. To do this, here's what we do:
  1. Go to the Filters tab.
  2. Click "Add filter -> Not null".
  3. Select only your converted column (in our example "birthdate (as date)").
  4. Click "INVALID -> Write to CSV file".
  5. Select the desired columns for the error handling file.
  6. Optionally right click the "Write to CSV file" tab and select "Rename component" to give it a name like "write errors".
  7. Go back to the Filters tab and click the VALID button to write the valid records to a CSV file or a spreadsheet.
After these steps, you should be able to inspect your job flow by clicking the Visualize button:
Now your date standardization job is ready to run again and again to enforce standardized dates!

20110926

Data Profiling SQLized. Uh oh...

What does 'Scrooge' and 'Kasper' have in common? Not much according to my SQL data profiler.

Some months back (admittedly, more than a couple) I was explaining how I think people tend do "home made data profiling" too often because it apparently seems easy to do in SQL. I went on to promise that I would also try to play the devil's advocate and show a few examples of "copy paste queries" that you can use for such a tool. In this blog post I will try to do so. But let me first say:


Don't do this at home kids!

But let's start with the first query that I would add to my home baked profiling SQL script. We'll do what anyone who hasn't really understood what profiling is all about will tell you to do: Do a column analysis based on the metrics that are easily available through all SQL implementations:

SELECT MAX(column), MIN(column), COUNT(column), COUNT(*)
FROM table;

This is a good query, especially for number columns. Here I would typically look of the MIN value is below zero or not. Is the COUNT(column) equal to the COUNT(*)? If not, it means that there are nulls in the column. Why not just do two separate queries, that would be more readable? Yes, but it also makes my script larger and I will have more stuff to maintain. But let's try it, we can actually improve it also by adding a few metrics:

SELECT MAX(column) AS highest_value FROM table;
SELECT MIN(column) AS lowest_positive_value FROM table WHERE column > 0;
SELECT MIN(column) AS lowest_negative_value FROM table WHERE column < 0;
SELECT COUNT(*) AS num_values FROM table WHERE column IS NOT NULL;
SELECT COUNT(*) AS num_nulls FROM table WHERE column IS NULL;

Now let's continue with some string columns, because I think more often than not, this is where data profiling turns out to be really valuable. Something that I often see as an inconsistency in structured string data is case differences. Such inconsistencies makes reporting and analysis of the data cumbersome and error prone because grouping and filtering will ultimately be inprecise. So let's do a case analysis:

SELECT COUNT(*) AS num_lowercase FROM table WHERE LCASE(column) = column;
SELECT COUNT(*) AS num_uppercase FROM table WHERE UCASE(column) = column;
SELECT COUNT(*) AS num_mixed_case FROM table WHERE LCASE(column) <> column AND UCASE(column) <> column;

And then on to query the always popular "first letter is capitalized" type of strings. This one really depends on the database, because substring functions have not been standardized across major SQL implementations. I'll show a few:

INITCAP-based approach (eg. PostgreSQL and Oracle):

SELECT COUNT(*) AS num_first_letter_capitalized FROM table WHERE INITCAP(column) = column;

SUBSTRING-based approach (eg. Microsoft SQL Server):

SELECT COUNT(*) AS num_first_letter_capitalized FROM table
WHERE UCASE(SUBSTR(column FROM 0 FOR 1)) = SUBSTR(column FROM 0 FOR 1)
AND LCASE(SUBSTR(column FROM 1)) = SUBSTR(column FROM 1)
A bit cumbersome, but get's the job done. Being the devil's advocate, I'm still not convinced that I should throw out my home baked SQL just yet. So I'm ready for another challenge!

Let's have a look at pattern finding through SQL. Again this is perfectly possible. I've even heard many people telling me that we should rewrite DataCleaner's Pattern Finder to make it SQL optimized. Read on and judge for yourself :-)

To match tokens by pattern we apply the simplest possible configuration in DataCleaner's pattern finder: All letters are replaced by 'A' or 'a' and all numbers are replaced by '9'. This makes for a nice pattern based matcher, like this:
Mickey Mouse -> 'Aaaaaa Aaaaa'
Minnie Mouse -> 'Aaaaaa Aaaaa'
Joachim von And -> 'Aaaaaaa aaa Aaa'
kasper@eobjects.dk -> 'aaaaaa@aaaaaaa.aa'
(Random fact: 'Joachim von And' is the Danish name for Scrooge McDuck)
As you can see from the patterns, this is a good preliminary way to determine if string values have the same form and syntax - we immediately see that the email address is odd and that although all other values look like valid names, som have lowercase tokens (prefixes) inbetween.

In PostgreSQL for example, this would look like:

SELECT regexp_replace(regexp_replace(regexp_replace(column, '[a-z]','a','g'), '[A-Z]','A','g'), '[0-9]','9','g') as pattern, COUNT(*) as pattern_count from table GROUP BY pattern;

This actually works like a charm and returns:

patternpattern_count
Aaaaaa Aaaaa2
Aaaaaaa aaa Aaa1
aaaaaa@aaaaaaaa.aa1
So why even use a profiling tool for finding patterns? All this seems to be possible through raw SQL?

I will now stop playing the devil's advocate... Cuz' seriously... This is nonsense! Having worked for some years on a pretty good data quality analysis tool, this approach absolutely disgusts me. Here's just a few random reasons why, off the top of my head:

  • We still haven't scratched the surface when it comes to supporting eg. non-ASCII characters in patterns.
  • Some tokens in patterns should be matched regardless of string length, some shouldn't. In our case we never matched strings with unequal lengths (eg. Scrooge and Mickey). This is a setting that you will want to play around with! (For examples, check out our Pattern Finder documentation)
  • Each metric in the previous analyses required their own query. This means that if you want to analyze a hundred metrics, you would need to query (at least) a hundred times.
  • A lot of metrics are simply not possible to express in SQL. Some examples: Diacritic character count, max/min amount of words, matches against reference data and more.
  • Often you will want to preprocess data before (or actually I would argue, as a part of) your profiling. This can be for example to extract information from composite values or to replace known inconsistencies with standardized values.
  • All the examples offer no drill-to-detail behaviour, so further analysis is more or less impossible. And drill-to-detail is not offered through SQL, so there is for example no way to express in our pattern finder SQL that we want to keep some samples of various pattern matches for later inspection.
  • All in all, using SQL for data profiling makes for a terribly unexplorative approach. It's a pain having to write and modify such an amount of SQL to get simple things done, so don't rely on it, because it will make you lazy and then you'll not investigate properly!
  • And of course, SQL only applies to databases that support SQL! If you're looking to profile data in other formats, then you're out of luck with this approach.

20110815

Get your data right... First Time Right!

In my blog I mostly talk about data quality tools like DataCleaner that are diagnostic and treating, rather than preventive. Such tools have a lot of merit and strengths, but for a total view on data quality it is crucial that you also include tools that are preventive of poor data ever entering your system. In this blog post I want to talk a bit about a project that I have been involved with at Human Inference which is just that - our First Time Right JavaScript solution.

The idea is that we provide a subscription-based JavaScript API where you can easily decorate any HTML contact form with a lot of rich features for on-the-fly verification, validation, auto correction and helpful features for automatic filling of derived fields.

For example, the API allows you to enter (or copy/paste) a full name, including titulation, salutation, initials and more - and get these items parsed and placed into corresponding fields on a detailed contact form. It will even automatically detect what the gender of the contact is, and apply this in gender fields. We have similar data entry aids for address input, email input, phone numbers and contact duplicate checking.

Take a look at the video below, which demonstrate most of the features:



Now this is quite exciting functionality, but this is also a technical blog, so I'll talk a bit about the technology involved.

We built the project based on Google Web Toolkit (GWT). GWT enables us to build a very rich application, entirely in JavaScript, so that it can be embedded on any website - no matter if it's PHP based, ASP.NET based, Java based or whatever. Of course we do have a server-side piece that the JavaScript communicates with, but that is all hosted at Human Inferences cloud platform. So in other words: The deployment of our First Time Right principle is a breeze!

Since AJAX applications require locality of the server that it is communicating with, we've had to overcome quite some issues to allow the JavaScript to be external from the deployment sites. This is crucial as we want upgrades and improvements to be performed on our premises, not at individual customer sites. This way we can really leverage the cloud- and subscription-based approach to data quality. Our solution to the locality problem has been the JSONP approach, which is an alternative protocol for implementing AJAX behaviour. JSONP is a rather clever construct where instead of issuing actual HTTP requests, you insert new <script> elements into the HTML DOM at runtime! This means that the browser will perform a new request simply because the <script> element refers a new JavaScript source. It's not "pretty" to tackle errorhandling and the asynchronicity that this approach brings on, but we've done a lot of work to get it right, and it works like a charm! I hope to share some of our design patterns later, to demonstrate how it works.

Another challenge was of security. Obviously you will want to make sure that the JavaScript is only available for subscribers. And only for the websites that they've subscribed to (because otherwise the JavaScript can simply be copied to another website). Our way around this resembles how for example Google manages their subscriptions to Google Maps and other subscription services, where you need a site-specific API key. Very clever.

A few optional features may require some local add-on deployment. In particular, deduplication requires us to know the contact data to use as the source for detecting if a new contact is a duplicate. Here we have two options: On-premise installation of the deduplication engine or hooking up with our cloud-based deduplication engine, which can be configured to sync with your datastores.

All in all I am quite enthusiastic about the FTR solution and the technology behind the solution. I also think that our FTR API is an example of a lightweight approach to implementing Data Quality, which complements DataCleaner very well. Both tools are extremely useful for ensuring a high level of data quality, and both tools are very intuitive and flexible in the way you can deploy them.

20110804

Eye candy in Java 7: New javadoc style!

By now most of you've probably heard that Java 7 is out and there's a lot of discussions about new features, the loop optimization bug and general adoption.

But one of the things in Java 7 which has escaped most people attention (I think) is the new javadoc style.

Check it out:


And see it live - we've just published an updated API documentation for MetaModel.

20110801

Unit test your data

In modern software development unit testing is widely used as a way to check the quality of your code. For those of you who are not software developers, the idea in unit testing is that you define rules for your code that you check again and again, to verify that your code works, and keep on working.
Unit testing and data quality has quite a lot in common in my oppinion. Both code and data change over time, so there is a constant need to keep checking that you code/data has the desired characteristics. This was something that I was recently reminded of by a DataCleaner user on our forums.
I am happy to see that data stewards and the like are picking up this idea, as it has been maturing for quite some time in the software development industry. It also got me thinking: In software development we have a lot of related methods and practices around unit testing. Let me try to list a few, which are very important, and which we can perhaps also apply to data?
CodeData
Compile-time checking
(Ensuring correct syntax)
Database constraints
Unit testing
(Checking a single unit of code)
Validating data profiling?
Continuous integration
(Running all tests periodically)
Data Quality monitoring?
Bug tracking
(Maintaining records of all code issues)
?
Static code analysis
(a la FindBugs)
Explorative data profiling?
Refactoring
(Changing code without breaking functionality)
ETL with applied DQ rules?

For explanation of the various data profiling and monitoring types, please refer to my previous post, Two types of data profiling.
Of course not all metaphors here map one-to-one, but in my oppinion it is a pretty good metaphor. For me, as a software product developer, I think it also points out some of the weak and strong points of current Data Quality tools. In software development the tool support for unit testing, continuous integration, bug tracking and more is incredible. In the data world I feel that many tools focus only on one or two of the above areas of quality control. Of course you can combine tools, but as I've argued before, switching tools also comes at a large price.
So what do I suggest? Well, fellow product developers, let's make better tools that integrate more disciplines of data quality! I know that this has been and still will be my aim for DataCleaner.
Update: Further actions in this direction have been taken with the plans for DataCleaner 3.0, see this blog post for more information.

20110714

A colorful value distribution

A few weeks ago I was dealing a bit of attention to the charts in DataCleaner. Of special interest are the value distribution charts which has caused some discussions...

Anyways, here's a proposal which includes nicer (IMO) coloring, a "distinct count measure", a dedicated "<blank>" keyword and a few other niceties.

Value distribution chart proposals


You can expect to see this live in DataCleaner 2.3 which is expected in august.

20110708

Proposal for writing data in MetaModel 2.0

Hi everyone,

For a long time we've had a lot of people asking "can I use MetaModel to not only read varying data formats, but also to write data?". So far the answer has been "no, MetaModel is a read-only API". But lately we've been working at Human Inference on a proposal for an API to write to the same DataContexts as you read from, in MetaModel.

Here's a glimpse of the API, by example. Currently we have fluent API's for creating tables and inserting rows:

UpdateableDataContext dc = ...
Schema schema = dc.getDefaultSchema();
Table table = dc.createTable(schema, "my_table")
   .withColumn("id").ofType(INTEGER)
   .withColumn("name").ofType("VARCHAR").ofSize(255)
   .execute();
dc.insertInto(table).value("id",1).value("name","john doe").execute();
dc.insertInto(table).value("id",2).value("name","jane doe").execute();

This API has so far been implemented succesfully for Excel spreadsheets, CSV files and JDBC databases - our 3 most used datastore types.

You can find the work-in-progress of the proposal in SVN at:
http://eobjects.org/svn/MetaModel/branches/2.0-writable-datacontext/

We would like to get your reactions on the API proposal. Does it suit your needs and do you like the approach? Will it be acceptible to launch 2.0 with just these "CREATE TABLE" and "INSERT" operations, or will other operations (such as DELETE, UPDATE, DROP, ALTER) be needed before it makes up a valid solution for you guys?

Best regards,
Kasper

Update 2011-07-11
A few people have provided feedback (thank you for that) and also some performance tests on our side revealed that we need to apply a more batch-friendly approach, which also has better encapsulation and isolation properties for multiple and large updates. So, we've instead applied a pattern similar to Spring's template or Akka's atomic STM pattern. The idea is that the user supplies an UpdateScript which will be executed in isolation, like this:
UpdateableDataContext dc = ...
final Schema schema = dc.getDefaultSchema();
dc.executeUpdate(new UpdateScript() {
  public void run(UpdateCallback callback) {
    Table table = callback.createTable(schema, "my_table")
      .withColumn("id").ofType(INTEGER)
      .withColumn("name").ofType("VARCHAR").ofSize(255)
      .execute();
    callback.insertInto(table).value("id",1).value("name","john doe").execute();
    callback.insertInto(table).value("id",2).value("name","jane doe").execute();
  }
});

On first sight it might not look quite as elegant, but I think that in the big picture this pattern is actually a lot nicer. First of all because it gives you a very clear understanding of exactly where in your code you modify your data. It also makes it a lot easier to write eg. fallback-scripts in case something goes wrong with your update. For datastore types that support transactions (eg. JDBC databases) it also makes it possible for us to easily demarcate the transactional boundaries.

... Please keep posting feedback!

20110627

Developing DataCleaner extensions

Today has been all about the release of DataCleaner 2.2. This is a significant release of our Data Quality Analysis product which I think is becoming more and more mature and capable.

One of the really neat things in DataCleaner 2.2 is it's extensibility. Lots of applications are extensible, but few are in my oppinion as easy to approach as DataCleaner. We expose a limited API which is extremely flexible though. This makes it easy for developers to explore the opportunities and the architecture.

Another great strengths of DataCleaner's extension architecture is the ExtensionSwap. With a click in the browser you can install an extension onto a running application. Personally I think it's a quite jaw-dropping effect when you see the seamlessnes of the integration here.

I've recorded this webcast demonstration for developers who want to get started, or just feel curious on how our developer API works.

Also, a few nice resources for you to investigate further:

  • Our reference documentation now contains a "developers guide" with lots of nice info on extension packaging and more.
  • The Develop page on the website contains links to various previous blog entries and instructions. These are still valid even though the 2.2 API has been elaborated.
  • And of course, check out the javadoc API documentation for DataCleaner.

Lastly I want to point out that not only is DataCleaner 2.2 extensible, it is also embeddable. You can now grab DataCleaner in the central Maven repo's and you can bootstrap the application in a really easy fashion:

BootstrapOptions bootstrapOptions = new DefaultBootstrapOptions(args);
Bootstrap bootstrap = new Bootstrap(bootstrapOptions);
bootstrap.run();

For more info, check out the chapter "Embedding DataCleaner" in the reference documentation.

20110620

SassyReader - Open Source reader of SAS data sets for Java

I'm quite excited to announce the first release of a brand new eobjects.org project: SassyReader. SassyReader is in my oppinion in deed something sassy as it fills a gap that has long existed in open source applications that deals with data management (ETL tools, tools like DataCleaner and the like). SassyReader is a library for reading data in the sas7bdat format, aka. the format that the SAS statistical software use! It is written entirely in Java and reads the files from their binary format (eg. it's not a connector to the SAS system, but a reader of the raw data).

Visit the SassyReader websiteSo why is this important? Well first of all because it is very difficult to create systems that interoperate with SAS. SAS does ship a JDBC driver but it's compliancy with JDBC is actually very limited. Even creating a connection will typically require use of SAS's proprietary classes, so you cannot go the standards JDBC way. There is also no JDBC metadata support and you need to set up a server-side SAS/SHARE option to even expose the connection. Furthermore this is an add-on product from SAS which costs additional money if you're just a base SAS user. So doing trivial things like connecting and querying a data set requires a lot of work and money. In my oppinion this is poor practice - a legacy way of trying to lock people in to using only a particular brand of software, simply because interoperability is a big pain.

All in all I see a great benefit in a project like SassyReader for those who simply want a way of reading the data that is stored in SAS files.

I cannot take a whole lot of credit for this project though. Most of the really challenging stuff was created by Matt Shotwell, aka. BioStatMatt, who founded the sas7bdat project which is written in R. My contribution was to port it to Java and fix a few issues on the way. Matt put together a lot of fractioned works that describe various findings about the sas7bdat format. In other words this is a completely reverse engineered library, based on analysis of actual sas7bdat files. During the last months we've had a good conversation going and actually fixing some of the remaining issues in parallel and bringing additions to each other's code.

Today we've released version 0.1 of SassyReader. It's not yet ready for mission critical use as there are still quirks in the format that we haven't figured out. Also there are different shapes and sizes within the format that vary apparently depending on (I'm a bit guessing here) the amount of columns and the operating system that the file was written with. The good thing is that we have a quite extensive test set and for at least the files that I had lying around that I wanted to work with the reader managed to read all but one (11 out of 12)!

Please visit the SassyReader website for more details, and let me know your feedback!

20110531

The value of data quality

The other day I stumbled into an interesting blog post about software quality, and this brilliant quote (by Alan Weiss) kept flashing on my internal billboard over and over again...

Quality [...] is not the absence of something in management's eyes, that is, defects, but the presence of something in the consumer's eyes, that is, value.
What struck me was an explanation for one of the findings that I quite often get when people tell me that they are using DataCleaner or competing tools, and they find value in even some of the simplest functionalities in there - That even the simplest of features can provide a fortune of value. Coming from the world of tools and product development we tend to look at feature comparisons and technical capabilities a lot. And even customers also use such arguments for choosing a tool. And of course it makes a lot of sense because quality, as in the amount of value provided for the consumer, is very hard to measure and compare.

So how can a software product deliver high quality, in the sense of consumer value? I believe it is a mix of making the product fairly easy to use as well as solving concrete problems for the consumer. In DataCleaner we've done a lot of work to make the tool work as a generic Data Quality Analysis (DQA) tool for a wide variety of data types. But maybe we should also consider building more domain targeted packages where you can easily do a "data value assesment" (to twist the words of Alan Weiss a little) for particular domains, eg. customer data, product data, geographic data and more.

What do you think? Should data profiling and DQA stay generic, or should it target specific domains? Can it do both?

20110513

Speed up your JDBC queries with MetaModel

Although the recent MetaModel version 1.7.1 was a minor release there is something about this release that excites me very much! That thing is our new FETCH_SIZE calculator. So what is that you might say? Well, here’s a little background.

In JDBC there is an option to set the so-called FETCH_SIZE of a query statement, like this:

Statement st = connection.createStatement();
st.setFetchSize(10000);
ResultSet rs = st.executeQuery(...);

What this does is that it tells the database how many rows should be fetched from the database when more rows are needed. In other words: A buffer size, measured in the amount of rows.

Cool – so this means you should set a rather high fetch size if you have plenty of memory? Say 20.000 records? Well, sometimes yes, but not quite at all times!

The guys over at Oracle has done an exemplary work on documenting the memory consumption of their latest JDBC driver. From this you can learn that if you put a too high fetch size, you might run out of memory! But on the other hand you would really like to optimize the buffering of your result sets! The trouble is that the row size in memory is hugely different if you select 1, 2, 10 or 300 columns in your query! Furthermore the size of a VARCHAR(4000) value in the buffer is about 8000 times larger than a BIT value! These findings lead to a very nice new feature in MetaModel: Dynamically setting an appropriate fetch size, depending on the query you are executing...

How does this work? Well basically MetaModel has all the needed meta-information for doing a strictly mathematical calculation. MetaModel knows the data types of all your columns, it knows their column sizes and most importantly – it knows which columns you are querying! So if we want to allow eg. up to 8 megs of memory to be used for the query buffer, then it’s very easy for us to figure out the needed FETCH_SIZE – in principle it’s just to divide the available memory with the size of a single row in the buffer.

So far I’ve tested it out together with some friends of DataCleaner that have both small tables and 200-300 column tables and they see a dramatic improvement in performance as well as it prevents a few OutOfMemory issues!

A final remark – I haven’t seen any other Java frameworks that do something like this. Even Hibernate only has the option to specify a constant FETCH_SIZE:
<property name="hibernate.jdbc.fetch_size">10000</property>

But as explained in this blog entry you will hopefully agree that such an approach is not optimal unless your application only executes a single query again and again!

For more information on MetaModel, please visit the website.

20110413

Two types of data profiling

Recently I've been blogging about how I see that DataCleaner is what I would dub a 'Data Quality Analysis (DQA) tool' more than anything else. This leads me to an explanation of what I mean by DQA tool, profiling tool and more.

So...

Data profiling is in my worldview the activity of extracting (and possibly refining) a set of analysis metrics from your data. As such it is a quite boring and trivial tasks that you can even automate rather easily.

The interesting question is not what data profiling is, but why it is! I see two main reasons and they have quite different characteristics as to how you would use a profiling tool!

The (chronologically) first reason that you would apply data profiling is to perform an analysis. Not a technical analysis, but an analysis where you apply your human reasoning. You investigate the metrics to discover your data. If you're a good analyst you will also continuously refine your analysis, challenge it and change settings to see what happens. For this a profiling tool enables you to go below "the tip of the iceberg" (a common phrase about profiling) in your datastores.

The second reason is for monitoring your data quality. A profiling tool has the power to extract the metrics so you will often see that people use profiling tools to perform a set of data quality validation tasks. It is used as a way to retain a quality level. In this case of data profiling you execute the same analysis again and again - only the data changes (over time).

Do you see other applications of data profiling tools?

20110319

Coming up in DataCleaner: Single window UI

Lately I've been working on an fundamental improvement to the UI of DataCleaner that I think is quite important, and I would really like to present it to gather some reactions and ideas for further development.

The idea was to change the UI metaphor with regards to windows. Currently you might say that the left hand window is the "environment" window - it contains management of datastores and reference data (dictionaries, synonyms, patterns) as well as the file menu which includes links to various global dialogs etc. The issue with this window is that it is an additional window that you always have to carry with you, even though you want to focus on a particular job. My aim was to find a way to remove the window, but retain all it's abilities, but make the job creation the main focus of the UI.

So here's what we've done so far:

  • Moved datastore selection (and management) to the workbench (analysis job) window.
  • Moved the file menu to the workbench window.
  • Created a separate dialog for management of reference data, accessible through the file menu.

Here's a screenshot of the first concept, datastore selection and management. This is what you'll see when the app starts up (the cursor is hovering the MySQL icon, which is why it has an "information bubble"):


Once a datastore has been selected and the "Analyze!" button is clicked, the schema tree will become visible and will correspond to the analysis job window as you already know it.

In the next screenshots you can see the contents of the new "Reference data" menu in the top of the screen:


... and the dialog that it links to:


So what do you think? I hope that it's a bit more intuitive of course. The metaphor was to make something similar to a typical office application, where you begin with an empty document which you can build, or you can choose to open a saved document if you want to resume previous work.

If you want to try it out, simply check out the trunk of AnalyzerBeans and DataCleaner and build it! ... Here's the command line summary:

svn co http://eobjects.org/svn/AnalyzerBeans/trunk/ AnalyzerBeans
cd AnalyzerBeans
mvn install
cd ..
svn co http://eobjects.org/svn/DataCleaner/trunk/ DataCleaner
cd DataCleaner
mvn install
cd target
java -jar DataCleaner.jar

Enjoy and please provide feedback :)

20110303

The complexity of cleansing a name

Following up on Winfried van Holland's blog post on having to cleanse 180 million names I thought I'd share a cool screenshot from a funny moment at the Human Inference office today, when we where experimenting with good ways to visualize our execution flow in DataCleaner...

The diagram shows 1) the steps that we go through in our filtering process, 2) the dependencies between them, 3) and a circle layout that we tried out and dismissed ;-)

Anyways, a funny moment and quite a nice view of just how much execution power you can extract from a tool like DataCleaner.

20110217

Pre-processing in DataCleaner 2: Why?

Last monday we released the new DataCleaner 2.0 and one of the major new features in it is the ability to transform and filter your data using the tool. Previously the common answer for someone asking about transformation in DataCleaner has been to tell them that "there are already good tools for this available elsewhere, so use them". So why did we choose to focus on data processing in DataCleaner 2.0? And what are the capabilities of DataCleaner in terms of transformations? Is DataCleaner over time going to evolve into a full-fledged ETL tool? I'll try to answer these questions.

All your data quality functions in one place

The obvious answer is that we want to provide more data quality functionality and that transformations is something that a lot of people need. While doing data profiling it is often needed to do data adjustments, eg. to tokenize values, extract certain information, filter which rows gets profiled etc. You can also do all this also by applying an ETL tool or maybe even by creating database VIEWs. The problem with such an approach is that it will eventually get in your way because you're trying to get 2-3 independent tools to work nicely together, instead of just having these functions available where you need them.

Transformations for the DQ domain

Possibly even more important is that the transformations that you want to employ in Data Quality analysis are typically quite different than those that come out-of-the-box in database scripts and ETL tools. Such tools are typically quite generic and will provide general purpose tokenizers etc., but will typically lack transformations pertaining to the DQ-domain, such as date mask matchers, dictionary lookups and synonym replacements, standardization of email adresses, names and URL's.

Non-persistent transformations

When you do pre-processing in separate tools, you also need to persist your transformed datasets. In Data Quality analysis this is just a waste of resources and provides poor performance. If you need to perform transformations, apply filtering etc. for the purpose of analysis, profiling and exploring your data it is much more feasible to just perform these transformations when needed in stead of storing them up front. This also allows for a much more free user experience where you can actually experiment with your data and you analysis in stead of having to overthink it.

Is DataCleaner's validator gone? No, it's just filtering with added output handling!

DataCleaner 1.x was known to have a strict separation between the activities "profile" and "validate". Since this separation is gone in 2.0, one might ask, "Is DataCleaner's validator gone? I can only see analysis jobs!". But the answer is no, we just consider validation as a type of analysis (thus, analysis is a broader term, comprising both profiling, validation and more). You can easily perform all the validation operations of DataCleaner 1.x, but the approach is a bit different because you basically apply filters in stead of "validators". There is even an example of this in the "Introduction to analyzing, cleansing and filtering data" webcast, available on DataCleaner's website.

20110214

DataCleaner 2 released, Human Inference acquires eobjects.org, I have a new job and dataqualitypro.com even publishes a nice article about it

Earlier I've promised you some "Big news" today and here it is... Hmm, where to start, so much to say.


OK, let's start with the software, after all that's what I think most of my blog readers care most about:


DataCleaner 2.0 was released!

To me this is the biggest news for the DataCleaner community in a LONG time. DataCleaner 2.0 is a major release that I and my new employer (read more below) have put a lot of effort into. I just had a look at some source code statistics and actually the 2.0 release is larger (in terms of lines of code, source code commits, contributions etc.) than all previous DataCleaner releases together. I don't want to say a lot about the new functionality here, because it's all presented quite well at the DataCleaner website.

More information: Watch out, dirty data! DataCleaner 2.0 is in town!


MetaModel 1.5 was released!

My other lovechild, MetaModel, have also just been released in a version 1.5! MetaModel 1.5 is also a quite big improvement on the previous 1.2 version. The release contains a lot of exciting new features for doing querying and datastore exploration as well as a lot of maturity bugfixes.

More information: What's new in MetaModel 1.5?


And then let's move on to a major announcement that I definately think will affect the eobjects.org community positively:


Human Inference acquires eobjects.org

This might come as a surprise to quite a lot of you, so let me explain a bit. For some years DataCleaner and the other eobjects.org projects have been independent open source projects that I've invested a lot of time in. The projects have grown nicely in terms of users and the ideas have been manyfold. My ambitions for the projects have always been high, but they suffered from the fact that I was mostly working on them in my free time. One of the many fun things about doing these projects was that I've gotten to meet up with a lot of exciting people that thought my projects where interesting. At one time I met some people from the data quality vendor Human Inference, who thought DataCleaner was great and they wanted to know if they could in some way use it in collaboration with their commercial offerings. From my end of the table I was on the other hand thinking that their products offered some features that would be an excellent addition to DataCleaner's functionality. So what we did was a deal to try and raise the value for both parties. And with that in mind, here's the press release about it:

More information: Human Inference completes acquisition of DataCleaner and eobjects.org

I have a new job

I now work with Human Inference to actively grow the DataCleaner project, MetaModel as well as Human Inference's commercial products. We're building really exciting cloud-based data quality services that I think will complement the open source offering nicely. Of course it's not all going to be free, but I promise that even for the users who don't want to pay for the additional services, the acquisition and my new job will be beneficial anyway, because we're adding a lot of new resources to the projects that are improving on both the open source parts and the commercial plug-ins.



And in the end I just also want to mention that Data Quality Pro has a great article about a lot of these news, including an interview with me and Sabine Palinckx, the CEO of Human Inference.

More information: Open Source DataCleaner gets a major update, Human Inference enters the Open Source Data Quality Market

20110204

Big news ...


20110118

It’s very easy to make your own data profiling logic – but wrong!



In the world of data quality I often see that people tend to think that a profiling application is that important because the user can easily analyze his data using SQL and similar methods already. It can actually make it quite difficult to convince users to try out a tool built for the job.

Let me tell you about a situation that I have been in quite a few times: If I am to do a quick demonstration of my favourite data profiling application then I will begin with something like a Value distribution chart. The responding watching the demo would then reply:

"... But I could easily do this with just a simple GROUP BY query."

And my answer would be YES, you can.

So then I will proceed to demonstrate various other metrics such as null counts, empty string counts, length maximum/minimum etc. An SQL-hefty respondent might reply:

"... Sure, but I can also use AVG, WHERE x IS NULL, length functions etc. in SQL".

And my answer would be YES, you can.

If at this point I dive into the discussion on these terms, then I would demonstrate hard-to-query measures such as diacritic count, word count, non-letter chars and case distribution. But ultimately I will also myself have misunderstood the point of using a data profiling application because the point isn’t (so much) that it contains measures that are not possible to express as queries.

The point is that the measures and their impact to your data's profile are not always known to you in advance. If you know just what you’re looking for, then you’ll find it and nothing more.

So while it’s quite easy (or at least possible) to design your own profiling plan using SQL and various other querying extensions, it’s not going to help you much in terms of profiling, because profiling should be an exploratory, analytical and interactive discipline.

To scare you off, I think I’ll try creating a few SQL queries that could compensate for most of the functionality in DataCleaner. I imagine they’ll be quite amusing, so hold on for an update...