20120123

Now you can build your own DQ monitoring solution with DataCleaner

In the cover of night we've released a new version of DataCleaner today (version 2.4.2). Officially it's a minor release because for the User Interface very few things have changed, only a few bugfixes and minor enhancements have been introduced. But one potentially major feature have been added in the inner workings of DataCleaner: The ability to persist the results of your DQ analysis jobs. Although this feature still has very limited User Interface support, it has full support in the command line interface, which I would argue is actually sufficient for the purposes of establishing a data quality monitoring solution. Later on I do expect there to be full (and backwards compatible) support in the UI as well.

So what is it, and how does it work?
Well basically it is simply two new parameters to the command line interface:

 -of (--output-file) FILE                          : File in which to save the result of the job
 -ot (--output-type) [TEXT | HTML | SERIALIZED]    : How to represent the result of the job

Here's an example of how to use it. Notice that I use the file extension .analysis.result.dat, which is the one thing that is currently implemented and recognized in the UI as a result file.

> DataCleaner-console.exe -job examples\employees.analysis.xml\
 -ot SERIALIZED\
 -of employees.analysis.result.dat

Now start up DataCleaner's UI, and select "File -> Open analysis job..." - you'll suddenly see that the produced file can be opened:

And when you open the file, the result will be displayed just like a job you've run inside the application:

Since files like this are generally easy to archive and to append eg. timestamps etc., it should be really easy to build a DIY data quality monitoring solution based scheduled jobs and this approach to execution. Or you can get in contact with Human Inference if you want something more sophisticated ;-)

Notice also that there's a HTML output type, which is also quite neat and easy to parse with an XML parser. The SERIALIZED format is more rich though, and includes information needed for more refined, programmatic access to the results. For instance, you might deserialize the whole file using the regular Java serialization API and access it, as an AnalysisResult instance. Thereby you could eg. create a timeline of a particular metric and track changes to the data that you are monitoring.

20111223

Push down query optimization in DataCleaner

As a follow-up to my previous post about how we make DataCleaner super-fast by applying some nice multi-threading tricks, The DataCleaner engine explained, I would now like to touch upon another performance booster: Push down query optimization.

To my knowledge "push down query optimization" is a trick that only very few tools support, since it requires a flow model that was actually built for it. The idea is that by inspecting an execution flow the tool might be able to identify steps in the beginning or in the end of the flow that can be replaced by query modifications.

For example, if your data flow begins with a filtering action that removes all records of a given type or restricts the further processing to only be the first 1000 records or something like that. Most tools simply require you to write some SQL yourself, which is also doable, but as I've said before on this blog, I think writing SQL is a barrier to productivity, creativity and good data quality results. So in DataCleaner we do not offer this option, because we have something that is much, much nicer. That solution is push down query optimization!

Let me illustrate. I will be using the Sakila example database for MySQL:

Say you want to do a simple pattern finding of film titles in the Sakila database, you would select the title column and you would get a result like this:

In the DataCleaner logs we can see what queries are actually fired to the database. Open up the log file (in the logs folder) and inspect datacleaner.log. You will find a line like this:

Executing query: SELECT `nicer_but_slower_film_list`.`title` FROM sakila.`nicer_but_slower_film_list`

That's fine. You can inspect the results closer, but that's not what this topic is about, so I'll carry on... Now let's say you want to refine your job. Let's instead see how the pattern distribution is if we want to only look at a few categories of films. So I add a 'Equals' filter to only select horror, sports and action movies and apply it to my pattern finder:

If we run the job, and inspect the log file again, we see now this entry:

Executing query: SELECT `nicer_but_slower_film_list`.`title`, `nicer_but_slower_film_list`.`category` FROM sakila.`nicer_but_slower_film_list` WHERE (`nicer_but_slower_film_list`.`category` = 'Horror' OR `nicer_but_slower_film_list`.`category` = 'Action' OR `nicer_but_slower_film_list`.`category` = 'Sports')

What's surprising here is that the filter actually got query optimized. Not all filters have this ability, since some of them have richer functionality than can be expressed as a query modification. But some of them do, and typically these are the small functions that make a big difference.

Let's also apply a Max rows filter that limits the analysis for only 20 records and chain it so that it depends on the Equals filter:

If we now run the job, both filters will have been applied to the query:

Executing query: SELECT `nicer_but_slower_film_list`.`title`, `nicer_but_slower_film_list`.`category` FROM sakila.`nicer_but_slower_film_list` WHERE (`nicer_but_slower_film_list`.`category` = 'Horror' OR `nicer_but_slower_film_list`.`category` = 'Action' OR `nicer_but_slower_film_list`.`category` = 'Sports') LIMIT 20

That means that we do as much as we can to optimize the query, without ever having to ask the user to help us. So if you modify the logical job, the physical queries are automatically adapted! This is why push down query optimization is a superior optimization technique to raw SQL. Happy data cleaning!

Additional information for developers: If you're developing plugins to DataCleaner and want to make a query optimized filter, then simply make sure you implement the QueryOptimizedFilter interface! Happy coding!

20111129

The DataCleaner engine explained

For this blog entry I have decided to record a short video instead of write till my fingers fall off :) So I present to you: My videoblog entry about DataCleaner's data quality processing engine, and how it compares to traditional ETL engines.

The DataCleaner engine was created from the ground-up to be optimized for Data Quality projects. It performs superiorly to any other engine that we've looked at, which I think is a pretty nice archievement. In the video I try to explain what makes it different!

The Camtasia Studio video content presented here requires JavaScript to be enabled and the latest version of the Adobe Flash Player. If you are using a browser with JavaScript disabled please enable it now. Otherwise, please update your version of the free Adobe Flash Player by downloading here.

Enjoy using DataCleaner :)

20111030

Standardize the date formats in your data

One of the things that I see sometimes is that web forms cause unstandardized data in your database. For example, text fields in web forms do not have a native way to specify the type of the data. So what if you have a field that is supposed to be a date? For example the birthdate of your web users? A lot of web applications are not performing real validations of the format and content of the data entered into such fields. I think this typically occurs because it was not thought of as important at the time of designing the initial web page. But maybe it will become important at a point in time if eg. you want to analyze the age groups of your users! The trouble is that later on in the applications lifecycle, a state of unchangeability enters because you're stuck with a bunch of unstandardized data that you cannot conform to a new standardized data format. This is because you will have a lot of different date formats represented. For example:

  • 2011-10-30
  • 20111030
  • 30th of October, 2011
  • 30/10/11
And maybe some even more exotic ones...
In this blog entry I will show you how to solve that migration issue with the use of DataCleaner.

1. Date mask matching
The first thing we should do is to analyze which date patterns are present in the data. To do this you need to combine two components: The Date mask matcher and the Boolean analyzer. Here are the steps involved.
  1. First set up you datastore in the welcome screen of DataCleaner.
  2. Click the "Analyze!" button to begin composing your job.
  3. In the tree to the left, select the columns of interest - in our example at least the birthdate column.
  4. Click "Add transformer -> Matching and standardization -> Date mask matcher".
Your screen will now look something like this:
In the middle of the screen you see a list of date masks. Each of these produce a boolean output column (seen below). The idea of the Date mask matcher is that it creates the boolean columns so that you can even assert if a particular date is parseable by using several date masks. That's because a single date string like "080910" can be understood in many ways!

2. Analyzing matches
Moving on, we want to see how well our dates match against the date masks. Since all the matches are now stored in boolean columns, we can apply the Boolean analyzer. Here are the steps involved:
  1. Click "Add analyzer -> Boolean analyzer".
  2. Make sure all the transformed boolean columns are checked.
  3. Click the "Run analysis" button.
  4. Wait for the analysis to run.
Your screen will now contain an analysis result like this:
The result has two parts: The Column statistics and the Frequency of combinations.
In the column statistics you can see how much individual date masks have been matched. In our example we can see that 4 of our date masks (no. 2, 3, 5 and 6) are not matched at all, so we may consider removing them from the Date mask matcher.
In the frequency of combinations we get a view of the rows and which match combinations are frequent and less frequent. The most frequent combination is that our date mask no. 1 is the only valid mask. The second most frequent combination (Combination 1) is that none of the date masks apply. If you click the green arrow to the right of the combination you will see which records fall into that category. In our example that looks like this:
This gives us a good hint about which date masks we need to add to our date mask matcher.
The "1982.03.21" date is a simple case - we should simply create a date mask like this: YYYY.MM.dd
The "11th of march, 1982" date is a bit more complex. We need to allow the date mask to have a literal string part (the "th of" part) and it needs to recognize the month by name ("march"), not by number. Fortunately this is still possible, the date mask looks like this: dd'th of' MMMMM, YYYY

3. Converting to dates
While we could continue to refine the analysis, this is a blog, not a reference manual and I want to cut to the chase - the actual migration to standardized dates!
So let us look at how you can convert your date strings to actual date fields which you can then choose to format using a standardized format. To do this, click "Add transformer -> Conversion -> Convert to date". You will now see a configuration panel like this:
In here you also see a list of example date masks. Click the plus-button to add additional date masks to convert by. The converter will try from the top to convert if it can, so in case you have cases like "091011" then you have to make your choice here (I would recommend based on your analysis).
We add the few masks that are relevant for our example:
And we verify that there are no immediate unrecognized dates, by clicking the "Preview data" button:
If a date is not recognized and converted, then the output column will have a null instead of a date. Therefore a good practice would be to look for null values and eg. save them to an error handling file. To do this, here's what we do:
  1. Go to the Filters tab.
  2. Click "Add filter -> Not null".
  3. Select only your converted column (in our example "birthdate (as date)").
  4. Click "INVALID -> Write to CSV file".
  5. Select the desired columns for the error handling file.
  6. Optionally right click the "Write to CSV file" tab and select "Rename component" to give it a name like "write errors".
  7. Go back to the Filters tab and click the VALID button to write the valid records to a CSV file or a spreadsheet.
After these steps, you should be able to inspect your job flow by clicking the Visualize button:
Now your date standardization job is ready to run again and again to enforce standardized dates!

20110926

Data Profiling SQLized. Uh oh...

What does 'Scrooge' and 'Kasper' have in common? Not much according to my SQL data profiler.

Some months back (admittedly, more than a couple) I was explaining how I think people tend do "home made data profiling" too often because it apparently seems easy to do in SQL. I went on to promise that I would also try to play the devil's advocate and show a few examples of "copy paste queries" that you can use for such a tool. In this blog post I will try to do so. But let me first say:


Don't do this at home kids!

But let's start with the first query that I would add to my home baked profiling SQL script. We'll do what anyone who hasn't really understood what profiling is all about will tell you to do: Do a column analysis based on the metrics that are easily available through all SQL implementations:

SELECT MAX(column), MIN(column), COUNT(column), COUNT(*)
FROM table;

This is a good query, especially for number columns. Here I would typically look of the MIN value is below zero or not. Is the COUNT(column) equal to the COUNT(*)? If not, it means that there are nulls in the column. Why not just do two separate queries, that would be more readable? Yes, but it also makes my script larger and I will have more stuff to maintain. But let's try it, we can actually improve it also by adding a few metrics:

SELECT MAX(column) AS highest_value FROM table;
SELECT MIN(column) AS lowest_positive_value FROM table WHERE column > 0;
SELECT MIN(column) AS lowest_negative_value FROM table WHERE column < 0;
SELECT COUNT(*) AS num_values FROM table WHERE column IS NOT NULL;
SELECT COUNT(*) AS num_nulls FROM table WHERE column IS NULL;

Now let's continue with some string columns, because I think more often than not, this is where data profiling turns out to be really valuable. Something that I often see as an inconsistency in structured string data is case differences. Such inconsistencies makes reporting and analysis of the data cumbersome and error prone because grouping and filtering will ultimately be inprecise. So let's do a case analysis:

SELECT COUNT(*) AS num_lowercase FROM table WHERE LCASE(column) = column;
SELECT COUNT(*) AS num_uppercase FROM table WHERE UCASE(column) = column;
SELECT COUNT(*) AS num_mixed_case FROM table WHERE LCASE(column) <> column AND UCASE(column) <> column;

And then on to query the always popular "first letter is capitalized" type of strings. This one really depends on the database, because substring functions have not been standardized across major SQL implementations. I'll show a few:

INITCAP-based approach (eg. PostgreSQL and Oracle):

SELECT COUNT(*) AS num_first_letter_capitalized FROM table WHERE INITCAP(column) = column;

SUBSTRING-based approach (eg. Microsoft SQL Server):

SELECT COUNT(*) AS num_first_letter_capitalized FROM table
WHERE UCASE(SUBSTR(column FROM 0 FOR 1)) = SUBSTR(column FROM 0 FOR 1)
AND LCASE(SUBSTR(column FROM 1)) = SUBSTR(column FROM 1)
A bit cumbersome, but get's the job done. Being the devil's advocate, I'm still not convinced that I should throw out my home baked SQL just yet. So I'm ready for another challenge!

Let's have a look at pattern finding through SQL. Again this is perfectly possible. I've even heard many people telling me that we should rewrite DataCleaner's Pattern Finder to make it SQL optimized. Read on and judge for yourself :-)

To match tokens by pattern we apply the simplest possible configuration in DataCleaner's pattern finder: All letters are replaced by 'A' or 'a' and all numbers are replaced by '9'. This makes for a nice pattern based matcher, like this:
Mickey Mouse -> 'Aaaaaa Aaaaa'
Minnie Mouse -> 'Aaaaaa Aaaaa'
Joachim von And -> 'Aaaaaaa aaa Aaa'
kasper@eobjects.dk -> 'aaaaaa@aaaaaaa.aa'
(Random fact: 'Joachim von And' is the Danish name for Scrooge McDuck)
As you can see from the patterns, this is a good preliminary way to determine if string values have the same form and syntax - we immediately see that the email address is odd and that although all other values look like valid names, som have lowercase tokens (prefixes) inbetween.

In PostgreSQL for example, this would look like:

SELECT regexp_replace(regexp_replace(regexp_replace(column, '[a-z]','a','g'), '[A-Z]','A','g'), '[0-9]','9','g') as pattern, COUNT(*) as pattern_count from table GROUP BY pattern;

This actually works like a charm and returns:

patternpattern_count
Aaaaaa Aaaaa2
Aaaaaaa aaa Aaa1
aaaaaa@aaaaaaaa.aa1
So why even use a profiling tool for finding patterns? All this seems to be possible through raw SQL?

I will now stop playing the devil's advocate... Cuz' seriously... This is nonsense! Having worked for some years on a pretty good data quality analysis tool, this approach absolutely disgusts me. Here's just a few random reasons why, off the top of my head:

  • We still haven't scratched the surface when it comes to supporting eg. non-ASCII characters in patterns.
  • Some tokens in patterns should be matched regardless of string length, some shouldn't. In our case we never matched strings with unequal lengths (eg. Scrooge and Mickey). This is a setting that you will want to play around with! (For examples, check out our Pattern Finder documentation)
  • Each metric in the previous analyses required their own query. This means that if you want to analyze a hundred metrics, you would need to query (at least) a hundred times.
  • A lot of metrics are simply not possible to express in SQL. Some examples: Diacritic character count, max/min amount of words, matches against reference data and more.
  • Often you will want to preprocess data before (or actually I would argue, as a part of) your profiling. This can be for example to extract information from composite values or to replace known inconsistencies with standardized values.
  • All the examples offer no drill-to-detail behaviour, so further analysis is more or less impossible. And drill-to-detail is not offered through SQL, so there is for example no way to express in our pattern finder SQL that we want to keep some samples of various pattern matches for later inspection.
  • All in all, using SQL for data profiling makes for a terribly unexplorative approach. It's a pain having to write and modify such an amount of SQL to get simple things done, so don't rely on it, because it will make you lazy and then you'll not investigate properly!
  • And of course, SQL only applies to databases that support SQL! If you're looking to profile data in other formats, then you're out of luck with this approach.

20110815

Get your data right... First Time Right!

In my blog I mostly talk about data quality tools like DataCleaner that are diagnostic and treating, rather than preventive. Such tools have a lot of merit and strengths, but for a total view on data quality it is crucial that you also include tools that are preventive of poor data ever entering your system. In this blog post I want to talk a bit about a project that I have been involved with at Human Inference which is just that - our First Time Right JavaScript solution.

The idea is that we provide a subscription-based JavaScript API where you can easily decorate any HTML contact form with a lot of rich features for on-the-fly verification, validation, auto correction and helpful features for automatic filling of derived fields.

For example, the API allows you to enter (or copy/paste) a full name, including titulation, salutation, initials and more - and get these items parsed and placed into corresponding fields on a detailed contact form. It will even automatically detect what the gender of the contact is, and apply this in gender fields. We have similar data entry aids for address input, email input, phone numbers and contact duplicate checking.

Take a look at the video below, which demonstrate most of the features:



Now this is quite exciting functionality, but this is also a technical blog, so I'll talk a bit about the technology involved.

We built the project based on Google Web Toolkit (GWT). GWT enables us to build a very rich application, entirely in JavaScript, so that it can be embedded on any website - no matter if it's PHP based, ASP.NET based, Java based or whatever. Of course we do have a server-side piece that the JavaScript communicates with, but that is all hosted at Human Inferences cloud platform. So in other words: The deployment of our First Time Right principle is a breeze!

Since AJAX applications require locality of the server that it is communicating with, we've had to overcome quite some issues to allow the JavaScript to be external from the deployment sites. This is crucial as we want upgrades and improvements to be performed on our premises, not at individual customer sites. This way we can really leverage the cloud- and subscription-based approach to data quality. Our solution to the locality problem has been the JSONP approach, which is an alternative protocol for implementing AJAX behaviour. JSONP is a rather clever construct where instead of issuing actual HTTP requests, you insert new <script> elements into the HTML DOM at runtime! This means that the browser will perform a new request simply because the <script> element refers a new JavaScript source. It's not "pretty" to tackle errorhandling and the asynchronicity that this approach brings on, but we've done a lot of work to get it right, and it works like a charm! I hope to share some of our design patterns later, to demonstrate how it works.

Another challenge was of security. Obviously you will want to make sure that the JavaScript is only available for subscribers. And only for the websites that they've subscribed to (because otherwise the JavaScript can simply be copied to another website). Our way around this resembles how for example Google manages their subscriptions to Google Maps and other subscription services, where you need a site-specific API key. Very clever.

A few optional features may require some local add-on deployment. In particular, deduplication requires us to know the contact data to use as the source for detecting if a new contact is a duplicate. Here we have two options: On-premise installation of the deduplication engine or hooking up with our cloud-based deduplication engine, which can be configured to sync with your datastores.

All in all I am quite enthusiastic about the FTR solution and the technology behind the solution. I also think that our FTR API is an example of a lightweight approach to implementing Data Quality, which complements DataCleaner very well. Both tools are extremely useful for ensuring a high level of data quality, and both tools are very intuitive and flexible in the way you can deploy them.

20110804

Eye candy in Java 7: New javadoc style!

By now most of you've probably heard that Java 7 is out and there's a lot of discussions about new features, the loop optimization bug and general adoption.

But one of the things in Java 7 which has escaped most people attention (I think) is the new javadoc style.

Check it out:


And see it live - we've just published an updated API documentation for MetaModel.