20120417

Data quality monitoring with Kettle and DataCleaner

We've just announced a great thing - the cooperation with Pentaho and DataCleaner which brings DataCleaners profiling features to all users of Pentaho Data Integration (aka. Kettle)! Not only is this something I've been looking forward to for a long time because it is a great exposure for us, but it also opens up new doors in terms of functionality. In this blog post I'll describe something new: Data monitoring with Kettle and DataCleaner.

While DataCleaner is perfectly capable of doing continuous data profiling, we lack the deployment platform that Pentaho has. With Pentaho you get orchestration and scheduling, and even with a graphical editor.

A scenario that I often encounter is that someone wants to execute a daily profiling job, archive the results with a timestamp and have the results emailed to the data steward. Previously we would set this sorta thing up with DataCleaner's command line interface, which is still quite a nice solution, but if you have more than just a few of these jobs, it can quickly become a mess.

So alternatively, I can now just create a Kettle job like this:


Here's what the example does:

  1. Starts the job (duh!)
  2. Creates a timestamp which needs to be used for archiving the result. This is done using a separate transformation, which you can do either using the "Get System Info" step or the "Formula" step. The result is put into a variable called "today".
  3. Executes the DataCleaner job. The result filename is set to include the "${today}" variable!
  4. Emails the results to the data steward.
  5. If everything went well without errors, the job is succesful.
Pretty neat and something I am extremely happy about!

In the future I imagine to have even more features built like this. For example an ability to run multiple DataCleaner jobs with configuration options stored as data in the ETL flow. Or the ability to treat the stream of data in a Kettle transformation as the input of the DataCleaner job. Do you guys have any other wild ideas?

20120409

Implementing a custom datastore in DataCleaner

A question I am often asked by super-users, partners and developers of DataCleaner is: How do you build a custom datastore in DataCleaner for my system/file-format XYZ? Recently I've dealt with this for the use in the upcoming integration with Pentaho Kettle, for a Human Inference customer who had a home grown database proxy system, and just today while it was asked on the DataCleaner forum. In this blog post I will guide you through this process, which requires some basic Java programming skills, but if that's in place it isn't terribly complicated.

Just gimme the code ...

First of all I should say (to those of you who prefer "just the code") that there is already an example of how to do this in the sample extension for DataCleaner. Take a look at the org.eobjects.datacleaner.sample.SampleDatastore class. Once you've read, understood and compiled the Java code, all you need to do is register the datastore in DataCleaner's conf.xml file like this (within the <datastore-catalog> element):

<custom-datastore class-name="org.eobjects.datacleaner.sample.SampleDatastore">
  <property name="Name" value="My datastore" />
</custom-datastore>

A bit more explanation please!

OK, so if you wanna really know how it works, here goes...

First of all, a datastore in DataCleaner needs to implement the Datastore interface. But instead of implementing the interface directly, I would suggest using the abstract implementation called the UsageAwareDatastore. This abstract implementation handles concurrent access to the datastore, reusing existing connections and more. What you still need to provide when extending the UsageAwareDatastore class is primarily the createDatastoreConnection() method which is invoked when a (new) connection is requested. Let's see how an initial new Datastore implementation will look like:

public class ExampleDatastore extends UsageAwareDatastore<DataContext> {

 private static final long serialVersionUID = 1L;
 
 public ExampleDatastore() {
  super("My datastore");
 }

 @Override
 protected UsageAwareDatastoreConnection createDatastoreConnection() {
  // TODO Auto-generated method stub
  return null;
 }

 @Override
 public PerformanceCharacteristics getPerformanceCharacteristics() {
  // TODO Auto-generated method stub
  return null;
 }
}

Notice that I have created a no-arg constructor. This is REQUIRED for custom datastores, since the datastore will be instantiated by DataCleaner. Later we will focus on how to make the name ("My datastore") adjustable.

First we want to have a look at the two unimplemented methods:

  • createDatastoreConnection() is used to create a new connection. DataCleaner builds upon the MetaModel framework for data access. You will need to return a new DatastoreConnectionImpl(...). This class takes an important parameter, namely your MetaModel DataContext implementation. Often times there will already be a DataContext that you can use given some configuration, eg. a JdbcDataContext, a CsvDataContext, ExcelDataContext, MongoDbDatacontext or whatever.
  • getPerformanceCharacteristics() is used by DataCleaner to figure out the query plan when executing a job. You will typically just return a new PerformanceCharacteristics(false);. Read the javadoc for more information :)

Parameterizable properties, please

By now you should be able to implement a custom datastore, which hopefully covers your basic needs. But maybe you want to reuse the datastore class with eg. different files, different hostnames etc. In other words: Maybe you want to let your user define certain properties of the datastore.

To your rescue is the @Configured annotation, which is an annotation widely used in DataCleaner. It allows you to annotate fields in your class which should be configured by the user. The types of the fields can be Strings, Integers, Files etc., you name it. Let's see how you would expose the properties of a typical connection:

public class ExampleDatastore extends UsageAwareDatastore<DataContext> {
 // ...

 @Configured
 String datastoreName;

 @Configured
 String hostname;

 @Configured
 Integer port;

 @Configured
 String systemId;

 // ...
}
And how you would typically use them to implement methods:
public class ExampleDatastore extends UsageAwareDatastore<DataContext> {
 // ...

 @Override
 public String getName() {
  return datastoreName;
 }

 @Override
 protected UsageAwareDatastoreConnection createDatastoreConnection() {
  DataContext dataContext = createDataContext(hostname, port, systemId);
  return new DatastoreConnectionImpl(dataContext, this);
 }
}

If I wanted to configure a datastore using the parameters above, I could enter it in my conf.xml file like this:

<custom-datastore class-name="foo.bar.ExampleDatastore">
  <property name="Datastore name" value="My datastore" />
  <property name="Hostname" value="localhost" />
  <property name="Port" value="1234" />
  <property name="System id" value="foobar" />
</custom-datastore>

Notice that the names of the properties are inferred by reversing the camelCase notation which Java uses, so that "datastoreName" becomes "Datastore name" and so on. Alternatively you can provide an explicit name in the @Configured annotation.

I hope this introduction tutorial makes some sense for you. Once again I urge you to take a look at the Sample DataCleaner extension, which also includes a build setup (Maven based), and a custom MetaModel DataContext implementation.

20120123

Now you can build your own DQ monitoring solution with DataCleaner

In the cover of night we've released a new version of DataCleaner today (version 2.4.2). Officially it's a minor release because for the User Interface very few things have changed, only a few bugfixes and minor enhancements have been introduced. But one potentially major feature have been added in the inner workings of DataCleaner: The ability to persist the results of your DQ analysis jobs. Although this feature still has very limited User Interface support, it has full support in the command line interface, which I would argue is actually sufficient for the purposes of establishing a data quality monitoring solution. Later on I do expect there to be full (and backwards compatible) support in the UI as well.

So what is it, and how does it work?
Well basically it is simply two new parameters to the command line interface:

 -of (--output-file) FILE                          : File in which to save the result of the job
 -ot (--output-type) [TEXT | HTML | SERIALIZED]    : How to represent the result of the job

Here's an example of how to use it. Notice that I use the file extension .analysis.result.dat, which is the one thing that is currently implemented and recognized in the UI as a result file.

> DataCleaner-console.exe -job examples\employees.analysis.xml\
 -ot SERIALIZED\
 -of employees.analysis.result.dat

Now start up DataCleaner's UI, and select "File -> Open analysis job..." - you'll suddenly see that the produced file can be opened:

And when you open the file, the result will be displayed just like a job you've run inside the application:

Since files like this are generally easy to archive and to append eg. timestamps etc., it should be really easy to build a DIY data quality monitoring solution based scheduled jobs and this approach to execution. Or you can get in contact with Human Inference if you want something more sophisticated ;-)

Notice also that there's a HTML output type, which is also quite neat and easy to parse with an XML parser. The SERIALIZED format is more rich though, and includes information needed for more refined, programmatic access to the results. For instance, you might deserialize the whole file using the regular Java serialization API and access it, as an AnalysisResult instance. Thereby you could eg. create a timeline of a particular metric and track changes to the data that you are monitoring.

20111223

Push down query optimization in DataCleaner

As a follow-up to my previous post about how we make DataCleaner super-fast by applying some nice multi-threading tricks, The DataCleaner engine explained, I would now like to touch upon another performance booster: Push down query optimization.

To my knowledge "push down query optimization" is a trick that only very few tools support, since it requires a flow model that was actually built for it. The idea is that by inspecting an execution flow the tool might be able to identify steps in the beginning or in the end of the flow that can be replaced by query modifications.

For example, if your data flow begins with a filtering action that removes all records of a given type or restricts the further processing to only be the first 1000 records or something like that. Most tools simply require you to write some SQL yourself, which is also doable, but as I've said before on this blog, I think writing SQL is a barrier to productivity, creativity and good data quality results. So in DataCleaner we do not offer this option, because we have something that is much, much nicer. That solution is push down query optimization!

Let me illustrate. I will be using the Sakila example database for MySQL:

Say you want to do a simple pattern finding of film titles in the Sakila database, you would select the title column and you would get a result like this:

In the DataCleaner logs we can see what queries are actually fired to the database. Open up the log file (in the logs folder) and inspect datacleaner.log. You will find a line like this:

Executing query: SELECT `nicer_but_slower_film_list`.`title` FROM sakila.`nicer_but_slower_film_list`

That's fine. You can inspect the results closer, but that's not what this topic is about, so I'll carry on... Now let's say you want to refine your job. Let's instead see how the pattern distribution is if we want to only look at a few categories of films. So I add a 'Equals' filter to only select horror, sports and action movies and apply it to my pattern finder:

If we run the job, and inspect the log file again, we see now this entry:

Executing query: SELECT `nicer_but_slower_film_list`.`title`, `nicer_but_slower_film_list`.`category` FROM sakila.`nicer_but_slower_film_list` WHERE (`nicer_but_slower_film_list`.`category` = 'Horror' OR `nicer_but_slower_film_list`.`category` = 'Action' OR `nicer_but_slower_film_list`.`category` = 'Sports')

What's surprising here is that the filter actually got query optimized. Not all filters have this ability, since some of them have richer functionality than can be expressed as a query modification. But some of them do, and typically these are the small functions that make a big difference.

Let's also apply a Max rows filter that limits the analysis for only 20 records and chain it so that it depends on the Equals filter:

If we now run the job, both filters will have been applied to the query:

Executing query: SELECT `nicer_but_slower_film_list`.`title`, `nicer_but_slower_film_list`.`category` FROM sakila.`nicer_but_slower_film_list` WHERE (`nicer_but_slower_film_list`.`category` = 'Horror' OR `nicer_but_slower_film_list`.`category` = 'Action' OR `nicer_but_slower_film_list`.`category` = 'Sports') LIMIT 20

That means that we do as much as we can to optimize the query, without ever having to ask the user to help us. So if you modify the logical job, the physical queries are automatically adapted! This is why push down query optimization is a superior optimization technique to raw SQL. Happy data cleaning!

Additional information for developers: If you're developing plugins to DataCleaner and want to make a query optimized filter, then simply make sure you implement the QueryOptimizedFilter interface! Happy coding!

20111129

The DataCleaner engine explained

For this blog entry I have decided to record a short video instead of write till my fingers fall off :) So I present to you: My videoblog entry about DataCleaner's data quality processing engine, and how it compares to traditional ETL engines.

The DataCleaner engine was created from the ground-up to be optimized for Data Quality projects. It performs superiorly to any other engine that we've looked at, which I think is a pretty nice archievement. In the video I try to explain what makes it different!

The Camtasia Studio video content presented here requires JavaScript to be enabled and the latest version of the Adobe Flash Player. If you are using a browser with JavaScript disabled please enable it now. Otherwise, please update your version of the free Adobe Flash Player by downloading here.

Enjoy using DataCleaner :)

20111030

Standardize the date formats in your data

One of the things that I see sometimes is that web forms cause unstandardized data in your database. For example, text fields in web forms do not have a native way to specify the type of the data. So what if you have a field that is supposed to be a date? For example the birthdate of your web users? A lot of web applications are not performing real validations of the format and content of the data entered into such fields. I think this typically occurs because it was not thought of as important at the time of designing the initial web page. But maybe it will become important at a point in time if eg. you want to analyze the age groups of your users! The trouble is that later on in the applications lifecycle, a state of unchangeability enters because you're stuck with a bunch of unstandardized data that you cannot conform to a new standardized data format. This is because you will have a lot of different date formats represented. For example:

  • 2011-10-30
  • 20111030
  • 30th of October, 2011
  • 30/10/11
And maybe some even more exotic ones...
In this blog entry I will show you how to solve that migration issue with the use of DataCleaner.

1. Date mask matching
The first thing we should do is to analyze which date patterns are present in the data. To do this you need to combine two components: The Date mask matcher and the Boolean analyzer. Here are the steps involved.
  1. First set up you datastore in the welcome screen of DataCleaner.
  2. Click the "Analyze!" button to begin composing your job.
  3. In the tree to the left, select the columns of interest - in our example at least the birthdate column.
  4. Click "Add transformer -> Matching and standardization -> Date mask matcher".
Your screen will now look something like this:
In the middle of the screen you see a list of date masks. Each of these produce a boolean output column (seen below). The idea of the Date mask matcher is that it creates the boolean columns so that you can even assert if a particular date is parseable by using several date masks. That's because a single date string like "080910" can be understood in many ways!

2. Analyzing matches
Moving on, we want to see how well our dates match against the date masks. Since all the matches are now stored in boolean columns, we can apply the Boolean analyzer. Here are the steps involved:
  1. Click "Add analyzer -> Boolean analyzer".
  2. Make sure all the transformed boolean columns are checked.
  3. Click the "Run analysis" button.
  4. Wait for the analysis to run.
Your screen will now contain an analysis result like this:
The result has two parts: The Column statistics and the Frequency of combinations.
In the column statistics you can see how much individual date masks have been matched. In our example we can see that 4 of our date masks (no. 2, 3, 5 and 6) are not matched at all, so we may consider removing them from the Date mask matcher.
In the frequency of combinations we get a view of the rows and which match combinations are frequent and less frequent. The most frequent combination is that our date mask no. 1 is the only valid mask. The second most frequent combination (Combination 1) is that none of the date masks apply. If you click the green arrow to the right of the combination you will see which records fall into that category. In our example that looks like this:
This gives us a good hint about which date masks we need to add to our date mask matcher.
The "1982.03.21" date is a simple case - we should simply create a date mask like this: YYYY.MM.dd
The "11th of march, 1982" date is a bit more complex. We need to allow the date mask to have a literal string part (the "th of" part) and it needs to recognize the month by name ("march"), not by number. Fortunately this is still possible, the date mask looks like this: dd'th of' MMMMM, YYYY

3. Converting to dates
While we could continue to refine the analysis, this is a blog, not a reference manual and I want to cut to the chase - the actual migration to standardized dates!
So let us look at how you can convert your date strings to actual date fields which you can then choose to format using a standardized format. To do this, click "Add transformer -> Conversion -> Convert to date". You will now see a configuration panel like this:
In here you also see a list of example date masks. Click the plus-button to add additional date masks to convert by. The converter will try from the top to convert if it can, so in case you have cases like "091011" then you have to make your choice here (I would recommend based on your analysis).
We add the few masks that are relevant for our example:
And we verify that there are no immediate unrecognized dates, by clicking the "Preview data" button:
If a date is not recognized and converted, then the output column will have a null instead of a date. Therefore a good practice would be to look for null values and eg. save them to an error handling file. To do this, here's what we do:
  1. Go to the Filters tab.
  2. Click "Add filter -> Not null".
  3. Select only your converted column (in our example "birthdate (as date)").
  4. Click "INVALID -> Write to CSV file".
  5. Select the desired columns for the error handling file.
  6. Optionally right click the "Write to CSV file" tab and select "Rename component" to give it a name like "write errors".
  7. Go back to the Filters tab and click the VALID button to write the valid records to a CSV file or a spreadsheet.
After these steps, you should be able to inspect your job flow by clicking the Visualize button:
Now your date standardization job is ready to run again and again to enforce standardized dates!

20110926

Data Profiling SQLized. Uh oh...

What does 'Scrooge' and 'Kasper' have in common? Not much according to my SQL data profiler.

Some months back (admittedly, more than a couple) I was explaining how I think people tend do "home made data profiling" too often because it apparently seems easy to do in SQL. I went on to promise that I would also try to play the devil's advocate and show a few examples of "copy paste queries" that you can use for such a tool. In this blog post I will try to do so. But let me first say:


Don't do this at home kids!

But let's start with the first query that I would add to my home baked profiling SQL script. We'll do what anyone who hasn't really understood what profiling is all about will tell you to do: Do a column analysis based on the metrics that are easily available through all SQL implementations:

SELECT MAX(column), MIN(column), COUNT(column), COUNT(*)
FROM table;

This is a good query, especially for number columns. Here I would typically look of the MIN value is below zero or not. Is the COUNT(column) equal to the COUNT(*)? If not, it means that there are nulls in the column. Why not just do two separate queries, that would be more readable? Yes, but it also makes my script larger and I will have more stuff to maintain. But let's try it, we can actually improve it also by adding a few metrics:

SELECT MAX(column) AS highest_value FROM table;
SELECT MIN(column) AS lowest_positive_value FROM table WHERE column > 0;
SELECT MIN(column) AS lowest_negative_value FROM table WHERE column < 0;
SELECT COUNT(*) AS num_values FROM table WHERE column IS NOT NULL;
SELECT COUNT(*) AS num_nulls FROM table WHERE column IS NULL;

Now let's continue with some string columns, because I think more often than not, this is where data profiling turns out to be really valuable. Something that I often see as an inconsistency in structured string data is case differences. Such inconsistencies makes reporting and analysis of the data cumbersome and error prone because grouping and filtering will ultimately be inprecise. So let's do a case analysis:

SELECT COUNT(*) AS num_lowercase FROM table WHERE LCASE(column) = column;
SELECT COUNT(*) AS num_uppercase FROM table WHERE UCASE(column) = column;
SELECT COUNT(*) AS num_mixed_case FROM table WHERE LCASE(column) <> column AND UCASE(column) <> column;

And then on to query the always popular "first letter is capitalized" type of strings. This one really depends on the database, because substring functions have not been standardized across major SQL implementations. I'll show a few:

INITCAP-based approach (eg. PostgreSQL and Oracle):

SELECT COUNT(*) AS num_first_letter_capitalized FROM table WHERE INITCAP(column) = column;

SUBSTRING-based approach (eg. Microsoft SQL Server):

SELECT COUNT(*) AS num_first_letter_capitalized FROM table
WHERE UCASE(SUBSTR(column FROM 0 FOR 1)) = SUBSTR(column FROM 0 FOR 1)
AND LCASE(SUBSTR(column FROM 1)) = SUBSTR(column FROM 1)
A bit cumbersome, but get's the job done. Being the devil's advocate, I'm still not convinced that I should throw out my home baked SQL just yet. So I'm ready for another challenge!

Let's have a look at pattern finding through SQL. Again this is perfectly possible. I've even heard many people telling me that we should rewrite DataCleaner's Pattern Finder to make it SQL optimized. Read on and judge for yourself :-)

To match tokens by pattern we apply the simplest possible configuration in DataCleaner's pattern finder: All letters are replaced by 'A' or 'a' and all numbers are replaced by '9'. This makes for a nice pattern based matcher, like this:
Mickey Mouse -> 'Aaaaaa Aaaaa'
Minnie Mouse -> 'Aaaaaa Aaaaa'
Joachim von And -> 'Aaaaaaa aaa Aaa'
kasper@eobjects.dk -> 'aaaaaa@aaaaaaa.aa'
(Random fact: 'Joachim von And' is the Danish name for Scrooge McDuck)
As you can see from the patterns, this is a good preliminary way to determine if string values have the same form and syntax - we immediately see that the email address is odd and that although all other values look like valid names, som have lowercase tokens (prefixes) inbetween.

In PostgreSQL for example, this would look like:

SELECT regexp_replace(regexp_replace(regexp_replace(column, '[a-z]','a','g'), '[A-Z]','A','g'), '[0-9]','9','g') as pattern, COUNT(*) as pattern_count from table GROUP BY pattern;

This actually works like a charm and returns:

patternpattern_count
Aaaaaa Aaaaa2
Aaaaaaa aaa Aaa1
aaaaaa@aaaaaaaa.aa1
So why even use a profiling tool for finding patterns? All this seems to be possible through raw SQL?

I will now stop playing the devil's advocate... Cuz' seriously... This is nonsense! Having worked for some years on a pretty good data quality analysis tool, this approach absolutely disgusts me. Here's just a few random reasons why, off the top of my head:

  • We still haven't scratched the surface when it comes to supporting eg. non-ASCII characters in patterns.
  • Some tokens in patterns should be matched regardless of string length, some shouldn't. In our case we never matched strings with unequal lengths (eg. Scrooge and Mickey). This is a setting that you will want to play around with! (For examples, check out our Pattern Finder documentation)
  • Each metric in the previous analyses required their own query. This means that if you want to analyze a hundred metrics, you would need to query (at least) a hundred times.
  • A lot of metrics are simply not possible to express in SQL. Some examples: Diacritic character count, max/min amount of words, matches against reference data and more.
  • Often you will want to preprocess data before (or actually I would argue, as a part of) your profiling. This can be for example to extract information from composite values or to replace known inconsistencies with standardized values.
  • All the examples offer no drill-to-detail behaviour, so further analysis is more or less impossible. And drill-to-detail is not offered through SQL, so there is for example no way to express in our pattern finder SQL that we want to keep some samples of various pattern matches for later inspection.
  • All in all, using SQL for data profiling makes for a terribly unexplorative approach. It's a pain having to write and modify such an amount of SQL to get simple things done, so don't rely on it, because it will make you lazy and then you'll not investigate properly!
  • And of course, SQL only applies to databases that support SQL! If you're looking to profile data in other formats, then you're out of luck with this approach.