20120417

Data quality monitoring with Kettle and DataCleaner

We've just announced a great thing - the cooperation with Pentaho and DataCleaner which brings DataCleaners profiling features to all users of Pentaho Data Integration (aka. Kettle)! Not only is this something I've been looking forward to for a long time because it is a great exposure for us, but it also opens up new doors in terms of functionality. In this blog post I'll describe something new: Data monitoring with Kettle and DataCleaner.

While DataCleaner is perfectly capable of doing continuous data profiling, we lack the deployment platform that Pentaho has. With Pentaho you get orchestration and scheduling, and even with a graphical editor.

A scenario that I often encounter is that someone wants to execute a daily profiling job, archive the results with a timestamp and have the results emailed to the data steward. Previously we would set this sorta thing up with DataCleaner's command line interface, which is still quite a nice solution, but if you have more than just a few of these jobs, it can quickly become a mess.

So alternatively, I can now just create a Kettle job like this:


Here's what the example does:

  1. Starts the job (duh!)
  2. Creates a timestamp which needs to be used for archiving the result. This is done using a separate transformation, which you can do either using the "Get System Info" step or the "Formula" step. The result is put into a variable called "today".
  3. Executes the DataCleaner job. The result filename is set to include the "${today}" variable!
  4. Emails the results to the data steward.
  5. If everything went well without errors, the job is succesful.
Pretty neat and something I am extremely happy about!

In the future I imagine to have even more features built like this. For example an ability to run multiple DataCleaner jobs with configuration options stored as data in the ETL flow. Or the ability to treat the stream of data in a Kettle transformation as the input of the DataCleaner job. Do you guys have any other wild ideas?
Update: In fact we are now taking actions to provide more elaborate data quality monitoring features to the community. Go to my blog entry about the plans for DataCleaner 3.0 for more information.

20120409

Implementing a custom datastore in DataCleaner

A question I am often asked by super-users, partners and developers of DataCleaner is: How do you build a custom datastore in DataCleaner for my system/file-format XYZ? Recently I've dealt with this for the use in the upcoming integration with Pentaho Kettle, for a Human Inference customer who had a home grown database proxy system, and just today while it was asked on the DataCleaner forum. In this blog post I will guide you through this process, which requires some basic Java programming skills, but if that's in place it isn't terribly complicated.

Just gimme the code ...

First of all I should say (to those of you who prefer "just the code") that there is already an example of how to do this in the sample extension for DataCleaner. Take a look at the org.eobjects.datacleaner.sample.SampleDatastore class. Once you've read, understood and compiled the Java code, all you need to do is register the datastore in DataCleaner's conf.xml file like this (within the <datastore-catalog> element):

<custom-datastore class-name="org.eobjects.datacleaner.sample.SampleDatastore">
  <property name="Name" value="My datastore" />
</custom-datastore>

A bit more explanation please!

OK, so if you wanna really know how it works, here goes...

First of all, a datastore in DataCleaner needs to implement the Datastore interface. But instead of implementing the interface directly, I would suggest using the abstract implementation called the UsageAwareDatastore. This abstract implementation handles concurrent access to the datastore, reusing existing connections and more. What you still need to provide when extending the UsageAwareDatastore class is primarily the createDatastoreConnection() method which is invoked when a (new) connection is requested. Let's see how an initial new Datastore implementation will look like:

public class ExampleDatastore extends UsageAwareDatastore<DataContext> {

 private static final long serialVersionUID = 1L;
 
 public ExampleDatastore() {
  super("My datastore");
 }

 @Override
 protected UsageAwareDatastoreConnection createDatastoreConnection() {
  // TODO Auto-generated method stub
  return null;
 }

 @Override
 public PerformanceCharacteristics getPerformanceCharacteristics() {
  // TODO Auto-generated method stub
  return null;
 }
}

Notice that I have created a no-arg constructor. This is REQUIRED for custom datastores, since the datastore will be instantiated by DataCleaner. Later we will focus on how to make the name ("My datastore") adjustable.

First we want to have a look at the two unimplemented methods:

  • createDatastoreConnection() is used to create a new connection. DataCleaner builds upon the MetaModel framework for data access. You will need to return a new DatastoreConnectionImpl(...). This class takes an important parameter, namely your MetaModel DataContext implementation. Often times there will already be a DataContext that you can use given some configuration, eg. a JdbcDataContext, a CsvDataContext, ExcelDataContext, MongoDbDatacontext or whatever.
  • getPerformanceCharacteristics() is used by DataCleaner to figure out the query plan when executing a job. You will typically just return a new PerformanceCharacteristics(false);. Read the javadoc for more information :)

Parameterizable properties, please

By now you should be able to implement a custom datastore, which hopefully covers your basic needs. But maybe you want to reuse the datastore class with eg. different files, different hostnames etc. In other words: Maybe you want to let your user define certain properties of the datastore.

To your rescue is the @Configured annotation, which is an annotation widely used in DataCleaner. It allows you to annotate fields in your class which should be configured by the user. The types of the fields can be Strings, Integers, Files etc., you name it. Let's see how you would expose the properties of a typical connection:

public class ExampleDatastore extends UsageAwareDatastore<DataContext> {
 // ...

 @Configured
 String datastoreName;

 @Configured
 String hostname;

 @Configured
 Integer port;

 @Configured
 String systemId;

 // ...
}
And how you would typically use them to implement methods:
public class ExampleDatastore extends UsageAwareDatastore<DataContext> {
 // ...

 @Override
 public String getName() {
  return datastoreName;
 }

 @Override
 protected UsageAwareDatastoreConnection createDatastoreConnection() {
  DataContext dataContext = createDataContext(hostname, port, systemId);
  return new DatastoreConnectionImpl(dataContext, this);
 }
}

If I wanted to configure a datastore using the parameters above, I could enter it in my conf.xml file like this:

<custom-datastore class-name="foo.bar.ExampleDatastore">
  <property name="Datastore name" value="My datastore" />
  <property name="Hostname" value="localhost" />
  <property name="Port" value="1234" />
  <property name="System id" value="foobar" />
</custom-datastore>

Notice that the names of the properties are inferred by reversing the camelCase notation which Java uses, so that "datastoreName" becomes "Datastore name" and so on. Alternatively you can provide an explicit name in the @Configured annotation.

I hope this introduction tutorial makes some sense for you. Once again I urge you to take a look at the Sample DataCleaner extension, which also includes a build setup (Maven based), and a custom MetaModel DataContext implementation.