Data quality monitoring with Kettle and DataCleaner

We've just announced a great thing - the cooperation with Pentaho and DataCleaner which brings DataCleaners profiling features to all users of Pentaho Data Integration (aka. Kettle)! Not only is this something I've been looking forward to for a long time because it is a great exposure for us, but it also opens up new doors in terms of functionality. In this blog post I'll describe something new: Data monitoring with Kettle and DataCleaner.

While DataCleaner is perfectly capable of doing continuous data profiling, we lack the deployment platform that Pentaho has. With Pentaho you get orchestration and scheduling, and even with a graphical editor.

A scenario that I often encounter is that someone wants to execute a daily profiling job, archive the results with a timestamp and have the results emailed to the data steward. Previously we would set this sorta thing up with DataCleaner's command line interface, which is still quite a nice solution, but if you have more than just a few of these jobs, it can quickly become a mess.

So alternatively, I can now just create a Kettle job like this:

Here's what the example does:

  1. Starts the job (duh!)
  2. Creates a timestamp which needs to be used for archiving the result. This is done using a separate transformation, which you can do either using the "Get System Info" step or the "Formula" step. The result is put into a variable called "today".
  3. Executes the DataCleaner job. The result filename is set to include the "${today}" variable!
  4. Emails the results to the data steward.
  5. If everything went well without errors, the job is succesful.
Pretty neat and something I am extremely happy about!

In the future I imagine to have even more features built like this. For example an ability to run multiple DataCleaner jobs with configuration options stored as data in the ETL flow. Or the ability to treat the stream of data in a Kettle transformation as the input of the DataCleaner job. Do you guys have any other wild ideas?
Update: In fact we are now taking actions to provide more elaborate data quality monitoring features to the community. Go to my blog entry about the plans for DataCleaner 3.0 for more information.


techie said...

I have a doubt, I downloaded the Datacleaner and perform profiling as well. but i cannot see profiling job component which you have mentioned above.

techie said...

Now, i can see the Execute datacleaner job, it is failing sometimes with file doesnt exists and sometime with No such datastore.

Kasper Sørensen said...

Hi techie,

Yea, it's been a while since anyone paid much attention to this plugin I think. And it's not really backed by the professional edition anymore, so our best chance is maybe to invogorate a bit of community activity here.

The source for the Pentaho plugin is here: https://github.com/datacleaner/pdi-datacleaner
We should start by upgrading to the latest version of both PDI and DataCleaner, I think.

techie said...

oh , surprising. I like this Datacleaner a lot. I am using Pentaho 7.0 and datacleaner latest version 5.1.5. I am able to resolve the "Nosuchdatastore" issue. No i am facing issue of data conversion. Actually datacleaner generate xml file which has datatype VARCHAR for all string fields whereas file input field has datatype as STRING. So, getting error with expected VARCHAR and its STRING. Very weird.