We've just announced a great thing - the cooperation with Pentaho and DataCleaner which brings DataCleaners profiling features to all users of Pentaho Data Integration (aka. Kettle)! Not only is this something I've been looking forward to for a long time because it is a great exposure for us, but it also opens up new doors in terms of functionality. In this blog post I'll describe something new: Data monitoring with Kettle and DataCleaner.
While DataCleaner is perfectly capable of doing continuous data profiling, we lack the deployment platform that Pentaho has. With Pentaho you get orchestration and scheduling, and even with a graphical editor.
A scenario that I often encounter is that someone wants to execute a daily profiling job, archive the results with a timestamp and have the results emailed to the data steward. Previously we would set this sorta thing up with DataCleaner's command line interface, which is still quite a nice solution, but if you have more than just a few of these jobs, it can quickly become a mess.
So alternatively, I can now just create a Kettle job like this:
Here's what the example does:
- Starts the job (duh!)
- Creates a timestamp which needs to be used for archiving the result. This is done using a separate transformation, which you can do either using the "Get System Info" step or the "Formula" step. The result is put into a variable called "today".
- Executes the DataCleaner job. The result filename is set to include the "${today}" variable!
- Emails the results to the data steward.
- If everything went well without errors, the job is succesful.
4 comments:
I have a doubt, I downloaded the Datacleaner and perform profiling as well. but i cannot see profiling job component which you have mentioned above.
Now, i can see the Execute datacleaner job, it is failing sometimes with file doesnt exists and sometime with No such datastore.
Hi techie,
Yea, it's been a while since anyone paid much attention to this plugin I think. And it's not really backed by the professional edition anymore, so our best chance is maybe to invogorate a bit of community activity here.
The source for the Pentaho plugin is here: https://github.com/datacleaner/pdi-datacleaner
We should start by upgrading to the latest version of both PDI and DataCleaner, I think.
oh , surprising. I like this Datacleaner a lot. I am using Pentaho 7.0 and datacleaner latest version 5.1.5. I am able to resolve the "Nosuchdatastore" issue. No i am facing issue of data conversion. Actually datacleaner generate xml file which has datatype VARCHAR for all string fields whereas file input field has datatype as STRING. So, getting error with expected VARCHAR and its STRING. Very weird.
Post a Comment