20120123

Now you can build your own DQ monitoring solution with DataCleaner

In the cover of night we've released a new version of DataCleaner today (version 2.4.2). Officially it's a minor release because for the User Interface very few things have changed, only a few bugfixes and minor enhancements have been introduced. But one potentially major feature have been added in the inner workings of DataCleaner: The ability to persist the results of your DQ analysis jobs. Although this feature still has very limited User Interface support, it has full support in the command line interface, which I would argue is actually sufficient for the purposes of establishing a data quality monitoring solution. Later on I do expect there to be full (and backwards compatible) support in the UI as well.

So what is it, and how does it work?
Well basically it is simply two new parameters to the command line interface:

 -of (--output-file) FILE                          : File in which to save the result of the job
 -ot (--output-type) [TEXT | HTML | SERIALIZED]    : How to represent the result of the job
Here's an example of how to use it. Notice that I use the file extension .analysis.result.dat, which is the one thing that is currently implemented and recognized in the UI as a result file.
> DataCleaner-console.exe -job examples\employees.analysis.xml\
 -ot SERIALIZED\
 -of employees.analysis.result.dat
Now start up DataCleaner's UI, and select "File -> Open analysis job..." - you'll suddenly see that the produced file can be opened:
And when you open the file, the result will be displayed just like a job you've run inside the application:
Since files like this are generally easy to archive and to append eg. timestamps etc., it should be really easy to build a DIY data quality monitoring solution based scheduled jobs and this approach to execution. Or you can get in contact with Human Inference if you want something more sophisticated ;-)
Notice also that there's a HTML output type, which is also quite neat and easy to parse with an XML parser. The SERIALIZED format is more rich though, and includes information needed for more refined, programmatic access to the results. For instance, you might deserialize the whole file using the regular Java serialization API and access it, as an AnalysisResult instance. Thereby you could eg. create a timeline of a particular metric and track changes to the data that you are monitoring. Update: Please read my follow-up blog post about the plans to include a full Data Quality monitoring solution as of DataCleaner 3.0.

4 comments:

Lonnie said...

Kasper,

I have a simple google contact csv example with a distribution and pattern finder in a job. Works fine interactively. When I run as a job and open the data file looks great until I click on the button for details results. The popup is empty. Am I doing something wrong?

Lonnie

Kasper Sørensen said...

Hi Lonnie,

No you are not doing something wrong. Sample records (the drill to detail information) is currently not persisted in the file. But that's just an arbitrary design decision to be honest. You feel it is important to save it also?

Kasper

Lonnie said...

Kasper,

For this purpose having the option to persist has value. In this case, I have dozens of tables being imported from external sources--some customers and vendors. I am using DC to produce profiling reports in jobs and the drill down would allow a recipient of the reports to determine if a range condition warrants a response.

Since the serialized report is interactive (or even the HTML for that matter) I can have a cron job toss the results file over the wall to my data stewards. Saves us a lot of manual work.

Lonnie

Kasper Sørensen said...

Hi Lonnie,

More plans have been published to make DataCleaner 3.0 become a full data quality monitoring solution which will also allow you to share your results in a more natural fashion, as HTML documents.

Read my blog entry about it for more information.