20120123

Now you can build your own DQ monitoring solution with DataCleaner

In the cover of night we've released a new version of DataCleaner today (version 2.4.2). Officially it's a minor release because for the User Interface very few things have changed, only a few bugfixes and minor enhancements have been introduced. But one potentially major feature have been added in the inner workings of DataCleaner: The ability to persist the results of your DQ analysis jobs. Although this feature still has very limited User Interface support, it has full support in the command line interface, which I would argue is actually sufficient for the purposes of establishing a data quality monitoring solution. Later on I do expect there to be full (and backwards compatible) support in the UI as well.

So what is it, and how does it work?
Well basically it is simply two new parameters to the command line interface:

 -of (--output-file) FILE                          : File in which to save the result of the job
 -ot (--output-type) [TEXT | HTML | SERIALIZED]    : How to represent the result of the job
Here's an example of how to use it. Notice that I use the file extension .analysis.result.dat, which is the one thing that is currently implemented and recognized in the UI as a result file.
> DataCleaner-console.exe -job examples\employees.analysis.xml\
 -ot SERIALIZED\
 -of employees.analysis.result.dat
Now start up DataCleaner's UI, and select "File -> Open analysis job..." - you'll suddenly see that the produced file can be opened:
And when you open the file, the result will be displayed just like a job you've run inside the application:
Since files like this are generally easy to archive and to append eg. timestamps etc., it should be really easy to build a DIY data quality monitoring solution based scheduled jobs and this approach to execution. Or you can get in contact with Human Inference if you want something more sophisticated ;-)
Notice also that there's a HTML output type, which is also quite neat and easy to parse with an XML parser. The SERIALIZED format is more rich though, and includes information needed for more refined, programmatic access to the results. For instance, you might deserialize the whole file using the regular Java serialization API and access it, as an AnalysisResult instance. Thereby you could eg. create a timeline of a particular metric and track changes to the data that you are monitoring. Update: Please read my follow-up blog post about the plans to include a full Data Quality monitoring solution as of DataCleaner 3.0.