20100523

Using DataCleaner's API to run jobs as a part of your Java applications

Yesterday someone asked me if there where any examples around of how to set up scheduled DataCleaner jobs in a Java EE environment. While the common case have been to just use the Command-Line Interface (CLI) for DataCleaner together with cron-jobs or Windows' scheduled tasks, he had a point - for some organizations this kind of solution would be insufficient - invocation through code would be better if you already have a lot of Java applications running (eg. in a Java EE environment).

So here's my response to that request - I'll try to walk you through the process of invoking DataCleaner through it's Java API. I'll start out with an example of a Profiling job - validation is quite similar but I'll cover that in another blog-post later. It's my ambition that these walkthroughs will eventually end up in the DataCleaner online docs.

The package dk.eobjects.datacleaner.execution holds the main entrypoints for setting up and running a DataCleaner job. First you need to have a DataCleanerExecutor - in this case we wanna execute profiling jobs so we'll use a factory-method for setting up our executor accordingly:

DataCleanerExecutor<ProfilerJobConfiguration,IProfileResult,IProfile> executor = ProfilerExecutorCallback.createExecutor();
Notice the three type-parameters. They dictate that this executor handles ProfilerJobConfigurations, it produces IProfileResults and it executes using IProfile's.

Now it's time to create some profiling jobs. We do this by adding configuration-objects that describe the tasks at hand (the executor will handle the lifecycle of the actual profilers for us). In this example we'll configure a ValueDistributionProfile:
// for this purpose we never use the "displayName" param for anything, so we just enter "valuedist" or whatever
IProfileDescriptor descriptor = new BasicProfileDescriptor("valuedist", ValueDistributionProfile.class);

ProfilerJobConfiguration jobConfiguration = new ProfilerJobConfiguration(descriptor);

// all properties are by convention placed as constants within a PROPERTY_ prefix in their profile class
jobConfiguration.addProfileProperty(ValueDistributionProfile.PROPERTY_TOP_N, "5");
jobConfiguration.addProfileProperty(ValueDistributionProfile.PROPERTY_BOTTOM_N, "5");
Also we need to select which columns to profile as a part of our job configuration. DataCleaner uses MetaModel for it's datastore connectivity so we need to find our retrieve our column definitions using a MetaModel DataContext. I'll examplify with typical MySQL database connection values but there are a lot of other options in the DataContextSelection class:
DataContextSelection dcs = new DataContextSelection();
dcs.selectDatabase("jdbc:mysql://localhost/mydb", null, "username", "password", new TableType[] {TableType.VIEW, TableType.TABLE});
DataContext dc = dcs.getDataContext();
Table[] tables = dc.getDefaultSchema().getTables();

// I'll just add all columns from all tables!
List<Column> allColumns = new LinkedList<Column>();
for (Table table : tables) {
      allColumns.addAll(Arrays.asList(table.getColumns));
}

jobConfiguration.setColumns(allColumns);
Finally, we add our job configuration to the executor:
executor.addJobConfiguration(jobConfiguration);
If we want to, we can add our own observers to recieve notifications as the job progresses. For example, in the DataCleaner GUI we use an observer for updating the on-screen progress indicators.
executor.addProgressObserver(...);
Another optional feature is to set the execution options through an ExecutionConfiguration object. As an example we can configure our job to use multithreading assigning more than one connection and/or by allowing more than one query to execute at a time (the example below has a max thread count of 2*5 = 10):
ExecutionConfiguration conf = new ExecutionConfiguration();
conf.setMaxQueriesPerConnection(2);
conf.setMaxConnections(5);
executor.setExecutionConfiguration(conf);
And now it's time to kick off the executor! When we do this we provide our DataContextSelection object, which holds the connection information needed to spawn connections to the datastore.
executor.execute(dcs);
Alternatively you can start the execution asynchronously by calling:
executor.execute(dcs, false);
And now ... you're done. All you have to do now is investigate the results. You retrieve these calling:
List<IProfileResult> results = executor.getResults();
Consider using one of the result exporters in the DataCleaner API (providing support for CSV, XML and HTML export) or use some custom code to retrieve just the metrics of your interest by traversing the IProfileResult model.

I hope this walkthrough has brought some light to the subject of invoking DataCleaner through it's Java API. It's the first time I sit down and try to explain this part of the application so I might have missed some points but I think the major ideas are present. Let me know what you think - and suggestions for improving the API is always welcome.

A couple of notes to the use of DataCleaner's execution API:
  • Notice in the javadocs that almost all the classes covered in this blog-post has a serialize() and a static deserialize(...) method. These are used for saving and loading the configuration to/from XML documents. So if you've already created your jobs using DataCleaners GUI then you can save these jobs (as .dcp or .dcv files) and restore them using using deserialize(...). That might be an easier and quicker path to solving your problems if you're not much keen of setting up everything in code.
  • If you want a shortcut for setting up the ProfileDescriptors and ValidationRuleDescriptors, then take a look at DataCleaners bundled XML files, datacleaner-config.xml, datacleaner-profiler-modules.xml and datacleaner-validator-modules.xml. These a Spring Framework based files that are currently used by DataCleaner as a convenient way to serve these descriptors. You should be able to load the objects easily using Spring and then you'll have the descriptors set up automatically.

No comments: