Using DataCleaner's API to run jobs as a part of your Java applications

Yesterday someone asked me if there where any examples around of how to set up scheduled DataCleaner jobs in a Java EE environment. While the common case have been to just use the Command-Line Interface (CLI) for DataCleaner together with cron-jobs or Windows' scheduled tasks, he had a point - for some organizations this kind of solution would be insufficient - invocation through code would be better if you already have a lot of Java applications running (eg. in a Java EE environment).

So here's my response to that request - I'll try to walk you through the process of invoking DataCleaner through it's Java API. I'll start out with an example of a Profiling job - validation is quite similar but I'll cover that in another blog-post later. It's my ambition that these walkthroughs will eventually end up in the DataCleaner online docs.

The package dk.eobjects.datacleaner.execution holds the main entrypoints for setting up and running a DataCleaner job. First you need to have a DataCleanerExecutor - in this case we wanna execute profiling jobs so we'll use a factory-method for setting up our executor accordingly:

DataCleanerExecutor<ProfilerJobConfiguration,IProfileResult,IProfile> executor = ProfilerExecutorCallback.createExecutor();
Notice the three type-parameters. They dictate that this executor handles ProfilerJobConfigurations, it produces IProfileResults and it executes using IProfile's.

Now it's time to create some profiling jobs. We do this by adding configuration-objects that describe the tasks at hand (the executor will handle the lifecycle of the actual profilers for us). In this example we'll configure a ValueDistributionProfile:
// for this purpose we never use the "displayName" param for anything, so we just enter "valuedist" or whatever
IProfileDescriptor descriptor = new BasicProfileDescriptor("valuedist", ValueDistributionProfile.class);

ProfilerJobConfiguration jobConfiguration = new ProfilerJobConfiguration(descriptor);

// all properties are by convention placed as constants within a PROPERTY_ prefix in their profile class
jobConfiguration.addProfileProperty(ValueDistributionProfile.PROPERTY_TOP_N, "5");
jobConfiguration.addProfileProperty(ValueDistributionProfile.PROPERTY_BOTTOM_N, "5");
Also we need to select which columns to profile as a part of our job configuration. DataCleaner uses MetaModel for it's datastore connectivity so we need to find our retrieve our column definitions using a MetaModel DataContext. I'll examplify with typical MySQL database connection values but there are a lot of other options in the DataContextSelection class:
DataContextSelection dcs = new DataContextSelection();
dcs.selectDatabase("jdbc:mysql://localhost/mydb", null, "username", "password", new TableType[] {TableType.VIEW, TableType.TABLE});
DataContext dc = dcs.getDataContext();
Table[] tables = dc.getDefaultSchema().getTables();

// I'll just add all columns from all tables!
List<Column> allColumns = new LinkedList<Column>();
for (Table table : tables) {

Finally, we add our job configuration to the executor:
If we want to, we can add our own observers to recieve notifications as the job progresses. For example, in the DataCleaner GUI we use an observer for updating the on-screen progress indicators.
Another optional feature is to set the execution options through an ExecutionConfiguration object. As an example we can configure our job to use multithreading assigning more than one connection and/or by allowing more than one query to execute at a time (the example below has a max thread count of 2*5 = 10):
ExecutionConfiguration conf = new ExecutionConfiguration();
And now it's time to kick off the executor! When we do this we provide our DataContextSelection object, which holds the connection information needed to spawn connections to the datastore.
Alternatively you can start the execution asynchronously by calling:
executor.execute(dcs, false);
And now ... you're done. All you have to do now is investigate the results. You retrieve these calling:
List<IProfileResult> results = executor.getResults();
Consider using one of the result exporters in the DataCleaner API (providing support for CSV, XML and HTML export) or use some custom code to retrieve just the metrics of your interest by traversing the IProfileResult model.

I hope this walkthrough has brought some light to the subject of invoking DataCleaner through it's Java API. It's the first time I sit down and try to explain this part of the application so I might have missed some points but I think the major ideas are present. Let me know what you think - and suggestions for improving the API is always welcome.

A couple of notes to the use of DataCleaner's execution API:
  • Notice in the javadocs that almost all the classes covered in this blog-post has a serialize() and a static deserialize(...) method. These are used for saving and loading the configuration to/from XML documents. So if you've already created your jobs using DataCleaners GUI then you can save these jobs (as .dcp or .dcv files) and restore them using using deserialize(...). That might be an easier and quicker path to solving your problems if you're not much keen of setting up everything in code.
  • If you want a shortcut for setting up the ProfileDescriptors and ValidationRuleDescriptors, then take a look at DataCleaners bundled XML files, datacleaner-config.xml, datacleaner-profiler-modules.xml and datacleaner-validator-modules.xml. These a Spring Framework based files that are currently used by DataCleaner as a convenient way to serve these descriptors. You should be able to load the objects easily using Spring and then you'll have the descriptors set up automatically.


Watch out for manual flushing in JBoss Seam

I've done quite a lot of development in JBoss Seam for the last six months and overall I'm quite enthusiastic. Also I'm looking forward to using some of the features of Seam in their new Java EE 6 incarnations (ie. in short: @Inject in stead of @In, @Produces in stead of @Factory, @Unwrap and @Out, and @ConversationScoped in stead of @Scope(CONVERSATION) ;-)).

One key feature of Seam is it's persistence strategy and at first glance it's quite a cool thing. The idea is to use an extended persistence context which means that your entities are kept managed across transactions. The extended persistence context is very important as Seam wraps each request in transactions and all changes to entities caused by actions in the request will then be automatically propagated to the database. The extended persistence context saves you from having to call merge(...) in order to reattach your entities all the time. Calling merge(...) is a heavy operation so this is good.

This pattern makes a lot of sense just until the point where you want to make it possible to edit an entity throughout a few requests but then forget about your changes (because the user changes his/her mind). To make this use case possible the Seam guys are advocating to use "MANUAL flushing" which means that Hibernate won't flush updates to the database unless you programmatically tell it to. Seems smart - here's the idea: Hibernate will keep track of all changes made in transactions (requests) but won't flush them. At a certain time the user will typically hit a "Save changes" button and then everything will be flushed.

Apart from the fact that MANUAL flushing is a Hibernate-specific feature not available with other JPA persistence providers, this pattern has three very serious flaws:

  1. Any query fired will cause an implicid flush - even if the flush-mode is MANUAL. This means that if your conversation involves a query, your changes will be flushed even though you haven't invoked the flush-method yourself. Again - this almost certainly rules out the possibility to use MANUAL flushing in just about any conversation I can imagine (especially if you want to enable navigation by nested conversations). Queries are a good example of something that used to be a 'side-effect-free function' but is now something that can impose a lot of unintended changes in state.
    NOTE: I stand (a bit) corrected here - I was adviced that this behaviour can be avoided by setting the flushmode of the query to COMMIT and it seems to work.
  2. While we're at the topic - if you wan't to enable nested conversations then you will have to do a lot of plumbing code to make sure that the nested parts doesn't invoke the flush-method and then end up flushing on behalf of the parent-conversation also. It IS possible to code your way around this flaw but it's a very serious prohibitant to compose your application of reusable nested conversations.
  3. The Seam guys seem to have failed to realize that transactions are used for other purposes than saving entities. For example, if you're using JMS, you would send messages at commit-time which means that developers of the JMS-dispatch code will assume that if a commit takes place, data has been persisted. If the message contains for example id's of updated entities the messagehandler will access these entities before any updates has taken place because the updates haven't been flushed!
I think that these flaws make it utterly hard to develop applications using MANUAL flushing because of the intrinsics it imposes on the flow in your application. In this light, I'm quite pleased that they didn't include manual flushing in Java EE 6 (or rather, JPA 2).