20120409

Implementing a custom datastore in DataCleaner

A question I am often asked by super-users, partners and developers of DataCleaner is: How do you build a custom datastore in DataCleaner for my system/file-format XYZ? Recently I've dealt with this for the use in the upcoming integration with Pentaho Kettle, for a Human Inference customer who had a home grown database proxy system, and just today while it was asked on the DataCleaner forum. In this blog post I will guide you through this process, which requires some basic Java programming skills, but if that's in place it isn't terribly complicated.

Just gimme the code ...

First of all I should say (to those of you who prefer "just the code") that there is already an example of how to do this in the sample extension for DataCleaner. Take a look at the org.eobjects.datacleaner.sample.SampleDatastore class. Once you've read, understood and compiled the Java code, all you need to do is register the datastore in DataCleaner's conf.xml file like this (within the <datastore-catalog> element):

<custom-datastore class-name="org.eobjects.datacleaner.sample.SampleDatastore">
  <property name="Name" value="My datastore" />
</custom-datastore>

A bit more explanation please!

OK, so if you wanna really know how it works, here goes...

First of all, a datastore in DataCleaner needs to implement the Datastore interface. But instead of implementing the interface directly, I would suggest using the abstract implementation called the UsageAwareDatastore. This abstract implementation handles concurrent access to the datastore, reusing existing connections and more. What you still need to provide when extending the UsageAwareDatastore class is primarily the createDatastoreConnection() method which is invoked when a (new) connection is requested. Let's see how an initial new Datastore implementation will look like:

public class ExampleDatastore extends UsageAwareDatastore<DataContext> {

 private static final long serialVersionUID = 1L;
 
 public ExampleDatastore() {
  super("My datastore");
 }

 @Override
 protected UsageAwareDatastoreConnection createDatastoreConnection() {
  // TODO Auto-generated method stub
  return null;
 }

 @Override
 public PerformanceCharacteristics getPerformanceCharacteristics() {
  // TODO Auto-generated method stub
  return null;
 }
}

Notice that I have created a no-arg constructor. This is REQUIRED for custom datastores, since the datastore will be instantiated by DataCleaner. Later we will focus on how to make the name ("My datastore") adjustable.

First we want to have a look at the two unimplemented methods:

  • createDatastoreConnection() is used to create a new connection. DataCleaner builds upon the MetaModel framework for data access. You will need to return a new DatastoreConnectionImpl(...). This class takes an important parameter, namely your MetaModel DataContext implementation. Often times there will already be a DataContext that you can use given some configuration, eg. a JdbcDataContext, a CsvDataContext, ExcelDataContext, MongoDbDatacontext or whatever.
  • getPerformanceCharacteristics() is used by DataCleaner to figure out the query plan when executing a job. You will typically just return a new PerformanceCharacteristics(false);. Read the javadoc for more information :)

Parameterizable properties, please

By now you should be able to implement a custom datastore, which hopefully covers your basic needs. But maybe you want to reuse the datastore class with eg. different files, different hostnames etc. In other words: Maybe you want to let your user define certain properties of the datastore.

To your rescue is the @Configured annotation, which is an annotation widely used in DataCleaner. It allows you to annotate fields in your class which should be configured by the user. The types of the fields can be Strings, Integers, Files etc., you name it. Let's see how you would expose the properties of a typical connection:

public class ExampleDatastore extends UsageAwareDatastore<DataContext> {
 // ...

 @Configured
 String datastoreName;

 @Configured
 String hostname;

 @Configured
 Integer port;

 @Configured
 String systemId;

 // ...
}
And how you would typically use them to implement methods:
public class ExampleDatastore extends UsageAwareDatastore<DataContext> {
 // ...

 @Override
 public String getName() {
  return datastoreName;
 }

 @Override
 protected UsageAwareDatastoreConnection createDatastoreConnection() {
  DataContext dataContext = createDataContext(hostname, port, systemId);
  return new DatastoreConnectionImpl(dataContext, this);
 }
}

If I wanted to configure a datastore using the parameters above, I could enter it in my conf.xml file like this:

<custom-datastore class-name="foo.bar.ExampleDatastore">
  <property name="Datastore name" value="My datastore" />
  <property name="Hostname" value="localhost" />
  <property name="Port" value="1234" />
  <property name="System id" value="foobar" />
</custom-datastore>

Notice that the names of the properties are inferred by reversing the camelCase notation which Java uses, so that "datastoreName" becomes "Datastore name" and so on. Alternatively you can provide an explicit name in the @Configured annotation.

I hope this introduction tutorial makes some sense for you. Once again I urge you to take a look at the Sample DataCleaner extension, which also includes a build setup (Maven based), and a custom MetaModel DataContext implementation.

No comments: