20100907

Developing a value transformer using the DataCleaner Java API

In this blog-entry I will demonstrate the Java API of DataCleaner to create transformers, ie. components for transforming/converting/tokenizing/generating new values based on the existing values of a dataset. You will need Java programming skills to follow this tutorial.
I find that the easiest way to explain this process is by running an example. So here's my case: I want to transform birthdates of persons (represented as Date fields) into age fields (represented as a number field). A scenario is depicted below:

After the transformation I will be able to independently process the age field, eg. with a number analysis, value distribution or apply some business rule that depends on age.
The requirements for building a transformer class are the following:
  • The class must implement the Transformer interface.
  • The class must be annotated with the (javax.inject) @Named annotation. The annotation takes an argument: The readable name of the transformer. We will thusly annotate: @Named(”Date to age”)
  • In order to read from the incoming fields we need to inject an InputColumn<E> instance (or alternatively an array of these), where <E> is the data type of the incoming fields. To inject we use the @Configured annotation. In our example this translates to: @Configured InputColumn<Date> dateColumn;
After these steps our code will look something like this:
@Named("Date to age")
public class DateToAgeTransformer implements Transformer {

  @Configured
  InputColumn<Date> dateColumn;

  @Override
  public OutputColumns getOutputColumns() {
    // TODO
    return null;
  }

  @Override
  public Object[] transform(InputRow inputRow) {
    // TODO
    return null;
  }
}
As we see, there are two methods defined by the Transformer interface, that we need to implement. They are:
  • getOutputColumns(): This method is called by the framework to determine which virtual columns will be produced by the transformer. In our case it is quite simple: The transformer creates virtual columns for age (both in days and in years, just to make it more flexible). The method body should therefore just be:
    return new OutputColumns(Integer.class, "Age in days", "Age in years");
  • transform(InputRow): This method will be called for each row with values to be transformed. The return type of the method is an Object-array representing the new values of the row. The indexes of the returned array should match the output columns, ie. index 0 is for ”Age in days” and index 1 is for ”Age in years”. Let's have a look at the methods implementation:
    Integer[] result = new Integer[2];
    Date date = inputRow.getValue(dateColumn);
    
    if (date != null) {
      long diffMillis = today.getTime() - date.getTime();
      int diffDays = (int) (diffMillis / (1000 * 60 * 60 * 24));
    
      result[0] = diffDays;
    
      // use Joda time to easily calculate the diff in years
      int diffYears = Years.yearsBetween(new DateTime(date), new DateTime(today)).getYears();
       result[1] = diffYears;
    }
    
    return result;
Of course I didn't do all the work of writing this tutorial without checking in the code so you could try it in action. The code for the ”Date to age” transformer is available here and there's also a unittest available here, that is pretty usable as a demonstration of how to unittest transformers. I hope some of you engage in developing transformers and let me know how it turns out. In my next blog post I'll explain how to build Analyzers which are the obvious next step when developing components for DataCleaner.
There are a few other good examples of transformers that might be of interest:
  • The Convert to date transformer which will try to convert any value to a date. This is perhaps useful in combination with the transformer that I've just explained in this tutorial. In other words: These two transformers may need to be chained if for example the birth date to be transformed is stored in a String-based field.
  • The Tokenizer transformer because it has a flexible amount of output columns based on the users configuration. Notice the @Configured Integer numTokens that is used in the getOutputColumns() for this purpose.

No comments: