In this blog-entry I will demonstrate the Java API of DataCleaner to create transformers, ie. components for transforming/converting/tokenizing/generating new values based on the existing values of a dataset. You will need Java programming skills to follow this tutorial.
I find that the easiest way to explain this process is by running an example. So here's my case: I want to transform birthdates of persons (represented as Date fields) into age fields (represented as a number field). A scenario is depicted below:
After the transformation I will be able to independently process the age field, eg. with a number analysis, value distribution or apply some business rule that depends on age.
The requirements for building a transformer class are the following:
interface, that we need to implement. They are:
There are a few other good examples of transformers that might be of interest:
I find that the easiest way to explain this process is by running an example. So here's my case: I want to transform birthdates of persons (represented as Date fields) into age fields (represented as a number field). A scenario is depicted below:
After the transformation I will be able to independently process the age field, eg. with a number analysis, value distribution or apply some business rule that depends on age.
The requirements for building a transformer class are the following:
- The class must implement the Transformer interface.
- The class must be annotated with the (javax.inject) @Named annotation. The annotation takes an argument: The readable name of the transformer. We will thusly annotate: @Named(”Date to age”)
- In order to read from the incoming fields we need to inject an InputColumn<E> instance (or alternatively an array of these), where <E> is the data type of the incoming fields. To inject we use the @Configured annotation. In our example this translates to: @Configured InputColumn<Date> dateColumn;
@Named("Date to age")
public class DateToAgeTransformer implements Transformer {
@Configured
InputColumn<Date> dateColumn;
@Override
public OutputColumns getOutputColumns() {
// TODO
return null;
}
@Override
public Object[] transform(InputRow inputRow) {
// TODO
return null;
}
}
As we see, there are two methods defined by the Transformer- getOutputColumns(): This method is called by the framework to determine which virtual columns will be produced by the transformer. In our case it is quite simple: The transformer creates virtual columns for age (both in days and in years, just to make it more flexible). The method body should therefore just be:
return new OutputColumns(Integer.class, "Age in days", "Age in years");
- transform(InputRow): This method will be called for each row with values to be transformed. The return type of the method is an Object-array representing the new values of the row. The indexes of the returned array should match the output columns, ie. index 0 is for ”Age in days” and index 1 is for ”Age in years”. Let's have a look at the methods implementation:
Integer[] result = new Integer[2]; Date date = inputRow.getValue(dateColumn); if (date != null) { long diffMillis = today.getTime() - date.getTime(); int diffDays = (int) (diffMillis / (1000 * 60 * 60 * 24)); result[0] = diffDays; // use Joda time to easily calculate the diff in years int diffYears = Years.yearsBetween(new DateTime(date), new DateTime(today)).getYears(); result[1] = diffYears; } return result;
There are a few other good examples of transformers that might be of interest:
- The Convert to date transformer which will try to convert any value to a date. This is perhaps useful in combination with the transformer that I've just explained in this tutorial. In other words: These two transformers may need to be chained if for example the birth date to be transformed is stored in a String-based field.
- The Tokenizer transformer because it has a flexible amount of output columns based on the users configuration. Notice the @Configured Integer numTokens that is used in the getOutputColumns() for this purpose.
No comments:
Post a Comment