20100926

Developing an analyzer using the AnalyzerBeans Java API

Previously I've posted about developing a value transformer using the AnalyzerBeans Java API. Now it's time to see how to develop an analyzer, which is a component for consuming data and turning it into a result that is humanly readable and hopefully useful. The Javadocs for the Java API are located here. There are lots of different analyzers in AnalyzerBeans already, which could be interesting to have a look at when you decide that you want to develop your own:

  • For typical measures there are analyzers like the Number analyzer and String analyzer. These analyzers calculate standardized measures for these data types.
  • There's the Value distribution analyzer which is interesting because it uses a backing database (using the @Provided annotation) for counting unique values if the values succeeds the amount of free memory.
  • The Date gap analyzer is also a good example because it has named input columns, used for building a timeline of from- and to-dates.
  • The Pattern finder analyzer which you can read a lot more about in one of my previous blog posts.
So let's begin with a simple example. Say you want to build a very simple analyzer that consumes date or time based values and determines the value distribution based on day-of-week (ie. how is the distribution of values grouped on monday, tuesday, wednesday etc.). While this is a rather naive example of an analyzer, it will work well as just that - an example.
We'll begin with the requirements for building an analyzer:
  • You need to define a class that implements the Analyzer<R> interface. The generic 'R' argument defines the result type of the analyzer. We can reuse a built-in result-type or write our own.
  • The class needs to be annotated with the @AnalyzerBean annotation. This annotation takes an argument: The display name of the analyzer.
  • You need to inject one or more InputColumn<E>'s using the @Configured annotation in order to consume the incoming data. The <E> type-parameter defines the datatype of interest, which is also used to determine which kinds of data types the analyzer supports. In our case we'll use Date as the InputColumn type, because we want our analyzer to consume date values.
So here is our class when it has been created in accordance with the requirements above:
@AnalyzerBean("Average date analyzer")
public class AverageDateAnalyzer implements Analyzer<CrosstabResult> {

      @Configured
      InputColumn<Date> dateColumn;

      public void run(InputRow row, int distinctCount) { ... }
      public CrosstabResult getResult() { ... }
}
Notice that we're using the built-in result type CrosstabResult, which represents a result consisting of a dimensional crosstab. We could have used other built-in result types or we could have created our own result-class - the only requirement is that it implements the AnalyzerResult interface.

The rest of the Analyzer should be "plain old Java" but of course using the API's that are available in AnalyzerBeans. I've explained most of these things before, but I'll go through it again.

So now to consider how to implement the concrete analyzer logic. We'll use a regular map to hold the distribution values. We'll map the weekday numbers to counts in this map. But we'll need to keep a count for each column that we're analyzing, so it's going to be a nested map:
private Map<InputColumn<Date>, Map<Integer, Integer>> distributionMap;
To initialize the map we need to have the InputColumn's injected first, so the constructor won't do. In stead we can annotate a method with the @Initialize annotation, which will make AnalyzerBeans invoke the method when the bean has been properly initialized.
@Initialize
public void init() {
  distributionMap = new HashMap<InputColumn<Date>, Map<Integer, Integer>>();
  for (InputColumn<Date> col : dateColumns) {
    Map<Integer, Integer> countMap = new HashMap<Integer, Integer>(7);
    for (int i = Calendar.SUNDAY; i <= Calendar.SATURDAY; i++) {
      // put a count of 0 for each day of the week
      countMap.put(i, 0);
    }
    distributionMap.put(col, countMap);
  }
}
Now that the map has been initialized we can proceed to implement the run(...) method:
@Override
public void run(InputRow row, int distinctCount) {
  for (InputColumn<Date> col : dateColumns) {
    Date value = row.getValue(col);
    if (value != null) {
      Calendar c = Calendar.getInstance();
      c.setTime(value);
      int dayOfWeek = c.get(Calendar.DAY_OF_WEEK);
      Map<Integer, Integer> countMap = distributionMap.get(col);
      int count = countMap.get(dayOfWeek);
      count += distinctCount;
      countMap.put(dayOfWeek, count);
    }
  }
}
This should be pretty much "Java as usual". The only thing that should be new to you if you're an experienced Java developer is the way you extract values from the InputRow using the InputColumns as qualifiers:
Date value = row.getValue(col);
Notice that the value variable has the Date type. The AnalyzerBeans API takes advantage of type-safety to a large extent. Since the injected InputColumn's are defined as Date-columns this means that we can safely assume that the values in the incoming row is also of the Date-type. Furthermore the Date-column will be used to verify the configuration of AnalyzerBeans jobs and early error messages to the user if he tries to configure this particular Analyzer with a non-Date column. Now on to creating the result. As stated earlier we will use the CrosstabResult for this. The crosstab result is a pretty dynamic result type that can be used for a lot of purposes. It's metaphor is similar to DataCleaners result matrices but with added features. Here's how we build our crosstab:
@Override
public CrosstabResult getResult() {
  CrosstabDimension columnDimension = new CrosstabDimension("Column");
  CrosstabDimension weekdayDimension = new CrosstabDimension("Weekday");
  weekdayDimension.addCategory("Sunday").addCategory("Monday")
    .addCategory("Tuesday").addCategory("Wednesday").addCategory("Thursday")
    .addCategory("Friday").addCategory("Saturday");

  Crosstab crosstab = new Crosstab(Integer.class, columnDimension, weekdayDimension);
  for (InputColumn col : dateColumns) {
    columnDimension.addCategory(col.getName());
    CrosstabNavigator nav = crosstab.where(columnDimension, col.getName());
    Map countMap = distributionMap.get(col);
    nav.where(weekdayDimension, "Sunday").put(countMap.get(Calendar.SUNDAY));
    nav.where(weekdayDimension, "Monday").put(countMap.get(Calendar.MONDAY));
    nav.where(weekdayDimension, "Tuesday").put(countMap.get(Calendar.TUESDAY));
    nav.where(weekdayDimension, "Wednesday").put(countMap.get(Calendar.WEDNESDAY));
    nav.where(weekdayDimension, "Thursday").put(countMap.get(Calendar.THURSDAY));
    nav.where(weekdayDimension, "Friday").put(countMap.get(Calendar.FRIDAY));
    nav.where(weekdayDimension, "Saturday").put(countMap.get(Calendar.SATURDAY));
  }
  return new CrosstabResult(getClass(), crosstab);
}
Now we're done. You can take a look at the final result here. When I run this analyzer with a small sample of data in three columns the result looks like this:
             Order date Shipment date Delivery date
Sunday                0             0             0
Monday                2             0             1
Tuesday               0             2             1
Wednesday             0             0             0
Thursday              1             0             0
Friday                1             1             2
Saturday              0             1             0
You can also check out the unit test for this analyzer here.

No comments: