September 2010

20100926

Developing an analyzer using the AnalyzerBeans Java API

Previously I've posted about developing a value transformer using the AnalyzerBeans Java API. Now it's time to see how to develop an analyzer, which is a component for consuming data and turning it into a result that is humanly readable and hopefully useful. The Javadocs for the Java API are located here. There are lots of different analyzers in AnalyzerBeans already, which could be interesting to have a look at when you decide that you want to develop your own:

For typical measures there are analyzers like the Number analyzer and String analyzer. These analyzers calculate standardized measures for these data types.
There's the Value distribution analyzer which is interesting because it uses a backing database (using the @Provided annotation) for counting unique values if the values succeeds the amount of free memory.
The Date gap analyzer is also a good example because it has named input columns, used for building a timeline of from- and to-dates.
The Pattern finder analyzer which you can read a lot more about in one of my previous blog posts.

So let's begin with a simple example. Say you want to build a very simple analyzer that consumes date or time based values and determines the value distribution based on day-of-week (ie. how is the distribution of values grouped on monday, tuesday, wednesday etc.). While this is a rather naive example of an analyzer, it will work well as just that - an example.
We'll begin with the requirements for building an analyzer:

You need to define a class that implements the Analyzer<R> interface. The generic 'R' argument defines the result type of the analyzer. We can reuse a built-in result-type or write our own.
The class needs to be annotated with the @AnalyzerBean annotation. This annotation takes an argument: The display name of the analyzer.
You need to inject one or more InputColumn<E>'s using the @Configured annotation in order to consume the incoming data. The <E> type-parameter defines the datatype of interest, which is also used to determine which kinds of data types the analyzer supports. In our case we'll use Date as the InputColumn type, because we want our analyzer to consume date values.

So here is our class when it has been created in accordance with the requirements above:

@AnalyzerBean("Average date analyzer")
public class AverageDateAnalyzer implements Analyzer<CrosstabResult> {

      @Configured
      InputColumn<Date> dateColumn;

      public void run(InputRow row, int distinctCount) { ... }
      public CrosstabResult getResult() { ... }
}

Notice that we're using the built-in result type CrosstabResult, which represents a result consisting of a dimensional crosstab. We could have used other built-in result types or we could have created our own result-class - the only requirement is that it implements the AnalyzerResult interface.

The rest of the Analyzer should be "plain old Java" but of course using the API's that are available in AnalyzerBeans. I've explained most of these things before, but I'll go through it again.

So now to consider how to implement the concrete analyzer logic. We'll use a regular map to hold the distribution values. We'll map the weekday numbers to counts in this map. But we'll need to keep a count for each column that we're analyzing, so it's going to be a nested map:

private Map<InputColumn<Date>, Map<Integer, Integer>> distributionMap;

To initialize the map we need to have the InputColumn's injected first, so the constructor won't do. In stead we can annotate a method with the @Initialize annotation, which will make AnalyzerBeans invoke the method when the bean has been properly initialized.

@Initialize
public void init() {
  distributionMap = new HashMap<InputColumn<Date>, Map<Integer, Integer>>();
  for (InputColumn<Date> col : dateColumns) {
    Map<Integer, Integer> countMap = new HashMap<Integer, Integer>(7);
    for (int i = Calendar.SUNDAY; i <= Calendar.SATURDAY; i++) {
      // put a count of 0 for each day of the week
      countMap.put(i, 0);
    }
    distributionMap.put(col, countMap);
  }
}

Now that the map has been initialized we can proceed to implement the run(...) method:

@Override
public void run(InputRow row, int distinctCount) {
  for (InputColumn<Date> col : dateColumns) {
    Date value = row.getValue(col);
    if (value != null) {
      Calendar c = Calendar.getInstance();
      c.setTime(value);
      int dayOfWeek = c.get(Calendar.DAY_OF_WEEK);
      Map<Integer, Integer> countMap = distributionMap.get(col);
      int count = countMap.get(dayOfWeek);
      count += distinctCount;
      countMap.put(dayOfWeek, count);
    }
  }
}

This should be pretty much "Java as usual". The only thing that should be new to you if you're an experienced Java developer is the way you extract values from the InputRow using the InputColumns as qualifiers:

Date value = row.getValue(col);

Notice that the value variable has the Date type. The AnalyzerBeans API takes advantage of type-safety to a large extent. Since the injected InputColumn's are defined as Date-columns this means that we can safely assume that the values in the incoming row is also of the Date-type. Furthermore the Date-column will be used to verify the configuration of AnalyzerBeans jobs and early error messages to the user if he tries to configure this particular Analyzer with a non-Date column. Now on to creating the result. As stated earlier we will use the CrosstabResult for this. The crosstab result is a pretty dynamic result type that can be used for a lot of purposes. It's metaphor is similar to DataCleaners result matrices but with added features. Here's how we build our crosstab:

@Override
public CrosstabResult getResult() {
  CrosstabDimension columnDimension = new CrosstabDimension("Column");
  CrosstabDimension weekdayDimension = new CrosstabDimension("Weekday");
  weekdayDimension.addCategory("Sunday").addCategory("Monday")
    .addCategory("Tuesday").addCategory("Wednesday").addCategory("Thursday")
    .addCategory("Friday").addCategory("Saturday");

  Crosstab crosstab = new Crosstab(Integer.class, columnDimension, weekdayDimension);
  for (InputColumn col : dateColumns) {
    columnDimension.addCategory(col.getName());
    CrosstabNavigator nav = crosstab.where(columnDimension, col.getName());
    Map countMap = distributionMap.get(col);
    nav.where(weekdayDimension, "Sunday").put(countMap.get(Calendar.SUNDAY));
    nav.where(weekdayDimension, "Monday").put(countMap.get(Calendar.MONDAY));
    nav.where(weekdayDimension, "Tuesday").put(countMap.get(Calendar.TUESDAY));
    nav.where(weekdayDimension, "Wednesday").put(countMap.get(Calendar.WEDNESDAY));
    nav.where(weekdayDimension, "Thursday").put(countMap.get(Calendar.THURSDAY));
    nav.where(weekdayDimension, "Friday").put(countMap.get(Calendar.FRIDAY));
    nav.where(weekdayDimension, "Saturday").put(countMap.get(Calendar.SATURDAY));
  }
  return new CrosstabResult(getClass(), crosstab);
}

Now we're done. You can take a look at the final result here. When I run this analyzer with a small sample of data in three columns the result looks like this:

             Order date Shipment date Delivery date
Sunday                0             0             0
Monday                2             0             1
Tuesday               0             2             1
Wednesday             0             0             0
Thursday              1             0             0
Friday                1             1             2
Saturday              0             1             0

You can also check out the unit test for this analyzer here.

20100913

Pattern finder 2.0 - the latest feature in DataCleaner

I'm happy to be able to present a feature in this blog post that I know a lot of you have been asking for: A new and improved "Pattern finder" (as known in DataCleaner).

The new Pattern finder works similarly to the old one. The new thing is that it supports a wide variety of configuration options (and it has been designed so that it will be significantly easier to add more options, if needed). Here are the current available options:

Property	Default value	Description
Discriminate text case	true	Sets whether or not text tokens that are upper case and lower case should be treated as different types of tokens.
Discriminate negative numbers	false	Sets whether or not negative numbers should be treated as different token types than positive numbers.
Discriminate decimals	true	Sets whether or not decimal numbers should be treated as different token types than integers.
Enable mixed tokens	true	Enables the "mixed" token type (denoted as '?' output). This type of token will occur when numbers and letters occur without separation of whitespaces.
Ignore repeated spaces	false	Sets whether or not repeated whitespaces should be ignored (ie. matched with single whitespaces)
Decimal separator	,*	The separator used to identify decimal numbers.
Thousands separator	.*	The character used as a thousands separator in large numbers.
Minus sign	-*	The character used to denote negative numbers.
Predefined token name	(none)	Can be used to define an anticipated "predefined token" that should be replaced before any subsequent pattern recognition. Requires that the "Predefined token regexes" property is also set. An example of a name could be "Titulation"
Predefined token regexes	(none)	Defines a set of regular expressions for the "predefined token". Requires that the "Predefined token name" property is also set. An example value for these regular expressions could be "[Mr,Mrs,Miss,Mister]" (which would correspond to the "Titulation" name).

* = Depending on locale, the shown value is the typical one.

This may all seem complicated, but rest assured that the default values are reasonable and almost exactly resembles what you would expect from the Pattern finder in DataCleaner (except for the "Discriminate text case" property, which is inherently turned off in DataCleaner).

Here's how it works with a set of different inputs (job title, email, name) and configurations:

> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -job examples/patternfinder_job.xml 

RESULT:
                            Match count Sample      
Aaaaa Aaa                            17 Sales Rep   
AA Aaaaaaaaa                          2 VP Sales    
Aaaaa Aaaaaaa (AAAA)                  2 Sale Manager (EMEA) 
Aaaaa Aaaaaaa (AAAAA, AAAA)           1 Sales Manager (JAPAN, APAC) 
Aaaaaaaaa                             1 President   


RESULT:
                                  Match count Sample      
aaaaaa.aaa@aaaaaaa[Domain suffix]           4 foo.bar@company.com 
aaaaaaa@aaaaaaaa[Domain suffix]             3 santa@claus.com 


RESULT:
                 Match count Sample      
aaaaaaa aaaaa              4 Jane Doe    
aaaaaaaa, aaaaaa           2 Bar, Foo    
aaa. aaaaaa aaa            1 Mrs. Foobar Foo

Notice that in the email example the two patterns end with "[Domain suffix]". This is because I've registered a corresponding "Predefined token" for this:

<analyzer>
  <descriptor ref="Pattern finder" />
  <properties>
    <property name="Predefined token name" value="Domain suffix" />
    <property name="Predefined token regexes" value="[\.com,\.org]" />
  </properties>
  <input ref="col_email" />
</analyzer>

So now that you've seen the new Pattern finder... Does it meet all your expectations? Let me know if you've got any ideas or unresolved issues!

20100912

Join the DataCleaner group at LinkedIn

I've opened up a new LinkedIn group for DataCleaner and I would like to invite anyone with an interest in DataCleaner and open source data quality to join.

If you've read my blog lately you will know that we are currently in heavy development of a new engine for the application (called "AnalyzerBeans") and the groups focus right now is also on this development. But it is also for sharing experience, discussing features, issues and solutions.

Join the group to help us gather a bit of traction for the project.

20100911

More instructions for authoring AnalyzerBeans jobs

I've previously posted a blog entry about how you could now download and run a simple example of AnalyzerBeans in the shell. I've updated the example and improved the command-line interface so that it will further assist you if you are interested in using the tool.

First of all, the command-line tool now has a reasonable usage screen:

> java -jar target/AnalyzerBeans.jar
-conf (-configuration, --configuration-file) FILE
: XML file describing the configuration of AnalyzerBeans
-ds (-datastore, --datastore-name) VAL
: Name of datastore when printing a list of schemas, tables or columns
-job (--job-file) FILE
: An analysis job XML file to execute
-list [ANALYZERS | TRANSFORMERS | DATASTORES | SCHEMAS | TABLES | COLUMNS ]
: Used to print a list of various elements available in the configuration
-s (-schema, --schema-name) VAL
: Name of schema when printing a list of tables or columns
-t (-table, --table-name) VAL
: Name of table when printing a list of columns

As you can see, you can now for example list all available analyzers (there are a lot, so I'm only posting the relevant parts for my up-coming example here, the rest have been replaced with "..."):

> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -list ANALYZERS
Analyzers:
----------
...
name: String analyzer
- Consumes multiple input columns
...
name: Value distribution
- Consumes a single input column
- Property: name=Record unique values, type=boolean, required=true
- Property: name=Bottom n most frequent values, type=Integer, required=false
- Property: name=Top n most frequent values, type=Integer, required=false
...
name: Number analyzer
- Consumes multiple input columns

I'll help you read this output: There are three analyzers listed. The String analyzer and Number analyzer both consume multiple columns, which means that they can be configured to have multiple inputs. Value distribution is another analyzer which only consumes a single column and has three configurable properties: Record unique values, Bottom n most frequent values and Top n most frequent values.

You can similarly list available transformers:

> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -list TRANSFORMERS
...

or datastores:

> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -list DATASTORES
...

or tables and columns in a particular datastore

> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -ds employees_csv -list TABLES
...
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -ds employees_csv -t employees -list COLUMNS
...

So now you have all the details that enable you to author an XML-based AnalyzerBeans job yourself. Let's take a look at the example. I'm going to post a few snippets from the employees_job.xml file which I also used in my previous post. Notice that this file has been updated since my last post so you will need to run an "svn update" if you followed my previous tutorial, in order to get up-to-date code and data.

The file starts up with a little metadata. We're not going into detail with that. Then there's the <source> part:

<source>
<data-context ref="employees_csv" />
<columns>
<column id="col_name" path="employees.csv.employees.name" />
<column id="col_email" path="employees.csv.employees.email" />
<column id="col_birthdate" path="employees.birthdate" />
</columns>
</source>

The content is almost self-explanatory. There's a reference to the employees_csv datastore and the three columns defined in the CSV file: name, email, birthdate. Notice the id's (marked in red) of these three columns. These id's will be referenced further down in the XML file.

The next major part of the XML file is the transformation part. Let's have a look at one of the transformations:

<transformer>
<descriptor ref="Email standardizer" />
<input ref="col_email" />
<output id="col_username" name="Email username" />
<output id="col_domain" name="Email domain" />
</transformer>

This snippet defines that the Email standardizer transformer consumes a single column (col_email) and generates two new virtual columns: col_username and col_domain. Now understanding the final part of the XML file will be pretty obvious. Let's have a look at one of the analyzers defined in the <analysis> part:

<analyzer>
<descriptor ref="Value distribution" />
<input ref="col_username" />
</analyzer>

It simply maps the (virtual) col_username column to a Value distribution analyzer which is then executed (along with all the other analyzers defined in the file) when you run the job from the command line:

> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -job examples/employees_job.xml
...
Value distribution for column: Email username
Null count: 0
Unique values:
- asbjorn
- foo.bar
- foobar.foo
- jane.doe
- john.doe
- kasper
- santa
...

I hope that you find this XML format pretty straight forward to author. Of course we will be implementing a graphical user interface as well, but for the moment I am actually quite satisfied with this early user interface.

20100907

Developing a value transformer using the DataCleaner Java API

In this blog-entry I will demonstrate the Java API of DataCleaner to create transformers, ie. components for transforming/converting/tokenizing/generating new values based on the existing values of a dataset. You will need Java programming skills to follow this tutorial.
I find that the easiest way to explain this process is by running an example. So here's my case: I want to transform birthdates of persons (represented as Date fields) into age fields (represented as a number field). A scenario is depicted below:

After the transformation I will be able to independently process the age field, eg. with a number analysis, value distribution or apply some business rule that depends on age.
The requirements for building a transformer class are the following:

The class must implement the Transformer interface.
The class must be annotated with the (javax.inject) @Named annotation. The annotation takes an argument: The readable name of the transformer. We will thusly annotate: @Named(”Date to age”)
In order to read from the incoming fields we need to inject an InputColumn<E> instance (or alternatively an array of these), where <E> is the data type of the incoming fields. To inject we use the @Configured annotation. In our example this translates to: @Configured InputColumn<Date> dateColumn;

After these steps our code will look something like this:

@Named("Date to age")
public class DateToAgeTransformer implements Transformer {

  @Configured
  InputColumn<Date> dateColumn;

  @Override
  public OutputColumns getOutputColumns() {
    // TODO
    return null;
  }

  @Override
  public Object[] transform(InputRow inputRow) {
    // TODO
    return null;
  }
}

As we see, there are two methods defined by the Transformer interface, that we need to implement. They are:

getOutputColumns(): This method is called by the framework to determine which virtual columns will be produced by the transformer. In our case it is quite simple: The transformer creates virtual columns for age (both in days and in years, just to make it more flexible). The method body should therefore just be:
```
return new OutputColumns(Integer.class, "Age in days", "Age in years");
```

transform(InputRow): This method will be called for each row with values to be transformed. The return type of the method is an Object-array representing the new values of the row. The indexes of the returned array should match the output columns, ie. index 0 is for ”Age in days” and index 1 is for ”Age in years”. Let's have a look at the methods implementation:

Integer[] result = new Integer[2];
Date date = inputRow.getValue(dateColumn);

if (date != null) {
  long diffMillis = today.getTime() - date.getTime();
  int diffDays = (int) (diffMillis / (1000 * 60 * 60 * 24));

  result[0] = diffDays;

  // use Joda time to easily calculate the diff in years
  int diffYears = Years.yearsBetween(new DateTime(date), new DateTime(today)).getYears();
   result[1] = diffYears;
}

return result;

Of course I didn't do all the work of writing this tutorial without checking in the code so you could try it in action. The code for the ”Date to age” transformer is available here and there's also a unittest available here, that is pretty usable as a demonstration of how to unittest transformers. I hope some of you engage in developing transformers and let me know how it turns out. In my next blog post I'll explain how to build Analyzers which are the obvious next step when developing components for DataCleaner.
There are a few other good examples of transformers that might be of interest:

The Convert to date transformer which will try to convert any value to a date. This is perhaps useful in combination with the transformer that I've just explained in this tutorial. In other words: These two transformers may need to be chained if for example the birth date to be transformed is stored in a String-based field.
The Tokenizer transformer because it has a flexible amount of output columns based on the users configuration. Notice the @Configured Integer numTokens that is used in the getOutputColumns() for this purpose.

kasper's source