Yet Another DataCleaner 2.0 Screenshot :) This is basically just an addition to my previous post about richer reporting and charts in DataCleaner 2.0.

What you see is our new Date Gap Analyzer which can be used to plot a timeline based on FROM and TO dates in a dataset. The analyzer will display gaps in the timeline and overlaps (periods where more than one record exist). This should be pretty useful for finding errors in datasets that contain continuous activities.

The chart is zoomable and scrollable so it is able of displaying quite a lot of data without harming the visual appearance.


Match! Boardgame about the heuristics in data matching(?)

This morning I was enjoying a bit of Good Clean Family Christmas TV You Can Trust and one of the subjects covered was a new Danish board game that you could spend your Christmas vacation playing. It's called Match! and here I will try to outline the rules as I understand them:

  • Each player has 3 game cards with a picture of something on their hand.
  • A picture is shown to all players and they now has to match that picture with one of the pictures on their hand.
  • In the example shown on the TV there was a picture of some sausages on a grill. The example matches of the three players where:
    • Danish politician Pia Kjærsgaard - both she and the sausages represent something very danish and something they'd like to put on a grill!
    • A used roll of toilet paper - related to a different kind of "sausage"!
    • A crowd at a musical festival - a place where you'd love to eat a grilled sausage.

The matches themselves where not the best I've seen but they do point out an important feature of a good matching engine: Using simple similarity checks is not enough. You need to understand not only the spelling, phonetics etc. of the things you are trying to match, but also the semantics. Of course a good example of this in Denmark is our country's biggest company: "Maersk", which can rather easily be matched with "Mærsk" but it's more difficult to get the synonym "A.P. Møller" into the matching rules except if you hardcode it somehow. And if matching goes beyond just names other associative matching rules might apply.

Well... Can't wait to play Match! It sounds like a fun game and it will definately be in the back of my head to try and record some of the interesting heuristics applied there.


Richer reporting and charts in DataCleaner 2

One of the important new features of DataCleaner 2 will be a much richer reporting module than the old one. In DataCleaner 2 the result of an analysis is not limited to the crosstabular view that a lot of you know from DataCleaner 1.x. In this blog post I will provide you with a preview of some of the exciting reports that have been added lately.

Charts in Value distribution
The Value distribution component is well-known to most DataCleaner users. It provides a simple but crucial look into the distribution of values for a column. In DataCleaner 2.0 we are enhancing the experience of working with the Value Distribution by applying visually pleasant charts as well as grouping of values with similar frequencies. Take a look at this example result on a country-column:

Now you might think: "Looks nice, but that's going to be messy for columns with very oddly distributed values". And you're right. Except that we have applied a rather intelligent grouping mechanism that will make sure we never above a certain amount of slices in a chart. To accomplish this we may need to group together some values by their frequencies which will communicate another important fact: When repeated values occur, how many times do they occur. Take a look at this next example of the value distribution of a customer number column:

As you can see, even though there's a very high amount of customer numbers we are grouping them together by frequency. This is a principle that is actually already known from the <unique> group, except that we now also apply it to further frequencies: <group=2>, <group=3> etc.

Notice also the green arrows in the table to the right. Using this button (or by clicking the slices of the pie-chart) you will be able to drill to detail to view the actual values that make up a given group.

Navigation tree in Phonetic similarity finder

Another application of richer reporting in DataCleaner is for the new Phonetic similarity finder. In short this analyzer will apply a mix of well-known algorithms for similarity checking such as Soundex, Metaphone and Levenshtein distance to produce a set of groups of similar sounding values. What you get is a tree of groups from where you can see the rows that are similar or maybe even identical:

The big news here is of course that this kind of result would be practically impossible to display in a crosstabular result of DataCleaner 1.x - which is also why DataCleaner 1.x doesn't have this feature. I hope that my message with this is clear: DataCleaner 2 will not only be a substantial improvement to the existing data profiling tool, but it will also open up a lot of new doors for more interactive (and interesting) analyses.


The last thing that I would like to point out in this blog entry is the fact that the rendering mechanism in DataCleaner 2.0 is pluggable. This means that you can very easily, using modular Java code, enhance the existing result renderers or implement your own, and simply plug it into the application. Just remember to contribute it back to the community :)


Java 6 compilation problems

If you're trying to build DataCleaner 2 and get problems like this ...

...src\main\java\org\eobjects\datacleaner\widgets\properties\ChangeRequirementButton.java:[94,6] inconvertible types
found : org.eobjects.analyzer.job.builder.AbstractBeanWithInputColumnsBuilder<capture#10 of ?,capture#279 of ?,capture#140 of ?>
required: org.eobjects.analyzer.job.builder.FilterJobBuilder<?,?>

...src\main\java\org\eobjects\datacleaner\widgets\properties\MultipleInputColumnsPropertyWidget.java:[116,61] inconvertible types
found : org.eobjects.analyzer.job.builder.AbstractBeanJobBuilder<capture#0 of ?,capture#51 of ?,capture#202 of ?>
required: org.eobjects.analyzer.job.builder.TransformerJobBuilder<?>

...src\main\java\org\eobjects\datacleaner\widgets\properties\SingleInputColumnPropertyWidget.java:[73,61] inconvertible types
found : org.eobjects.analyzer.job.builder.AbstractBeanJobBuilder<capture#29 of ?,capture#564 of ?,capture#109 of ?>
required: org.eobjects.analyzer.job.builder.TransformerJobBuilder<?>
... Then I just wanted to let you know that it's actually a compiler error, not a source code error. The Java 6 compiler (pre update 18 I believe) seems to not be able to cope with subtypes of generic interfaces which contain different type parameters than the interface. Of course this should be (and is in newer compiler versions) possible because a subtype may implement an interface with certain type parameters and define a new set of type parameters. So ... Update to the newest JDK please :)


Preview of the UI in DataCleaner 2

Lately I've been blogging a lot about AnalyzerBeans, which is the name of the new engine of DataCleaner from version 2.0 and onwards. As AnalyzerBeans is nearing a state where it is usable and maturing I have now also taken the first steps of development on the roadmap for DataCleaner 2. As a techie I would like to attribute as much emphasis on the technical capabilities of AnalyzerBeans as possible but honestly it doesn't do much good without a good user interface also. So just as AnalyzerBeans was/is an attempt to rewrite the functional/logical part of DataCleaner, the new UI will be an attempt to deliver an user experience that feels new, exciting, more responsive and interactive. The "sketches" for the new UI is being drawn these days - I'll take you through a few examples.

In the two screenshots below you can see the source data selection and a transformation of this source data. The source selection is pretty similar to the existing DataCleaner UI but notice the new transformation-oriented features. In the example below I want to use a "Name standardizer" transformation which will turn my "real_name" column into four (virtual) columns: First name, Last name, Middle name and Titulation. Similarly I can convert data types, concatenate, tokenize, parse etc.

Another thing that is much needed in the existing DataCleaner UI is more elaborate configuration options for the various profiles. In the screenshot below you'll see the new and improved version of the Pattern finder which includes a new set of configuration options. Notice that both my physical columns (real_name) and my virtual columns (as mentioned before) are available for the Pattern finder.
There are a lot of other exciting things going into the new DataCleaner version but I will safe some news for later :) For now, I can only invite everyone to try it out. All you have to do is:
> mkdir datacleaner_dev
> cd datacleaner_dev
> svn co http://eobjects.org/svn/AnalyzerBeans/trunk AnalyzerBeans
> cd AnalyzerBeans
> mvn install
> cd ..
> svn co http://eobjects.org/svn/DataCleaner/trunk DataCleaner
> cd DataCleaner
> mvn install
> java -jar target/DataCleaner-2.0-SNAPSHOT.jar
Good luck and let us know what you think :-)

PS: Maybe I should not that even though the new version is usable there are still a lot of things NOT working. If you're wondering if something odd is a bug or a feature that has simply not yet been implemented yet - don't hesitate to ask.


Developing an analyzer using the AnalyzerBeans Java API

Previously I've posted about developing a value transformer using the AnalyzerBeans Java API. Now it's time to see how to develop an analyzer, which is a component for consuming data and turning it into a result that is humanly readable and hopefully useful. The Javadocs for the Java API are located here. There are lots of different analyzers in AnalyzerBeans already, which could be interesting to have a look at when you decide that you want to develop your own:

  • For typical measures there are analyzers like the Number analyzer and String analyzer. These analyzers calculate standardized measures for these data types.
  • There's the Value distribution analyzer which is interesting because it uses a backing database (using the @Provided annotation) for counting unique values if the values succeeds the amount of free memory.
  • The Date gap analyzer is also a good example because it has named input columns, used for building a timeline of from- and to-dates.
  • The Pattern finder analyzer which you can read a lot more about in one of my previous blog posts.
So let's begin with a simple example. Say you want to build a very simple analyzer that consumes date or time based values and determines the value distribution based on day-of-week (ie. how is the distribution of values grouped on monday, tuesday, wednesday etc.). While this is a rather naive example of an analyzer, it will work well as just that - an example.
We'll begin with the requirements for building an analyzer:
  • You need to define a class that implements the Analyzer<R> interface. The generic 'R' argument defines the result type of the analyzer. We can reuse a built-in result-type or write our own.
  • The class needs to be annotated with the @AnalyzerBean annotation. This annotation takes an argument: The display name of the analyzer.
  • You need to inject one or more InputColumn<E>'s using the @Configured annotation in order to consume the incoming data. The <E> type-parameter defines the datatype of interest, which is also used to determine which kinds of data types the analyzer supports. In our case we'll use Date as the InputColumn type, because we want our analyzer to consume date values.
So here is our class when it has been created in accordance with the requirements above:
@AnalyzerBean("Average date analyzer")
public class AverageDateAnalyzer implements Analyzer<CrosstabResult> {

      InputColumn<Date> dateColumn;

      public void run(InputRow row, int distinctCount) { ... }
      public CrosstabResult getResult() { ... }
Notice that we're using the built-in result type CrosstabResult, which represents a result consisting of a dimensional crosstab. We could have used other built-in result types or we could have created our own result-class - the only requirement is that it implements the AnalyzerResult interface.

The rest of the Analyzer should be "plain old Java" but of course using the API's that are available in AnalyzerBeans. I've explained most of these things before, but I'll go through it again.

So now to consider how to implement the concrete analyzer logic. We'll use a regular map to hold the distribution values. We'll map the weekday numbers to counts in this map. But we'll need to keep a count for each column that we're analyzing, so it's going to be a nested map:
private Map<InputColumn<Date>, Map<Integer, Integer>> distributionMap;
To initialize the map we need to have the InputColumn's injected first, so the constructor won't do. In stead we can annotate a method with the @Initialize annotation, which will make AnalyzerBeans invoke the method when the bean has been properly initialized.
public void init() {
  distributionMap = new HashMap<InputColumn<Date>, Map<Integer, Integer>>();
  for (InputColumn<Date> col : dateColumns) {
    Map<Integer, Integer> countMap = new HashMap<Integer, Integer>(7);
    for (int i = Calendar.SUNDAY; i <= Calendar.SATURDAY; i++) {
      // put a count of 0 for each day of the week
      countMap.put(i, 0);
    distributionMap.put(col, countMap);
Now that the map has been initialized we can proceed to implement the run(...) method:
public void run(InputRow row, int distinctCount) {
  for (InputColumn<Date> col : dateColumns) {
    Date value = row.getValue(col);
    if (value != null) {
      Calendar c = Calendar.getInstance();
      int dayOfWeek = c.get(Calendar.DAY_OF_WEEK);
      Map<Integer, Integer> countMap = distributionMap.get(col);
      int count = countMap.get(dayOfWeek);
      count += distinctCount;
      countMap.put(dayOfWeek, count);
This should be pretty much "Java as usual". The only thing that should be new to you if you're an experienced Java developer is the way you extract values from the InputRow using the InputColumns as qualifiers:
Date value = row.getValue(col);
Notice that the value variable has the Date type. The AnalyzerBeans API takes advantage of type-safety to a large extent. Since the injected InputColumn's are defined as Date-columns this means that we can safely assume that the values in the incoming row is also of the Date-type. Furthermore the Date-column will be used to verify the configuration of AnalyzerBeans jobs and early error messages to the user if he tries to configure this particular Analyzer with a non-Date column. Now on to creating the result. As stated earlier we will use the CrosstabResult for this. The crosstab result is a pretty dynamic result type that can be used for a lot of purposes. It's metaphor is similar to DataCleaners result matrices but with added features. Here's how we build our crosstab:
public CrosstabResult getResult() {
  CrosstabDimension columnDimension = new CrosstabDimension("Column");
  CrosstabDimension weekdayDimension = new CrosstabDimension("Weekday");

  Crosstab crosstab = new Crosstab(Integer.class, columnDimension, weekdayDimension);
  for (InputColumn col : dateColumns) {
    CrosstabNavigator nav = crosstab.where(columnDimension, col.getName());
    Map countMap = distributionMap.get(col);
    nav.where(weekdayDimension, "Sunday").put(countMap.get(Calendar.SUNDAY));
    nav.where(weekdayDimension, "Monday").put(countMap.get(Calendar.MONDAY));
    nav.where(weekdayDimension, "Tuesday").put(countMap.get(Calendar.TUESDAY));
    nav.where(weekdayDimension, "Wednesday").put(countMap.get(Calendar.WEDNESDAY));
    nav.where(weekdayDimension, "Thursday").put(countMap.get(Calendar.THURSDAY));
    nav.where(weekdayDimension, "Friday").put(countMap.get(Calendar.FRIDAY));
    nav.where(weekdayDimension, "Saturday").put(countMap.get(Calendar.SATURDAY));
  return new CrosstabResult(getClass(), crosstab);
Now we're done. You can take a look at the final result here. When I run this analyzer with a small sample of data in three columns the result looks like this:
             Order date Shipment date Delivery date
Sunday                0             0             0
Monday                2             0             1
Tuesday               0             2             1
Wednesday             0             0             0
Thursday              1             0             0
Friday                1             1             2
Saturday              0             1             0
You can also check out the unit test for this analyzer here.


Pattern finder 2.0 - the latest feature in DataCleaner

I'm happy to be able to present a feature in this blog post that I know a lot of you have been asking for: A new and improved "Pattern finder" (as known in DataCleaner).

The new Pattern finder works similarly to the old one. The new thing is that it supports a wide variety of configuration options (and it has been designed so that it will be significantly easier to add more options, if needed). Here are the current available options:

PropertyDefault valueDescription
Discriminate text case true Sets whether or not text tokens that are upper case and lower case should be treated as different types of tokens.
Discriminate negative numbers false Sets whether or not negative numbers should be treated as different token types than positive numbers.
Discriminate decimals true Sets whether or not decimal numbers should be treated as different token types than integers.
Enable mixed tokens true Enables the "mixed" token type (denoted as '?' output). This type of token will occur when numbers and letters occur without separation of whitespaces.
Ignore repeated spaces false Sets whether or not repeated whitespaces should be ignored (ie. matched with single whitespaces)
Decimal separator ,* The separator used to identify decimal numbers.
Thousands separator .* The character used as a thousands separator in large numbers.
Minus sign -* The character used to denote negative numbers.
Predefined token name (none) Can be used to define an anticipated "predefined token" that should be replaced before any subsequent pattern recognition. Requires that the "Predefined token regexes" property is also set. An example of a name could be "Titulation"
Predefined token regexes (none) Defines a set of regular expressions for the "predefined token". Requires that the "Predefined token name" property is also set. An example value for these regular expressions could be "[Mr,Mrs,Miss,Mister]" (which would correspond to the "Titulation" name).

* = Depending on locale, the shown value is the typical one.

This may all seem complicated, but rest assured that the default values are reasonable and almost exactly resembles what you would expect from the Pattern finder in DataCleaner (except for the "Discriminate text case" property, which is inherently turned off in DataCleaner).

Here's how it works with a set of different inputs (job title, email, name) and configurations:
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -job examples/patternfinder_job.xml 

                            Match count Sample      
Aaaaa Aaa                            17 Sales Rep   
AA Aaaaaaaaa                          2 VP Sales    
Aaaaa Aaaaaaa (AAAA)                  2 Sale Manager (EMEA) 
Aaaaa Aaaaaaa (AAAAA, AAAA)           1 Sales Manager (JAPAN, APAC) 
Aaaaaaaaa                             1 President   

                                  Match count Sample      
aaaaaa.aaa@aaaaaaa[Domain suffix]           4 foo.bar@company.com 
aaaaaaa@aaaaaaaa[Domain suffix]             3 santa@claus.com 

                 Match count Sample      
aaaaaaa aaaaa              4 Jane Doe    
aaaaaaaa, aaaaaa           2 Bar, Foo    
aaa. aaaaaa aaa            1 Mrs. Foobar Foo
Notice that in the email example the two patterns end with "[Domain suffix]". This is because I've registered a corresponding "Predefined token" for this:
  <descriptor ref="Pattern finder" />
    <property name="Predefined token name" value="Domain suffix" />
    <property name="Predefined token regexes" value="[\.com,\.org]" />
  <input ref="col_email" />
So now that you've seen the new Pattern finder... Does it meet all your expectations? Let me know if you've got any ideas or unresolved issues!


Join the DataCleaner group at LinkedIn

I've opened up a new LinkedIn group for DataCleaner and I would like to invite anyone with an interest in DataCleaner and open source data quality to join.

If you've read my blog lately you will know that we are currently in heavy development of a new engine for the application (called "AnalyzerBeans") and the groups focus right now is also on this development. But it is also for sharing experience, discussing features, issues and solutions.

Join the group to help us gather a bit of traction for the project.


More instructions for authoring AnalyzerBeans jobs

I've previously posted a blog entry about how you could now download and run a simple example of AnalyzerBeans in the shell. I've updated the example and improved the command-line interface so that it will further assist you if you are interested in using the tool.

First of all, the command-line tool now has a reasonable usage screen:

> java -jar target/AnalyzerBeans.jar
-conf (-configuration, --configuration-file) FILE
      : XML file describing the configuration of AnalyzerBeans
-ds (-datastore, --datastore-name) VAL
      : Name of datastore when printing a list of schemas, tables or columns
-job (--job-file) FILE
      : An analysis job XML file to execute
      : Used to print a list of various elements available in the configuration
-s (-schema, --schema-name) VAL
      : Name of schema when printing a list of tables or columns
-t (-table, --table-name) VAL
      : Name of table when printing a list of columns

As you can see, you can now for example list all available analyzers (there are a lot, so I'm only posting the relevant parts for my up-coming example here, the rest have been replaced with "..."):
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -list ANALYZERS
name: String analyzer
- Consumes multiple input columns
name: Value distribution
- Consumes a single input column
- Property: name=Record unique values, type=boolean, required=true
- Property: name=Bottom n most frequent values, type=Integer, required=false
- Property: name=Top n most frequent values, type=Integer, required=false
name: Number analyzer
- Consumes multiple input columns

I'll help you read this output: There are three analyzers listed. The String analyzer and Number analyzer both consume multiple columns, which means that they can be configured to have multiple inputs. Value distribution is another analyzer which only consumes a single column and has three configurable properties: Record unique values, Bottom n most frequent values and Top n most frequent values.

You can similarly list available transformers:
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -list TRANSFORMERS
or datastores:
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -list DATASTORES
or tables and columns in a particular datastore
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -ds employees_csv -list TABLES
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -ds employees_csv -t employees -list COLUMNS
So now you have all the details that enable you to author an XML-based AnalyzerBeans job yourself. Let's take a look at the example. I'm going to post a few snippets from the employees_job.xml file which I also used in my previous post. Notice that this file has been updated since my last post so you will need to run an "svn update" if you followed my previous tutorial, in order to get up-to-date code and data.

The file starts up with a little metadata. We're not going into detail with that. Then there's the <source> part:
    <data-context ref="employees_csv" />
      <column id="col_name" path="employees.csv.employees.name" />
      <column id="col_email" path="employees.csv.employees.email" />
      <column id="col_birthdate" path="employees.birthdate" />
The content is almost self-explanatory. There's a reference to the employees_csv datastore and the three columns defined in the CSV file: name, email, birthdate. Notice the id's (marked in red) of these three columns. These id's will be referenced further down in the XML file.

The next major part of the XML file is the transformation part. Let's have a look at one of the transformations:
  <descriptor ref="Email standardizer" />
  <input ref="col_email" />
  <output id="col_username" name="Email username" />
  <output id="col_domain" name="Email domain" />
This snippet defines that the Email standardizer transformer consumes a single column (col_email) and generates two new virtual columns: col_username and col_domain. Now understanding the final part of the XML file will be pretty obvious. Let's have a look at one of the analyzers defined in the <analysis> part:
  <descriptor ref="Value distribution" />
  <input ref="col_username" />
It simply maps the (virtual) col_username column to a Value distribution analyzer which is then executed (along with all the other analyzers defined in the file) when you run the job from the command line:
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -job examples/employees_job.xml
Value distribution for column: Email username
Null count: 0
Unique values:
- asbjorn
- foo.bar
- foobar.foo
- jane.doe
- john.doe
- kasper
- santa
I hope that you find this XML format pretty straight forward to author. Of course we will be implementing a graphical user interface as well, but for the moment I am actually quite satisfied with this early user interface.


Developing a value transformer using the DataCleaner Java API

In this blog-entry I will demonstrate the Java API of DataCleaner to create transformers, ie. components for transforming/converting/tokenizing/generating new values based on the existing values of a dataset. You will need Java programming skills to follow this tutorial.
I find that the easiest way to explain this process is by running an example. So here's my case: I want to transform birthdates of persons (represented as Date fields) into age fields (represented as a number field). A scenario is depicted below:

After the transformation I will be able to independently process the age field, eg. with a number analysis, value distribution or apply some business rule that depends on age.
The requirements for building a transformer class are the following:
  • The class must implement the Transformer interface.
  • The class must be annotated with the (javax.inject) @Named annotation. The annotation takes an argument: The readable name of the transformer. We will thusly annotate: @Named(”Date to age”)
  • In order to read from the incoming fields we need to inject an InputColumn<E> instance (or alternatively an array of these), where <E> is the data type of the incoming fields. To inject we use the @Configured annotation. In our example this translates to: @Configured InputColumn<Date> dateColumn;
After these steps our code will look something like this:
@Named("Date to age")
public class DateToAgeTransformer implements Transformer {

  InputColumn<Date> dateColumn;

  public OutputColumns getOutputColumns() {
    // TODO
    return null;

  public Object[] transform(InputRow inputRow) {
    // TODO
    return null;
As we see, there are two methods defined by the Transformer interface, that we need to implement. They are:
  • getOutputColumns(): This method is called by the framework to determine which virtual columns will be produced by the transformer. In our case it is quite simple: The transformer creates virtual columns for age (both in days and in years, just to make it more flexible). The method body should therefore just be:
    return new OutputColumns(Integer.class, "Age in days", "Age in years");
  • transform(InputRow): This method will be called for each row with values to be transformed. The return type of the method is an Object-array representing the new values of the row. The indexes of the returned array should match the output columns, ie. index 0 is for ”Age in days” and index 1 is for ”Age in years”. Let's have a look at the methods implementation:
    Integer[] result = new Integer[2];
    Date date = inputRow.getValue(dateColumn);
    if (date != null) {
      long diffMillis = today.getTime() - date.getTime();
      int diffDays = (int) (diffMillis / (1000 * 60 * 60 * 24));
      result[0] = diffDays;
      // use Joda time to easily calculate the diff in years
      int diffYears = Years.yearsBetween(new DateTime(date), new DateTime(today)).getYears();
       result[1] = diffYears;
    return result;
Of course I didn't do all the work of writing this tutorial without checking in the code so you could try it in action. The code for the ”Date to age” transformer is available here and there's also a unittest available here, that is pretty usable as a demonstration of how to unittest transformers. I hope some of you engage in developing transformers and let me know how it turns out. In my next blog post I'll explain how to build Analyzers which are the obvious next step when developing components for DataCleaner.
There are a few other good examples of transformers that might be of interest:
  • The Convert to date transformer which will try to convert any value to a date. This is perhaps useful in combination with the transformer that I've just explained in this tutorial. In other words: These two transformers may need to be chained if for example the birth date to be transformed is stored in a String-based field.
  • The Tokenizer transformer because it has a flexible amount of output columns based on the users configuration. Notice the @Configured Integer numTokens that is used in the getOutputColumns() for this purpose.


Now you can run AnalyzerBeans (from the shell)

Lately I've been blabbering a lot about the marvels of AnalyzerBeans - the project that is aimed at re-implementing an engine for data analysis based on my experience from DataCleaner.

An important milestone in any development project, especially those like AnalyzerBeans, that are implemented bottom-up, is when it is actually possible to use the application without having any developer skills. So far the development of AnalyzerBeans has been focused on making it work in a unittesting perspective, but now we've reached a point where it is also possible to invoke the engine from the command line.

Since we haven't released AnalyzerBeans yet, you will still have to check out the code and build it yourself. It's rather easy - it just requires Subversion and Maven. First, check out the code:

> svn co http://eobjects.org/svn/AnalyzerBeans/trunk AnalyzerBeans

Now build it:

> cd AnalyzerBeans
> mvn install

And now run the example job that's in there:

> java -jar target/AnalyzerBeans.jar \
> -configuration examples/conf.xml -job examples/employees_job.xml

The job will transform/standardize the "full name" and "email address" columns of a CSV-file (located in the examples-folder) and then print out value distribution and string analysis results for the standardized tokens: First name, Last name, Email username, Email domain.

If you've gone this far, you've probably also tried opening the xml-files employees_job.xml and conf.xml in the examples-folder. Maybe you've even figured out that the conf.xml describes the application setup and that the employees_job.xml file describes the job contents. You can edit these files as you please to further explore the application. I will be sure to update my blog soon with some more examples. Also one of the next features of the command line interface will be to print the available Analyzers and Transformers in order to make it easier to author the xml job-files.

If you're just trying this out now and if you are getting excited about AnalyzerBeans, here are my previous blog posts on the subject. Please don't hesistate to let me know what you think.


A nice abstraction over regular expressions

Often when you're developing data profiling, matching or cleansing software, you're dealing with expression matching, typically through regular expressions (regexes). One thing that I find is that it is often a tedious and error-prone task to define and reuse regexes or parts of regexes. In AnalyzerBeans there's a huge need for easier and reusable pattern matching. To counter this requirement I've come up with a helper-class, Named Pattern, which you can use to match and identify tokens in the patterns in a type-safe and easy way. Here's a short example for matching and tokenizing names based on two simple patterns:

//First define an enum with the tokens in the pattern(s)

// The two patterns
NamedPattern<NamePart> p1 = new NamedPattern("TITULATION. FIRSTNAME LASTNAME", NamePart.class);
NamedPattern<NamePart> p2 = new NamedPattern("FIRSTNAME LASTNAME", NamePart.class);
NamedPattern<NamePart> p3 = new NamedPattern("LASTNAME, FIRSTNAME", NamePart.class);

// notice the type parameter <NamePart> - the match result type is typesafe!
NamedPatternMatch<NamePart> match = p1.match("Sørensen, Kasper");
assert match == null;

match = p2.match("Sørensen, Kasper");
assert match == null;

// here's a match!
match = p3.match("Sørensen, Kasper");
assert match != null;

String firstName = match.get(NamePart.FIRSTNAME);
String lastName = match.get(NamePart.LASTNAME);
String titulation = match.get(NamePart.TITULATION);

All in all I think that the NamedPattern class (and the NamedPatternMatch) in combination with your own enums is a pretty elegant way to do string pattern matching. There's also a way to specify how the underlying regular expression will be built by letting the enum implement the HasGroupLiteral interface.

Developers can dive into the details of these classes and interfaces at the Javadoc / API Documentation for AnalyzerBeans (package org.eobjects.analyzer.util).


Visualizations and API documentation for AnalyzerBeans

I've spent a few hours trying to capture some of the basic principles of data flow and execution in my new favourite spare-time project AnalyzerBeans. Here's the results, that you will also find available in the API Documentation.

The first image shows the relationship analyzers, transformers and the data that they consume:

Data flow

The second image shows a "close-up" of a row of data. Some of that values originate from the actual datastore, while some of the values may be virtual, generated by a chain of transformers:


Enjoy :)


Data transformation added to AnalyzerBeans

I have been doing a lot of improvements to the API of AnalyzerBeans - a sandbox project that I am very passionate about. In short it is a new Data Profiling/Analysis engine that I think will eventually replace the core parts of DataCleaner. So here's a bit about "what's cookin'":

  • The largest of the new features is that it is now possible to transform data before it will be analyzed. The idea here is that it should be possible to tokenize/split/convert/etc. values before they enter the analysis. This means one fundamental change to analyzers, namely that they consume data through an intermediary input-column type which can be virtual (to represent eg. a token) or physical (to represent a "regular" column in a datastore). The new component type, "Transformer Beans" will support all the same cool stuff that I've already introduced to the analyzer components like dependency injection, persistent/scalable collections, annotation-driven composition and registration etc.
  • Another neat thing that I'm currently finishing up is an Analysis Job Builder. The idea is that analysis jobs should be immutable because this makes it a lot safer to parallelize the process of executing the jobs. Immutable structure are very good to work with when you are executing but they tend to be tedious when you're building the structure. So I'm also adding an API for building the jobs which will emphasize type-safety and syntactic neatness to make it easy to programmatically manage and verify the jobs you're building. This will make it a lot easier to build a good UI for AnalyzerBeans.


Using DataCleaner's API to run jobs as a part of your Java applications

Yesterday someone asked me if there where any examples around of how to set up scheduled DataCleaner jobs in a Java EE environment. While the common case have been to just use the Command-Line Interface (CLI) for DataCleaner together with cron-jobs or Windows' scheduled tasks, he had a point - for some organizations this kind of solution would be insufficient - invocation through code would be better if you already have a lot of Java applications running (eg. in a Java EE environment).

So here's my response to that request - I'll try to walk you through the process of invoking DataCleaner through it's Java API. I'll start out with an example of a Profiling job - validation is quite similar but I'll cover that in another blog-post later. It's my ambition that these walkthroughs will eventually end up in the DataCleaner online docs.

The package dk.eobjects.datacleaner.execution holds the main entrypoints for setting up and running a DataCleaner job. First you need to have a DataCleanerExecutor - in this case we wanna execute profiling jobs so we'll use a factory-method for setting up our executor accordingly:

DataCleanerExecutor<ProfilerJobConfiguration,IProfileResult,IProfile> executor = ProfilerExecutorCallback.createExecutor();
Notice the three type-parameters. They dictate that this executor handles ProfilerJobConfigurations, it produces IProfileResults and it executes using IProfile's.

Now it's time to create some profiling jobs. We do this by adding configuration-objects that describe the tasks at hand (the executor will handle the lifecycle of the actual profilers for us). In this example we'll configure a ValueDistributionProfile:
// for this purpose we never use the "displayName" param for anything, so we just enter "valuedist" or whatever
IProfileDescriptor descriptor = new BasicProfileDescriptor("valuedist", ValueDistributionProfile.class);

ProfilerJobConfiguration jobConfiguration = new ProfilerJobConfiguration(descriptor);

// all properties are by convention placed as constants within a PROPERTY_ prefix in their profile class
jobConfiguration.addProfileProperty(ValueDistributionProfile.PROPERTY_TOP_N, "5");
jobConfiguration.addProfileProperty(ValueDistributionProfile.PROPERTY_BOTTOM_N, "5");
Also we need to select which columns to profile as a part of our job configuration. DataCleaner uses MetaModel for it's datastore connectivity so we need to find our retrieve our column definitions using a MetaModel DataContext. I'll examplify with typical MySQL database connection values but there are a lot of other options in the DataContextSelection class:
DataContextSelection dcs = new DataContextSelection();
dcs.selectDatabase("jdbc:mysql://localhost/mydb", null, "username", "password", new TableType[] {TableType.VIEW, TableType.TABLE});
DataContext dc = dcs.getDataContext();
Table[] tables = dc.getDefaultSchema().getTables();

// I'll just add all columns from all tables!
List<Column> allColumns = new LinkedList<Column>();
for (Table table : tables) {

Finally, we add our job configuration to the executor:
If we want to, we can add our own observers to recieve notifications as the job progresses. For example, in the DataCleaner GUI we use an observer for updating the on-screen progress indicators.
Another optional feature is to set the execution options through an ExecutionConfiguration object. As an example we can configure our job to use multithreading assigning more than one connection and/or by allowing more than one query to execute at a time (the example below has a max thread count of 2*5 = 10):
ExecutionConfiguration conf = new ExecutionConfiguration();
And now it's time to kick off the executor! When we do this we provide our DataContextSelection object, which holds the connection information needed to spawn connections to the datastore.
Alternatively you can start the execution asynchronously by calling:
executor.execute(dcs, false);
And now ... you're done. All you have to do now is investigate the results. You retrieve these calling:
List<IProfileResult> results = executor.getResults();
Consider using one of the result exporters in the DataCleaner API (providing support for CSV, XML and HTML export) or use some custom code to retrieve just the metrics of your interest by traversing the IProfileResult model.

I hope this walkthrough has brought some light to the subject of invoking DataCleaner through it's Java API. It's the first time I sit down and try to explain this part of the application so I might have missed some points but I think the major ideas are present. Let me know what you think - and suggestions for improving the API is always welcome.

A couple of notes to the use of DataCleaner's execution API:
  • Notice in the javadocs that almost all the classes covered in this blog-post has a serialize() and a static deserialize(...) method. These are used for saving and loading the configuration to/from XML documents. So if you've already created your jobs using DataCleaners GUI then you can save these jobs (as .dcp or .dcv files) and restore them using using deserialize(...). That might be an easier and quicker path to solving your problems if you're not much keen of setting up everything in code.
  • If you want a shortcut for setting up the ProfileDescriptors and ValidationRuleDescriptors, then take a look at DataCleaners bundled XML files, datacleaner-config.xml, datacleaner-profiler-modules.xml and datacleaner-validator-modules.xml. These a Spring Framework based files that are currently used by DataCleaner as a convenient way to serve these descriptors. You should be able to load the objects easily using Spring and then you'll have the descriptors set up automatically.


Watch out for manual flushing in JBoss Seam

I've done quite a lot of development in JBoss Seam for the last six months and overall I'm quite enthusiastic. Also I'm looking forward to using some of the features of Seam in their new Java EE 6 incarnations (ie. in short: @Inject in stead of @In, @Produces in stead of @Factory, @Unwrap and @Out, and @ConversationScoped in stead of @Scope(CONVERSATION) ;-)).

One key feature of Seam is it's persistence strategy and at first glance it's quite a cool thing. The idea is to use an extended persistence context which means that your entities are kept managed across transactions. The extended persistence context is very important as Seam wraps each request in transactions and all changes to entities caused by actions in the request will then be automatically propagated to the database. The extended persistence context saves you from having to call merge(...) in order to reattach your entities all the time. Calling merge(...) is a heavy operation so this is good.

This pattern makes a lot of sense just until the point where you want to make it possible to edit an entity throughout a few requests but then forget about your changes (because the user changes his/her mind). To make this use case possible the Seam guys are advocating to use "MANUAL flushing" which means that Hibernate won't flush updates to the database unless you programmatically tell it to. Seems smart - here's the idea: Hibernate will keep track of all changes made in transactions (requests) but won't flush them. At a certain time the user will typically hit a "Save changes" button and then everything will be flushed.

Apart from the fact that MANUAL flushing is a Hibernate-specific feature not available with other JPA persistence providers, this pattern has three very serious flaws:

  1. Any query fired will cause an implicid flush - even if the flush-mode is MANUAL. This means that if your conversation involves a query, your changes will be flushed even though you haven't invoked the flush-method yourself. Again - this almost certainly rules out the possibility to use MANUAL flushing in just about any conversation I can imagine (especially if you want to enable navigation by nested conversations). Queries are a good example of something that used to be a 'side-effect-free function' but is now something that can impose a lot of unintended changes in state.
    NOTE: I stand (a bit) corrected here - I was adviced that this behaviour can be avoided by setting the flushmode of the query to COMMIT and it seems to work.
  2. While we're at the topic - if you wan't to enable nested conversations then you will have to do a lot of plumbing code to make sure that the nested parts doesn't invoke the flush-method and then end up flushing on behalf of the parent-conversation also. It IS possible to code your way around this flaw but it's a very serious prohibitant to compose your application of reusable nested conversations.
  3. The Seam guys seem to have failed to realize that transactions are used for other purposes than saving entities. For example, if you're using JMS, you would send messages at commit-time which means that developers of the JMS-dispatch code will assume that if a commit takes place, data has been persisted. If the message contains for example id's of updated entities the messagehandler will access these entities before any updates has taken place because the updates haven't been flushed!
I think that these flaws make it utterly hard to develop applications using MANUAL flushing because of the intrinsics it imposes on the flow in your application. In this light, I'm quite pleased that they didn't include manual flushing in Java EE 6 (or rather, JPA 2).


Query multiple datastores with MetaModel 1.2

I am currently packaging and distributing the new version of MetaModel - version 1.2. In this blog-post I'll introduce what I think is the most exciting thing introduced in this version: Composite DataContexts aka. "Query multiple datasources with a single query". Or in plain english: You can now treat multiple datastores as if they where one.

An example:
Imagine that you want to match a database table with the contents of an excel spreadsheet, you can easily create a query that reads from both datastores and does all the joining, filtering etc. that is possible with regular MetaModel queries.

DataContext database = DataContextFactory.createJdbcDataContext( myConnection );
DataContext spreadsheet = DataContextFactory.createExcelDataContext(new File("my_spreadsheet.xls");

Table dbTable = database.getDefaultSchema().getTableByName("my_db_table");
Column dbPkColumn = dbTable.getColumnByName("my_primary_key");
Table excelTable = spreadsheet.getDefaultSchema().getTableByName("my_sheet");
Column excelFkColumn = excelTable.getColumnByName("my_foreign_key");

// now we create a composite DataContext which enables us
// to explore and query both DataContexts transparently
// through the same DataContext reference!

DataContext composite = DataContextFactory.createCompositeDataContext( database, spreadsheet );

// example query with carthesian product and cross-datastore where clause
Query q = new Query();
q.where(dbPkColumn, OperatorType.EQUALS_TO, excelFkColumn);

DataSet ds = composite.executeQuery(q)
// read the result

... How cool is that?

Of course if a query is posted to the composite DataContext that spans multiple underlaying DataContexts, it will most likely spawn a case of "client side joining" which will not perform well compared to co-locating the datastores. But often that is not possible (or practical if it's just a case of ad-hoc analysis) so I believe that the new composite DataContext feature can add some real value to a lot of projects!

Other notable news in MetaModel 1.2: We now support MS Access databases and dBase (.dbf) database-files.