20101229

YADC2S

Yet Another DataCleaner 2.0 Screenshot :) This is basically just an addition to my previous post about richer reporting and charts in DataCleaner 2.0.


What you see is our new Date Gap Analyzer which can be used to plot a timeline based on FROM and TO dates in a dataset. The analyzer will display gaps in the timeline and overlaps (periods where more than one record exist). This should be pretty useful for finding errors in datasets that contain continuous activities.

The chart is zoomable and scrollable so it is able of displaying quite a lot of data without harming the visual appearance.

20101227

Match! Boardgame about the heuristics in data matching(?)

This morning I was enjoying a bit of Good Clean Family Christmas TV You Can Trust and one of the subjects covered was a new Danish board game that you could spend your Christmas vacation playing. It's called Match! and here I will try to outline the rules as I understand them:

  • Each player has 3 game cards with a picture of something on their hand.
  • A picture is shown to all players and they now has to match that picture with one of the pictures on their hand.
  • In the example shown on the TV there was a picture of some sausages on a grill. The example matches of the three players where:
    • Danish politician Pia Kjærsgaard - both she and the sausages represent something very danish and something they'd like to put on a grill!
    • A used roll of toilet paper - related to a different kind of "sausage"!
    • A crowd at a musical festival - a place where you'd love to eat a grilled sausage.


The matches themselves where not the best I've seen but they do point out an important feature of a good matching engine: Using simple similarity checks is not enough. You need to understand not only the spelling, phonetics etc. of the things you are trying to match, but also the semantics. Of course a good example of this in Denmark is our country's biggest company: "Maersk", which can rather easily be matched with "Mærsk" but it's more difficult to get the synonym "A.P. Møller" into the matching rules except if you hardcode it somehow. And if matching goes beyond just names other associative matching rules might apply.

Well... Can't wait to play Match! It sounds like a fun game and it will definately be in the back of my head to try and record some of the interesting heuristics applied there.

20101214

Richer reporting and charts in DataCleaner 2

One of the important new features of DataCleaner 2 will be a much richer reporting module than the old one. In DataCleaner 2 the result of an analysis is not limited to the crosstabular view that a lot of you know from DataCleaner 1.x. In this blog post I will provide you with a preview of some of the exciting reports that have been added lately.

Charts in Value distribution
The Value distribution component is well-known to most DataCleaner users. It provides a simple but crucial look into the distribution of values for a column. In DataCleaner 2.0 we are enhancing the experience of working with the Value Distribution by applying visually pleasant charts as well as grouping of values with similar frequencies. Take a look at this example result on a country-column:


Now you might think: "Looks nice, but that's going to be messy for columns with very oddly distributed values". And you're right. Except that we have applied a rather intelligent grouping mechanism that will make sure we never above a certain amount of slices in a chart. To accomplish this we may need to group together some values by their frequencies which will communicate another important fact: When repeated values occur, how many times do they occur. Take a look at this next example of the value distribution of a customer number column:


As you can see, even though there's a very high amount of customer numbers we are grouping them together by frequency. This is a principle that is actually already known from the <unique> group, except that we now also apply it to further frequencies: <group=2>, <group=3> etc.

Notice also the green arrows in the table to the right. Using this button (or by clicking the slices of the pie-chart) you will be able to drill to detail to view the actual values that make up a given group.

Navigation tree in Phonetic similarity finder

Another application of richer reporting in DataCleaner is for the new Phonetic similarity finder. In short this analyzer will apply a mix of well-known algorithms for similarity checking such as Soundex, Metaphone and Levenshtein distance to produce a set of groups of similar sounding values. What you get is a tree of groups from where you can see the rows that are similar or maybe even identical:


The big news here is of course that this kind of result would be practically impossible to display in a crosstabular result of DataCleaner 1.x - which is also why DataCleaner 1.x doesn't have this feature. I hope that my message with this is clear: DataCleaner 2 will not only be a substantial improvement to the existing data profiling tool, but it will also open up a lot of new doors for more interactive (and interesting) analyses.

Pluggability

The last thing that I would like to point out in this blog entry is the fact that the rendering mechanism in DataCleaner 2.0 is pluggable. This means that you can very easily, using modular Java code, enhance the existing result renderers or implement your own, and simply plug it into the application. Just remember to contribute it back to the community :)