One of the important new features of DataCleaner 2 will be a much richer reporting module than the old one. In DataCleaner 2 the result of an analysis is not limited to the crosstabular view that a lot of you know from DataCleaner 1.x. In this blog post I will provide you with a preview of some of the exciting reports that have been added lately.
Charts in Value distribution
The Value distribution component is well-known to most DataCleaner users. It provides a simple but crucial look into the distribution of values for a column. In DataCleaner 2.0 we are enhancing the experience of working with the Value Distribution by applying visually pleasant charts as well as grouping of values with similar frequencies. Take a look at this example result on a country-column:
Now you might think: "Looks nice, but that's going to be messy for columns with very oddly distributed values". And you're right. Except that we have applied a rather intelligent grouping mechanism that will make sure we never above a certain amount of slices in a chart. To accomplish this we may need to group together some values by their frequencies which will communicate another important fact:
When repeated values occur, how many times do they occur. Take a look at this next example of the value distribution of a
customer number column:
As you can see, even though there's a very high amount of customer numbers we are grouping them together by frequency. This is a principle that is actually already known from the
<unique> group, except that we now also apply it to further frequencies:
<group=2>,
<group=3> etc.
Notice also the green arrows in the table to the right. Using this button (or by clicking the slices of the pie-chart) you will be able to
drill to detail to view the actual values that make up a given group.
Navigation tree in Phonetic similarity finder
Another application of richer reporting in DataCleaner is for the new Phonetic similarity finder. In short this analyzer will apply a mix of well-known algorithms for similarity checking such as
Soundex,
Metaphone and
Levenshtein distance to produce a set of groups of similar sounding values. What you get is a tree of groups from where you can see the rows that are similar or maybe even identical:
The big news here is of course that this kind of result would be practically impossible to display in a crosstabular result of DataCleaner 1.x - which is also why DataCleaner 1.x doesn't have this feature. I hope that my message with this is clear: DataCleaner 2 will not only be a substantial improvement to the existing data profiling tool, but it will also open up a lot of new doors for more interactive (and interesting) analyses.
Pluggability
The last thing that I would like to point out in this blog entry is the fact that the rendering mechanism in DataCleaner 2.0 is pluggable. This means that you can very easily, using modular Java code, enhance the existing result renderers or implement your own, and simply plug it into the application. Just remember to contribute it back to the community :)