kasper's source

20110217

Pre-processing in DataCleaner 2: Why?

Last monday we released the new DataCleaner 2.0 and one of the major new features in it is the ability to transform and filter your data using the tool. Previously the common answer for someone asking about transformation in DataCleaner has been to tell them that "there are already good tools for this available elsewhere, so use them". So why did we choose to focus on data processing in DataCleaner 2.0? And what are the capabilities of DataCleaner in terms of transformations? Is DataCleaner over time going to evolve into a full-fledged ETL tool? I'll try to answer these questions.

All your data quality functions in one place

The obvious answer is that we want to provide more data quality functionality and that transformations is something that a lot of people need. While doing data profiling it is often needed to do data adjustments, eg. to tokenize values, extract certain information, filter which rows gets profiled etc. You can also do all this also by applying an ETL tool or maybe even by creating database VIEWs. The problem with such an approach is that it will eventually get in your way because you're trying to get 2-3 independent tools to work nicely together, instead of just having these functions available where you need them.

Transformations for the DQ domain

Possibly even more important is that the transformations that you want to employ in Data Quality analysis are typically quite different than those that come out-of-the-box in database scripts and ETL tools. Such tools are typically quite generic and will provide general purpose tokenizers etc., but will typically lack transformations pertaining to the DQ-domain, such as date mask matchers, dictionary lookups and synonym replacements, standardization of email adresses, names and URL's.

Non-persistent transformations

When you do pre-processing in separate tools, you also need to persist your transformed datasets. In Data Quality analysis this is just a waste of resources and provides poor performance. If you need to perform transformations, apply filtering etc. for the purpose of analysis, profiling and exploring your data it is much more feasible to just perform these transformations when needed in stead of storing them up front. This also allows for a much more free user experience where you can actually experiment with your data and you analysis in stead of having to overthink it.

Is DataCleaner's validator gone? No, it's just filtering with added output handling!

DataCleaner 1.x was known to have a strict separation between the activities "profile" and "validate". Since this separation is gone in 2.0, one might ask, "Is DataCleaner's validator gone? I can only see analysis jobs!". But the answer is no, we just consider validation as a type of analysis (thus, analysis is a broader term, comprising both profiling, validation and more). You can easily perform all the validation operations of DataCleaner 1.x, but the approach is a bit different because you basically apply filters in stead of "validators". There is even an example of this in the "Introduction to analyzing, cleansing and filtering data" webcast, available on DataCleaner's website.

20110214

DataCleaner 2 released, Human Inference acquires eobjects.org, I have a new job and dataqualitypro.com even publishes a nice article about it

Earlier I've promised you some "Big news" today and here it is... Hmm, where to start, so much to say.

OK, let's start with the software, after all that's what I think most of my blog readers care most about:

DataCleaner 2.0 was released!

To me this is the biggest news for the DataCleaner community in a LONG time. DataCleaner 2.0 is a major release that I and my new employer (read more below) have put a lot of effort into. I just had a look at some source code statistics and actually the 2.0 release is larger (in terms of lines of code, source code commits, contributions etc.) than all previous DataCleaner releases together. I don't want to say a lot about the new functionality here, because it's all presented quite well at the DataCleaner website.

More information: Watch out, dirty data! DataCleaner 2.0 is in town!

MetaModel 1.5 was released!

My other lovechild, MetaModel, have also just been released in a version 1.5! MetaModel 1.5 is also a quite big improvement on the previous 1.2 version. The release contains a lot of exciting new features for doing querying and datastore exploration as well as a lot of maturity bugfixes.

More information: What's new in MetaModel 1.5?

And then let's move on to a major announcement that I definately think will affect the eobjects.org community positively:

Human Inference acquires eobjects.org

This might come as a surprise to quite a lot of you, so let me explain a bit. For some years DataCleaner and the other eobjects.org projects have been independent open source projects that I've invested a lot of time in. The projects have grown nicely in terms of users and the ideas have been manyfold. My ambitions for the projects have always been high, but they suffered from the fact that I was mostly working on them in my free time. One of the many fun things about doing these projects was that I've gotten to meet up with a lot of exciting people that thought my projects where interesting. At one time I met some people from the data quality vendor Human Inference, who thought DataCleaner was great and they wanted to know if they could in some way use it in collaboration with their commercial offerings. From my end of the table I was on the other hand thinking that their products offered some features that would be an excellent addition to DataCleaner's functionality. So what we did was a deal to try and raise the value for both parties. And with that in mind, here's the press release about it:

More information: Human Inference completes acquisition of DataCleaner and eobjects.org

I have a new job

I now work with Human Inference to actively grow the DataCleaner project, MetaModel as well as Human Inference's commercial products. We're building really exciting cloud-based data quality services that I think will complement the open source offering nicely. Of course it's not all going to be free, but I promise that even for the users who don't want to pay for the additional services, the acquisition and my new job will be beneficial anyway, because we're adding a lot of new resources to the projects that are improving on both the open source parts and the commercial plug-ins.

And in the end I just also want to mention that Data Quality Pro has a great article about a lot of these news, including an interview with me and Sabine Palinckx, the CEO of Human Inference.

More information: Open Source DataCleaner gets a major update, Human Inference enters the Open Source Data Quality Market

20110204

Big news ...

20110118

It’s very easy to make your own data profiling logic – but wrong!

In the world of data quality I often see that people tend to think that a profiling application is that important because the user can easily analyze his data using SQL and similar methods already. It can actually make it quite difficult to convince users to try out a tool built for the job.

Let me tell you about a situation that I have been in quite a few times: If I am to do a quick demonstration of my favourite data profiling application then I will begin with something like a Value distribution chart. The responding watching the demo would then reply:

"... But I could easily do this with just a simple GROUP BY query."

And my answer would be YES, you can.

So then I will proceed to demonstrate various other metrics such as null counts, empty string counts, length maximum/minimum etc. An SQL-hefty respondent might reply:

"... Sure, but I can also use AVG, WHERE x IS NULL, length functions etc. in SQL".

And my answer would be YES, you can.

If at this point I dive into the discussion on these terms, then I would demonstrate hard-to-query measures such as diacritic count, word count, non-letter chars and case distribution. But ultimately I will also myself have misunderstood the point of using a data profiling application because the point isn’t (so much) that it contains measures that are not possible to express as queries.

The point is that the measures and their impact to your data's profile are not always known to you in advance. If you know just what you’re looking for, then you’ll find it and nothing more.

So while it’s quite easy (or at least possible) to design your own profiling plan using SQL and various other querying extensions, it’s not going to help you much in terms of profiling, because profiling should be an exploratory, analytical and interactive discipline.

To scare you off, I think I’ll try creating a few SQL queries that could compensate for most of the functionality in DataCleaner. I imagine they’ll be quite amusing, so hold on for an update...

20101229

YADC2S

Yet Another DataCleaner 2.0 Screenshot :) This is basically just an addition to my previous post about richer reporting and charts in DataCleaner 2.0.

What you see is our new Date Gap Analyzer which can be used to plot a timeline based on FROM and TO dates in a dataset. The analyzer will display gaps in the timeline and overlaps (periods where more than one record exist). This should be pretty useful for finding errors in datasets that contain continuous activities.

The chart is zoomable and scrollable so it is able of displaying quite a lot of data without harming the visual appearance.

20101227

Match! Boardgame about the heuristics in data matching(?)

This morning I was enjoying a bit of Good Clean Family Christmas TV You Can Trust and one of the subjects covered was a new Danish board game that you could spend your Christmas vacation playing. It's called Match! and here I will try to outline the rules as I understand them:

Each player has 3 game cards with a picture of something on their hand.
A picture is shown to all players and they now has to match that picture with one of the pictures on their hand.
In the example shown on the TV there was a picture of some sausages on a grill. The example matches of the three players where:
- Danish politician Pia Kjærsgaard - both she and the sausages represent something very danish and something they'd like to put on a grill!
- A used roll of toilet paper - related to a different kind of "sausage"!
- A crowd at a musical festival - a place where you'd love to eat a grilled sausage.

The matches themselves where not the best I've seen but they do point out an important feature of a good matching engine: Using simple similarity checks is not enough. You need to understand not only the spelling, phonetics etc. of the things you are trying to match, but also the semantics. Of course a good example of this in Denmark is our country's biggest company: "Maersk", which can rather easily be matched with "Mærsk" but it's more difficult to get the synonym "A.P. Møller" into the matching rules except if you hardcode it somehow. And if matching goes beyond just names other associative matching rules might apply.

Well... Can't wait to play Match! It sounds like a fun game and it will definately be in the back of my head to try and record some of the interesting heuristics applied there.

20101214

Richer reporting and charts in DataCleaner 2

One of the important new features of DataCleaner 2 will be a much richer reporting module than the old one. In DataCleaner 2 the result of an analysis is not limited to the crosstabular view that a lot of you know from DataCleaner 1.x. In this blog post I will provide you with a preview of some of the exciting reports that have been added lately.

Charts in Value distribution
The Value distribution component is well-known to most DataCleaner users. It provides a simple but crucial look into the distribution of values for a column. In DataCleaner 2.0 we are enhancing the experience of working with the Value Distribution by applying visually pleasant charts as well as grouping of values with similar frequencies. Take a look at this example result on a country-column:

Now you might think: "Looks nice, but that's going to be messy for columns with very oddly distributed values". And you're right. Except that we have applied a rather intelligent grouping mechanism that will make sure we never above a certain amount of slices in a chart. To accomplish this we may need to group together some values by their frequencies which will communicate another important fact: When repeated values occur, how many times do they occur. Take a look at this next example of the value distribution of a customer number column:

As you can see, even though there's a very high amount of customer numbers we are grouping them together by frequency. This is a principle that is actually already known from the <unique> group, except that we now also apply it to further frequencies: <group=2>, <group=3> etc.

Notice also the green arrows in the table to the right. Using this button (or by clicking the slices of the pie-chart) you will be able to drill to detail to view the actual values that make up a given group.

Navigation tree in Phonetic similarity finder

Another application of richer reporting in DataCleaner is for the new Phonetic similarity finder. In short this analyzer will apply a mix of well-known algorithms for similarity checking such as Soundex, Metaphone and Levenshtein distance to produce a set of groups of similar sounding values. What you get is a tree of groups from where you can see the rows that are similar or maybe even identical:

The big news here is of course that this kind of result would be practically impossible to display in a crosstabular result of DataCleaner 1.x - which is also why DataCleaner 1.x doesn't have this feature. I hope that my message with this is clear: DataCleaner 2 will not only be a substantial improvement to the existing data profiling tool, but it will also open up a lot of new doors for more interactive (and interesting) analyses.

Pluggability

The last thing that I would like to point out in this blog entry is the fact that the rendering mechanism in DataCleaner 2.0 is pluggable. This means that you can very easily, using modular Java code, enhance the existing result renderers or implement your own, and simply plug it into the application. Just remember to contribute it back to the community :)