It’s very easy to make your own data profiling logic

In the world of data quality I often see that people tend to think that a profiling application is that important because the user can easily analyze his data using SQL and similar methods already. It can actually make it quite difficult to convince users to try out a tool built for the job.

Let me tell you about a situation that I have been in quite a few times: If I am to do a quick demonstration of my favourite data profiling application then I will begin with something like a Value distribution chart. The responding watching the demo would then reply:

"... But I could easily do this with just a simple GROUP BY query."

And my answer would be YES, you can.

So then I will proceed to demonstrate various other metrics such as null counts, empty string counts, length maximum/minimum etc. An SQL-hefty respondent might reply:

"... Sure, but I can also use AVG, WHERE x IS NULL, length functions etc. in SQL".

And my answer would be YES, you can.

If at this point I dive into the discussion on these terms, then I would demonstrate hard-to-query measures such as diacritic count, word count, non-letter chars and case distribution. But ultimately I will also myself have misunderstood the point of using a data profiling application because the point isn’t (so much) that it contains measures that are not possible to express as queries.

The point is that the measures and their impact to your data's profile are not always known to you in advance. If you know just what you’re looking for, then you’ll find it and nothing more.

So while it’s quite easy (or at least possible) to design your own profiling plan using SQL and various other querying extensions, it’s not going to help you much in terms of profiling, because profiling should be an exploratory, analytical and interactive discipline.

To scare you off, I think I’ll try creating a few SQL queries that could compensate for most of the functionality in DataCleaner. I imagine they’ll be quite amusing, so hold on for an update...

kasper's source

20110118

It’s very easy to make your own data profiling logic – but wrong!

No comments:

Blog-archive

My links

eobjects - news