September 2011

What does 'Scrooge' and 'Kasper' have in common? Not much according to my SQL data profiler.

Some months back (admittedly, more than a couple) I was explaining how I think people tend do "home made data profiling" too often because it apparently seems easy to do in SQL. I went on to promise that I would also try to play the devil's advocate and show a few examples of "copy paste queries" that you can use for such a tool. In this blog post I will try to do so. But let me first say:

Don't do this at home kids!

But let's start with the first query that I would add to my home baked profiling SQL script. We'll do what anyone who hasn't really understood what profiling is all about will tell you to do: Do a column analysis based on the metrics that are easily available through all SQL implementations:

SELECT MAX(column), MIN(column), COUNT(column), COUNT(*)
FROM table;

This is a good query, especially for number columns. Here I would typically look of the MIN value is below zero or not. Is the COUNT(column) equal to the COUNT(*)? If not, it means that there are nulls in the column. Why not just do two separate queries, that would be more readable? Yes, but it also makes my script larger and I will have more stuff to maintain. But let's try it, we can actually improve it also by adding a few metrics:

SELECT MAX(column) AS highest_value FROM table;
SELECT MIN(column) AS lowest_positive_value FROM table WHERE column > 0;
SELECT MIN(column) AS lowest_negative_value FROM table WHERE column < 0;
SELECT COUNT(*) AS num_values FROM table WHERE column IS NOT NULL;
SELECT COUNT(*) AS num_nulls FROM table WHERE column IS NULL;

Now let's continue with some string columns, because I think more often than not, this is where data profiling turns out to be really valuable. Something that I often see as an inconsistency in structured string data is case differences. Such inconsistencies makes reporting and analysis of the data cumbersome and error prone because grouping and filtering will ultimately be inprecise. So let's do a case analysis:

SELECT COUNT(*) AS num_lowercase FROM table WHERE LCASE(column) = column;
SELECT COUNT(*) AS num_uppercase FROM table WHERE UCASE(column) = column;
SELECT COUNT(*) AS num_mixed_case FROM table WHERE LCASE(column) <> column AND UCASE(column) <> column;

And then on to query the always popular "first letter is capitalized" type of strings. This one really depends on the database, because substring functions have not been standardized across major SQL implementations. I'll show a few:

INITCAP-based approach (eg. PostgreSQL and Oracle):

SELECT COUNT(*) AS num_first_letter_capitalized FROM table WHERE INITCAP(column) = column;

SUBSTRING-based approach (eg. Microsoft SQL Server):

SELECT COUNT(*) AS num_first_letter_capitalized FROM table
WHERE UCASE(SUBSTR(column FROM 0 FOR 1)) = SUBSTR(column FROM 0 FOR 1)
AND LCASE(SUBSTR(column FROM 1)) = SUBSTR(column FROM 1)

A bit cumbersome, but get's the job done. Being the devil's advocate, I'm still not convinced that I should throw out my home baked SQL just yet. So I'm ready for another challenge!

Let's have a look at pattern finding through SQL. Again this is perfectly possible. I've even heard many people telling me that we should rewrite DataCleaner's Pattern Finder to make it SQL optimized. Read on and judge for yourself :-)

To match tokens by pattern we apply the simplest possible configuration in DataCleaner's pattern finder: All letters are replaced by 'A' or 'a' and all numbers are replaced by '9'. This makes for a nice pattern based matcher, like this:

Mickey Mouse -> 'Aaaaaa Aaaaa'
Minnie Mouse -> 'Aaaaaa Aaaaa'
Joachim von And -> 'Aaaaaaa aaa Aaa'
kasper@eobjects.dk -> 'aaaaaa@aaaaaaa.aa'

(Random fact: 'Joachim von And' is the Danish name for Scrooge McDuck)
As you can see from the patterns, this is a good preliminary way to determine if string values have the same form and syntax - we immediately see that the email address is odd and that although all other values look like valid names, som have lowercase tokens (prefixes) inbetween.

In PostgreSQL for example, this would look like:

SELECT regexp_replace(regexp_replace(regexp_replace(column, '[a-z]','a','g'), '[A-Z]','A','g'), '[0-9]','9','g') as pattern, COUNT(*) as pattern_count from table GROUP BY pattern;

This actually works like a charm and returns:

pattern	pattern_count
Aaaaaa Aaaaa	2
Aaaaaaa aaa Aaa	1
aaaaaa@aaaaaaaa.aa	1

So why even use a profiling tool for finding patterns? All this seems to be possible through raw SQL?

I will now stop playing the devil's advocate... Cuz' seriously... This is nonsense! Having worked for some years on a pretty good data quality analysis tool, this approach absolutely disgusts me. Here's just a few random reasons why, off the top of my head:

We still haven't scratched the surface when it comes to supporting eg. non-ASCII characters in patterns.
Some tokens in patterns should be matched regardless of string length, some shouldn't. In our case we never matched strings with unequal lengths (eg. Scrooge and Mickey). This is a setting that you will want to play around with! (For examples, check out our Pattern Finder documentation)
Each metric in the previous analyses required their own query. This means that if you want to analyze a hundred metrics, you would need to query (at least) a hundred times.
A lot of metrics are simply not possible to express in SQL. Some examples: Diacritic character count, max/min amount of words, matches against reference data and more.
Often you will want to preprocess data before (or actually I would argue, as a part of) your profiling. This can be for example to extract information from composite values or to replace known inconsistencies with standardized values.
All the examples offer no drill-to-detail behaviour, so further analysis is more or less impossible. And drill-to-detail is not offered through SQL, so there is for example no way to express in our pattern finder SQL that we want to keep some samples of various pattern matches for later inspection.
All in all, using SQL for data profiling makes for a terribly unexplorative approach. It's a pain having to write and modify such an amount of SQL to get simple things done, so don't rely on it, because it will make you lazy and then you'll not investigate properly!
And of course, SQL only applies to databases that support SQL! If you're looking to profile data in other formats, then you're out of luck with this approach.

kasper's source

20110926

Data Profiling SQLized. Uh oh...

Blog-archive

My links

eobjects - news