20110413

Two types of data profiling

Recently I've been blogging about how I see that DataCleaner is what I would dub a 'Data Quality Analysis (DQA) tool' more than anything else. This leads me to an explanation of what I mean by DQA tool, profiling tool and more.

So...

Data profiling is in my worldview the activity of extracting (and possibly refining) a set of analysis metrics from your data. As such it is a quite boring and trivial tasks that you can even automate rather easily.

The interesting question is not what data profiling is, but why it is! I see two main reasons and they have quite different characteristics as to how you would use a profiling tool!

The (chronologically) first reason that you would apply data profiling is to perform an analysis. Not a technical analysis, but an analysis where you apply your human reasoning. You investigate the metrics to discover your data. If you're a good analyst you will also continuously refine your analysis, challenge it and change settings to see what happens. For this a profiling tool enables you to go below "the tip of the iceberg" (a common phrase about profiling) in your datastores.

The second reason is for monitoring your data quality. A profiling tool has the power to extract the metrics so you will often see that people use profiling tools to perform a set of data quality validation tasks. It is used as a way to retain a quality level. In this case of data profiling you execute the same analysis again and again - only the data changes (over time).

Do you see other applications of data profiling tools?