Unit test your data

In modern software development unit testing is widely used as a way to check the quality of your code. For those of you who are not software developers, the idea in unit testing is that you define rules for your code that you check again and again, to verify that your code works, and keep on working.
Unit testing and data quality has quite a lot in common in my oppinion. Both code and data change over time, so there is a constant need to keep checking that you code/data has the desired characteristics. This was something that I was recently reminded of by a DataCleaner user on our forums.
I am happy to see that data stewards and the like are picking up this idea, as it has been maturing for quite some time in the software development industry. It also got me thinking: In software development we have a lot of related methods and practices around unit testing. Let me try to list a few, which are very important, and which we can perhaps also apply to data?
Compile-time checking
(Ensuring correct syntax)
Database constraints
Unit testing
(Checking a single unit of code)
Validating data profiling?
Continuous integration
(Running all tests periodically)
Data Quality monitoring?
Bug tracking
(Maintaining records of all code issues)
Static code analysis
(a la FindBugs)
Explorative data profiling?
(Changing code without breaking functionality)
ETL with applied DQ rules?

For explanation of the various data profiling and monitoring types, please refer to my previous post, Two types of data profiling.
Of course not all metaphors here map one-to-one, but in my oppinion it is a pretty good metaphor. For me, as a software product developer, I think it also points out some of the weak and strong points of current Data Quality tools. In software development the tool support for unit testing, continuous integration, bug tracking and more is incredible. In the data world I feel that many tools focus only on one or two of the above areas of quality control. Of course you can combine tools, but as I've argued before, switching tools also comes at a large price.
So what do I suggest? Well, fellow product developers, let's make better tools that integrate more disciplines of data quality! I know that this has been and still will be my aim for DataCleaner.
Update: Further actions in this direction have been taken with the plans for DataCleaner 3.0, see this blog post for more information.

No comments: