Get your data right... First Time Right!

In my blog I mostly talk about data quality tools like DataCleaner that are diagnostic and treating, rather than preventive. Such tools have a lot of merit and strengths, but for a total view on data quality it is crucial that you also include tools that are preventive of poor data ever entering your system. In this blog post I want to talk a bit about a project that I have been involved with at Human Inference which is just that - our First Time Right JavaScript solution.

The idea is that we provide a subscription-based JavaScript API where you can easily decorate any HTML contact form with a lot of rich features for on-the-fly verification, validation, auto correction and helpful features for automatic filling of derived fields.

For example, the API allows you to enter (or copy/paste) a full name, including titulation, salutation, initials and more - and get these items parsed and placed into corresponding fields on a detailed contact form. It will even automatically detect what the gender of the contact is, and apply this in gender fields. We have similar data entry aids for address input, email input, phone numbers and contact duplicate checking.

Take a look at the video below, which demonstrate most of the features:

Now this is quite exciting functionality, but this is also a technical blog, so I'll talk a bit about the technology involved.

We built the project based on Google Web Toolkit (GWT). GWT enables us to build a very rich application, entirely in JavaScript, so that it can be embedded on any website - no matter if it's PHP based, ASP.NET based, Java based or whatever. Of course we do have a server-side piece that the JavaScript communicates with, but that is all hosted at Human Inferences cloud platform. So in other words: The deployment of our First Time Right principle is a breeze!

Since AJAX applications require locality of the server that it is communicating with, we've had to overcome quite some issues to allow the JavaScript to be external from the deployment sites. This is crucial as we want upgrades and improvements to be performed on our premises, not at individual customer sites. This way we can really leverage the cloud- and subscription-based approach to data quality. Our solution to the locality problem has been the JSONP approach, which is an alternative protocol for implementing AJAX behaviour. JSONP is a rather clever construct where instead of issuing actual HTTP requests, you insert new <script> elements into the HTML DOM at runtime! This means that the browser will perform a new request simply because the <script> element refers a new JavaScript source. It's not "pretty" to tackle errorhandling and the asynchronicity that this approach brings on, but we've done a lot of work to get it right, and it works like a charm! I hope to share some of our design patterns later, to demonstrate how it works.

Another challenge was of security. Obviously you will want to make sure that the JavaScript is only available for subscribers. And only for the websites that they've subscribed to (because otherwise the JavaScript can simply be copied to another website). Our way around this resembles how for example Google manages their subscriptions to Google Maps and other subscription services, where you need a site-specific API key. Very clever.

A few optional features may require some local add-on deployment. In particular, deduplication requires us to know the contact data to use as the source for detecting if a new contact is a duplicate. Here we have two options: On-premise installation of the deduplication engine or hooking up with our cloud-based deduplication engine, which can be configured to sync with your datastores.

All in all I am quite enthusiastic about the FTR solution and the technology behind the solution. I also think that our FTR API is an example of a lightweight approach to implementing Data Quality, which complements DataCleaner very well. Both tools are extremely useful for ensuring a high level of data quality, and both tools are very intuitive and flexible in the way you can deploy them.


Eye candy in Java 7: New javadoc style!

By now most of you've probably heard that Java 7 is out and there's a lot of discussions about new features, the loop optimization bug and general adoption.

But one of the things in Java 7 which has escaped most people attention (I think) is the new javadoc style.

Check it out:

And see it live - we've just published an updated API documentation for MetaModel.


Unit test your data

In modern software development unit testing is widely used as a way to check the quality of your code. For those of you who are not software developers, the idea in unit testing is that you define rules for your code that you check again and again, to verify that your code works, and keep on working.
Unit testing and data quality has quite a lot in common in my oppinion. Both code and data change over time, so there is a constant need to keep checking that you code/data has the desired characteristics. This was something that I was recently reminded of by a DataCleaner user on our forums.
I am happy to see that data stewards and the like are picking up this idea, as it has been maturing for quite some time in the software development industry. It also got me thinking: In software development we have a lot of related methods and practices around unit testing. Let me try to list a few, which are very important, and which we can perhaps also apply to data?
Compile-time checking
(Ensuring correct syntax)
Database constraints
Unit testing
(Checking a single unit of code)
Validating data profiling?
Continuous integration
(Running all tests periodically)
Data Quality monitoring?
Bug tracking
(Maintaining records of all code issues)
Static code analysis
(a la FindBugs)
Explorative data profiling?
(Changing code without breaking functionality)
ETL with applied DQ rules?

For explanation of the various data profiling and monitoring types, please refer to my previous post, Two types of data profiling.
Of course not all metaphors here map one-to-one, but in my oppinion it is a pretty good metaphor. For me, as a software product developer, I think it also points out some of the weak and strong points of current Data Quality tools. In software development the tool support for unit testing, continuous integration, bug tracking and more is incredible. In the data world I feel that many tools focus only on one or two of the above areas of quality control. Of course you can combine tools, but as I've argued before, switching tools also comes at a large price.
So what do I suggest? Well, fellow product developers, let's make better tools that integrate more disciplines of data quality! I know that this has been and still will be my aim for DataCleaner.
Update: Further actions in this direction have been taken with the plans for DataCleaner 3.0, see this blog post for more information.