20130422

What's the fuzz about National Identifier matching?

The topic of National Identification numbers (also sometimes referred to as social security numbers) is something that can spawn a heated debate at my workplace. Coming from Denmark, but having an employer in the Netherlands, I am exposed to two very different ways of thinking about the subject. But regardless of our differences on the subject - while developing a product for MDM of customer data, like Human Inference is doing, you need to understand the implifications of both approaches - I would almost call them the "with" and "without" national identifiers implementation of MDM!

In Denmark we use our National identifiers a lot - the CPR numbers for persons and the CVR numbers for companies. Previously I wrote a series of blog posts on how to use CVR and CPR for data cleansing. Private companies are allowed in Denmark to collect these identifiers, although consumers can opt to say "no thanks" in most cases. But all in all, it means that we have our IDs out in the society, not locked up in our bedroom closet.

In the Netherlands the use of such identifiers is prohibited for almost all organizations. While talking to my colleagues I get the sense that there's a profound thinking of this ID as more a password than a key. Commercial use of the ID would be like giving up your basic privacy liberties and most people don't remember it by heart. In contrast, Danish citizens typically do share their credentials, but are quite aware about the privacy laws that companies are obligated to follow when receiving this information.

So what is the end result for data quality and MDM? Well, it highly affects how organizations will be doing two of the most complex operations in an MDM solution: Data cleansing and data matching (aka deduplication).

I recently had my second daughter, and immediately after her birth I could not help but noticing that the hospital crew gave us a sheet of paper with her CPR number. This was just an hour or two after she had her first breath, and a long time before she even had an official name! In data quality terms, this was nicely designed first-time-right (FTR, a common principle in DQ and MDM) system in action!

When I work with Danish customers it usually means that you spend a lot of time on verifying that the IDs of persons and companies are in fact the correct IDs. Like any other attribute, you might have typos, formatting glitches etc. And since we have multiple registries and number types (CVR numbers, P numbers, EAN numbers etc.), you also spend quite some time on infering "what is this number?". You would typically look up the companies' names in the public CVR registry and make sure it matches the name in your own data. If not - probably the ID is wrong and you need to delete it or obtain a correct one. While finding duplicates you can typically standardize the formatting of the IDs and do exact matching for the most part, except for those cases where the ID is missing.

When I work with Dutch customers it usually means that we cleanse the individual attributes of a customer in a much more rigorous manner. The name is cleansed on it's own. Then the address. Then the phone number, and then the email and other similar fields. You'll end up knowing if each element is valid or not, but not if the whole record is actually a cohesive chunk of data. While you can apply a lot of cool inferential techniques to check that the data is cohesive (for instance it is plausible that I, Kasper Sørensen, have the email i.am.kasper.sorensen@gmail.com) but you won't know if it is also my address that you can find in the data, or if it's just some valid address.

Of course the grass isn't that much greener in Denmark as I present it here. Unfortunately we also do have problems with CPR and CVR, and in general I disbelieve that there will be one single source of the truth that we can do reference data checks on in the near future. For instance, change of address typically is quite delayed in the national registries, whereas it is much quicker at the post agencies. And although I think you can share your email and similar attributes through CPR - in practice that's not what people do. So actually you need a MDM hub which connects to several sources of data and then pick and choose from the ones that you trust the most for individual pieces of the data. The great thing is that in Denmark we have a much clearer way to do data interchange inbetween data services, since we do have a common type of key for the basic entities. This gives way for very interesting reference data hubs like for instance iDQ, which in turn makes it easier for us to consolidate some of our integration work.

Coming back to the more ethical question: Is the Danish National Identifiers a threat to our privacy? Or is it just a more modern and practical way to reference and share basic data? For me the winning argument for the Danish model is in the term "basic data". We do share basic data through CPR/CVR, but you can't access any of my compromising or transactional data. In comparison, I fear much more for my privacy when sharing data through Facebook, Google and so on. Sure, if you had my CPR number, you would also be able to find out where I live. I wouldn't share my CPR number with you if I did not want to provide that information though, and after all sharing information, CPR, addresses or anything else, always comes at the risk of leaking that information to other parties. Finally, as an MDM professional I must say - combining information from multiple sources - be it public or private registries - isn't exactly something new, so privacy concerns are in my opinion largely the same in all countries.

But it does mean that implementations of MDM are highly likely to differ a lot when you cross national borders. Denmark and the Netherlands are maybe profound examples of different national systems, but given how much we have in common in general, I am sure there are tons of black swans out there for me yet to discover. As a MDM vendor, and as the lead for DataCleaner, I always need to ensure that our products caters to international - and thereby highly varied - data and ways of processing data.