Introducing Apache MetaModel

Recently we where able to announce an important milestone in the life of our project MetaModel - it is being incubated into the Apache Foundation! Obviously this generates a lot of new attention to the project, and causing lots and lots of questions on what MetaModel is good for. We didn't grant this project to Apache just for fun, but because we wanted to maximize it's value, both for us and for the industry as a whole. So in this post I'll try and explain the scope of MetaModel, how we use it at Human Inference, and what you might use it for in your products or services.

First, let's recap the one-liner for MetaModel:
MetaModel is a library that encapsulates the differences and enhances the capabilities of different datastores.
In other words - it's all about making sure that the way you work with data is standardized, reusable and smart.

But wait, don't we already have things like Object-Relational-Mapping (ORM) frameworks to do that? After all, a framework like OpenJPA or Hibernate will allow you to work with different databases without having to deal with the different SQL dialects etc. The answer is of course yes, you can use such frameworks for ORM, but MetaModel is by choice not an ORM! An ORM assumes an application domain model, whereas MetaModel, as its name implies, is treating the datastore's metadata as its model. This not only allows for much more dynamic behaviour, but it also makes MetaModel applicable only to a range of specific application types that deal with more or less arbitrary data, or dynamic data models, as their domain.

At Human Inference we build just this kind of software products, so we have great use of MetaModel! The two predominant applications that use MetaModel in our products are:
  • HIquality Master Data Management (MDM)
    Our MDM solution is built on a very dynamic data model. With this application we want to allow multiple data sources to be consolidated into a single view - typically to create an aggregated list of customers. In addition, we take third party sources in and enrich the source data with this. So as you can imagine there's a lot of mapping of data models going on in MDM, and also quite a wide range of database technologies. MetaModel is one of the cornerstones to making this happen. Not only does it mean that we onboard data from a wide range of sources. It also means that these source can vary a lot from eachother and that we can map them using metadata about fields, tables etc.
  • DataCleaner
    Our open source data quality toolkit DataCleaner is obviously also very dependent on MetaModel. Actually MetaModel started as a kind of derivative project from the DataCleaner project. In DataCleaner we allow the user to rapidly register new data and immediately build analysis jobs using it. We wanted to avoid building code that's specific to any particular database technology, so we created MetaModel as the abstraction layer for reading data from almost anywhere. Over time it has grown into richer and richer querying capabilities as well as write access, making MetaModel essentially a full CRUD framework for ... anything.
I've often pondered on the question of What could other people be using MetaModel for? I can obviously only provide an open answer, but some ideas that have popped up (in my head or in the community's) are:
  • Any application that needs to onboard/intake data of multiple formats
    Oftentimes people need to be able to import data from many sources, files etc. MetaModel makes it easy to "code once, apply on any data format", so you save a lot of work. This use-case is similar to what the Quipu project is using MetaModel for.
  • Model Driven Development (MDD) tools
    Design tools for MDD are often used to build domain models and at some point translate them to the physical storage layer. MetaModel provides not only "live" metadata from a particular source, but also in-memory structures for e.g. building and mutating virtual tables, schemas, columns and other metadata. By encompassing both the virtual and the physical layer, MetaModel provides a lot of the groundwork to build MDD tools on top of it.
  • Code generation tools
    Tools that generate code (or other digital artifacts) based on data and metadata. Traversing metadata in MetaModel is very easy and uniform, so you could use this information to build code, XML documents etc. to describe or utilize the data at hand.
  • An ORM for anything
    I said just earlier that MetaModel is NOT an ORM. But if you look at MetaModel from an architectural point of view, it could very well serve as the data access layer of an ORM's design. Obviously all the object-mapping would have to be built on top of it, but then you would also have an ORM that maps to not just JDBC databases, like most ORMs do, but also to file formats, NoSQL databases, Salesforce and more!
  • Open-ended analytical applications
    Say you want to figure out e.g. if particular words appear in a file, a database or whatever. You would have to know a lot about the file format, the database fields or similar constructs. But with MetaModel you can instead automate this process by traversing the metadata and querying whatever fields match your predicates. This way you can build tools that "just takes a file" or "just takes a database connection" and let them loose to figure out their own query plan and so on.
If I am to point at a few buzzwords these days, I would say MetaModel can play a critical role in implementing services for things such as data federation, data virtualization, data consolidation, metadata management and automation.

And obviously this is also part of our incentive to make MetaModel available for everyone, under the Apache license. We are in this industry to make our products better and believe that cooperation to build the best foundation will benefit both us and everyone else that reuses it and contributes to it.


What's the fuzz about National Identifier matching?

The topic of National Identification numbers (also sometimes referred to as social security numbers) is something that can spawn a heated debate at my workplace. Coming from Denmark, but having an employer in the Netherlands, I am exposed to two very different ways of thinking about the subject. But regardless of our differences on the subject - while developing a product for MDM of customer data, like Human Inference is doing, you need to understand the implifications of both approaches - I would almost call them the "with" and "without" national identifiers implementation of MDM!

In Denmark we use our National identifiers a lot - the CPR numbers for persons and the CVR numbers for companies. Previously I wrote a series of blog posts on how to use CVR and CPR for data cleansing. Private companies are allowed in Denmark to collect these identifiers, although consumers can opt to say "no thanks" in most cases. But all in all, it means that we have our IDs out in the society, not locked up in our bedroom closet.

In the Netherlands the use of such identifiers is prohibited for almost all organizations. While talking to my colleagues I get the sense that there's a profound thinking of this ID as more a password than a key. Commercial use of the ID would be like giving up your basic privacy liberties and most people don't remember it by heart. In contrast, Danish citizens typically do share their credentials, but are quite aware about the privacy laws that companies are obligated to follow when receiving this information.

So what is the end result for data quality and MDM? Well, it highly affects how organizations will be doing two of the most complex operations in an MDM solution: Data cleansing and data matching (aka deduplication).

I recently had my second daughter, and immediately after her birth I could not help but noticing that the hospital crew gave us a sheet of paper with her CPR number. This was just an hour or two after she had her first breath, and a long time before she even had an official name! In data quality terms, this was nicely designed first-time-right (FTR, a common principle in DQ and MDM) system in action!

When I work with Danish customers it usually means that you spend a lot of time on verifying that the IDs of persons and companies are in fact the correct IDs. Like any other attribute, you might have typos, formatting glitches etc. And since we have multiple registries and number types (CVR numbers, P numbers, EAN numbers etc.), you also spend quite some time on infering "what is this number?". You would typically look up the companies' names in the public CVR registry and make sure it matches the name in your own data. If not - probably the ID is wrong and you need to delete it or obtain a correct one. While finding duplicates you can typically standardize the formatting of the IDs and do exact matching for the most part, except for those cases where the ID is missing.

When I work with Dutch customers it usually means that we cleanse the individual attributes of a customer in a much more rigorous manner. The name is cleansed on it's own. Then the address. Then the phone number, and then the email and other similar fields. You'll end up knowing if each element is valid or not, but not if the whole record is actually a cohesive chunk of data. While you can apply a lot of cool inferential techniques to check that the data is cohesive (for instance it is plausible that I, Kasper Sørensen, have the email i.am.kasper.sorensen@gmail.com) but you won't know if it is also my address that you can find in the data, or if it's just some valid address.

Of course the grass isn't that much greener in Denmark as I present it here. Unfortunately we also do have problems with CPR and CVR, and in general I disbelieve that there will be one single source of the truth that we can do reference data checks on in the near future. For instance, change of address typically is quite delayed in the national registries, whereas it is much quicker at the post agencies. And although I think you can share your email and similar attributes through CPR - in practice that's not what people do. So actually you need a MDM hub which connects to several sources of data and then pick and choose from the ones that you trust the most for individual pieces of the data. The great thing is that in Denmark we have a much clearer way to do data interchange inbetween data services, since we do have a common type of key for the basic entities. This gives way for very interesting reference data hubs like for instance iDQ, which in turn makes it easier for us to consolidate some of our integration work.

Coming back to the more ethical question: Is the Danish National Identifiers a threat to our privacy? Or is it just a more modern and practical way to reference and share basic data? For me the winning argument for the Danish model is in the term "basic data". We do share basic data through CPR/CVR, but you can't access any of my compromising or transactional data. In comparison, I fear much more for my privacy when sharing data through Facebook, Google and so on. Sure, if you had my CPR number, you would also be able to find out where I live. I wouldn't share my CPR number with you if I did not want to provide that information though, and after all sharing information, CPR, addresses or anything else, always comes at the risk of leaking that information to other parties. Finally, as an MDM professional I must say - combining information from multiple sources - be it public or private registries - isn't exactly something new, so privacy concerns are in my opinion largely the same in all countries.

But it does mean that implementations of MDM are highly likely to differ a lot when you cross national borders. Denmark and the Netherlands are maybe profound examples of different national systems, but given how much we have in common in general, I am sure there are tons of black swans out there for me yet to discover. As a MDM vendor, and as the lead for DataCleaner, I always need to ensure that our products caters to international - and thereby highly varied - data and ways of processing data.


Cleaning Danish customer data with the CVR registry and DataCleaner

I am doing a series of 3 blog entries over at Data Value Talk about the political and practical side of the Danish government's recent decision to open up basic public data to everyone. I thought it would be nice to also share the word here on my personal blog!
Click the images above to navigate to the three chapters on my blog:
Cleaning Danish customer data with the CVR registry and DataCleaner.