kasper's source

20130422

What's the fuzz about National Identifier matching?

The topic of National Identification numbers (also sometimes referred to as social security numbers) is something that can spawn a heated debate at my workplace. Coming from Denmark, but having an employer in the Netherlands, I am exposed to two very different ways of thinking about the subject. But regardless of our differences on the subject - while developing a product for MDM of customer data, like Human Inference is doing, you need to understand the implifications of both approaches - I would almost call them the "with" and "without" national identifiers implementation of MDM!

In Denmark we use our National identifiers a lot - the CPR numbers for persons and the CVR numbers for companies. Previously I wrote a series of blog posts on how to use CVR and CPR for data cleansing. Private companies are allowed in Denmark to collect these identifiers, although consumers can opt to say "no thanks" in most cases. But all in all, it means that we have our IDs out in the society, not locked up in our bedroom closet.

In the Netherlands the use of such identifiers is prohibited for almost all organizations. While talking to my colleagues I get the sense that there's a profound thinking of this ID as more a password than a key. Commercial use of the ID would be like giving up your basic privacy liberties and most people don't remember it by heart. In contrast, Danish citizens typically do share their credentials, but are quite aware about the privacy laws that companies are obligated to follow when receiving this information.

So what is the end result for data quality and MDM? Well, it highly affects how organizations will be doing two of the most complex operations in an MDM solution: Data cleansing and data matching (aka deduplication).

I recently had my second daughter, and immediately after her birth I could not help but noticing that the hospital crew gave us a sheet of paper with her CPR number. This was just an hour or two after she had her first breath, and a long time before she even had an official name! In data quality terms, this was nicely designed first-time-right (FTR, a common principle in DQ and MDM) system in action!

When I work with Danish customers it usually means that you spend a lot of time on verifying that the IDs of persons and companies are in fact the correct IDs. Like any other attribute, you might have typos, formatting glitches etc. And since we have multiple registries and number types (CVR numbers, P numbers, EAN numbers etc.), you also spend quite some time on infering "what is this number?". You would typically look up the companies' names in the public CVR registry and make sure it matches the name in your own data. If not - probably the ID is wrong and you need to delete it or obtain a correct one. While finding duplicates you can typically standardize the formatting of the IDs and do exact matching for the most part, except for those cases where the ID is missing.

When I work with Dutch customers it usually means that we cleanse the individual attributes of a customer in a much more rigorous manner. The name is cleansed on it's own. Then the address. Then the phone number, and then the email and other similar fields. You'll end up knowing if each element is valid or not, but not if the whole record is actually a cohesive chunk of data. While you can apply a lot of cool inferential techniques to check that the data is cohesive (for instance it is plausible that I, Kasper Sørensen, have the email i.am.kasper.sorensen@gmail.com) but you won't know if it is also my address that you can find in the data, or if it's just some valid address.

Of course the grass isn't that much greener in Denmark as I present it here. Unfortunately we also do have problems with CPR and CVR, and in general I disbelieve that there will be one single source of the truth that we can do reference data checks on in the near future. For instance, change of address typically is quite delayed in the national registries, whereas it is much quicker at the post agencies. And although I think you can share your email and similar attributes through CPR - in practice that's not what people do. So actually you need a MDM hub which connects to several sources of data and then pick and choose from the ones that you trust the most for individual pieces of the data. The great thing is that in Denmark we have a much clearer way to do data interchange inbetween data services, since we do have a common type of key for the basic entities. This gives way for very interesting reference data hubs like for instance iDQ, which in turn makes it easier for us to consolidate some of our integration work.

Coming back to the more ethical question: Is the Danish National Identifiers a threat to our privacy? Or is it just a more modern and practical way to reference and share basic data? For me the winning argument for the Danish model is in the term "basic data". We do share basic data through CPR/CVR, but you can't access any of my compromising or transactional data. In comparison, I fear much more for my privacy when sharing data through Facebook, Google and so on. Sure, if you had my CPR number, you would also be able to find out where I live. I wouldn't share my CPR number with you if I did not want to provide that information though, and after all sharing information, CPR, addresses or anything else, always comes at the risk of leaking that information to other parties. Finally, as an MDM professional I must say - combining information from multiple sources - be it public or private registries - isn't exactly something new, so privacy concerns are in my opinion largely the same in all countries.

But it does mean that implementations of MDM are highly likely to differ a lot when you cross national borders. Denmark and the Netherlands are maybe profound examples of different national systems, but given how much we have in common in general, I am sure there are tons of black swans out there for me yet to discover. As a MDM vendor, and as the lead for DataCleaner, I always need to ensure that our products caters to international - and thereby highly varied - data and ways of processing data.

20130117

Cleaning Danish customer data with the CVR registry and DataCleaner

I am doing a series of 3 blog entries over at Data Value Talk about the political and practical side of the Danish government's recent decision to open up basic public data to everyone. I thought it would be nice to also share the word here on my personal blog!

Click the images above to navigate to the three chapters on my blog:
Cleaning Danish customer data with the CVR registry and DataCleaner.

20121207

How to build a Groovy DataCleaner extension

In this blog entry I'll go through the process of developing a DataCleaner extension: The Groovy DataCleaner extension (just published today). The source code for the extension is available on GitHub if you wish to check it out or even fork and improve it!
Groovy

First step: You have an idea for your extension. My idea was to get the Groovy language integrated with DataCleaner, to offer an advanced scripting language option, similar to the existing JavaScript transformer - just a lot more powerful. The task would give me the chance to 1) get acquainted with the Groovy language, 2) solve some of the more advanced uses of DataCleaner by giving a completely open-ended scripting option and 3) blog about it. The third point is important to me, because we right now have a Community Contributor Contest, and I'd like to invite extension developers to participate.

Second step: Build a quick prototype. This usually starts by identifying which type of component(s) you want to create. In my case it was a transformer, but in some cases it might be an analyzer. The choice between these are essentially: Does your extension pre-process or transform the data in a way that it should become a part of a flow of operations? Then it's a Transformer. Or is it something that will consume the records (potentially after being pre-processed) and generate some kind of analysis result or write the records somewhere? Then it's a Analyzer.

The API for DataCleaner was designed to be very easy to use. The ideom has been: 1) The obligatory functionality is provided in the interface that you implement. 2) The user-configured parts are injected using the @Configured annotation. 3) The optional parts can be injected if you need them. In other words, this is very much inspired by the idea of Convention-over-Configuration.

So, I wanted to build a Transformer. This was my first prototype, which I could hitch together quite quickly after reading the Embedding Groovy documentation and just implementing the Transformer interface revealed what I needed to provide for DataCleaner to operate:

@TransformerBean("Groovy transformer (simple)")
public class GroovySimpleTransformer implements Transformer {

    @Configured
    InputColumn[] inputs;

    @Configured
    String code;

    private GroovyObject _groovyObject;

    public OutputColumns getOutputColumns() {
        return new OutputColumns("Groovy output");
    }

    public String[] transform(InputRow inputRow) {
        if (_groovyObject == null) {
            _groovyObject = compileCode();
        }
        final Map map = new LinkedHashMap();
        for (InputColumn input : inputs) {
            map.put(input.getName(), inputRow.getValue(input));
        }
        final Object[] args = new Object[] { map };
        String result = (String) _groovyObject.invokeMethod("transform", args);

        logger.debug("Transformation result: {}", result);
        return new String[] { result };
    }
    
    private GroovyObject compileCode() {
        // omitted 
    }

Third step: Start testing. I believe a lot in unittesting your code, also at a very early stage. So the next thing I did was to implement a simple unittest. Notice that I take use of the MockInputColumn and MockInputRow classes from DataCleaner - these make it possible for me to test the Transformer as a unit and not have to do integration testing (in that case I would have to start an actual batch job, which takes a lot more effort from both me and the machine):

public class GroovySimpleTransformerTest extends TestCase {

    public void testScenario() throws Exception {
        GroovySimpleTransformer transformer = new GroovySimpleTransformer();

        InputColumn col1 = new MockInputColumn("foo");
        InputColumn col2 = new MockInputColumn("bar");

        transformer.inputs = new InputColumn[] { col1, col2 };
        transformer.code = 
          "class Transformer {\n" +
          "  String transform(map){println(map); return \"hello \" + map.get(\"foo\")}\n" +
          "}";

        String[] result = transformer.transform(new MockInputRow().put(col1, "Kasper").put(col2, "S"));
        assertEquals(1, result.length);
        assertEquals("hello Kasper", result[0]);
    }
}

Great - this verifies that our Transformer is actually working.

Fourth step: Do the polishing that makes it look and feel like a usable component. It's time to build the extension and see how it works in DataCleaner. When the extension is bundled in a JAR file, you can simply click Window -> Options, select the Extensions tab and click Add extension package -> Manually install JAR file:

After registering your extension you will be able to find it in DataCleaner's Transformation menu (or if you built an Analyzer, in the Analyze menu).

In my case I discovered several sub-optimal features of my extensions. Here's a list of them, and how I solved it:

What?	How?
My transformer had only a default icon	Icons can be defined by providing PNG icon (32x32 pixels) with the same name as the transformer class, in the JAR file. In my case the transformer class was GroovySimpleTransformer.java, so I made an icon available at GroovySimpleTransformer.png.
The 'Code' text field was a single line field and did not look like a code editing field.	Since the API is designed for Convention-over-Configuration, putting a plain String property as the groovy code was maybe a bit naive. There are two strategies to pursue if you have properties which need special rendering on the UI: Provide more metadata about the property (Quite easy), or build your own renderer for it (most flexible, but also more complex). In this case I was able to simply provide more metadata, using the @StringProperty annotation: @Configured @StringProperty(multiline = true, mimeType = "text/groovy") String code The default DataCleaner string property widget will then provide a multi-line text field with syntax coloring for the specific mime-type:
The compilation of the Groovy class was done when the first record hits the transformer, but ideally we would want to do it before the batch even begins.	This point is actually quite important, also to avoid race-conditions in concurrent code and other nasty scenarios. Additionally it will help DataCleaner validation the configuration before actually kicking off a the batch job. The trick is to add a method with the @Initialize annotation. If you have multiple items you need to initialize, you can even add more. In our case, it was quite simple: @Initialize public void init() { _groovyObject = compileCode(); }
The transformer was placed in the root of the Transformation menu.	This was fixed by applying the following annotation on the class, moving it into the Scripting category: @Categorized(ScriptingCategory.class)
The transformer had no description text while hovering over it.	The description was added in a similar fashion, with a class-level annotation: @Description("Perform a data transformation with the use of the Groovy language.")
After execution it would be good to clean up resources used by Groovy.	Similarly to the @Initialize annotation, I can also create one or more descruction methods, annotated with @Close. In the case of the Groovy transformer, there are some classloader-related items that can be cleared after execution this way.
In a more advanced edition of the same transformer, I wanted to support multiple output records.	DataCleaner does support transformers that yield multiple (or even zero) output records. To archieve this, you can inject an OutputRowCollector instance into the Transformer: public class MyTransformer implements Transformer<...> { @Inject @Provided OutputRowCollector collector; public void transform(InputRow row) { // output two records, each with two new values collector.putValues("foo", "bar"); collector.putValues("hello", "world"); } } Side-note - Users of Hadoop might recognize the OutputRowCollector as similar to mappers in Map-Reduce. Transformers, like mappers, are in deed quite capable of executing in parallel.

Fifth step: When you're satisfied with the extension, Publish it on the ExtensionSwap. Simply click the "Register extension" button and follow the instructions on the form. Your extension will now be available to everyone and make others in the community happy!

I hope you found this blog useful as a way to get into DataCleaner extension development. I would be interested in any kind of comment regarding the extension mechanism in DataCleaner - please speak up and let me know what you think!

20121106

Data quality of Big Data - challenge or potential?

If you've been dealing with data in the last few years, you've inevitably also come across the concept of Big Data and NoSQL databases. What do these terms mean for data quality efforts? I believe they can have a great impact. But I also think they are difficult concepts to juggle because the understanding of them varies a lot.

For the purpose of this writing, I will simply say that in my opinion Big Data and NoSQL exposes us as data workers to two new general challenges:

Much more data than we're used to working with, leading to more computing time and new techniques for even handling those amounts of data.
More complex data structures that are not nescesarily based on a relational way of thinking.

So who will be exposed to these challenges? As a developer of tools I am definately be exposed to both challenges. But will the end-user be challenged by both of them? I hope not. But people need good tools in order to do clever things with data quality.

The more data challenge is primarily one of storage and performance. And since storage isn't that expensive, I mostly concern it to be a performance challenge. And for the sake of argument, let's just assume that the tool vendors should be the ones mostly concerned about performance - if the tools and solutions are well designed by the vendors, then at least most of the performance issues should be tackled here.

But the complex data structures challenge is a different one. It has surfaced for a lot of reasons, including:

Favoring performance by eg. elminating the need to join database tables.
Favoring ease of use by allowing logical grouping of related data even though granularity might differ (eg. storing orderlines together with the order itself).
Favoring flexibility by not having a strict schema to perform validation upon inserting.

Typically the structure of such records isn't tabular. Instead it is often based on key/value maps, like this:

{
  "orderNumber": 123,
  "priceSum": 500,
  "orderLines": [
    { "productId": "abc", "productName": "Foo", "price": 200 },
    { "productId": "def", "productName": "Bar", "price": 300 },
  ],
  "customer": { "id": "xyz", "name": "John Doe", "country": "USA" }
}

Looking at this you will notice some data structure features which are alien to the relational world:

The orderlines and customer information is contained within the same record as the order itself. In other words: We have data types like arrays/lists (the "orderLines" field) and key/value maps (each orderline element, and the "customer" field).
Each orderline have a "productId" field, which probably refers to a detailed product record. But it also contains a "productName" field since this information is probably often the most wanted information when presenting/reading the record. In other words: Redundancy is added for ease of use and for performance.
The "priceSum" field is also redundant, since we could easily deduct it by visiting all orderlines and summing them. But redundancy is also added for performance in this case.
Some of the same principles apply to the "customer" field: It is a key/value type, and it contains both an ID as well as some redundant information about the customer.

Support for such data structures is traditionally difficult for database tools. As tool producers, we're used to dealing with relational data, joins and these things. But in the world of Big Data and NoSQL we're up against challenges of complex data structures which includes traversal and selection within the same record.

In DataCleaner, these issues are primarily handled using the "Data structures" transformations, which I feel I need to give a little attention:

Using these transformations you can both read and write the complex data structures that we see in Big Data projects. To explain the 8 "Data structures" transformations, I will try to define a few terms used in their naming:

To "Read" a data structure such as a list or a key/value map means to view each element in the structure as a separate record. This enables you to profile eg. all the orderlines in the order-record example.
To "Select" from a data structure such as a list or a key/value map means to retreive specific elements while retaining the current record granularity. This enables you to profile eg. the customer names of your orders.
Similarly you can "Build" these data structures. The record granularity of the built data structures is the same as the current granularity in your data processing flow.
And lastly I should mention JSON because it is in the screenshot. JSON is probably the most common format of literal representing data structures like these. And in deed the example above is written in JSON format. Obviously we support reading and writing JSON objects directly in DataCleaner.

Using these transformations we can overcome the challenge of handling the new data structures in Big Data and NoSQL.

Another important aspect that comes up whenever there's a whole new generation of databases is that the way you interface them is different. In DataCleaner (or rather, in DataCleaner's little-sister project MetaModel) we've done a lot of work to make sure that the abstractions already known can be reused. This means that querying and inserting records into a NoSQL database works exactly the same as with a relational database. The only difference of course is that new data types may be available, but that builds nicely on top of the existing relational concepts of schemas, tables, columns etc. And at the same time we do all we can to optimize performance so that querying techniques is not up to the end-user but will be applied according to the situation. Maybe I'll try and explain how that works under the covers some other time!

20121004

Video: Introducing Data Quality monitoring

Everytime we publish some work that I've been working on I feel a big sigh of relief. This time it's not software, but a video - and I think the end result is quite well although we are not movie making professionals at Human Inference :)

Here's the stuff: Data Quality monitoring with DataCleaner 3

Enjoy! And do let us know what you think on our discussion forum or our linkedin group.

20120920

DataCleaner 3 released!

Dear friends, users, customers, developers, analysts, partners and more!

After an intense period of development and a long wait, it is our pleasure to finally announce that DataCleaner 3 is available. We at Human Inference invite you all to our celebration! Impatient to try it out? Go download it right now!

So what is all the fuzz about? Well, in all modesty, we think that with DataCleaner 3 we are redefining 'the premier open source data quality solution'. With DataCleaner 3 we've embraced a whole new functional area of data quality, namely data monitoring.

Traditionally, DataCleaner has its roots in data profiling. In the former years, we've added several related additional functions:- transformations, data cleansing, duplicate detection and more. With data monitoring we basically deliver all of the above, but in a continuous environment for analyzing, improving and reporting on your data. Furthermore, we will deliver these functions in a centralized web-based system.

So how will the users benefit from this new data monitoring environment? We've tried to answer this question using a series of images:

Monitor the evolution of your data:

Share your data quality analysis with everyone:

Continuously monitor and improve your data's quality:

Connect DataCleaner to your infrastructure using web services:

The monitoring web application is a fully fledged environment for data quality, covering several functional and non-functional areas:

Display of timeline and trends of data quality metrics
Centralized repository for managing and containing jobs, results, timelines etc.
Scheduling and auditing of DataCleaner jobs
Providing web services for invoking DataCleaner transformations
Security and multi-tenancy
Alerts and notifications when data quality metrics are out of their expected comfort zones.

Naturally, the traditional desktop application of DataCleaner continues to be the tool of choice for expert users and one-time data quality efforts. We've even enhanced the desktop experience quite substantially:

There is a new Completeness analyzer which is very useful for simply identifying records that have incomplete fields.
You can now export DataCleaner results to nice-looking HTML reports that you can give to your manager, or send to your XML parser!
The new monitoring environment is also closely integrated with the desktop application. Thus, the desktop application now has the ability to publish jobs and results to the monitor repository, and to be used as an interactive editor for content already in the repository.
New date-oriented transformations are now available: Date range filter, which allows you to subset datasets based on date ranges, and format date, which allows to format a date using a date mask.
The Regex Parser (which was previously only available through the ExtensionSwap) has now been included in DataCleaner. This makes it very convenient to parse and standardize rich text fields using regular expressions.
There's a new Text case transformer available. With this transformation you can easily convert between upper/lower case and proper capitalization of sentences and words.
Two new search/replace transformations have been added: Plain search/replace and Regex search/replace.
The user experience of the desktop application has been improved. We've added several in-application help messages, made the colors look brighter and clearer and improved the font handling.

More than 50 features and enhancements were implemented in this release, in addition to incorporating several hundreds of upstream improvements from dependent projects.

We hope you will enjoy everything that is new about DataCleaner 3. And do watch out for follow-up material in the coming weeks and months. We will be posting more and more online material and examples to demonstrate the wonderful new features that we are very proud of.

20120819

Gartner's Magic Quadrant for Data Quality Tools

Over the last two year I've been obsessed with Gartner and their analyses of the Data Quality market. It's not because they're always right in every aspect, but a lot of times they are, and in addition they are very prominent opinionmakers. And in all honesty, a huge part of one's personal reasons to do open source development is to get recognition from your peers.

While leading the DataCleaner development and other projects at Human Inference, we definately are paying a lot of attention to what Gartner is saying about us. Also before I joined Human Inference, Gartner was important to me. Their mentioning of DataCleaner in their Who's who in Open Source Data Quality report from 2009 was the initial trigger for contact to a lot of people, including Winfried van Holland, CEO of Human Inference. So on a personal level I have a lot to thank Gartner for.

Therefore it is with great proudness that I see that Human Inference has been promoted to the visionary quadrant of Gartner's Magic Quadrant for Data Quality Tools annual report, which just came out (get it from Gartner here). That's exactly where I think we deserve to be.

At the same time I see they are mentioning DataCleaner specifically as one of the strong points for Human Inference. This is because of the licensing model that we lend ourselves to with it, and for the easy-to-use interface which it provides to data quality professionals. Additionally our integration with Pentaho Data Integration (read more) and the application of our cloud data quality platform (read more) are mentioned as strong points.

This is quite a recognition since our last review by Gartner. In their 2012 update for the Who's who in Open Source Data Quality (get it form Gartner here) Gartner critizised the DataCleaner project on a general negative attitude and a range of false grounds. In particular, my feeling is that certain misunderstandings about the integration of DataCleaner with the rest of Human Inference's offerings caused our Gartner rating to be undervalued at that time. Hopefully the next update to that report will reflect their recent, more enlightened view. I should mention that the two reports are rather independent of each other, so these are just speculations from my side.

As such the Gartner reports have shown to be a wonderful source of information about competing products and market demand. Our current plans for DataCleaner 3 are in deed also influenced by Gartner's (and our own) description of data monitoring being a key functional area in the space of data quality tools.

I am deeply grateful for Gartner's treatment of DataCleaner over time. “You've got to take the good with the bad” and for me that's been a great way of continously improving our products. A good reception in the end makes all the trouble worthwhile.