kasper's source

DataCleaner 4 - a high-quality user experience to provide high-quality data

2015-02-27T16:21:00.001+01:00

It's been a long time since I've blogged about DataCleaner and that's partly because I've been busy on other projects - partly because the thing that I'm going to blog about now has been a very long time in the making. We're right now in the final stages of building DataCleaner version 4 which is (in my opinion) going to be a pretty disruptive move for the tool. And especially for the tools usability and user experience.

The UI of DataCleaner is changing in many ways. The moment you start DataCleaner 4 you will see that the initial start screen has been simplified, beautified and is in general a lot less busy than previous versions. We also focus on what you want to have done by offering quick start options such as answering questions "Are my address correct and up-to-date?" or more openly "What can you tell me about my data?". More such options are going to be added by the way ...

Registering and selecting your data in DataCleaner 4 is also a whole lot easier.

When you start building your job, the way of working with it has undergone a drastic change... For the better! We've introduced a graph-based canvas which means that what you work with is a process flow that is in my opinion (and in fact everyone we've talked to about this) is a lot more intuitive and matches the mental model of our users.

The components/functions that you want to apply to your job are positioned in the left-side tree and can now just be dragged on to the canvas. Draw lines between them and you start to design the data quality analysis job that you need. It's quite simple really.

There's a bunch more we want to do of course. That's why it's not released. But for those curious minds, you can get it already as a early access download here. I hope to get some good review/feedback remarks. Let us know how we're doing :-)

MetaModel 4.2 to offer 'Schema Inference'

2014-07-14T21:05:00.002+02:00

For the upcoming version 4.2 of Apache MetaModel (incubating) I have been working on adding a JSON module to the already quite broad data access library. The purpose of the new JSON module is to be able to read text files with JSON contents and present it as if it's a database you can explore and query. This presented (again) an issue in MetaModel:

How do we present a queryable schema for a datastore which does not intrinsically have a schema?

A couple of times (for instance in implementing MongoDB, CouchDB, HBase or XML modules) we have faced this issue, and answers have varied a little depending on the particular datastore. This time I didn't feel like creating another one-off strategy for building the schema. Rather it felt like a good time to introduce a common abstraction - called the 'schema builder' - which allows for several standard implementations as well as pluggable ways of building or infering the schema structure.

The new version of MetaModel will thus have a SchemaBuilder interface which you can implement yourself to plug in a mechanism for building/defining the schema. Even better, there's a couple of quite useful standard implementations.

To explain this I need some examples. Assume you have these documents in your JSON file:

{"name":"Kasper Sørensen", "type":"person", "country":"Denmark"}
{"name":"Apache MetaModel", "type":"project", "community":"Apache"}
{"name":"Java", "type":"language"}
{"name":"JavaScript", "type":"language"}

Here we have 4 documents/rows containing various things. All have a name and a type field, but other fields such as country or community seems optional or situational.

In many situations you would want to have a schema model containing a single table with all documents. That would be possible in two ways. Either as a columnized form (implemented via SingleTableInferentialSchemaBuilder):

name	type	country	community
Kasper Sørensen	person	Denmark
Apache MetaModel	project		Apache
Java	language
JavaScript	language

Or having a single column of type MAP (implemented via SingleMapColumnSchemaBuilder):

value
{name=Kasper Sørensen, type=person, country=Denmark}
{name=Apache MetaModel, type=project, community=Apache}
{name=Java, type=language}
{name=JavaScript, type=language}

The latter approach of course has the advantage that it allows for the same polymorphic behaviour as the JSON documents itself, but it doesn't allow for a great SQL-like query syntax like the first approach. The first approach needs to do an initial analysis to infer the structure based on observed documents in a sample.

A third approach is to build multiple virtualized tables by splitting by distinct values of the "type" field. In our example, it seems that this "type" field is there to distinguish separate types of documents, so that means we can build tables like this (implemented via MultiTableInferentialSchemaBuilder):

Table 'person'

name	country
Kasper Sørensen	Denmark

Table 'project'

name	community
Apache MetaModel	Apache

Table 'language'

name
Java
JavaScript

As you can see, the standard-implementations of schema builder offer quite some flexibility. The feature is still new and can certainly be improved and made more elaborate. But I think this is a great addition to MetaModel - to be able to dynamically define and inject rules for accessing schema-less structures.

Using Apache MetaModel for web applications, some experiences

2014-06-09T09:42:00.000+02:00

Recently I’ve been involved in the development of two separate web applications, both built with Apache MetaModel as the data access library. In this blog I want to share some experiences of good and bad things about this experience.

How we configure the webapp

Before we look at what went good and what went bad, let’s first go quickly through the anatomy of the web apps. Both of them have both a "primary" database and some "periphery" databases or files for storage of purpose-specific items. We use MetaModel to access both types of databases, but obviously the demands are quite different.

To make the term "periphery" database understandable, let’s take an example: In one of the applications the user submits data which may contain country names. We have large catalog of known synonyms for country names (like UK, Great Britain, United Kingdom, England etc.) but every once in a while we are unable to match some input – that input is stored in a periphery CSV file so that we can analyze it once in a while and figure out if the input was garbage or if we’re missing some synonyms. Similarly other periphery databases may be to monitor the success rate of A/B tests (usability experiments) or similar things. All of this data we do not want to put into the "primary" database since the life-cycle of the data is very different.

In the cases I’ve worked on, we’ve been using a RDBMS (PostgreSQL to be specific) as the "primary" database and CSV files, Salesforce.com and CouchDB as "periphery" databases.

The primary database is configured using Spring. One of the requirements we have is the ability to externalize all connection information, and we leverage Spring’s property placeholder for this:

<context:property-placeholder location="file:///${user.home}/datastore.properties" />

<bean class="org.apache.metamodel.spring.DataContextFactoryBean">
    <property name="type" value="${datastore.type}" />
    <property name="driverClassName" value="${datastore.driver}" />
    <property name="url" value="${datastore.url}" />
    <property name="username" value="${datastore.username}" />
    <property name="password" value="${datastore.password}" />
</bean>

If you prefer, you can also configure a traditional java.sql.DataSource and inject it into this factory bean with the name ‘dataSource’. But with the approach above the 'type' of the datastore is even externalizable, meaning that we could potentially switch our web application’s primary database to MongoDB or something like that, just by changing the properties.

Implementing a backend component using MetaModel

When we need to get access to data, we simply inject that DataContext. If we also need to modify the data the sub-interface UpdateableDataContext is injected. Here’s an example:

@Component
public class UserDao {
    private final UpdateableDataContext _dataContext;

    @Autowired
    public UserDao(UpdateableDataContext dataContext) {
        _dataContext = dataContext;
    }

    public User getUserById(Number userId) {
        DataSet ds = _dataContext.query()
                        .from("users")
                        .select("username", "name")
                        .where("id").eq(userId)
                        .execute();
        try {
            if (!ds.next()) {
                return null;
            }
            Row row = ds.getRow();
            return new User(userId, row.getValue(0), row.getValue(1));
        } finally {
            ds.close();
        }
    }
}

In reality we use a couple of String-constants and so on here, to avoid typos slipping into e.g. column names. One of the good parts here is that we’re completely in control of the query plan, the query building is type-safe and neat, and the DataContext object being injected is not tied to any particular backend. We can run this query in a test using a completely different type of backend without issues. More about testing in a jiffy.

Automatically creating the schema model

To make deployment as easy as possible, we also ensure that our application can automatically build the tables needed upon starting the application. In existing production environments the tables will already be there, but for new deployments and testing, this capability is great. We’ve implemented this as a spring bean that has a @PostConstruct method to do the bootstrapping of new DataContexts. Here’s how we could build a “users” table:

@Component
public class DatastoreBootstrap {

    private final UpdateableDataContext _dataContext;

    @Autowired
    public UserDao(UpdateableDataContext dataContext) {
        _dataContext = dataContext;
    }

    @PostConstruct
    public void initialize() {
        Schema schema = _dataContext.getDefaultSchema();
        if (schema.getTable("users") == null) {
            CreateTable createTable = new CreateTable(schema, “users”);
            createTable.withColumn("id").ofType(ColumnType.INTEGER).asPrimaryKey();
            createTable.withColumn("username").ofType(ColumnType.VARCHAR).ofSize(64);
            createTable.withColumn("password_hash").ofType(ColumnType.VARCHAR).ofSize(64);
            createTable.withColumn("name").ofType(ColumnType.VARCHAR).ofSize(128);
            _dataContext.executeUpdate(createTable);
        }
        _dataContext.refreshSchemas();
    }
}

So far we’ve demoed stuff that honestly can also be done by many many other persistence frameworks. But now it gets exciting, because in terms of testability I believe Apache MetaModel has something to offer which almost no other...

Testing your components

Using a PojoDataContext (a DataContext based on in-memory Java objects) we can bootstrap a virtual environment for our testcase that has an extremely low footprint compared to normal integration testing. Let me demonstrate how we can test our UserDao:

@Test
public void testGetUserById() {
    // set up test environment.
    // the test datacontext is an in-memory POJO datacontext
    UpdateableDataContext dc = new PojoDataContext();
    new DatastoreBootstrap(dc).initialize();

    // insert a few user records for the test only.
    Table usersTable = dc.getDefaultSchema().getTable("users");
    dc.executeUpdate(new InsertInto(usersTable).value("id", 1233).value("name", "John Doe"));
    dc.executeUpdate(new InsertInto(usersTable).value("id", 1234).value("name", "Jane Doe"));

    // perform test operations and assertions.
    UserDao userDao = new UserDao(dc);
    User user = userDao.getUserById(1234);
    assertEquals(1234, user.getId();
    assertEquals("Jane Doe", user.getName());
}

This is in my opinion a real strength of MetaModel. First we insert some records (physically represented as Java objects) into our data context, and then we can test the querying and everything without even having a real database engine running.

Further evaluation

The examples above are of course just part of the experiences of building a few webapps on top of Apache MetaModel. I noted down a lot of stuff during the development, of which I can summarize in the following pros and cons list.

Pros:

Testability with POJOs
It’s also easy to facilitate integration testing or manual monkey testing using e.g. Apache Derby or the H2 database. We have a main method in our test-source that will launch our webapp and have it running within a second or so.
We use the same API for different kinds of databases.
- When we do A/B testing, the metrics we are interested in changes a lot. So our results are stored in a CouchDB database instead, because of its dynamic schema nature.
- For certain unexpected scenarios or exceptional values, we store data for debugging and analytical needs in CSV files, also using MetaModel. Having stuff like that in files makes it easy to share with people who wants to manually inspect what is going on.
Precise control over queries and update scripts (transaction) scope is a great benefit. Compared with many Object-Relational-Mapping (ORM) frameworks, this feels like returning home and not having to worry about cascading effects of your actions. You get what you ask for and nothing more.
Concerns like SQL injection is already taken care of by MetaModel. Proper escaping and handling of difficult cases in query literals is not your concern.

Cons:

At this point Apache MetaModel does not have any object mapping facilities at all. While we do not want a ORM framework, we could definitely use some basic "Row to Object" mapping to reduce boilerplate.
There’s currently no API in MetaModel for altering tables. This means that we still have a few cases where we do this using plain SQL, which of course is not portable and therefore not as easily testable. But we can manage to isolate this to just a very few places in the code since altering tables is quite unusual.
In one of our applications we have the unusual situation that one of the databases can be altered at runtime by a different application. Since MetaModel caches the schema structure of a DataContext, such a change is not automatically reflected in our code. We can call DataContext.refreshSchemas() to flush our cached model, but obviously that needs to happen intelligently. This is only a concern for database that have tables altered at runtime though (which is in my opinion quite rare).

I hope this may be useful for you to evaluate the library. If you have questions or remarks, or just feel like getting closer to the Apache MetaModel project, I strongly encourage you to join our dev mailing list and raise all your ideas, concerns and questions!

Introducing Apache MetaModel

2013-06-18T12:55:00.000+02:00

Recently we where able to announce an important milestone in the life of our project MetaModel - it is being incubated into the Apache Foundation! Obviously this generates a lot of new attention to the project, and causing lots and lots of questions on what MetaModel is good for. We didn't grant this project to Apache just for fun, but because we wanted to maximize it's value, both for us and for the industry as a whole. So in this post I'll try and explain the scope of MetaModel, how we use it at Human Inference, and what you might use it for in your products or services.

First, let's recap the one-liner for MetaModel:

MetaModel is a library that encapsulates the differences and enhances the capabilities of different datastores.

In other words - it's all about making sure that the way you work with data is standardized, reusable and smart.

But wait, don't we already have things like Object-Relational-Mapping (ORM) frameworks to do that? After all, a framework like OpenJPA or Hibernate will allow you to work with different databases without having to deal with the different SQL dialects etc. The answer is of course yes, you can use such frameworks for ORM, but MetaModel is by choice not an ORM! An ORM assumes an application domain model, whereas MetaModel, as its name implies, is treating the datastore's metadata as its model. This not only allows for much more dynamic behaviour, but it also makes MetaModel applicable only to a range of specific application types that deal with more or less arbitrary data, or dynamic data models, as their domain.

At Human Inference we build just this kind of software products, so we have great use of MetaModel! The two predominant applications that use MetaModel in our products are:

HIquality Master Data Management (MDM)
Our MDM solution is built on a very dynamic data model. With this application we want to allow multiple data sources to be consolidated into a single view - typically to create an aggregated list of customers. In addition, we take third party sources in and enrich the source data with this. So as you can imagine there's a lot of mapping of data models going on in MDM, and also quite a wide range of database technologies. MetaModel is one of the cornerstones to making this happen. Not only does it mean that we onboard data from a wide range of sources. It also means that these source can vary a lot from eachother and that we can map them using metadata about fields, tables etc.
DataCleaner
Our open source data quality toolkit DataCleaner is obviously also very dependent on MetaModel. Actually MetaModel started as a kind of derivative project from the DataCleaner project. In DataCleaner we allow the user to rapidly register new data and immediately build analysis jobs using it. We wanted to avoid building code that's specific to any particular database technology, so we created MetaModel as the abstraction layer for reading data from almost anywhere. Over time it has grown into richer and richer querying capabilities as well as write access, making MetaModel essentially a full CRUD framework for ... anything.

I've often pondered on the question of What could other people be using MetaModel for? I can obviously only provide an open answer, but some ideas that have popped up (in my head or in the community's) are:

Any application that needs to onboard/intake data of multiple formats
Oftentimes people need to be able to import data from many sources, files etc. MetaModel makes it easy to "code once, apply on any data format", so you save a lot of work. This use-case is similar to what the Quipu project is using MetaModel for.
Model Driven Development (MDD) tools
Design tools for MDD are often used to build domain models and at some point translate them to the physical storage layer. MetaModel provides not only "live" metadata from a particular source, but also in-memory structures for e.g. building and mutating virtual tables, schemas, columns and other metadata. By encompassing both the virtual and the physical layer, MetaModel provides a lot of the groundwork to build MDD tools on top of it.
Code generation tools
Tools that generate code (or other digital artifacts) based on data and metadata. Traversing metadata in MetaModel is very easy and uniform, so you could use this information to build code, XML documents etc. to describe or utilize the data at hand.
An ORM for anything
I said just earlier that MetaModel is NOT an ORM. But if you look at MetaModel from an architectural point of view, it could very well serve as the data access layer of an ORM's design. Obviously all the object-mapping would have to be built on top of it, but then you would also have an ORM that maps to not just JDBC databases, like most ORMs do, but also to file formats, NoSQL databases, Salesforce and more!
Open-ended analytical applications
Say you want to figure out e.g. if particular words appear in a file, a database or whatever. You would have to know a lot about the file format, the database fields or similar constructs. But with MetaModel you can instead automate this process by traversing the metadata and querying whatever fields match your predicates. This way you can build tools that "just takes a file" or "just takes a database connection" and let them loose to figure out their own query plan and so on.

If I am to point at a few buzzwords these days, I would say MetaModel can play a critical role in implementing services for things such as data federation, data virtualization, data consolidation, metadata management and automation.

And obviously this is also part of our incentive to make MetaModel available for everyone, under the Apache license. We are in this industry to make our products better and believe that cooperation to build the best foundation will benefit both us and everyone else that reuses it and contributes to it.

What's the fuzz about National Identifier matching?

2013-04-22T20:51:00.000+02:00

The topic of National Identification numbers (also sometimes referred to as social security numbers) is something that can spawn a heated debate at my workplace. Coming from Denmark, but having an employer in the Netherlands, I am exposed to two very different ways of thinking about the subject. But regardless of our differences on the subject - while developing a product for MDM of customer data, like Human Inference is doing, you need to understand the implifications of both approaches - I would almost call them the "with" and "without" national identifiers implementation of MDM!

In Denmark we use our National identifiers a lot - the CPR numbers for persons and the CVR numbers for companies. Previously I wrote a series of blog posts on how to use CVR and CPR for data cleansing. Private companies are allowed in Denmark to collect these identifiers, although consumers can opt to say "no thanks" in most cases. But all in all, it means that we have our IDs out in the society, not locked up in our bedroom closet.

In the Netherlands the use of such identifiers is prohibited for almost all organizations. While talking to my colleagues I get the sense that there's a profound thinking of this ID as more a password than a key. Commercial use of the ID would be like giving up your basic privacy liberties and most people don't remember it by heart. In contrast, Danish citizens typically do share their credentials, but are quite aware about the privacy laws that companies are obligated to follow when receiving this information.

So what is the end result for data quality and MDM? Well, it highly affects how organizations will be doing two of the most complex operations in an MDM solution: Data cleansing and data matching (aka deduplication).

I recently had my second daughter, and immediately after her birth I could not help but noticing that the hospital crew gave us a sheet of paper with her CPR number. This was just an hour or two after she had her first breath, and a long time before she even had an official name! In data quality terms, this was nicely designed first-time-right (FTR, a common principle in DQ and MDM) system in action!

When I work with Danish customers it usually means that you spend a lot of time on verifying that the IDs of persons and companies are in fact the correct IDs. Like any other attribute, you might have typos, formatting glitches etc. And since we have multiple registries and number types (CVR numbers, P numbers, EAN numbers etc.), you also spend quite some time on infering "what is this number?". You would typically look up the companies' names in the public CVR registry and make sure it matches the name in your own data. If not - probably the ID is wrong and you need to delete it or obtain a correct one. While finding duplicates you can typically standardize the formatting of the IDs and do exact matching for the most part, except for those cases where the ID is missing.

When I work with Dutch customers it usually means that we cleanse the individual attributes of a customer in a much more rigorous manner. The name is cleansed on it's own. Then the address. Then the phone number, and then the email and other similar fields. You'll end up knowing if each element is valid or not, but not if the whole record is actually a cohesive chunk of data. While you can apply a lot of cool inferential techniques to check that the data is cohesive (for instance it is plausible that I, Kasper Sørensen, have the email i.am.kasper.sorensen@gmail.com) but you won't know if it is also my address that you can find in the data, or if it's just some valid address.

Of course the grass isn't that much greener in Denmark as I present it here. Unfortunately we also do have problems with CPR and CVR, and in general I disbelieve that there will be one single source of the truth that we can do reference data checks on in the near future. For instance, change of address typically is quite delayed in the national registries, whereas it is much quicker at the post agencies. And although I think you can share your email and similar attributes through CPR - in practice that's not what people do. So actually you need a MDM hub which connects to several sources of data and then pick and choose from the ones that you trust the most for individual pieces of the data. The great thing is that in Denmark we have a much clearer way to do data interchange inbetween data services, since we do have a common type of key for the basic entities. This gives way for very interesting reference data hubs like for instance iDQ, which in turn makes it easier for us to consolidate some of our integration work.

Coming back to the more ethical question: Is the Danish National Identifiers a threat to our privacy? Or is it just a more modern and practical way to reference and share basic data? For me the winning argument for the Danish model is in the term "basic data". We do share basic data through CPR/CVR, but you can't access any of my compromising or transactional data. In comparison, I fear much more for my privacy when sharing data through Facebook, Google and so on. Sure, if you had my CPR number, you would also be able to find out where I live. I wouldn't share my CPR number with you if I did not want to provide that information though, and after all sharing information, CPR, addresses or anything else, always comes at the risk of leaking that information to other parties. Finally, as an MDM professional I must say - combining information from multiple sources - be it public or private registries - isn't exactly something new, so privacy concerns are in my opinion largely the same in all countries.

But it does mean that implementations of MDM are highly likely to differ a lot when you cross national borders. Denmark and the Netherlands are maybe profound examples of different national systems, but given how much we have in common in general, I am sure there are tons of black swans out there for me yet to discover. As a MDM vendor, and as the lead for DataCleaner, I always need to ensure that our products caters to international - and thereby highly varied - data and ways of processing data.

Cleaning Danish customer data with the CVR registry and DataCleaner

2013-01-17T09:53:00.003+01:00

I am doing a series of 3 blog entries over at Data Value Talk about the political and practical side of the Danish government's recent decision to open up basic public data to everyone. I thought it would be nice to also share the word here on my personal blog!

Click the images above to navigate to the three chapters on my blog:
Cleaning Danish customer data with the CVR registry and DataCleaner.

How to build a Groovy DataCleaner extension

2012-12-07T14:30:00.002+01:00

In this blog entry I'll go through the process of developing a DataCleaner extension: The Groovy DataCleaner extension (just published today). The source code for the extension is available on GitHub if you wish to check it out or even fork and improve it!

First step: You have an idea for your extension. My idea was to get the Groovy language integrated with DataCleaner, to offer an advanced scripting language option, similar to the existing JavaScript transformer - just a lot more powerful. The task would give me the chance to 1) get acquainted with the Groovy language, 2) solve some of the more advanced uses of DataCleaner by giving a completely open-ended scripting option and 3) blog about it. The third point is important to me, because we right now have a Community Contributor Contest, and I'd like to invite extension developers to participate.

Second step: Build a quick prototype. This usually starts by identifying which type of component(s) you want to create. In my case it was a transformer, but in some cases it might be an analyzer. The choice between these are essentially: Does your extension pre-process or transform the data in a way that it should become a part of a flow of operations? Then it's a Transformer. Or is it something that will consume the records (potentially after being pre-processed) and generate some kind of analysis result or write the records somewhere? Then it's a Analyzer.

The API for DataCleaner was designed to be very easy to use. The ideom has been: 1) The obligatory functionality is provided in the interface that you implement. 2) The user-configured parts are injected using the @Configured annotation. 3) The optional parts can be injected if you need them. In other words, this is very much inspired by the idea of Convention-over-Configuration.

So, I wanted to build a Transformer. This was my first prototype, which I could hitch together quite quickly after reading the Embedding Groovy documentation and just implementing the Transformer interface revealed what I needed to provide for DataCleaner to operate:

@TransformerBean("Groovy transformer (simple)")
public class GroovySimpleTransformer implements Transformer {

    @Configured
    InputColumn[] inputs;

    @Configured
    String code;

    private GroovyObject _groovyObject;

    public OutputColumns getOutputColumns() {
        return new OutputColumns("Groovy output");
    }

    public String[] transform(InputRow inputRow) {
        if (_groovyObject == null) {
            _groovyObject = compileCode();
        }
        final Map map = new LinkedHashMap();
        for (InputColumn input : inputs) {
            map.put(input.getName(), inputRow.getValue(input));
        }
        final Object[] args = new Object[] { map };
        String result = (String) _groovyObject.invokeMethod("transform", args);

        logger.debug("Transformation result: {}", result);
        return new String[] { result };
    }
    
    private GroovyObject compileCode() {
        // omitted 
    }

Third step: Start testing. I believe a lot in unittesting your code, also at a very early stage. So the next thing I did was to implement a simple unittest. Notice that I take use of the MockInputColumn and MockInputRow classes from DataCleaner - these make it possible for me to test the Transformer as a unit and not have to do integration testing (in that case I would have to start an actual batch job, which takes a lot more effort from both me and the machine):

public class GroovySimpleTransformerTest extends TestCase {

    public void testScenario() throws Exception {
        GroovySimpleTransformer transformer = new GroovySimpleTransformer();

        InputColumn col1 = new MockInputColumn("foo");
        InputColumn col2 = new MockInputColumn("bar");

        transformer.inputs = new InputColumn[] { col1, col2 };
        transformer.code = 
          "class Transformer {\n" +
          "  String transform(map){println(map); return \"hello \" + map.get(\"foo\")}\n" +
          "}";

        String[] result = transformer.transform(new MockInputRow().put(col1, "Kasper").put(col2, "S"));
        assertEquals(1, result.length);
        assertEquals("hello Kasper", result[0]);
    }
}

Great - this verifies that our Transformer is actually working.

Fourth step: Do the polishing that makes it look and feel like a usable component. It's time to build the extension and see how it works in DataCleaner. When the extension is bundled in a JAR file, you can simply click Window -> Options, select the Extensions tab and click Add extension package -> Manually install JAR file:

After registering your extension you will be able to find it in DataCleaner's Transformation menu (or if you built an Analyzer, in the Analyze menu).

In my case I discovered several sub-optimal features of my extensions. Here's a list of them, and how I solved it:

What?	How?
My transformer had only a default icon	Icons can be defined by providing PNG icon (32x32 pixels) with the same name as the transformer class, in the JAR file. In my case the transformer class was GroovySimpleTransformer.java, so I made an icon available at GroovySimpleTransformer.png.
The 'Code' text field was a single line field and did not look like a code editing field.	Since the API is designed for Convention-over-Configuration, putting a plain String property as the groovy code was maybe a bit naive. There are two strategies to pursue if you have properties which need special rendering on the UI: Provide more metadata about the property (Quite easy), or build your own renderer for it (most flexible, but also more complex). In this case I was able to simply provide more metadata, using the @StringProperty annotation: @Configured @StringProperty(multiline = true, mimeType = "text/groovy") String code The default DataCleaner string property widget will then provide a multi-line text field with syntax coloring for the specific mime-type:
The compilation of the Groovy class was done when the first record hits the transformer, but ideally we would want to do it before the batch even begins.	This point is actually quite important, also to avoid race-conditions in concurrent code and other nasty scenarios. Additionally it will help DataCleaner validation the configuration before actually kicking off a the batch job. The trick is to add a method with the @Initialize annotation. If you have multiple items you need to initialize, you can even add more. In our case, it was quite simple: @Initialize public void init() { _groovyObject = compileCode(); }
The transformer was placed in the root of the Transformation menu.	This was fixed by applying the following annotation on the class, moving it into the Scripting category: @Categorized(ScriptingCategory.class)
The transformer had no description text while hovering over it.	The description was added in a similar fashion, with a class-level annotation: @Description("Perform a data transformation with the use of the Groovy language.")
After execution it would be good to clean up resources used by Groovy.	Similarly to the @Initialize annotation, I can also create one or more descruction methods, annotated with @Close. In the case of the Groovy transformer, there are some classloader-related items that can be cleared after execution this way.
In a more advanced edition of the same transformer, I wanted to support multiple output records.	DataCleaner does support transformers that yield multiple (or even zero) output records. To archieve this, you can inject an OutputRowCollector instance into the Transformer: public class MyTransformer implements Transformer<...> { @Inject @Provided OutputRowCollector collector; public void transform(InputRow row) { // output two records, each with two new values collector.putValues("foo", "bar"); collector.putValues("hello", "world"); } } Side-note - Users of Hadoop might recognize the OutputRowCollector as similar to mappers in Map-Reduce. Transformers, like mappers, are in deed quite capable of executing in parallel.

Fifth step: When you're satisfied with the extension, Publish it on the ExtensionSwap. Simply click the "Register extension" button and follow the instructions on the form. Your extension will now be available to everyone and make others in the community happy!

I hope you found this blog useful as a way to get into DataCleaner extension development. I would be interested in any kind of comment regarding the extension mechanism in DataCleaner - please speak up and let me know what you think!

Data quality of Big Data - challenge or potential?

2012-11-06T16:08:00.001+01:00

If you've been dealing with data in the last few years, you've inevitably also come across the concept of Big Data and NoSQL databases. What do these terms mean for data quality efforts? I believe they can have a great impact. But I also think they are difficult concepts to juggle because the understanding of them varies a lot.

For the purpose of this writing, I will simply say that in my opinion Big Data and NoSQL exposes us as data workers to two new general challenges:

Much more data than we're used to working with, leading to more computing time and new techniques for even handling those amounts of data.
More complex data structures that are not nescesarily based on a relational way of thinking.

So who will be exposed to these challenges? As a developer of tools I am definately be exposed to both challenges. But will the end-user be challenged by both of them? I hope not. But people need good tools in order to do clever things with data quality.

The more data challenge is primarily one of storage and performance. And since storage isn't that expensive, I mostly concern it to be a performance challenge. And for the sake of argument, let's just assume that the tool vendors should be the ones mostly concerned about performance - if the tools and solutions are well designed by the vendors, then at least most of the performance issues should be tackled here.

But the complex data structures challenge is a different one. It has surfaced for a lot of reasons, including:

Favoring performance by eg. elminating the need to join database tables.
Favoring ease of use by allowing logical grouping of related data even though granularity might differ (eg. storing orderlines together with the order itself).
Favoring flexibility by not having a strict schema to perform validation upon inserting.

Typically the structure of such records isn't tabular. Instead it is often based on key/value maps, like this:

{
  "orderNumber": 123,
  "priceSum": 500,
  "orderLines": [
    { "productId": "abc", "productName": "Foo", "price": 200 },
    { "productId": "def", "productName": "Bar", "price": 300 },
  ],
  "customer": { "id": "xyz", "name": "John Doe", "country": "USA" }
}

Looking at this you will notice some data structure features which are alien to the relational world:

The orderlines and customer information is contained within the same record as the order itself. In other words: We have data types like arrays/lists (the "orderLines" field) and key/value maps (each orderline element, and the "customer" field).
Each orderline have a "productId" field, which probably refers to a detailed product record. But it also contains a "productName" field since this information is probably often the most wanted information when presenting/reading the record. In other words: Redundancy is added for ease of use and for performance.
The "priceSum" field is also redundant, since we could easily deduct it by visiting all orderlines and summing them. But redundancy is also added for performance in this case.
Some of the same principles apply to the "customer" field: It is a key/value type, and it contains both an ID as well as some redundant information about the customer.

Support for such data structures is traditionally difficult for database tools. As tool producers, we're used to dealing with relational data, joins and these things. But in the world of Big Data and NoSQL we're up against challenges of complex data structures which includes traversal and selection within the same record.

In DataCleaner, these issues are primarily handled using the "Data structures" transformations, which I feel I need to give a little attention:

Using these transformations you can both read and write the complex data structures that we see in Big Data projects. To explain the 8 "Data structures" transformations, I will try to define a few terms used in their naming:

To "Read" a data structure such as a list or a key/value map means to view each element in the structure as a separate record. This enables you to profile eg. all the orderlines in the order-record example.
To "Select" from a data structure such as a list or a key/value map means to retreive specific elements while retaining the current record granularity. This enables you to profile eg. the customer names of your orders.
Similarly you can "Build" these data structures. The record granularity of the built data structures is the same as the current granularity in your data processing flow.
And lastly I should mention JSON because it is in the screenshot. JSON is probably the most common format of literal representing data structures like these. And in deed the example above is written in JSON format. Obviously we support reading and writing JSON objects directly in DataCleaner.

Using these transformations we can overcome the challenge of handling the new data structures in Big Data and NoSQL.

Another important aspect that comes up whenever there's a whole new generation of databases is that the way you interface them is different. In DataCleaner (or rather, in DataCleaner's little-sister project MetaModel) we've done a lot of work to make sure that the abstractions already known can be reused. This means that querying and inserting records into a NoSQL database works exactly the same as with a relational database. The only difference of course is that new data types may be available, but that builds nicely on top of the existing relational concepts of schemas, tables, columns etc. And at the same time we do all we can to optimize performance so that querying techniques is not up to the end-user but will be applied according to the situation. Maybe I'll try and explain how that works under the covers some other time!

Video: Introducing Data Quality monitoring

2012-10-04T11:15:00.000+02:00

Everytime we publish some work that I've been working on I feel a big sigh of relief. This time it's not software, but a video - and I think the end result is quite well although we are not movie making professionals at Human Inference :)

Here's the stuff: Data Quality monitoring with DataCleaner 3

Enjoy! And do let us know what you think on our discussion forum or our linkedin group.

DataCleaner 3 released!

2012-09-20T14:33:00.001+02:00

Dear friends, users, customers, developers, analysts, partners and more!

After an intense period of development and a long wait, it is our pleasure to finally announce that DataCleaner 3 is available. We at Human Inference invite you all to our celebration! Impatient to try it out? Go download it right now!

So what is all the fuzz about? Well, in all modesty, we think that with DataCleaner 3 we are redefining 'the premier open source data quality solution'. With DataCleaner 3 we've embraced a whole new functional area of data quality, namely data monitoring.

Traditionally, DataCleaner has its roots in data profiling. In the former years, we've added several related additional functions:- transformations, data cleansing, duplicate detection and more. With data monitoring we basically deliver all of the above, but in a continuous environment for analyzing, improving and reporting on your data. Furthermore, we will deliver these functions in a centralized web-based system.

So how will the users benefit from this new data monitoring environment? We've tried to answer this question using a series of images:

Monitor the evolution of your data:

Share your data quality analysis with everyone:

Continuously monitor and improve your data's quality:

Connect DataCleaner to your infrastructure using web services:

The monitoring web application is a fully fledged environment for data quality, covering several functional and non-functional areas:

Display of timeline and trends of data quality metrics
Centralized repository for managing and containing jobs, results, timelines etc.
Scheduling and auditing of DataCleaner jobs
Providing web services for invoking DataCleaner transformations
Security and multi-tenancy
Alerts and notifications when data quality metrics are out of their expected comfort zones.

Naturally, the traditional desktop application of DataCleaner continues to be the tool of choice for expert users and one-time data quality efforts. We've even enhanced the desktop experience quite substantially:

There is a new Completeness analyzer which is very useful for simply identifying records that have incomplete fields.
You can now export DataCleaner results to nice-looking HTML reports that you can give to your manager, or send to your XML parser!
The new monitoring environment is also closely integrated with the desktop application. Thus, the desktop application now has the ability to publish jobs and results to the monitor repository, and to be used as an interactive editor for content already in the repository.
New date-oriented transformations are now available: Date range filter, which allows you to subset datasets based on date ranges, and format date, which allows to format a date using a date mask.
The Regex Parser (which was previously only available through the ExtensionSwap) has now been included in DataCleaner. This makes it very convenient to parse and standardize rich text fields using regular expressions.
There's a new Text case transformer available. With this transformation you can easily convert between upper/lower case and proper capitalization of sentences and words.
Two new search/replace transformations have been added: Plain search/replace and Regex search/replace.
The user experience of the desktop application has been improved. We've added several in-application help messages, made the colors look brighter and clearer and improved the font handling.

More than 50 features and enhancements were implemented in this release, in addition to incorporating several hundreds of upstream improvements from dependent projects.

We hope you will enjoy everything that is new about DataCleaner 3. And do watch out for follow-up material in the coming weeks and months. We will be posting more and more online material and examples to demonstrate the wonderful new features that we are very proud of.

Gartner's Magic Quadrant for Data Quality Tools

2012-08-19T10:36:00.001+02:00

Over the last two year I've been obsessed with Gartner and their analyses of the Data Quality market. It's not because they're always right in every aspect, but a lot of times they are, and in addition they are very prominent opinionmakers. And in all honesty, a huge part of one's personal reasons to do open source development is to get recognition from your peers.

While leading the DataCleaner development and other projects at Human Inference, we definately are paying a lot of attention to what Gartner is saying about us. Also before I joined Human Inference, Gartner was important to me. Their mentioning of DataCleaner in their Who's who in Open Source Data Quality report from 2009 was the initial trigger for contact to a lot of people, including Winfried van Holland, CEO of Human Inference. So on a personal level I have a lot to thank Gartner for.

Therefore it is with great proudness that I see that Human Inference has been promoted to the visionary quadrant of Gartner's Magic Quadrant for Data Quality Tools annual report, which just came out (get it from Gartner here). That's exactly where I think we deserve to be.

At the same time I see they are mentioning DataCleaner specifically as one of the strong points for Human Inference. This is because of the licensing model that we lend ourselves to with it, and for the easy-to-use interface which it provides to data quality professionals. Additionally our integration with Pentaho Data Integration (read more) and the application of our cloud data quality platform (read more) are mentioned as strong points.

This is quite a recognition since our last review by Gartner. In their 2012 update for the Who's who in Open Source Data Quality (get it form Gartner here) Gartner critizised the DataCleaner project on a general negative attitude and a range of false grounds. In particular, my feeling is that certain misunderstandings about the integration of DataCleaner with the rest of Human Inference's offerings caused our Gartner rating to be undervalued at that time. Hopefully the next update to that report will reflect their recent, more enlightened view. I should mention that the two reports are rather independent of each other, so these are just speculations from my side.

As such the Gartner reports have shown to be a wonderful source of information about competing products and market demand. Our current plans for DataCleaner 3 are in deed also influenced by Gartner's (and our own) description of data monitoring being a key functional area in the space of data quality tools.

I am deeply grateful for Gartner's treatment of DataCleaner over time. “You've got to take the good with the bad” and for me that's been a great way of continously improving our products. A good reception in the end makes all the trouble worthwhile.

Query and update Databases, CSV files, Excel spreadsheets, MongoDB, CouchDB and more in Java

2012-07-11T00:01:00.000+02:00

Introduction to MetaModel 3

The other day we released version 3 of MetaModel, a project that I have thoroughly enjoyed working on lately. Let me share a bit of my enthusiasm and try to convince you that this is the greatest data access library there is for Java.

First let me also say, just to make it clear: MetaModel is NOT an ORM (Object-Relational Mapping) framework. MetaModel does not doing any mapping to your domain object model. Contrary, MetaModel is oriented towards working with the data model that already exists in your datastores (databases, files etc.) as it is physically represented. So the model of MetaModel is in deed a meta model, just as it is a metadata model - it works with the concepts of tables, columns, rows, schemas, relationships etc. But it is also oriented towards abstracting away all the cruft of having to deal with the physical interactions with each individual data storage technology. So unlike most ORMs, MetaModel allows you to work with arbitrary data models, stored in arbitrary technologies such as relational (JDBC) databases, text file formats, Excel spreadsheets, NoSQL databases (currently CouchDB and MongoDB) and more.

Here's an overview of the scope of MetaModel, depicted in our "module diagram".

It's important to be able to access, query and update all of these different datastore techonologies in the same way. This is basically the cross-platform story all over again. But the frontier of cross-platform is with MetaModel being moved to also entail datastore technologies, whereas it was previously mostly about the freedom of Operating System.

In addition it is important that this common data access abstraction is elegant, scalable and flexible. I hope to convince you of this with a few examples.

Code examples

Everything you do with MetaModel always starts with a DataContext object. A DataContext represents the basis for any operation with your datastore. Typically one would use the DataContextFactory to get an instance for the datastore of interest. We'll assume you are working on an Excel spreadsheet "people.xlsx":

DataContext dc = DataContextFactory.createExcelDataContext(new File("people.xlsx"));

Easy. Now let's explore the structure of this spreadsheet. We can do so either generically by traversing the graph of schemas, tables and columns - or by names if we already know what we are looking for:

// getting column by path
Column customerNameColumn = dc.getColumnByQualifiedLabel("customers.name");

// traversing all schemas, tables, columns
Schema schema = dc.getDefaultSchema();
Table[] tables = schema.getTables();
Column[] columns = tables[0].getColumns();

// step-wise getting specific elements based on names
Table customersTable = schema.getTableByName("customers");
Column customerBalanceColumn = customerTable.getColumnByName("balance");

Queries

Now let's fire some queries. This is where it gets interesting! Our approach builds upon basic knowledge of SQL, but without all the dialects and runtime differences. Technically we express queries in a completely type safe manner by using the traversed metadata objects above. But we can also put in String literals and more when it is convenient to get the job done.

// Simple query: Get all customer fields for customers with a credit balance greater than 10000.
DataSet ds = dc.query().from(customersTable).select(customersTable.getColumns())
    .where(customerBalanceColumn).greaterThan(10000).execute();

// Slightly more advanced query: Join customers with their associated sales representatives
// and group the result to count which sales reps have the most customers
Column salesRepId = customersTable.getColumnByName("sales_rep_id");
Column employeeId = dc.getColumnByQualifiedLabel("employees.id");
DataSet ds = dc.query()
    .from(customersTable).innerJoin("employees").on(salesRepId, employeeId)
    .selectCount().and("employees.name")
    .groupBy("employees.name").execute();

You can even grab the Query as an object and pass it on to methods and compositions which will eg. modify it for optimization or other purposes.

Updates and changes

The last missing piece of the puzzle is making changes to your data. We've spent a lot of time creating an API that is best suited for this task across all types of datastores. Since there's a big difference in how different datastores treat updates, and MetaModel tries to unify that, we wanted an API which clearly demarcates when you're doing an update, so that there is no doubt about transactional bounds, scope of a batch update and so on. This is why you need to provide all your updated in a closure-style object called an UpdateScript.

Let's do a series of updates.

// Batch #1: Create a table and insert a few records
dc.executeUpdate(new UpdateScript() {
    public void run(UpdateCallback cb) {
        Table muppets = cb.createTable(schema, "muppets")
                .withColumn("name").ofType(VARCHAR)
                .withColumn("profession").ofType(VARCHAR).execute();
        cb.insertInto(muppets).value("name","Kermit the frog")
                .value("profession","TV host").execute();
        cb.insertInto(muppets).value("name","Miss Piggy")
                .value("profession","Diva").execute();
    }
});

// Batch #2: Update and delete a record
dc.executeUpdate(new UpdateScript() {
    public void run(UpdateCallback cb) {
        cb.update("muppets").value("profession","Theatre host")
               .where("name").equals("Kermit the frog").execute();
        cb.deleteFrom("muppets")
               .where("name").like("%Piggy").execute();
    }
});

// Batch #3: Drop the table
dc.executeUpdate(new UpdateScript() {
    public void run(UpdateCallback cb) {
       cb.dropTable("muppets").execute();
    }
});

As you can see, using the UpdateScripts we've encapsulated each batch operation. If the datastore supports transactions, this is also the point of transactional control. If not, MetaModel will provide appropriate synchronization to avoid race conditions, so you can safely perform concurrent updates even on Excel spreadsheets and CSV files.

Wrapping up...

I am extremely happy working both as a developer of and a consumer of MetaModel. I hope you felt this blog/tutorial was a good kick-start and that you got excited by the library. Please give MetaModel a spin and share your thoughts and impressions.

Seeing sporadic OptionalDataExceptions in your Java code?

2012-06-14T16:05:00.000+02:00

This is basically to uphold a valuable mental note I made today:

I was seeing very sporadic OptionalDataExceptions while doing deserialization of certain objects. Have wondered for a looong time why it was, and today I took the time (many hours) to sit down and figure out what it was.

After spending a lot of time creating custom ObjectInputStreams, debugging readObject() methods and more, I finally tracked it down to being an elementary (but extremely hard to identify) issue: Some of my collections (HashMaps, HashSets and more) were exposed to multiple threads and since these basic collection type are neither concurrent nor synchronized, this was the issue.

In essense: If you're seeing this issue, double-check all your HashSets, HashMaps, LinkedHashMaps, LinkedHashSets and IdentityHashMaps! If they're exposed to multithreading, then make them concurrent (or synchronized).

Revealing the "DQ monitor" - DataCleaner 3.0

2012-06-04T13:50:00.000+02:00

Development of DataCleaner has been a bit quiet lately (where 'lately' refers to that last month or so ;-)) but now I want to share an idea that we have been working with at Human Inference: Data quality monitoring.

But first some motivation: I think by now DataCleaner has a pretty firm reputation as one of the leading open source data quality applications. We can, and we will, continue to improve it's core functionalities, but it is also time to take a look at the next step for it - the next big version, 3.0. For this we are picking up a major piece of functionality that is crucial to any DQ and MDM project - data quality monitoring. We all know the old saying that you cannot manage what you cannot measure. This is a statement that has a lot of general truth to it, and also when it comes to data quality. Data profiling tools traditionally focus on a one-time measurement of various metrics. But to truly manage your data quality, you need to monitor it over time and be able to act on not only the current status, but also the progress you're making. Doing this is something new, which none of our open source competitors provide - a monitoring application for tracking data quality levels over time! We also want to make it easy to share this intelligence with everyone in your organization - so it has to be web based.

Based on this we're setting forth the roadmap for DataCleaner 3.0. And we're already able to show very good results. In the following I will disclose some of what is there. Do not take it for anything more than it is: A snapshot from our current work-in-progress. But we do plan to deliver a final version within a few months.

The applications main purpose is to present the history of data quality measurements. Here's a screenshot of our current prototype's timeline view:

What is shown there is the metrics collected by a DataCleaner job, recorded and displayed in a timeline.

It's really easy to customize the timeline, or create new timeline views. All you need to do is select the metrics you're interested in. Notice in the screenshot below that there are different types of metrics, and for those that are queryable you have a nice autocompletion/suggestion interface:

If you click a point in the timeline, you'll get the option to drill to the point-in-time profiling result:

If you drill, you will be given a full data profiling result (like you know them from the current version of DataCleaner, but in the browser). Our prototype is still a bit simplified on this point, but most features analysis result actually render nicely:

So how is it going to work in a larger context? Let me disclose a few of the ideas:

Queryable metrics: Consider if you have a gender column and you always expect it to have values "MALE" or "FEMALE". But you also know that there is quite some dirt in there, and you wish to measure the progress of eliminating the dirt. How would you define the metric then? What you will need is a metric definition that you pass a parameter/query saying that you wish to monitor the number of values occurring that are NOT "MALE" or "FEMALE". Similar cases exist for many other use-cases like the Pattern finder, Dictionary matchers etc. etc. In the DataCleaner monitor, the user will be able to define such queries using IN [...] and NOT IN [...] clauses, like very simple SQL.
Scheduling: The monitoring application will include a scheduler to let you automatic run your periodic data quality assesments. The scheduler will allow you to both set up periodic, but also trigger-based scheduling events. For instance, it might be that you have a workflow where you wish to trigger data profiling and monitoring in a wider context, such as ETL jobs, business processes etc.
Desktop integration: You will also be able to run the jobs in the regular DataCleaner (desktop) application and then upload/synchronize your results with the monitoring application. This will make it easy for you to easily share your findings when you are interactively working with the desktop application.
Email alerting: You will be able to set up expected ranges of particular metrics, and in the case that values are recorded outside the allowed ranges, the data steward will be alerted by email.
Repository: All jobs, results, configuration data etc. is stored in a central repository. The repository allows you to centrally manage connection information, job definitions etc. that your team is using. The idea of a repository also opens up the door to concepts like versioning, personal workspaces, security, multi-tenancy and more. I think it will become a great organizing factor and a "hub" for the DataCleaner users.
Daily snapshots: It is easy to define a profiling job that profiles a complete table. But when dealing with periodic data profiling, it is likely that you only wish to analyze the latest increment. Therefore DataCleaner's handling of date filters have been improved a lot. This is to ensure that you can easily request a profiling job of "yesterday and todays data" and thereby see eg. only profiling results based on the data that was entered/changed in your system recently.

Sounds exciting? We certainly think so. And this idea actually has been talked about for quite some time (I found some mentionings of a "DC web monitor" application as old as from 2009!). So I am happy that we are finally putting this idea to the test. Will love to hear from you if you have any thoughts, remarks or additional ideas.

If you're interested in contributing or just taking a closer look at the development of DataCleaner 3.0, we've already started working on the new milestone in our issue tracking system. The code is located in DataCleaner's source code repository and we are eagerly awaiting you if you wish to pitch in!

Update

We now have an alpha version available that you can play around with. If you feel like being an early adopter, we would really appreciate any feedback from this!

Update (2)

We now have an beta version available that you can play around with. This is starting to look feature complete, so please check it out and let us know what you think!

Data quality monitoring with Kettle and DataCleaner

2012-04-17T15:48:00.000+02:00

We've just announced a great thing - the cooperation with Pentaho and DataCleaner which brings DataCleaners profiling features to all users of Pentaho Data Integration (aka. Kettle)! Not only is this something I've been looking forward to for a long time because it is a great exposure for us, but it also opens up new doors in terms of functionality. In this blog post I'll describe something new: Data monitoring with Kettle and DataCleaner.

While DataCleaner is perfectly capable of doing continuous data profiling, we lack the deployment platform that Pentaho has. With Pentaho you get orchestration and scheduling, and even with a graphical editor.

A scenario that I often encounter is that someone wants to execute a daily profiling job, archive the results with a timestamp and have the results emailed to the data steward. Previously we would set this sorta thing up with DataCleaner's command line interface, which is still quite a nice solution, but if you have more than just a few of these jobs, it can quickly become a mess.

So alternatively, I can now just create a Kettle job like this:

Here's what the example does:

Starts the job (duh!)
Creates a timestamp which needs to be used for archiving the result. This is done using a separate transformation, which you can do either using the "Get System Info" step or the "Formula" step. The result is put into a variable called "today".
Executes the DataCleaner job. The result filename is set to include the "${today}" variable!
Emails the results to the data steward.
If everything went well without errors, the job is succesful.

Pretty neat and something I am extremely happy about!

In the future I imagine to have even more features built like this. For example an ability to run multiple DataCleaner jobs with configuration options stored as data in the ETL flow. Or the ability to treat the stream of data in a Kettle transformation as the input of the DataCleaner job. Do you guys have any other wild ideas?

Update: In fact we are now taking actions to provide more elaborate data quality monitoring features to the community. Go to my blog entry about the plans for DataCleaner 3.0 for more information.

Implementing a custom datastore in DataCleaner

2012-04-09T20:16:00.002+02:00

A question I am often asked by super-users, partners and developers of DataCleaner is: How do you build a custom datastore in DataCleaner for my system/file-format XYZ? Recently I've dealt with this for the use in the upcoming integration with Pentaho Kettle, for a Human Inference customer who had a home grown database proxy system, and just today while it was asked on the DataCleaner forum. In this blog post I will guide you through this process, which requires some basic Java programming skills, but if that's in place it isn't terribly complicated.

Just gimme the code ...

First of all I should say (to those of you who prefer "just the code") that there is already an example of how to do this in the sample extension for DataCleaner. Take a look at the org.eobjects.datacleaner.sample.SampleDatastore class. Once you've read, understood and compiled the Java code, all you need to do is register the datastore in DataCleaner's conf.xml file like this (within the <datastore-catalog> element):

<custom-datastore class-name="org.eobjects.datacleaner.sample.SampleDatastore">
  <property name="Name" value="My datastore" />
</custom-datastore>

A bit more explanation please!

OK, so if you wanna really know how it works, here goes...

First of all, a datastore in DataCleaner needs to implement the Datastore interface. But instead of implementing the interface directly, I would suggest using the abstract implementation called the UsageAwareDatastore. This abstract implementation handles concurrent access to the datastore, reusing existing connections and more. What you still need to provide when extending the UsageAwareDatastore class is primarily the createDatastoreConnection() method which is invoked when a (new) connection is requested. Let's see how an initial new Datastore implementation will look like:

public class ExampleDatastore extends UsageAwareDatastore<DataContext> {

 private static final long serialVersionUID = 1L;
 
 public ExampleDatastore() {
  super("My datastore");
 }

 @Override
 protected UsageAwareDatastoreConnection createDatastoreConnection() {
  // TODO Auto-generated method stub
  return null;
 }

 @Override
 public PerformanceCharacteristics getPerformanceCharacteristics() {
  // TODO Auto-generated method stub
  return null;
 }
}

Notice that I have created a no-arg constructor. This is REQUIRED for custom datastores, since the datastore will be instantiated by DataCleaner. Later we will focus on how to make the name ("My datastore") adjustable.

First we want to have a look at the two unimplemented methods:

createDatastoreConnection() is used to create a new connection. DataCleaner builds upon the MetaModel framework for data access. You will need to return a new DatastoreConnectionImpl(...). This class takes an important parameter, namely your MetaModel DataContext implementation. Often times there will already be a DataContext that you can use given some configuration, eg. a JdbcDataContext, a CsvDataContext, ExcelDataContext, MongoDbDatacontext or whatever.
getPerformanceCharacteristics() is used by DataCleaner to figure out the query plan when executing a job. You will typically just return a new PerformanceCharacteristics(false);. Read the javadoc for more information :)

Parameterizable properties, please

By now you should be able to implement a custom datastore, which hopefully covers your basic needs. But maybe you want to reuse the datastore class with eg. different files, different hostnames etc. In other words: Maybe you want to let your user define certain properties of the datastore.

To your rescue is the @Configured annotation, which is an annotation widely used in DataCleaner. It allows you to annotate fields in your class which should be configured by the user. The types of the fields can be Strings, Integers, Files etc., you name it. Let's see how you would expose the properties of a typical connection:

public class ExampleDatastore extends UsageAwareDatastore<DataContext> {
 // ...

 @Configured
 String datastoreName;

 @Configured
 String hostname;

 @Configured
 Integer port;

 @Configured
 String systemId;

 // ...
}

And how you would typically use them to implement methods:

public class ExampleDatastore extends UsageAwareDatastore<DataContext> {
 // ...

 @Override
 public String getName() {
  return datastoreName;
 }

 @Override
 protected UsageAwareDatastoreConnection createDatastoreConnection() {
  DataContext dataContext = createDataContext(hostname, port, systemId);
  return new DatastoreConnectionImpl(dataContext, this);
 }
}

If I wanted to configure a datastore using the parameters above, I could enter it in my conf.xml file like this:

<custom-datastore class-name="foo.bar.ExampleDatastore">
  <property name="Datastore name" value="My datastore" />
  <property name="Hostname" value="localhost" />
  <property name="Port" value="1234" />
  <property name="System id" value="foobar" />
</custom-datastore>

Notice that the names of the properties are inferred by reversing the camelCase notation which Java uses, so that "datastoreName" becomes "Datastore name" and so on. Alternatively you can provide an explicit name in the @Configured annotation.

I hope this introduction tutorial makes some sense for you. Once again I urge you to take a look at the Sample DataCleaner extension, which also includes a build setup (Maven based), and a custom MetaModel DataContext implementation.

Now you can build your own DQ monitoring solution with DataCleaner

2012-01-23T21:42:00.000+01:00

In the cover of night we've released a new version of DataCleaner today (version 2.4.2). Officially it's a minor release because for the User Interface very few things have changed, only a few bugfixes and minor enhancements have been introduced. But one potentially major feature have been added in the inner workings of DataCleaner: The ability to persist the results of your DQ analysis jobs. Although this feature still has very limited User Interface support, it has full support in the command line interface, which I would argue is actually sufficient for the purposes of establishing a data quality monitoring solution. Later on I do expect there to be full (and backwards compatible) support in the UI as well.

So what is it, and how does it work?
Well basically it is simply two new parameters to the command line interface:

 -of (--output-file) FILE                          : File in which to save the result of the job
 -ot (--output-type) [TEXT | HTML | SERIALIZED]    : How to represent the result of the job

Here's an example of how to use it. Notice that I use the file extension .analysis.result.dat, which is the one thing that is currently implemented and recognized in the UI as a result file.

> DataCleaner-console.exe -job examples\employees.analysis.xml\
 -ot SERIALIZED\
 -of employees.analysis.result.dat

Now start up DataCleaner's UI, and select "File -> Open analysis job..." - you'll suddenly see that the produced file can be opened:

And when you open the file, the result will be displayed just like a job you've run inside the application:

Since files like this are generally easy to archive and to append eg. timestamps etc., it should be really easy to build a DIY data quality monitoring solution based scheduled jobs and this approach to execution. Or you can get in contact with Human Inference if you want something more sophisticated ;-)
Notice also that there's a HTML output type, which is also quite neat and easy to parse with an XML parser. The SERIALIZED format is more rich though, and includes information needed for more refined, programmatic access to the results. For instance, you might deserialize the whole file using the regular Java serialization API and access it, as an AnalysisResult instance. Thereby you could eg. create a timeline of a particular metric and track changes to the data that you are monitoring. Update: Please read my follow-up blog post about the plans to include a full Data Quality monitoring solution as of DataCleaner 3.0.

Push down query optimization in DataCleaner

2011-12-23T16:11:00.000+01:00

As a follow-up to my previous post about how we make DataCleaner super-fast by applying some nice multi-threading tricks, The DataCleaner engine explained, I would now like to touch upon another performance booster: Push down query optimization.

To my knowledge "push down query optimization" is a trick that only very few tools support, since it requires a flow model that was actually built for it. The idea is that by inspecting an execution flow the tool might be able to identify steps in the beginning or in the end of the flow that can be replaced by query modifications.

For example, if your data flow begins with a filtering action that removes all records of a given type or restricts the further processing to only be the first 1000 records or something like that. Most tools simply require you to write some SQL yourself, which is also doable, but as I've said before on this blog, I think writing SQL is a barrier to productivity, creativity and good data quality results. So in DataCleaner we do not offer this option, because we have something that is much, much nicer. That solution is push down query optimization!

Let me illustrate. I will be using the Sakila example database for MySQL:

Say you want to do a simple pattern finding of film titles in the Sakila database, you would select the title column and you would get a result like this:

In the DataCleaner logs we can see what queries are actually fired to the database. Open up the log file (in the logs folder) and inspect datacleaner.log. You will find a line like this:

Executing query: SELECT `nicer_but_slower_film_list`.`title` FROM sakila.`nicer_but_slower_film_list`

That's fine. You can inspect the results closer, but that's not what this topic is about, so I'll carry on... Now let's say you want to refine your job. Let's instead see how the pattern distribution is if we want to only look at a few categories of films. So I add a 'Equals' filter to only select horror, sports and action movies and apply it to my pattern finder:

If we run the job, and inspect the log file again, we see now this entry:

Executing query: SELECT `nicer_but_slower_film_list`.`title`, `nicer_but_slower_film_list`.`category` FROM sakila.`nicer_but_slower_film_list` WHERE (`nicer_but_slower_film_list`.`category` = 'Horror' OR `nicer_but_slower_film_list`.`category` = 'Action' OR `nicer_but_slower_film_list`.`category` = 'Sports')

What's surprising here is that the filter actually got query optimized. Not all filters have this ability, since some of them have richer functionality than can be expressed as a query modification. But some of them do, and typically these are the small functions that make a big difference.

Let's also apply a Max rows filter that limits the analysis for only 20 records and chain it so that it depends on the Equals filter:

If we now run the job, both filters will have been applied to the query:

Executing query: SELECT `nicer_but_slower_film_list`.`title`, `nicer_but_slower_film_list`.`category` FROM sakila.`nicer_but_slower_film_list` WHERE (`nicer_but_slower_film_list`.`category` = 'Horror' OR `nicer_but_slower_film_list`.`category` = 'Action' OR `nicer_but_slower_film_list`.`category` = 'Sports') LIMIT 20

That means that we do as much as we can to optimize the query, without ever having to ask the user to help us. So if you modify the logical job, the physical queries are automatically adapted! This is why push down query optimization is a superior optimization technique to raw SQL. Happy data cleaning!

Additional information for developers: If you're developing plugins to DataCleaner and want to make a query optimized filter, then simply make sure you implement the QueryOptimizedFilter interface! Happy coding!

The DataCleaner engine explained

2011-11-29T13:35:00.001+01:00

For this blog entry I have decided to record a short video instead of write till my fingers fall off :) So I present to you: My videoblog entry about DataCleaner's data quality processing engine, and how it compares to traditional ETL engines.

The DataCleaner engine was created from the ground-up to be optimized for Data Quality projects. It performs superiorly to any other engine that we've looked at, which I think is a pretty nice archievement. In the video I try to explain what makes it different!

The Camtasia Studio video content presented here requires JavaScript to be enabled and the latest version of the Adobe Flash Player. If you are using a browser with JavaScript disabled please enable it now. Otherwise, please update your version of the free Adobe Flash Player by downloading here.

Enjoy using DataCleaner :)

Standardize the date formats in your data

2011-10-30T10:33:00.000+01:00

One of the things that I see sometimes is that web forms cause unstandardized data in your database. For example, text fields in web forms do not have a native way to specify the type of the data. So what if you have a field that is supposed to be a date? For example the birthdate of your web users? A lot of web applications are not performing real validations of the format and content of the data entered into such fields. I think this typically occurs because it was not thought of as important at the time of designing the initial web page. But maybe it will become important at a point in time if eg. you want to analyze the age groups of your users! The trouble is that later on in the applications lifecycle, a state of unchangeability enters because you're stuck with a bunch of unstandardized data that you cannot conform to a new standardized data format. This is because you will have a lot of different date formats represented. For example:

2011-10-30
20111030
30th of October, 2011
30/10/11

And maybe some even more exotic ones...

In this blog entry I will show you how to solve that migration issue with the use of DataCleaner.

1. Date mask matching

The first thing we should do is to analyze which date patterns are present in the data. To do this you need to combine two components: The Date mask matcher and the Boolean analyzer. Here are the steps involved.

First set up you datastore in the welcome screen of DataCleaner.
Click the "Analyze!" button to begin composing your job.
In the tree to the left, select the columns of interest - in our example at least the birthdate column.
Click "Add transformer -> Matching and standardization -> Date mask matcher".

Your screen will now look something like this:

In the middle of the screen you see a list of date masks. Each of these produce a boolean output column (seen below). The idea of the Date mask matcher is that it creates the boolean columns so that you can even assert if a particular date is parseable by using several date masks. That's because a single date string like "080910" can be understood in many ways!

2. Analyzing matches

Moving on, we want to see how well our dates match against the date masks. Since all the matches are now stored in boolean columns, we can apply the Boolean analyzer. Here are the steps involved:

Click "Add analyzer -> Boolean analyzer".
Make sure all the transformed boolean columns are checked.
Click the "Run analysis" button.
Wait for the analysis to run.

Your screen will now contain an analysis result like this:

The result has two parts: The Column statistics and the Frequency of combinations.

In the column statistics you can see how much individual date masks have been matched. In our example we can see that 4 of our date masks (no. 2, 3, 5 and 6) are not matched at all, so we may consider removing them from the Date mask matcher.

In the frequency of combinations we get a view of the rows and which match combinations are frequent and less frequent. The most frequent combination is that our date mask no. 1 is the only valid mask. The second most frequent combination (Combination 1) is that none of the date masks apply. If you click the green arrow to the right of the combination you will see which records fall into that category. In our example that looks like this:

This gives us a good hint about which date masks we need to add to our date mask matcher.

The "1982.03.21" date is a simple case - we should simply create a date mask like this: YYYY.MM.dd

The "11th of march, 1982" date is a bit more complex. We need to allow the date mask to have a literal string part (the "th of" part) and it needs to recognize the month by name ("march"), not by number. Fortunately this is still possible, the date mask looks like this: dd'th of' MMMMM, YYYY

3. Converting to dates

While we could continue to refine the analysis, this is a blog, not a reference manual and I want to cut to the chase - the actual migration to standardized dates!

So let us look at how you can convert your date strings to actual date fields which you can then choose to format using a standardized format. To do this, click "Add transformer -> Conversion -> Convert to date". You will now see a configuration panel like this:

In here you also see a list of example date masks. Click the plus-button to add additional date masks to convert by. The converter will try from the top to convert if it can, so in case you have cases like "091011" then you have to make your choice here (I would recommend based on your analysis).

We add the few masks that are relevant for our example:

And we verify that there are no immediate unrecognized dates, by clicking the "Preview data" button:

If a date is not recognized and converted, then the output column will have a null instead of a date. Therefore a good practice would be to look for null values and eg. save them to an error handling file. To do this, here's what we do:

Go to the Filters tab.
Click "Add filter -> Not null".
Select only your converted column (in our example "birthdate (as date)").
Click "INVALID -> Write to CSV file".
Select the desired columns for the error handling file.
Optionally right click the "Write to CSV file" tab and select "Rename component" to give it a name like "write errors".
Go back to the Filters tab and click the VALID button to write the valid records to a CSV file or a spreadsheet.

After these steps, you should be able to inspect your job flow by clicking the Visualize button:

Now your date standardization job is ready to run again and again to enforce standardized dates!

Data Profiling SQLized. Uh oh...

2011-09-26T20:37:00.000+02:00

What does 'Scrooge' and 'Kasper' have in common? Not much according to my SQL data profiler.

Some months back (admittedly, more than a couple) I was explaining how I think people tend do "home made data profiling" too often because it apparently seems easy to do in SQL. I went on to promise that I would also try to play the devil's advocate and show a few examples of "copy paste queries" that you can use for such a tool. In this blog post I will try to do so. But let me first say:

Don't do this at home kids!

But let's start with the first query that I would add to my home baked profiling SQL script. We'll do what anyone who hasn't really understood what profiling is all about will tell you to do: Do a column analysis based on the metrics that are easily available through all SQL implementations:

SELECT MAX(column), MIN(column), COUNT(column), COUNT(*)
FROM table;

This is a good query, especially for number columns. Here I would typically look of the MIN value is below zero or not. Is the COUNT(column) equal to the COUNT(*)? If not, it means that there are nulls in the column. Why not just do two separate queries, that would be more readable? Yes, but it also makes my script larger and I will have more stuff to maintain. But let's try it, we can actually improve it also by adding a few metrics:

SELECT MAX(column) AS highest_value FROM table;
SELECT MIN(column) AS lowest_positive_value FROM table WHERE column > 0;
SELECT MIN(column) AS lowest_negative_value FROM table WHERE column < 0;
SELECT COUNT(*) AS num_values FROM table WHERE column IS NOT NULL;
SELECT COUNT(*) AS num_nulls FROM table WHERE column IS NULL;

Now let's continue with some string columns, because I think more often than not, this is where data profiling turns out to be really valuable. Something that I often see as an inconsistency in structured string data is case differences. Such inconsistencies makes reporting and analysis of the data cumbersome and error prone because grouping and filtering will ultimately be inprecise. So let's do a case analysis:

SELECT COUNT(*) AS num_lowercase FROM table WHERE LCASE(column) = column;
SELECT COUNT(*) AS num_uppercase FROM table WHERE UCASE(column) = column;
SELECT COUNT(*) AS num_mixed_case FROM table WHERE LCASE(column) <> column AND UCASE(column) <> column;

And then on to query the always popular "first letter is capitalized" type of strings. This one really depends on the database, because substring functions have not been standardized across major SQL implementations. I'll show a few:

INITCAP-based approach (eg. PostgreSQL and Oracle):

SELECT COUNT(*) AS num_first_letter_capitalized FROM table WHERE INITCAP(column) = column;

SUBSTRING-based approach (eg. Microsoft SQL Server):

SELECT COUNT(*) AS num_first_letter_capitalized FROM table
WHERE UCASE(SUBSTR(column FROM 0 FOR 1)) = SUBSTR(column FROM 0 FOR 1)
AND LCASE(SUBSTR(column FROM 1)) = SUBSTR(column FROM 1)

A bit cumbersome, but get's the job done. Being the devil's advocate, I'm still not convinced that I should throw out my home baked SQL just yet. So I'm ready for another challenge!

Let's have a look at pattern finding through SQL. Again this is perfectly possible. I've even heard many people telling me that we should rewrite DataCleaner's Pattern Finder to make it SQL optimized. Read on and judge for yourself :-)

To match tokens by pattern we apply the simplest possible configuration in DataCleaner's pattern finder: All letters are replaced by 'A' or 'a' and all numbers are replaced by '9'. This makes for a nice pattern based matcher, like this:

Mickey Mouse -> 'Aaaaaa Aaaaa'
Minnie Mouse -> 'Aaaaaa Aaaaa'
Joachim von And -> 'Aaaaaaa aaa Aaa'
kasper@eobjects.dk -> 'aaaaaa@aaaaaaa.aa'

(Random fact: 'Joachim von And' is the Danish name for Scrooge McDuck)
As you can see from the patterns, this is a good preliminary way to determine if string values have the same form and syntax - we immediately see that the email address is odd and that although all other values look like valid names, som have lowercase tokens (prefixes) inbetween.

In PostgreSQL for example, this would look like:

SELECT regexp_replace(regexp_replace(regexp_replace(column, '[a-z]','a','g'), '[A-Z]','A','g'), '[0-9]','9','g') as pattern, COUNT(*) as pattern_count from table GROUP BY pattern;

This actually works like a charm and returns:

pattern	pattern_count
Aaaaaa Aaaaa	2
Aaaaaaa aaa Aaa	1
aaaaaa@aaaaaaaa.aa	1

So why even use a profiling tool for finding patterns? All this seems to be possible through raw SQL?

I will now stop playing the devil's advocate... Cuz' seriously... This is nonsense! Having worked for some years on a pretty good data quality analysis tool, this approach absolutely disgusts me. Here's just a few random reasons why, off the top of my head:

We still haven't scratched the surface when it comes to supporting eg. non-ASCII characters in patterns.
Some tokens in patterns should be matched regardless of string length, some shouldn't. In our case we never matched strings with unequal lengths (eg. Scrooge and Mickey). This is a setting that you will want to play around with! (For examples, check out our Pattern Finder documentation)
Each metric in the previous analyses required their own query. This means that if you want to analyze a hundred metrics, you would need to query (at least) a hundred times.
A lot of metrics are simply not possible to express in SQL. Some examples: Diacritic character count, max/min amount of words, matches against reference data and more.
Often you will want to preprocess data before (or actually I would argue, as a part of) your profiling. This can be for example to extract information from composite values or to replace known inconsistencies with standardized values.
All the examples offer no drill-to-detail behaviour, so further analysis is more or less impossible. And drill-to-detail is not offered through SQL, so there is for example no way to express in our pattern finder SQL that we want to keep some samples of various pattern matches for later inspection.
All in all, using SQL for data profiling makes for a terribly unexplorative approach. It's a pain having to write and modify such an amount of SQL to get simple things done, so don't rely on it, because it will make you lazy and then you'll not investigate properly!
And of course, SQL only applies to databases that support SQL! If you're looking to profile data in other formats, then you're out of luck with this approach.

Get your data right... First Time Right!

2011-08-15T15:19:00.000+02:00

In my blog I mostly talk about data quality tools like DataCleaner that are diagnostic and treating, rather than preventive. Such tools have a lot of merit and strengths, but for a total view on data quality it is crucial that you also include tools that are preventive of poor data ever entering your system. In this blog post I want to talk a bit about a project that I have been involved with at Human Inference which is just that - our First Time Right JavaScript solution.

The idea is that we provide a subscription-based JavaScript API where you can easily decorate any HTML contact form with a lot of rich features for on-the-fly verification, validation, auto correction and helpful features for automatic filling of derived fields.

For example, the API allows you to enter (or copy/paste) a full name, including titulation, salutation, initials and more - and get these items parsed and placed into corresponding fields on a detailed contact form. It will even automatically detect what the gender of the contact is, and apply this in gender fields. We have similar data entry aids for address input, email input, phone numbers and contact duplicate checking.

Take a look at the video below, which demonstrate most of the features:

Now this is quite exciting functionality, but this is also a technical blog, so I'll talk a bit about the technology involved.

We built the project based on Google Web Toolkit (GWT). GWT enables us to build a very rich application, entirely in JavaScript, so that it can be embedded on any website - no matter if it's PHP based, ASP.NET based, Java based or whatever. Of course we do have a server-side piece that the JavaScript communicates with, but that is all hosted at Human Inferences cloud platform. So in other words: The deployment of our First Time Right principle is a breeze!

Since AJAX applications require locality of the server that it is communicating with, we've had to overcome quite some issues to allow the JavaScript to be external from the deployment sites. This is crucial as we want upgrades and improvements to be performed on our premises, not at individual customer sites. This way we can really leverage the cloud- and subscription-based approach to data quality. Our solution to the locality problem has been the JSONP approach, which is an alternative protocol for implementing AJAX behaviour. JSONP is a rather clever construct where instead of issuing actual HTTP requests, you insert new <script> elements into the HTML DOM at runtime! This means that the browser will perform a new request simply because the <script> element refers a new JavaScript source. It's not "pretty" to tackle errorhandling and the asynchronicity that this approach brings on, but we've done a lot of work to get it right, and it works like a charm! I hope to share some of our design patterns later, to demonstrate how it works.

Another challenge was of security. Obviously you will want to make sure that the JavaScript is only available for subscribers. And only for the websites that they've subscribed to (because otherwise the JavaScript can simply be copied to another website). Our way around this resembles how for example Google manages their subscriptions to Google Maps and other subscription services, where you need a site-specific API key. Very clever.

A few optional features may require some local add-on deployment. In particular, deduplication requires us to know the contact data to use as the source for detecting if a new contact is a duplicate. Here we have two options: On-premise installation of the deduplication engine or hooking up with our cloud-based deduplication engine, which can be configured to sync with your datastores.

All in all I am quite enthusiastic about the FTR solution and the technology behind the solution. I also think that our FTR API is an example of a lightweight approach to implementing Data Quality, which complements DataCleaner very well. Both tools are extremely useful for ensuring a high level of data quality, and both tools are very intuitive and flexible in the way you can deploy them.

Eye candy in Java 7: New javadoc style!

2011-08-04T13:46:00.008+02:00

By now most of you've probably heard that Java 7 is out and there's a lot of discussions about new features, the loop optimization bug and general adoption.

But one of the things in Java 7 which has escaped most people attention (I think) is the new javadoc style.

Check it out:

And see it live - we've just published an updated API documentation for MetaModel.

Unit test your data

2011-08-01T21:59:00.013+02:00

In modern software development unit testing is widely used as a way to check the quality of your code. For those of you who are not software developers, the idea in unit testing is that you define rules for your code that you check again and again, to verify that your code works, and keep on working.

Unit testing and data quality has quite a lot in common in my oppinion. Both code and data change over time, so there is a constant need to keep checking that you code/data has the desired characteristics. This was something that I was recently reminded of by a DataCleaner user on our forums.

I am happy to see that data stewards and the like are picking up this idea, as it has been maturing for quite some time in the software development industry. It also got me thinking: In software development we have a lot of related methods and practices around unit testing. Let me try to list a few, which are very important, and which we can perhaps also apply to data?

Code	Data
Compile-time checking (Ensuring correct syntax)	Database constraints
Unit testing (Checking a single unit of code)	Validating data profiling?
Continuous integration (Running all tests periodically)	Data Quality monitoring?
Bug tracking (Maintaining records of all code issues)	?
Static code analysis (a la FindBugs)	Explorative data profiling?
Refactoring (Changing code without breaking functionality)	ETL with applied DQ rules?

For explanation of the various data profiling and monitoring types, please refer to my previous post, Two types of data profiling.

Of course not all metaphors here map one-to-one, but in my oppinion it is a pretty good metaphor. For me, as a software product developer, I think it also points out some of the weak and strong points of current Data Quality tools. In software development the tool support for unit testing, continuous integration, bug tracking and more is incredible. In the data world I feel that many tools focus only on one or two of the above areas of quality control. Of course you can combine tools, but as I've argued before, switching tools also comes at a large price.

So what do I suggest? Well, fellow product developers, let's make better tools that integrate more disciplines of data quality! I know that this has been and still will be my aim for DataCleaner.

Update: Further actions in this direction have been taken with the plans for DataCleaner 3.0, see this blog post for more information.

A colorful value distribution

2011-07-14T21:59:00.003+02:00

A few weeks ago I was dealing a bit of attention to the charts in DataCleaner. Of special interest are the value distribution charts which has caused some discussions...

Anyways, here's a proposal which includes nicer (IMO) coloring, a "distinct count measure", a dedicated "<blank>" keyword and a few other niceties.

Value distribution chart proposals

You can expect to see this live in DataCleaner 2.3 which is expected in august.