20140714

MetaModel 4.2 to offer 'Schema Inference'

For the upcoming version 4.2 of Apache MetaModel (incubating) I have been working on adding a JSON module to the already quite broad data access library. The purpose of the new JSON module is to be able to read text files with JSON contents and present it as if it's a database you can explore and query. This presented (again) an issue in MetaModel:

How do we present a queryable schema for a datastore which does not intrinsically have a schema?

A couple of times (for instance in implementing MongoDB, CouchDB, HBase or XML modules) we have faced this issue, and answers have varied a little depending on the particular datastore. This time I didn't feel like creating another one-off strategy for building the schema. Rather it felt like a good time to introduce a common abstraction - called the 'schema builder' - which allows for several standard implementations as well as pluggable ways of building or infering the schema structure.

The new version of MetaModel will thus have a SchemaBuilder interface which you can implement yourself to plug in a mechanism for building/defining the schema. Even better, there's a couple of quite useful standard implementations.

To explain this I need some examples. Assume you have these documents in your JSON file:

{"name":"Kasper Sørensen", "type":"person", "country":"Denmark"}
{"name":"Apache MetaModel", "type":"project", "community":"Apache"}
{"name":"Java", "type":"language"}
{"name":"JavaScript", "type":"language"}

Here we have 4 documents/rows containing various things. All have a name and a type field, but other fields such as country or community seems optional or situational.

In many situations you would want to have a schema model containing a single table with all documents. That would be possible in two ways. Either as a columnized form (implemented via SingleTableInferentialSchemaBuilder):

nametypecountrycommunity
Kasper SørensenpersonDenmark
Apache MetaModelprojectApache
Javalanguage
JavaScriptlanguage

Or having a single column of type MAP (implemented via SingleMapColumnSchemaBuilder):

value
{name=Kasper Sørensen, type=person, country=Denmark}
{name=Apache MetaModel, type=project, community=Apache}
{name=Java, type=language}
{name=JavaScript, type=language}

The latter approach of course has the advantage that it allows for the same polymorphic behaviour as the JSON documents itself, but it doesn't allow for a great SQL-like query syntax like the first approach. The first approach needs to do an initial analysis to infer the structure based on observed documents in a sample.

A third approach is to build multiple virtualized tables by splitting by distinct values of the "type" field. In our example, it seems that this "type" field is there to distinguish separate types of documents, so that means we can build tables like this (implemented via MultiTableInferentialSchemaBuilder):

Table 'person'
namecountry
Kasper SørensenDenmark


Table 'project'
namecommunity
Apache MetaModelApache


Table 'language'
name
Java
JavaScript


As you can see, the standard-implementations of schema builder offer quite some flexibility. The feature is still new and can certainly be improved and made more elaborate. But I think this is a great addition to MetaModel - to be able to dynamically define and inject rules for accessing schema-less structures.

20140609

Using Apache MetaModel for web applications, some experiences

Recently I’ve been involved in the development of two separate web applications, both built with Apache MetaModel as the data access library. In this blog I want to share some experiences of good and bad things about this experience.

How we configure the webapp

Before we look at what went good and what went bad, let’s first go quickly through the anatomy of the web apps. Both of them have both a "primary" database and some "periphery" databases or files for storage of purpose-specific items. We use MetaModel to access both types of databases, but obviously the demands are quite different.

To make the term "periphery" database understandable, let’s take an example: In one of the applications the user submits data which may contain country names. We have large catalog of known synonyms for country names (like UK, Great Britain, United Kingdom, England etc.) but every once in a while we are unable to match some input – that input is stored in a periphery CSV file so that we can analyze it once in a while and figure out if the input was garbage or if we’re missing some synonyms. Similarly other periphery databases may be to monitor the success rate of A/B tests (usability experiments) or similar things. All of this data we do not want to put into the "primary" database since the life-cycle of the data is very different.

In the cases I’ve worked on, we’ve been using a RDBMS (PostgreSQL to be specific) as the "primary" database and CSV files, Salesforce.com and CouchDB as "periphery" databases.

The primary database is configured using Spring. One of the requirements we have is the ability to externalize all connection information, and we leverage Spring’s property placeholder for this:

<context:property-placeholder location="file:///${user.home}/datastore.properties" />

<bean class="org.apache.metamodel.spring.DataContextFactoryBean">
    <property name="type" value="${datastore.type}" />
    <property name="driverClassName" value="${datastore.driver}" />
    <property name="url" value="${datastore.url}" />
    <property name="username" value="${datastore.username}" />
    <property name="password" value="${datastore.password}" />
</bean>

If you prefer, you can also configure a traditional java.sql.DataSource and inject it into this factory bean with the name ‘dataSource’. But with the approach above the 'type' of the datastore is even externalizable, meaning that we could potentially switch our web application’s primary database to MongoDB or something like that, just by changing the properties.

Implementing a backend component using MetaModel

When we need to get access to data, we simply inject that DataContext. If we also need to modify the data the sub-interface UpdateableDataContext is injected. Here’s an example:

@Component
public class UserDao {
    private final UpdateableDataContext _dataContext;

    @Autowired
    public UserDao(UpdateableDataContext dataContext) {
        _dataContext = dataContext;
    }

    public User getUserById(Number userId) {
        DataSet ds = _dataContext.query()
                        .from("users")
                        .select("username", "name")
                        .where("id").eq(userId)
                        .execute();
        try {
            if (!ds.next()) {
                return null;
            }
            Row row = ds.getRow();
            return new User(userId, row.getValue(0), row.getValue(1));
        } finally {
            ds.close();
        }
    }
}

In reality we use a couple of String-constants and so on here, to avoid typos slipping into e.g. column names. One of the good parts here is that we’re completely in control of the query plan, the query building is type-safe and neat, and the DataContext object being injected is not tied to any particular backend. We can run this query in a test using a completely different type of backend without issues. More about testing in a jiffy.

Automatically creating the schema model

To make deployment as easy as possible, we also ensure that our application can automatically build the tables needed upon starting the application. In existing production environments the tables will already be there, but for new deployments and testing, this capability is great. We’ve implemented this as a spring bean that has a @PostConstruct method to do the bootstrapping of new DataContexts. Here’s how we could build a “users” table:

@Component
public class DatastoreBootstrap {

    private final UpdateableDataContext _dataContext;

    @Autowired
    public UserDao(UpdateableDataContext dataContext) {
        _dataContext = dataContext;
    }

    @PostConstruct
    public void initialize() {
        Schema schema = _dataContext.getDefaultSchema();
        if (schema.getTable("users") == null) {
            CreateTable createTable = new CreateTable(schema, “users”);
            createTable.withColumn("id").ofType(ColumnType.INTEGER).asPrimaryKey();
            createTable.withColumn("username").ofType(ColumnType.VARCHAR).ofSize(64);
            createTable.withColumn("password_hash").ofType(ColumnType.VARCHAR).ofSize(64);
            createTable.withColumn("name").ofType(ColumnType.VARCHAR).ofSize(128);
            _dataContext.executeUpdate(createTable);
        }
        _dataContext.refreshSchemas();
    }
}

So far we’ve demoed stuff that honestly can also be done by many many other persistence frameworks. But now it gets exciting, because in terms of testability I believe Apache MetaModel has something to offer which almost no other...

Testing your components

Using a PojoDataContext (a DataContext based on in-memory Java objects) we can bootstrap a virtual environment for our testcase that has an extremely low footprint compared to normal integration testing. Let me demonstrate how we can test our UserDao:

@Test
public void testGetUserById() {
    // set up test environment.
    // the test datacontext is an in-memory POJO datacontext
    UpdateableDataContext dc = new PojoDataContext();
    new DatastoreBootstrap(dc).initialize();

    // insert a few user records for the test only.
    Table usersTable = dc.getDefaultSchema().getTable("users");
    dc.executeUpdate(new InsertInto(usersTable).value("id", 1233).value("name", "John Doe"));
    dc.executeUpdate(new InsertInto(usersTable).value("id", 1234).value("name", "Jane Doe"));

    // perform test operations and assertions.
    UserDao userDao = new UserDao(dc);
    User user = userDao.getUserById(1234);
    assertEquals(1234, user.getId();
    assertEquals("Jane Doe", user.getName());
}

This is in my opinion a real strength of MetaModel. First we insert some records (physically represented as Java objects) into our data context, and then we can test the querying and everything without even having a real database engine running.

Further evaluation

The examples above are of course just part of the experiences of building a few webapps on top of Apache MetaModel. I noted down a lot of stuff during the development, of which I can summarize in the following pros and cons list.

Pros:

  • Testability with POJOs
  • It’s also easy to facilitate integration testing or manual monkey testing using e.g. Apache Derby or the H2 database. We have a main method in our test-source that will launch our webapp and have it running within a second or so.
  • We use the same API for different kinds of databases.
    • When we do A/B testing, the metrics we are interested in changes a lot. So our results are stored in a CouchDB database instead, because of its dynamic schema nature.
    • For certain unexpected scenarios or exceptional values, we store data for debugging and analytical needs in CSV files, also using MetaModel. Having stuff like that in files makes it easy to share with people who wants to manually inspect what is going on.
  • Precise control over queries and update scripts (transaction) scope is a great benefit. Compared with many Object-Relational-Mapping (ORM) frameworks, this feels like returning home and not having to worry about cascading effects of your actions. You get what you ask for and nothing more.
  • Concerns like SQL injection is already taken care of by MetaModel. Proper escaping and handling of difficult cases in query literals is not your concern.

Cons:

  • At this point Apache MetaModel does not have any object mapping facilities at all. While we do not want a ORM framework, we could definitely use some basic "Row to Object" mapping to reduce boilerplate.
  • There’s currently no API in MetaModel for altering tables. This means that we still have a few cases where we do this using plain SQL, which of course is not portable and therefore not as easily testable. But we can manage to isolate this to just a very few places in the code since altering tables is quite unusual.
  • In one of our applications we have the unusual situation that one of the databases can be altered at runtime by a different application. Since MetaModel caches the schema structure of a DataContext, such a change is not automatically reflected in our code. We can call DataContext.refreshSchemas() to flush our cached model, but obviously that needs to happen intelligently. This is only a concern for database that have tables altered at runtime though (which is in my opinion quite rare).

I hope this may be useful for you to evaluate the library. If you have questions or remarks, or just feel like getting closer to the Apache MetaModel project, I strongly encourage you to join our dev mailing list and raise all your ideas, concerns and questions!

20130618

Introducing Apache MetaModel

Recently we where able to announce an important milestone in the life of our project MetaModel - it is being incubated into the Apache Foundation! Obviously this generates a lot of new attention to the project, and causing lots and lots of questions on what MetaModel is good for. We didn't grant this project to Apache just for fun, but because we wanted to maximize it's value, both for us and for the industry as a whole. So in this post I'll try and explain the scope of MetaModel, how we use it at Human Inference, and what you might use it for in your products or services.


First, let's recap the one-liner for MetaModel:
MetaModel is a library that encapsulates the differences and enhances the capabilities of different datastores.
In other words - it's all about making sure that the way you work with data is standardized, reusable and smart.

But wait, don't we already have things like Object-Relational-Mapping (ORM) frameworks to do that? After all, a framework like OpenJPA or Hibernate will allow you to work with different databases without having to deal with the different SQL dialects etc. The answer is of course yes, you can use such frameworks for ORM, but MetaModel is by choice not an ORM! An ORM assumes an application domain model, whereas MetaModel, as its name implies, is treating the datastore's metadata as its model. This not only allows for much more dynamic behaviour, but it also makes MetaModel applicable only to a range of specific application types that deal with more or less arbitrary data, or dynamic data models, as their domain.

At Human Inference we build just this kind of software products, so we have great use of MetaModel! The two predominant applications that use MetaModel in our products are:
  • HIquality Master Data Management (MDM)
    Our MDM solution is built on a very dynamic data model. With this application we want to allow multiple data sources to be consolidated into a single view - typically to create an aggregated list of customers. In addition, we take third party sources in and enrich the source data with this. So as you can imagine there's a lot of mapping of data models going on in MDM, and also quite a wide range of database technologies. MetaModel is one of the cornerstones to making this happen. Not only does it mean that we onboard data from a wide range of sources. It also means that these source can vary a lot from eachother and that we can map them using metadata about fields, tables etc.
  • DataCleaner
    Our open source data quality toolkit DataCleaner is obviously also very dependent on MetaModel. Actually MetaModel started as a kind of derivative project from the DataCleaner project. In DataCleaner we allow the user to rapidly register new data and immediately build analysis jobs using it. We wanted to avoid building code that's specific to any particular database technology, so we created MetaModel as the abstraction layer for reading data from almost anywhere. Over time it has grown into richer and richer querying capabilities as well as write access, making MetaModel essentially a full CRUD framework for ... anything.
I've often pondered on the question of What could other people be using MetaModel for? I can obviously only provide an open answer, but some ideas that have popped up (in my head or in the community's) are:
  • Any application that needs to onboard/intake data of multiple formats
    Oftentimes people need to be able to import data from many sources, files etc. MetaModel makes it easy to "code once, apply on any data format", so you save a lot of work. This use-case is similar to what the Quipu project is using MetaModel for.
  • Model Driven Development (MDD) tools
    Design tools for MDD are often used to build domain models and at some point translate them to the physical storage layer. MetaModel provides not only "live" metadata from a particular source, but also in-memory structures for e.g. building and mutating virtual tables, schemas, columns and other metadata. By encompassing both the virtual and the physical layer, MetaModel provides a lot of the groundwork to build MDD tools on top of it.
  • Code generation tools
    Tools that generate code (or other digital artifacts) based on data and metadata. Traversing metadata in MetaModel is very easy and uniform, so you could use this information to build code, XML documents etc. to describe or utilize the data at hand.
  • An ORM for anything
    I said just earlier that MetaModel is NOT an ORM. But if you look at MetaModel from an architectural point of view, it could very well serve as the data access layer of an ORM's design. Obviously all the object-mapping would have to be built on top of it, but then you would also have an ORM that maps to not just JDBC databases, like most ORMs do, but also to file formats, NoSQL databases, Salesforce and more!
  • Open-ended analytical applications
    Say you want to figure out e.g. if particular words appear in a file, a database or whatever. You would have to know a lot about the file format, the database fields or similar constructs. But with MetaModel you can instead automate this process by traversing the metadata and querying whatever fields match your predicates. This way you can build tools that "just takes a file" or "just takes a database connection" and let them loose to figure out their own query plan and so on.
If I am to point at a few buzzwords these days, I would say MetaModel can play a critical role in implementing services for things such as data federation, data virtualization, data consolidation, metadata management and automation.

And obviously this is also part of our incentive to make MetaModel available for everyone, under the Apache license. We are in this industry to make our products better and believe that cooperation to build the best foundation will benefit both us and everyone else that reuses it and contributes to it.

20130422

What's the fuzz about National Identifier matching?

The topic of National Identification numbers (also sometimes referred to as social security numbers) is something that can spawn a heated debate at my workplace. Coming from Denmark, but having an employer in the Netherlands, I am exposed to two very different ways of thinking about the subject. But regardless of our differences on the subject - while developing a product for MDM of customer data, like Human Inference is doing, you need to understand the implifications of both approaches - I would almost call them the "with" and "without" national identifiers implementation of MDM!

In Denmark we use our National identifiers a lot - the CPR numbers for persons and the CVR numbers for companies. Previously I wrote a series of blog posts on how to use CVR and CPR for data cleansing. Private companies are allowed in Denmark to collect these identifiers, although consumers can opt to say "no thanks" in most cases. But all in all, it means that we have our IDs out in the society, not locked up in our bedroom closet.

In the Netherlands the use of such identifiers is prohibited for almost all organizations. While talking to my colleagues I get the sense that there's a profound thinking of this ID as more a password than a key. Commercial use of the ID would be like giving up your basic privacy liberties and most people don't remember it by heart. In contrast, Danish citizens typically do share their credentials, but are quite aware about the privacy laws that companies are obligated to follow when receiving this information.

So what is the end result for data quality and MDM? Well, it highly affects how organizations will be doing two of the most complex operations in an MDM solution: Data cleansing and data matching (aka deduplication).

I recently had my second daughter, and immediately after her birth I could not help but noticing that the hospital crew gave us a sheet of paper with her CPR number. This was just an hour or two after she had her first breath, and a long time before she even had an official name! In data quality terms, this was nicely designed first-time-right (FTR, a common principle in DQ and MDM) system in action!

When I work with Danish customers it usually means that you spend a lot of time on verifying that the IDs of persons and companies are in fact the correct IDs. Like any other attribute, you might have typos, formatting glitches etc. And since we have multiple registries and number types (CVR numbers, P numbers, EAN numbers etc.), you also spend quite some time on infering "what is this number?". You would typically look up the companies' names in the public CVR registry and make sure it matches the name in your own data. If not - probably the ID is wrong and you need to delete it or obtain a correct one. While finding duplicates you can typically standardize the formatting of the IDs and do exact matching for the most part, except for those cases where the ID is missing.

When I work with Dutch customers it usually means that we cleanse the individual attributes of a customer in a much more rigorous manner. The name is cleansed on it's own. Then the address. Then the phone number, and then the email and other similar fields. You'll end up knowing if each element is valid or not, but not if the whole record is actually a cohesive chunk of data. While you can apply a lot of cool inferential techniques to check that the data is cohesive (for instance it is plausible that I, Kasper Sørensen, have the email i.am.kasper.sorensen@gmail.com) but you won't know if it is also my address that you can find in the data, or if it's just some valid address.

Of course the grass isn't that much greener in Denmark as I present it here. Unfortunately we also do have problems with CPR and CVR, and in general I disbelieve that there will be one single source of the truth that we can do reference data checks on in the near future. For instance, change of address typically is quite delayed in the national registries, whereas it is much quicker at the post agencies. And although I think you can share your email and similar attributes through CPR - in practice that's not what people do. So actually you need a MDM hub which connects to several sources of data and then pick and choose from the ones that you trust the most for individual pieces of the data. The great thing is that in Denmark we have a much clearer way to do data interchange inbetween data services, since we do have a common type of key for the basic entities. This gives way for very interesting reference data hubs like for instance iDQ, which in turn makes it easier for us to consolidate some of our integration work.

Coming back to the more ethical question: Is the Danish National Identifiers a threat to our privacy? Or is it just a more modern and practical way to reference and share basic data? For me the winning argument for the Danish model is in the term "basic data". We do share basic data through CPR/CVR, but you can't access any of my compromising or transactional data. In comparison, I fear much more for my privacy when sharing data through Facebook, Google and so on. Sure, if you had my CPR number, you would also be able to find out where I live. I wouldn't share my CPR number with you if I did not want to provide that information though, and after all sharing information, CPR, addresses or anything else, always comes at the risk of leaking that information to other parties. Finally, as an MDM professional I must say - combining information from multiple sources - be it public or private registries - isn't exactly something new, so privacy concerns are in my opinion largely the same in all countries.

But it does mean that implementations of MDM are highly likely to differ a lot when you cross national borders. Denmark and the Netherlands are maybe profound examples of different national systems, but given how much we have in common in general, I am sure there are tons of black swans out there for me yet to discover. As a MDM vendor, and as the lead for DataCleaner, I always need to ensure that our products caters to international - and thereby highly varied - data and ways of processing data.

20130117

Cleaning Danish customer data with the CVR registry and DataCleaner

I am doing a series of 3 blog entries over at Data Value Talk about the political and practical side of the Danish government's recent decision to open up basic public data to everyone. I thought it would be nice to also share the word here on my personal blog!
Click the images above to navigate to the three chapters on my blog:
Cleaning Danish customer data with the CVR registry and DataCleaner.

20121207

How to build a Groovy DataCleaner extension

In this blog entry I'll go through the process of developing a DataCleaner extension: The Groovy DataCleaner extension (just published today). The source code for the extension is available on GitHub if you wish to check it out or even fork and improve it!
Groovy

First step: You have an idea for your extension. My idea was to get the Groovy language integrated with DataCleaner, to offer an advanced scripting language option, similar to the existing JavaScript transformer - just a lot more powerful. The task would give me the chance to 1) get acquainted with the Groovy language, 2) solve some of the more advanced uses of DataCleaner by giving a completely open-ended scripting option and 3) blog about it. The third point is important to me, because we right now have a Community Contributor Contest, and I'd like to invite extension developers to participate.

Second step: Build a quick prototype. This usually starts by identifying which type of component(s) you want to create. In my case it was a transformer, but in some cases it might be an analyzer. The choice between these are essentially: Does your extension pre-process or transform the data in a way that it should become a part of a flow of operations? Then it's a Transformer. Or is it something that will consume the records (potentially after being pre-processed) and generate some kind of analysis result or write the records somewhere? Then it's a Analyzer.

The API for DataCleaner was designed to be very easy to use. The ideom has been: 1) The obligatory functionality is provided in the interface that you implement. 2) The user-configured parts are injected using the @Configured annotation. 3) The optional parts can be injected if you need them. In other words, this is very much inspired by the idea of Convention-over-Configuration.

So, I wanted to build a Transformer. This was my first prototype, which I could hitch together quite quickly after reading the Embedding Groovy documentation and just implementing the Transformer interface revealed what I needed to provide for DataCleaner to operate:
@TransformerBean("Groovy transformer (simple)")
public class GroovySimpleTransformer implements Transformer {

    @Configured
    InputColumn[] inputs;

    @Configured
    String code;

    private GroovyObject _groovyObject;

    public OutputColumns getOutputColumns() {
        return new OutputColumns("Groovy output");
    }

    public String[] transform(InputRow inputRow) {
        if (_groovyObject == null) {
            _groovyObject = compileCode();
        }
        final Map map = new LinkedHashMap();
        for (InputColumn input : inputs) {
            map.put(input.getName(), inputRow.getValue(input));
        }
        final Object[] args = new Object[] { map };
        String result = (String) _groovyObject.invokeMethod("transform", args);

        logger.debug("Transformation result: {}", result);
        return new String[] { result };
    }
    
    private GroovyObject compileCode() {
        // omitted 
    }
Third step: Start testing. I believe a lot in unittesting your code, also at a very early stage. So the next thing I did was to implement a simple unittest. Notice that I take use of the MockInputColumn and MockInputRow classes from DataCleaner - these make it possible for me to test the Transformer as a unit and not have to do integration testing (in that case I would have to start an actual batch job, which takes a lot more effort from both me and the machine):
public class GroovySimpleTransformerTest extends TestCase {

    public void testScenario() throws Exception {
        GroovySimpleTransformer transformer = new GroovySimpleTransformer();

        InputColumn col1 = new MockInputColumn("foo");
        InputColumn col2 = new MockInputColumn("bar");

        transformer.inputs = new InputColumn[] { col1, col2 };
        transformer.code = 
          "class Transformer {\n" +
          "  String transform(map){println(map); return \"hello \" + map.get(\"foo\")}\n" +
          "}";

        String[] result = transformer.transform(new MockInputRow().put(col1, "Kasper").put(col2, "S"));
        assertEquals(1, result.length);
        assertEquals("hello Kasper", result[0]);
    }
}
Great - this verifies that our Transformer is actually working.

Fourth step: Do the polishing that makes it look and feel like a usable component. It's time to build the extension and see how it works in DataCleaner. When the extension is bundled in a JAR file, you can simply click Window -> Options, select the Extensions tab and click Add extension package -> Manually install JAR file:


After registering your extension you will be able to find it in DataCleaner's Transformation menu (or if you built an Analyzer, in the Analyze menu).

In my case I discovered several sub-optimal features of my extensions. Here's a list of them, and how I solved it:
What?How?
My transformer had only a default iconIcons can be defined by providing PNG icon (32x32 pixels) with the same name as the transformer class, in the JAR file. In my case the transformer class was GroovySimpleTransformer.java, so I made an icon available at GroovySimpleTransformer.png.
The 'Code' text field was a single line field and did not look like a code editing field.Since the API is designed for Convention-over-Configuration, putting a plain String property as the groovy code was maybe a bit naive. There are two strategies to pursue if you have properties which need special rendering on the UI: Provide more metadata about the property (Quite easy), or build your own renderer for it (most flexible, but also more complex). In this case I was able to simply provide more metadata, using the @StringProperty annotation:
@Configured
@StringProperty(multiline = true, mimeType = "text/groovy")
String code
The default DataCleaner string property widget will then provide a multi-line text field with syntax coloring for the specific mime-type:
The compilation of the Groovy class was done when the first record hits the transformer, but ideally we would want to do it before the batch even begins.This point is actually quite important, also to avoid race-conditions in concurrent code and other nasty scenarios. Additionally it will help DataCleaner validation the configuration before actually kicking off a the batch job.

The trick is to add a method with the @Initialize annotation. If you have multiple items you need to initialize, you can even add more. In our case, it was quite simple:
@Initialize
public void init() {
    _groovyObject = compileCode();
}
The transformer was placed in the root of the Transformation menu.This was fixed by applying the following annotation on the class, moving it into the Scripting category:
@Categorized(ScriptingCategory.class)
The transformer had no description text while hovering over it.The description was added in a similar fashion, with a class-level annotation:
@Description("Perform a data transformation with the use of the Groovy language.")
After execution it would be good to clean up resources used by Groovy.Similarly to the @Initialize annotation, I can also create one or more descruction methods, annotated with @Close. In the case of the Groovy transformer, there are some classloader-related items that can be cleared after execution this way.
In a more advanced edition of the same transformer, I wanted to support multiple output records.DataCleaner does support transformers that yield multiple (or even zero) output records. To archieve this, you can inject an OutputRowCollector instance into the Transformer:
public class MyTransformer implements Transformer<...> {
  @Inject
  @Provided
  OutputRowCollector collector;

  public void transform(InputRow row) {
    // output two records, each with two new values
    collector.putValues("foo", "bar");
    collector.putValues("hello", "world");
  }
}
Side-note - Users of Hadoop might recognize the OutputRowCollector as similar to mappers in Map-Reduce. Transformers, like mappers, are in deed quite capable of executing in parallel.


Fifth step: When you're satisfied with the extension, Publish it on the ExtensionSwap. Simply click the "Register extension" button and follow the instructions on the form. Your extension will now be available to everyone and make others in the community happy!

I hope you found this blog useful as a way to get into DataCleaner extension development. I would be interested in any kind of comment regarding the extension mechanism in DataCleaner - please speak up and let me know what you think!

20121106

Data quality of Big Data - challenge or potential?

If you've been dealing with data in the last few years, you've inevitably also come across the concept of Big Data and NoSQL databases. What do these terms mean for data quality efforts? I believe they can have a great impact. But I also think they are difficult concepts to juggle because the understanding of them varies a lot.

For the purpose of this writing, I will simply say that in my opinion Big Data and NoSQL exposes us as data workers to two new general challenges:
  • Much more data than we're used to working with, leading to more computing time and new techniques for even handling those amounts of data.
  • More complex data structures that are not nescesarily based on a relational way of thinking.
So who will be exposed to these challenges? As a developer of tools I am definately be exposed to both challenges. But will the end-user be challenged by both of them? I hope not. But people need good tools in order to do clever things with data quality.

The more data challenge is primarily one of storage and performance. And since storage isn't that expensive, I mostly concern it to be a performance challenge. And for the sake of argument, let's just assume that the tool vendors should be the ones mostly concerned about performance - if the tools and solutions are well designed by the vendors, then at least most of the performance issues should be tackled here.

But the complex data structures challenge is a different one. It has surfaced for a lot of reasons, including:
  • Favoring performance by eg. elminating the need to join database tables.
  • Favoring ease of use by allowing logical grouping of related data even though granularity might differ (eg. storing orderlines together with the order itself).
  • Favoring flexibility by not having a strict schema to perform validation upon inserting.
Typically the structure of such records isn't tabular. Instead it is often based on key/value maps, like this:
{
  "orderNumber": 123,
  "priceSum": 500,
  "orderLines": [
    { "productId": "abc", "productName": "Foo", "price": 200 },
    { "productId": "def", "productName": "Bar", "price": 300 },
  ],
  "customer": { "id": "xyz", "name": "John Doe", "country": "USA" }
}
Looking at this you will notice some data structure features which are alien to the relational world:
  • The orderlines and customer information is contained within the same record as the order itself. In other words: We have data types like arrays/lists (the "orderLines" field) and key/value maps (each orderline element, and the "customer" field).
  • Each orderline have a "productId" field, which probably refers to a detailed product record. But it also contains a "productName" field since this information is probably often the most wanted information when presenting/reading the record. In other words: Redundancy is added for ease of use and for performance.
  • The "priceSum" field is also redundant, since we could easily deduct it by visiting all orderlines and summing them. But redundancy is also added for performance in this case.
  • Some of the same principles apply to the "customer" field: It is a key/value type, and it contains both an ID as well as some redundant information about the customer.
Support for such data structures is traditionally difficult for database tools. As tool producers, we're used to dealing with relational data, joins and these things. But in the world of Big Data and NoSQL we're up against challenges of complex data structures which includes traversal and selection within the same record.

In DataCleaner, these issues are primarily handled using the "Data structures" transformations, which I feel I need to give a little attention:
Using these transformations you can both read and write the complex data structures that we see in Big Data projects. To explain the 8 "Data structures" transformations, I will try to define a few terms used in their naming:
  • To "Read" a data structure such as a list or a key/value map means to view each element in the structure as a separate record. This enables you to profile eg. all the orderlines in the order-record example.
  • To "Select" from a data structure such as a list or a key/value map means to retreive specific elements while retaining the current record granularity. This enables you to profile eg. the customer names of your orders.
  • Similarly you can "Build" these data structures. The record granularity of the built data structures is the same as the current granularity in your data processing flow.
  • And lastly I should mention JSON because it is in the screenshot. JSON is probably the most common format of literal representing data structures like these. And in deed the example above is written in JSON format. Obviously we support reading and writing JSON objects directly in DataCleaner.
Using these transformations we can overcome the challenge of handling the new data structures in Big Data and NoSQL.

Another important aspect that comes up whenever there's a whole new generation of databases is that the way you interface them is different. In DataCleaner (or rather, in DataCleaner's little-sister project MetaModel) we've done a lot of work to make sure that the abstractions already known can be reused. This means that querying and inserting records into a NoSQL database works exactly the same as with a relational database. The only difference of course is that new data types may be available, but that builds nicely on top of the existing relational concepts of schemas, tables, columns etc. And at the same time we do all we can to optimize performance so that querying techniques is not up to the end-user but will be applied according to the situation. Maybe I'll try and explain how that works under the covers some other time!