20121207

How to build a Groovy DataCleaner extension

In this blog entry I'll go through the process of developing a DataCleaner extension: The Groovy DataCleaner extension (just published today). The source code for the extension is available on GitHub if you wish to check it out or even fork and improve it!
Groovy

First step: You have an idea for your extension. My idea was to get the Groovy language integrated with DataCleaner, to offer an advanced scripting language option, similar to the existing JavaScript transformer - just a lot more powerful. The task would give me the chance to 1) get acquainted with the Groovy language, 2) solve some of the more advanced uses of DataCleaner by giving a completely open-ended scripting option and 3) blog about it. The third point is important to me, because we right now have a Community Contributor Contest, and I'd like to invite extension developers to participate.

Second step: Build a quick prototype. This usually starts by identifying which type of component(s) you want to create. In my case it was a transformer, but in some cases it might be an analyzer. The choice between these are essentially: Does your extension pre-process or transform the data in a way that it should become a part of a flow of operations? Then it's a Transformer. Or is it something that will consume the records (potentially after being pre-processed) and generate some kind of analysis result or write the records somewhere? Then it's a Analyzer.

The API for DataCleaner was designed to be very easy to use. The ideom has been: 1) The obligatory functionality is provided in the interface that you implement. 2) The user-configured parts are injected using the @Configured annotation. 3) The optional parts can be injected if you need them. In other words, this is very much inspired by the idea of Convention-over-Configuration.

So, I wanted to build a Transformer. This was my first prototype, which I could hitch together quite quickly after reading the Embedding Groovy documentation and just implementing the Transformer interface revealed what I needed to provide for DataCleaner to operate:
@TransformerBean("Groovy transformer (simple)")
public class GroovySimpleTransformer implements Transformer {

    @Configured
    InputColumn[] inputs;

    @Configured
    String code;

    private GroovyObject _groovyObject;

    public OutputColumns getOutputColumns() {
        return new OutputColumns("Groovy output");
    }

    public String[] transform(InputRow inputRow) {
        if (_groovyObject == null) {
            _groovyObject = compileCode();
        }
        final Map map = new LinkedHashMap();
        for (InputColumn input : inputs) {
            map.put(input.getName(), inputRow.getValue(input));
        }
        final Object[] args = new Object[] { map };
        String result = (String) _groovyObject.invokeMethod("transform", args);

        logger.debug("Transformation result: {}", result);
        return new String[] { result };
    }
    
    private GroovyObject compileCode() {
        // omitted 
    }
Third step: Start testing. I believe a lot in unittesting your code, also at a very early stage. So the next thing I did was to implement a simple unittest. Notice that I take use of the MockInputColumn and MockInputRow classes from DataCleaner - these make it possible for me to test the Transformer as a unit and not have to do integration testing (in that case I would have to start an actual batch job, which takes a lot more effort from both me and the machine):
public class GroovySimpleTransformerTest extends TestCase {

    public void testScenario() throws Exception {
        GroovySimpleTransformer transformer = new GroovySimpleTransformer();

        InputColumn col1 = new MockInputColumn("foo");
        InputColumn col2 = new MockInputColumn("bar");

        transformer.inputs = new InputColumn[] { col1, col2 };
        transformer.code = 
          "class Transformer {\n" +
          "  String transform(map){println(map); return \"hello \" + map.get(\"foo\")}\n" +
          "}";

        String[] result = transformer.transform(new MockInputRow().put(col1, "Kasper").put(col2, "S"));
        assertEquals(1, result.length);
        assertEquals("hello Kasper", result[0]);
    }
}
Great - this verifies that our Transformer is actually working.

Fourth step: Do the polishing that makes it look and feel like a usable component. It's time to build the extension and see how it works in DataCleaner. When the extension is bundled in a JAR file, you can simply click Window -> Options, select the Extensions tab and click Add extension package -> Manually install JAR file:


After registering your extension you will be able to find it in DataCleaner's Transformation menu (or if you built an Analyzer, in the Analyze menu).

In my case I discovered several sub-optimal features of my extensions. Here's a list of them, and how I solved it:
What?How?
My transformer had only a default iconIcons can be defined by providing PNG icon (32x32 pixels) with the same name as the transformer class, in the JAR file. In my case the transformer class was GroovySimpleTransformer.java, so I made an icon available at GroovySimpleTransformer.png.
The 'Code' text field was a single line field and did not look like a code editing field.Since the API is designed for Convention-over-Configuration, putting a plain String property as the groovy code was maybe a bit naive. There are two strategies to pursue if you have properties which need special rendering on the UI: Provide more metadata about the property (Quite easy), or build your own renderer for it (most flexible, but also more complex). In this case I was able to simply provide more metadata, using the @StringProperty annotation:
@Configured
@StringProperty(multiline = true, mimeType = "text/groovy")
String code
The default DataCleaner string property widget will then provide a multi-line text field with syntax coloring for the specific mime-type:
The compilation of the Groovy class was done when the first record hits the transformer, but ideally we would want to do it before the batch even begins.This point is actually quite important, also to avoid race-conditions in concurrent code and other nasty scenarios. Additionally it will help DataCleaner validation the configuration before actually kicking off a the batch job.

The trick is to add a method with the @Initialize annotation. If you have multiple items you need to initialize, you can even add more. In our case, it was quite simple:
@Initialize
public void init() {
    _groovyObject = compileCode();
}
The transformer was placed in the root of the Transformation menu.This was fixed by applying the following annotation on the class, moving it into the Scripting category:
@Categorized(ScriptingCategory.class)
The transformer had no description text while hovering over it.The description was added in a similar fashion, with a class-level annotation:
@Description("Perform a data transformation with the use of the Groovy language.")
After execution it would be good to clean up resources used by Groovy.Similarly to the @Initialize annotation, I can also create one or more descruction methods, annotated with @Close. In the case of the Groovy transformer, there are some classloader-related items that can be cleared after execution this way.
In a more advanced edition of the same transformer, I wanted to support multiple output records.DataCleaner does support transformers that yield multiple (or even zero) output records. To archieve this, you can inject an OutputRowCollector instance into the Transformer:
public class MyTransformer implements Transformer<...> {
  @Inject
  @Provided
  OutputRowCollector collector;

  public void transform(InputRow row) {
    // output two records, each with two new values
    collector.putValues("foo", "bar");
    collector.putValues("hello", "world");
  }
}
Side-note - Users of Hadoop might recognize the OutputRowCollector as similar to mappers in Map-Reduce. Transformers, like mappers, are in deed quite capable of executing in parallel.


Fifth step: When you're satisfied with the extension, Publish it on the ExtensionSwap. Simply click the "Register extension" button and follow the instructions on the form. Your extension will now be available to everyone and make others in the community happy!

I hope you found this blog useful as a way to get into DataCleaner extension development. I would be interested in any kind of comment regarding the extension mechanism in DataCleaner - please speak up and let me know what you think!

20121106

Data quality of Big Data - challenge or potential?

If you've been dealing with data in the last few years, you've inevitably also come across the concept of Big Data and NoSQL databases. What do these terms mean for data quality efforts? I believe they can have a great impact. But I also think they are difficult concepts to juggle because the understanding of them varies a lot.

For the purpose of this writing, I will simply say that in my opinion Big Data and NoSQL exposes us as data workers to two new general challenges:
  • Much more data than we're used to working with, leading to more computing time and new techniques for even handling those amounts of data.
  • More complex data structures that are not nescesarily based on a relational way of thinking.
So who will be exposed to these challenges? As a developer of tools I am definately be exposed to both challenges. But will the end-user be challenged by both of them? I hope not. But people need good tools in order to do clever things with data quality.

The more data challenge is primarily one of storage and performance. And since storage isn't that expensive, I mostly concern it to be a performance challenge. And for the sake of argument, let's just assume that the tool vendors should be the ones mostly concerned about performance - if the tools and solutions are well designed by the vendors, then at least most of the performance issues should be tackled here.

But the complex data structures challenge is a different one. It has surfaced for a lot of reasons, including:
  • Favoring performance by eg. elminating the need to join database tables.
  • Favoring ease of use by allowing logical grouping of related data even though granularity might differ (eg. storing orderlines together with the order itself).
  • Favoring flexibility by not having a strict schema to perform validation upon inserting.
Typically the structure of such records isn't tabular. Instead it is often based on key/value maps, like this:
{
  "orderNumber": 123,
  "priceSum": 500,
  "orderLines": [
    { "productId": "abc", "productName": "Foo", "price": 200 },
    { "productId": "def", "productName": "Bar", "price": 300 },
  ],
  "customer": { "id": "xyz", "name": "John Doe", "country": "USA" }
}
Looking at this you will notice some data structure features which are alien to the relational world:
  • The orderlines and customer information is contained within the same record as the order itself. In other words: We have data types like arrays/lists (the "orderLines" field) and key/value maps (each orderline element, and the "customer" field).
  • Each orderline have a "productId" field, which probably refers to a detailed product record. But it also contains a "productName" field since this information is probably often the most wanted information when presenting/reading the record. In other words: Redundancy is added for ease of use and for performance.
  • The "priceSum" field is also redundant, since we could easily deduct it by visiting all orderlines and summing them. But redundancy is also added for performance in this case.
  • Some of the same principles apply to the "customer" field: It is a key/value type, and it contains both an ID as well as some redundant information about the customer.
Support for such data structures is traditionally difficult for database tools. As tool producers, we're used to dealing with relational data, joins and these things. But in the world of Big Data and NoSQL we're up against challenges of complex data structures which includes traversal and selection within the same record.

In DataCleaner, these issues are primarily handled using the "Data structures" transformations, which I feel I need to give a little attention:
Using these transformations you can both read and write the complex data structures that we see in Big Data projects. To explain the 8 "Data structures" transformations, I will try to define a few terms used in their naming:
  • To "Read" a data structure such as a list or a key/value map means to view each element in the structure as a separate record. This enables you to profile eg. all the orderlines in the order-record example.
  • To "Select" from a data structure such as a list or a key/value map means to retreive specific elements while retaining the current record granularity. This enables you to profile eg. the customer names of your orders.
  • Similarly you can "Build" these data structures. The record granularity of the built data structures is the same as the current granularity in your data processing flow.
  • And lastly I should mention JSON because it is in the screenshot. JSON is probably the most common format of literal representing data structures like these. And in deed the example above is written in JSON format. Obviously we support reading and writing JSON objects directly in DataCleaner.
Using these transformations we can overcome the challenge of handling the new data structures in Big Data and NoSQL.

Another important aspect that comes up whenever there's a whole new generation of databases is that the way you interface them is different. In DataCleaner (or rather, in DataCleaner's little-sister project MetaModel) we've done a lot of work to make sure that the abstractions already known can be reused. This means that querying and inserting records into a NoSQL database works exactly the same as with a relational database. The only difference of course is that new data types may be available, but that builds nicely on top of the existing relational concepts of schemas, tables, columns etc. And at the same time we do all we can to optimize performance so that querying techniques is not up to the end-user but will be applied according to the situation. Maybe I'll try and explain how that works under the covers some other time!

20121004

Video: Introducing Data Quality monitoring

Everytime we publish some work that I've been working on I feel a big sigh of relief. This time it's not software, but a video - and I think the end result is quite well although we are not movie making professionals at Human Inference :)

Here's the stuff: Data Quality monitoring with DataCleaner 3

Enjoy! And do let us know what you think on our discussion forum or our linkedin group.

20120920

DataCleaner 3 released!

Dear friends, users, customers, developers, analysts, partners and more!

After an intense period of development and a long wait, it is our pleasure to finally announce that DataCleaner 3 is available. We at Human Inference invite you all to our celebration! Impatient to try it out? Go download it right now!

So what is all the fuzz about? Well, in all modesty, we think that with DataCleaner 3 we are redefining 'the premier open source data quality solution'. With DataCleaner 3 we've embraced a whole new functional area of data quality, namely data monitoring.

Traditionally, DataCleaner has its roots in data profiling. In the former years, we've added several related additional functions:- transformations, data cleansing, duplicate detection and more. With data monitoring we basically deliver all of the above, but in a continuous environment for analyzing, improving and reporting on your data. Furthermore, we will deliver these functions in a centralized web-based system.

So how will the users benefit from this new data monitoring environment? We've tried to answer this question using a series of images:

Monitor the evolution of your data:

Share your data quality analysis with everyone:

Continuously monitor and improve your data's quality:

Connect DataCleaner to your infrastructure using web services:


The monitoring web application is a fully fledged environment for data quality, covering several functional and non-functional areas:
  • Display of timeline and trends of data quality metrics
  • Centralized repository for managing and containing jobs, results, timelines etc.
  • Scheduling and auditing of DataCleaner jobs
  • Providing web services for invoking DataCleaner transformations
  • Security and multi-tenancy
  • Alerts and notifications when data quality metrics are out of their expected comfort zones.

Naturally, the traditional desktop application of DataCleaner continues to be the tool of choice for expert users and one-time data quality efforts. We've even enhanced the desktop experience quite substantially:
  • There is a new Completeness analyzer which is very useful for simply identifying records that have incomplete fields.
  • You can now export DataCleaner results to nice-looking HTML reports that you can give to your manager, or send to your XML parser!
  • The new monitoring environment is also closely integrated with the desktop application. Thus, the desktop application now has the ability to publish jobs and results to the monitor repository, and to be used as an interactive editor for content already in the repository.
  • New date-oriented transformations are now available: Date range filter, which allows you to subset datasets based on date ranges, and format date, which allows to format a date using a date mask.
  • The Regex Parser (which was previously only available through the ExtensionSwap) has now been included in DataCleaner. This makes it very convenient to parse and standardize rich text fields using regular expressions.
  • There's a new Text case transformer available. With this transformation you can easily convert between upper/lower case and proper capitalization of sentences and words.
  • Two new search/replace transformations have been added: Plain search/replace and Regex search/replace.
  • The user experience of the desktop application has been improved. We've added several in-application help messages, made the colors look brighter and clearer and improved the font handling.

More than 50 features and enhancements were implemented in this release, in addition to incorporating several hundreds of upstream improvements from dependent projects.

We hope you will enjoy everything that is new about DataCleaner 3. And do watch out for follow-up material in the coming weeks and months. We will be posting more and more online material and examples to demonstrate the wonderful new features that we are very proud of.

20120819

Gartner's Magic Quadrant for Data Quality Tools


Over the last two year I've been obsessed with Gartner and their analyses of the Data Quality market. It's not because they're always right in every aspect, but a lot of times they are, and in addition they are very prominent opinionmakers. And in all honesty, a huge part of one's personal reasons to do open source development is to get recognition from your peers.

While leading the DataCleaner development and other projects at Human Inference, we definately are paying a lot of attention to what Gartner is saying about us. Also before I joined Human Inference, Gartner was important to me. Their mentioning of DataCleaner in their Who's who in Open Source Data Quality report from 2009 was the initial trigger for contact to a lot of people, including Winfried van Holland, CEO of Human Inference. So on a personal level I have a lot to thank Gartner for.

Therefore it is with great proudness that I see that Human Inference has been promoted to the visionary quadrant of Gartner's Magic Quadrant for Data Quality Tools annual report, which just came out (get it from Gartner here). That's exactly where I think we deserve to be.

At the same time I see they are mentioning DataCleaner specifically as one of the strong points for Human Inference. This is because of the licensing model that we lend ourselves to with it, and for the easy-to-use interface which it provides to data quality professionals. Additionally our integration with Pentaho Data Integration (read more) and the application of our cloud data quality platform (read more) are mentioned as strong points.

This is quite a recognition since our last review by Gartner. In their 2012 update for the Who's who in Open Source Data Quality (get it form Gartner here) Gartner critizised the DataCleaner project on a general negative attitude and a range of false grounds. In particular, my feeling is that certain misunderstandings about the integration of DataCleaner with the rest of Human Inference's offerings caused our Gartner rating to be undervalued at that time. Hopefully the next update to that report will reflect their recent, more enlightened view. I should mention that the two reports are rather independent of each other, so these are just speculations from my side.

As such the Gartner reports have shown to be a wonderful source of information about competing products and market demand. Our current plans for DataCleaner 3 are in deed also influenced by Gartner's (and our own) description of data monitoring being a key functional area in the space of data quality tools.

I am deeply grateful for Gartner's treatment of DataCleaner over time. “You've got to take the good with the bad” and for me that's been a great way of continously improving our products. A good reception in the end makes all the trouble worthwhile.

20120711

Query and update Databases, CSV files, Excel spreadsheets, MongoDB, CouchDB and more in Java

Introduction to MetaModel 3

The other day we released version 3 of MetaModel, a project that I have thoroughly enjoyed working on lately. Let me share a bit of my enthusiasm and try to convince you that this is the greatest data access library there is for Java.

First let me also say, just to make it clear: MetaModel is NOT an ORM (Object-Relational Mapping) framework. MetaModel does not doing any mapping to your domain object model. Contrary, MetaModel is oriented towards working with the data model that already exists in your datastores (databases, files etc.) as it is physically represented. So the model of MetaModel is in deed a meta model, just as it is a metadata model - it works with the concepts of tables, columns, rows, schemas, relationships etc. But it is also oriented towards abstracting away all the cruft of having to deal with the physical interactions with each individual data storage technology. So unlike most ORMs, MetaModel allows you to work with arbitrary data models, stored in arbitrary technologies such as relational (JDBC) databases, text file formats, Excel spreadsheets, NoSQL databases (currently CouchDB and MongoDB) and more.

Here's an overview of the scope of MetaModel, depicted in our "module diagram".

It's important to be able to access, query and update all of these different datastore techonologies in the same way. This is basically the cross-platform story all over again. But the frontier of cross-platform is with MetaModel being moved to also entail datastore technologies, whereas it was previously mostly about the freedom of Operating System.
In addition it is important that this common data access abstraction is elegant, scalable and flexible. I hope to convince you of this with a few examples.

Code examples

Everything you do with MetaModel always starts with a DataContext object. A DataContext represents the basis for any operation with your datastore. Typically one would use the DataContextFactory to get an instance for the datastore of interest. We'll assume you are working on an Excel spreadsheet "people.xlsx":
DataContext dc = DataContextFactory.createExcelDataContext(new File("people.xlsx"));
Easy. Now let's explore the structure of this spreadsheet. We can do so either generically by traversing the graph of schemas, tables and columns - or by names if we already know what we are looking for:
// getting column by path
Column customerNameColumn = dc.getColumnByQualifiedLabel("customers.name"); 
// traversing all schemas, tables, columns
Schema schema = dc.getDefaultSchema();
Table[] tables = schema.getTables();
Column[] columns = tables[0].getColumns(); 
// step-wise getting specific elements based on names
Table customersTable = schema.getTableByName("customers");
Column customerBalanceColumn = customerTable.getColumnByName("balance");

Queries

Now let's fire some queries. This is where it gets interesting! Our approach builds upon basic knowledge of SQL, but without all the dialects and runtime differences. Technically we express queries in a completely type safe manner by using the traversed metadata objects above. But we can also put in String literals and more when it is convenient to get the job done.
// Simple query: Get all customer fields for customers with a credit balance greater than 10000.
DataSet ds = dc.query().from(customersTable).select(customersTable.getColumns())
    .where(customerBalanceColumn).greaterThan(10000).execute();
// Slightly more advanced query: Join customers with their associated sales representatives
// and group the result to count which sales reps have the most customers
Column salesRepId = customersTable.getColumnByName("sales_rep_id");
Column employeeId = dc.getColumnByQualifiedLabel("employees.id");
DataSet ds = dc.query()
    .from(customersTable).innerJoin("employees").on(salesRepId, employeeId)
    .selectCount().and("employees.name")
    .groupBy("employees.name").execute();

You can even grab the Query as an object and pass it on to methods and compositions which will eg. modify it for optimization or other purposes.

Updates and changes

The last missing piece of the puzzle is making changes to your data. We've spent a lot of time creating an API that is best suited for this task across all types of datastores. Since there's a big difference in how different datastores treat updates, and MetaModel tries to unify that, we wanted an API which clearly demarcates when you're doing an update, so that there is no doubt about transactional bounds, scope of a batch update and so on. This is why you need to provide all your updated in a closure-style object called an UpdateScript.
Let's do a series of updates.
// Batch #1: Create a table and insert a few records
dc.executeUpdate(new UpdateScript() {
    public void run(UpdateCallback cb) {
        Table muppets = cb.createTable(schema, "muppets")
                .withColumn("name").ofType(VARCHAR)
                .withColumn("profession").ofType(VARCHAR).execute();
        cb.insertInto(muppets).value("name","Kermit the frog")
                .value("profession","TV host").execute();         cb.insertInto(muppets).value("name","Miss Piggy")                 .value("profession","Diva").execute();     } });
// Batch #2: Update and delete a record
dc.executeUpdate(new UpdateScript() {
    public void run(UpdateCallback cb) {
        cb.update("muppets").value("profession","Theatre host")
               .where("name").equals("Kermit the frog").execute();
        cb.deleteFrom("muppets")
               .where("name").like("%Piggy").execute();
    }
});
// Batch #3: Drop the table
dc.executeUpdate(new UpdateScript() {
    public void run(UpdateCallback cb) {
       cb.dropTable("muppets").execute();
    }
});
As you can see, using the UpdateScripts we've encapsulated each batch operation. If the datastore supports transactions, this is also the point of transactional control. If not, MetaModel will provide appropriate synchronization to avoid race conditions, so you can safely perform concurrent updates even on Excel spreadsheets and CSV files.

Wrapping up...

I am extremely happy working both as a developer of and a consumer of MetaModel. I hope you felt this blog/tutorial was a good kick-start and that you got excited by the library. Please give MetaModel a spin and share your thoughts and impressions.

20120614

Seeing sporadic OptionalDataExceptions in your Java code?

This is basically to uphold a valuable mental note I made today:


I was seeing very sporadic OptionalDataExceptions while doing deserialization of certain objects. Have wondered for a looong time why it was, and today I took the time (many hours) to sit down and figure out what it was.

After spending a lot of time creating custom ObjectInputStreams, debugging readObject() methods and more, I finally tracked it down to being an elementary (but extremely hard to identify) issue: Some of my collections (HashMaps, HashSets and more) were exposed to multiple threads and since these basic collection type are neither concurrent nor synchronized, this was the issue.

In essense: If you're seeing this issue, double-check all your HashSets, HashMaps, LinkedHashMaps, LinkedHashSets and IdentityHashMaps! If they're exposed to multithreading, then make them concurrent (or synchronized).

20120604

Revealing the "DQ monitor" - DataCleaner 3.0

Development of DataCleaner has been a bit quiet lately (where 'lately' refers to that last month or so ;-)) but now I want to share an idea that we have been working with at Human Inference: Data quality monitoring.

But first some motivation: I think by now DataCleaner has a pretty firm reputation as one of the leading open source data quality applications. We can, and we will, continue to improve it's core functionalities, but it is also time to take a look at the next step for it - the next big version, 3.0. For this we are picking up a major piece of functionality that is crucial to any DQ and MDM project - data quality monitoring. We all know the old saying that you cannot manage what you cannot measure. This is a statement that has a lot of general truth to it, and also when it comes to data quality. Data profiling tools traditionally focus on a one-time measurement of various metrics. But to truly manage your data quality, you need to monitor it over time and be able to act on not only the current status, but also the progress you're making. Doing this is something new, which none of our open source competitors provide - a monitoring application for tracking data quality levels over time! We also want to make it easy to share this intelligence with everyone in your organization - so it has to be web based.

Based on this we're setting forth the roadmap for DataCleaner 3.0. And we're already able to show very good results. In the following I will disclose some of what is there. Do not take it for anything more than it is: A snapshot from our current work-in-progress. But we do plan to deliver a final version within a few months.

The applications main purpose is to present the history of data quality measurements. Here's a screenshot of our current prototype's timeline view:



What is shown there is the metrics collected by a DataCleaner job, recorded and displayed in a timeline.

It's really easy to customize the timeline, or create new timeline views. All you need to do is select the metrics you're interested in. Notice in the screenshot below that there are different types of metrics, and for those that are queryable you have a nice autocompletion/suggestion interface:

If you click a point in the timeline, you'll get the option to drill to the point-in-time profiling result:


If you drill, you will be given a full data profiling result (like you know them from the current version of DataCleaner, but in the browser). Our prototype is still a bit simplified on this point, but most features analysis result actually render nicely:


So how is it going to work in a larger context? Let me disclose a few of the ideas:
  • Queryable metrics: Consider if you have a gender column and you always expect it to have values "MALE" or "FEMALE". But you also know that there is quite some dirt in there, and you wish to measure the progress of eliminating the dirt. How would you define the metric then? What you will need is a metric definition that you pass a parameter/query saying that you wish to monitor the number of values occurring that are NOT "MALE" or "FEMALE". Similar cases exist for many other use-cases like the Pattern finder, Dictionary matchers etc. etc. In the DataCleaner monitor, the user will be  able to define such queries using IN [...] and NOT IN [...] clauses, like very simple SQL.
  • Scheduling: The monitoring application will include a scheduler to let you automatic run your periodic data quality assesments. The scheduler will allow you to both set up periodic, but also trigger-based scheduling events. For instance, it might be that you have a workflow where you wish to trigger data profiling and monitoring in a wider context, such as ETL jobs, business processes etc.
  • Desktop integration: You will also be able to run the jobs in the regular DataCleaner (desktop) application and then upload/synchronize your results with the monitoring application. This will make it easy for you to easily share your findings when you are interactively working with the desktop application.
  • Email alerting: You will be able to set up expected ranges of particular metrics, and in the case that values are recorded outside the allowed ranges, the data steward will be alerted by email.
  • Repository: All jobs, results, configuration data etc. is stored in a central repository. The repository allows you to centrally manage connection information, job definitions etc. that your team is using. The idea of a repository also opens up the door to concepts like versioning, personal workspaces, security, multi-tenancy and more. I think it will become a great organizing factor and a "hub" for the DataCleaner users.
  • Daily snapshots: It is easy to define a profiling job that profiles a complete table. But when dealing with periodic data profiling, it is likely that you only wish to analyze the latest increment. Therefore DataCleaner's handling of date filters have been improved a lot. This is to ensure that you can easily request a profiling job of "yesterday and todays data" and thereby see eg. only profiling results based on the data that was entered/changed in your system recently.
Sounds exciting? We certainly think so. And this idea actually has been talked about for quite some time (I found some mentionings of a "DC web monitor" application as old as from 2009!). So I am happy that we are finally putting this idea to the test. Will love to hear from you if you have any thoughts, remarks or additional ideas.

If you're interested in contributing or just taking a closer look at the development of DataCleaner 3.0, we've already started working on the new milestone in our issue tracking system. The code is located in DataCleaner's source code repository and we are eagerly awaiting you if you wish to pitch in!

Update

We now have an alpha version available that you can play around with. If you feel like being an early adopter, we would really appreciate any feedback from this!

Update (2)

We now have an beta version available that you can play around with. This is starting to look feature complete, so please check it out and let us know what you think!

20120417

Data quality monitoring with Kettle and DataCleaner

We've just announced a great thing - the cooperation with Pentaho and DataCleaner which brings DataCleaners profiling features to all users of Pentaho Data Integration (aka. Kettle)! Not only is this something I've been looking forward to for a long time because it is a great exposure for us, but it also opens up new doors in terms of functionality. In this blog post I'll describe something new: Data monitoring with Kettle and DataCleaner.

While DataCleaner is perfectly capable of doing continuous data profiling, we lack the deployment platform that Pentaho has. With Pentaho you get orchestration and scheduling, and even with a graphical editor.

A scenario that I often encounter is that someone wants to execute a daily profiling job, archive the results with a timestamp and have the results emailed to the data steward. Previously we would set this sorta thing up with DataCleaner's command line interface, which is still quite a nice solution, but if you have more than just a few of these jobs, it can quickly become a mess.

So alternatively, I can now just create a Kettle job like this:


Here's what the example does:

  1. Starts the job (duh!)
  2. Creates a timestamp which needs to be used for archiving the result. This is done using a separate transformation, which you can do either using the "Get System Info" step or the "Formula" step. The result is put into a variable called "today".
  3. Executes the DataCleaner job. The result filename is set to include the "${today}" variable!
  4. Emails the results to the data steward.
  5. If everything went well without errors, the job is succesful.
Pretty neat and something I am extremely happy about!

In the future I imagine to have even more features built like this. For example an ability to run multiple DataCleaner jobs with configuration options stored as data in the ETL flow. Or the ability to treat the stream of data in a Kettle transformation as the input of the DataCleaner job. Do you guys have any other wild ideas?
Update: In fact we are now taking actions to provide more elaborate data quality monitoring features to the community. Go to my blog entry about the plans for DataCleaner 3.0 for more information.

20120409

Implementing a custom datastore in DataCleaner

A question I am often asked by super-users, partners and developers of DataCleaner is: How do you build a custom datastore in DataCleaner for my system/file-format XYZ? Recently I've dealt with this for the use in the upcoming integration with Pentaho Kettle, for a Human Inference customer who had a home grown database proxy system, and just today while it was asked on the DataCleaner forum. In this blog post I will guide you through this process, which requires some basic Java programming skills, but if that's in place it isn't terribly complicated.

Just gimme the code ...

First of all I should say (to those of you who prefer "just the code") that there is already an example of how to do this in the sample extension for DataCleaner. Take a look at the org.eobjects.datacleaner.sample.SampleDatastore class. Once you've read, understood and compiled the Java code, all you need to do is register the datastore in DataCleaner's conf.xml file like this (within the <datastore-catalog> element):

<custom-datastore class-name="org.eobjects.datacleaner.sample.SampleDatastore">
  <property name="Name" value="My datastore" />
</custom-datastore>

A bit more explanation please!

OK, so if you wanna really know how it works, here goes...

First of all, a datastore in DataCleaner needs to implement the Datastore interface. But instead of implementing the interface directly, I would suggest using the abstract implementation called the UsageAwareDatastore. This abstract implementation handles concurrent access to the datastore, reusing existing connections and more. What you still need to provide when extending the UsageAwareDatastore class is primarily the createDatastoreConnection() method which is invoked when a (new) connection is requested. Let's see how an initial new Datastore implementation will look like:

public class ExampleDatastore extends UsageAwareDatastore<DataContext> {

 private static final long serialVersionUID = 1L;
 
 public ExampleDatastore() {
  super("My datastore");
 }

 @Override
 protected UsageAwareDatastoreConnection createDatastoreConnection() {
  // TODO Auto-generated method stub
  return null;
 }

 @Override
 public PerformanceCharacteristics getPerformanceCharacteristics() {
  // TODO Auto-generated method stub
  return null;
 }
}

Notice that I have created a no-arg constructor. This is REQUIRED for custom datastores, since the datastore will be instantiated by DataCleaner. Later we will focus on how to make the name ("My datastore") adjustable.

First we want to have a look at the two unimplemented methods:

  • createDatastoreConnection() is used to create a new connection. DataCleaner builds upon the MetaModel framework for data access. You will need to return a new DatastoreConnectionImpl(...). This class takes an important parameter, namely your MetaModel DataContext implementation. Often times there will already be a DataContext that you can use given some configuration, eg. a JdbcDataContext, a CsvDataContext, ExcelDataContext, MongoDbDatacontext or whatever.
  • getPerformanceCharacteristics() is used by DataCleaner to figure out the query plan when executing a job. You will typically just return a new PerformanceCharacteristics(false);. Read the javadoc for more information :)

Parameterizable properties, please

By now you should be able to implement a custom datastore, which hopefully covers your basic needs. But maybe you want to reuse the datastore class with eg. different files, different hostnames etc. In other words: Maybe you want to let your user define certain properties of the datastore.

To your rescue is the @Configured annotation, which is an annotation widely used in DataCleaner. It allows you to annotate fields in your class which should be configured by the user. The types of the fields can be Strings, Integers, Files etc., you name it. Let's see how you would expose the properties of a typical connection:

public class ExampleDatastore extends UsageAwareDatastore<DataContext> {
 // ...

 @Configured
 String datastoreName;

 @Configured
 String hostname;

 @Configured
 Integer port;

 @Configured
 String systemId;

 // ...
}
And how you would typically use them to implement methods:
public class ExampleDatastore extends UsageAwareDatastore<DataContext> {
 // ...

 @Override
 public String getName() {
  return datastoreName;
 }

 @Override
 protected UsageAwareDatastoreConnection createDatastoreConnection() {
  DataContext dataContext = createDataContext(hostname, port, systemId);
  return new DatastoreConnectionImpl(dataContext, this);
 }
}

If I wanted to configure a datastore using the parameters above, I could enter it in my conf.xml file like this:

<custom-datastore class-name="foo.bar.ExampleDatastore">
  <property name="Datastore name" value="My datastore" />
  <property name="Hostname" value="localhost" />
  <property name="Port" value="1234" />
  <property name="System id" value="foobar" />
</custom-datastore>

Notice that the names of the properties are inferred by reversing the camelCase notation which Java uses, so that "datastoreName" becomes "Datastore name" and so on. Alternatively you can provide an explicit name in the @Configured annotation.

I hope this introduction tutorial makes some sense for you. Once again I urge you to take a look at the Sample DataCleaner extension, which also includes a build setup (Maven based), and a custom MetaModel DataContext implementation.

20120123

Now you can build your own DQ monitoring solution with DataCleaner

In the cover of night we've released a new version of DataCleaner today (version 2.4.2). Officially it's a minor release because for the User Interface very few things have changed, only a few bugfixes and minor enhancements have been introduced. But one potentially major feature have been added in the inner workings of DataCleaner: The ability to persist the results of your DQ analysis jobs. Although this feature still has very limited User Interface support, it has full support in the command line interface, which I would argue is actually sufficient for the purposes of establishing a data quality monitoring solution. Later on I do expect there to be full (and backwards compatible) support in the UI as well.

So what is it, and how does it work?
Well basically it is simply two new parameters to the command line interface:

 -of (--output-file) FILE                          : File in which to save the result of the job
 -ot (--output-type) [TEXT | HTML | SERIALIZED]    : How to represent the result of the job
Here's an example of how to use it. Notice that I use the file extension .analysis.result.dat, which is the one thing that is currently implemented and recognized in the UI as a result file.
> DataCleaner-console.exe -job examples\employees.analysis.xml\
 -ot SERIALIZED\
 -of employees.analysis.result.dat
Now start up DataCleaner's UI, and select "File -> Open analysis job..." - you'll suddenly see that the produced file can be opened:
And when you open the file, the result will be displayed just like a job you've run inside the application:
Since files like this are generally easy to archive and to append eg. timestamps etc., it should be really easy to build a DIY data quality monitoring solution based scheduled jobs and this approach to execution. Or you can get in contact with Human Inference if you want something more sophisticated ;-)
Notice also that there's a HTML output type, which is also quite neat and easy to parse with an XML parser. The SERIALIZED format is more rich though, and includes information needed for more refined, programmatic access to the results. For instance, you might deserialize the whole file using the regular Java serialization API and access it, as an AnalysisResult instance. Thereby you could eg. create a timeline of a particular metric and track changes to the data that you are monitoring. Update: Please read my follow-up blog post about the plans to include a full Data Quality monitoring solution as of DataCleaner 3.0.