20090120

DataCleaner 1.5 - a heavy league Data Profiler

Often when I speak to data quality professionals and people from the business intelligence world I get the notion that most people think of Open Source tools as slightly immature when it comes to heavy processing, large loads, millions-of-rows-kinda-stuff. And this has had some truth to it. I don't want to name names, but at least I have heard a lot of stories about Open Source data integration / ETL tools that wasn't up for the job when you had millions of rows to transform. So I guess this notion have stuck to Open Source data profilers and data quality applications too...

In DataCleaner 1.5 I want this notion demystified and eradicated! Here are some of the things we are working on to make this release a truly enterprise-ready, performance-oriented and scalable application:

  • Multi-threaded, multi-connection, multi-query execution enging
    The execution engine in DataCleaner have been thoroughly refactored to support multithreading, multiple connections and query-splitting to perform loadbalancing on the threads and connections. This really boosts performance for large jobs and sets the bar for processing large result sets in Open Source tools I think.
  • On-disk caching for memory-intensive profiles and validation rules
    Some of the profiles and validation rules are almost inherently memory intensive. We are doing a lot of work optimizing them as much as we can but some thing are simply not possible to change. As an example, a Value Distribution profile simply HAS to know all distinct values of each column that is being profiled. If it doesn't - then it's not a value distribution profile. So we are implementing various degrees of on-disk caching to make this work without flooding memory. This means that the stability of DataCleaner is improved to a heavy league level.
  • Batch processing and scheduling
    The last (but not least important) feature that I'm going to mention is the new command line interface for DataCleaner. By providing a command line interface for executing DataCleaner jobs you are able to introduce DataCleaner into a grand architecture for data quality, data warehousing, master data management or whatever it is that you are using it for. You can schedule it using any scheduling tool that you like and you can save the results to automate reporting and result analysis.

4 comments:

Dylan Jones said...

The last one is really attractive, I can imagine it forming the sanity check of many a batch or ETL process.

Great initiatives Kasper, best of luck.

Kasper Sørensen said...

Thank you Dylan. It's the interest of people such as yourself that drives the development of DataCleaner so I'm very glad that you think the ideas are useful :-)

Taras Mankovski said...

Hey Kasper,

I was wandering how do you integrate django frontend with java backend applications. If there an easy way to make java and python talk to one another using an API? or do you do all of the interaction on the database level?

Thank you,
Taras

Kasper Sørensen said...

Hi psydonim,

Actually most of the eobjects.org website is python only. We do have a (Java) Hudson server running but it's pretty much independent of the other services.

Also theres a little bit of XML/HTTP-based integration between the usage statistics within the DataCleaner application and our python/django website.