Often when I speak to data quality professionals and people from the business intelligence world I get the notion that most people think of Open Source tools as slightly immature when it comes to heavy processing, large loads, millions-of-rows-kinda-stuff. And this has had some truth to it. I don't want to name names, but at least I have heard a lot of stories about Open Source data integration / ETL tools that wasn't up for the job when you had millions of rows to transform. So I guess this notion have stuck to Open Source data profilers and data quality applications too...
In DataCleaner 1.5 I want this notion demystified and eradicated! Here are some of the things we are working on to make this release a truly enterprise-ready, performance-oriented and scalable application:
- Multi-threaded, multi-connection, multi-query execution enging
The execution engine in DataCleaner have been thoroughly refactored to support multithreading, multiple connections and query-splitting to perform loadbalancing on the threads and connections. This really boosts performance for large jobs and sets the bar for processing large result sets in Open Source tools I think.
- On-disk caching for memory-intensive profiles and validation rules
Some of the profiles and validation rules are almost inherently memory intensive. We are doing a lot of work optimizing them as much as we can but some thing are simply not possible to change. As an example, a Value Distribution profile simply HAS to know all distinct values of each column that is being profiled. If it doesn't - then it's not a value distribution profile. So we are implementing various degrees of on-disk caching to make this work without flooding memory. This means that the stability of DataCleaner is improved to a heavy league level.
- Batch processing and scheduling
The last (but not least important) feature that I'm going to mention is the new command line interface for DataCleaner. By providing a command line interface for executing DataCleaner jobs you are able to introduce DataCleaner into a grand architecture for data quality, data warehousing, master data management or whatever it is that you are using it for. You can schedule it using any scheduling tool that you like and you can save the results to automate reporting and result analysis.