20090125

Free OLAP cube icon

Okay this is a bit off-topic compared to my normal posts, but here goes.

When I do websites or GUI design I usually look out for free/open source icon packages such as Tango or Crystal. I have been looking for a nice-looking icon to represent an OLAP cube, preferably in a style and coloring similar to the Tango icon set. Sorry to say, I didn't find any, so I went ahead and spent some hours creating a new icon on my own. Here's the result:


And a plain version without the sum/count text elements (good if you need to resize it to very small sizes):


I'm giving this away under a beerware license, so if you want it, it's yours.

20090120

DataCleaner 1.5 - a heavy league Data Profiler

Often when I speak to data quality professionals and people from the business intelligence world I get the notion that most people think of Open Source tools as slightly immature when it comes to heavy processing, large loads, millions-of-rows-kinda-stuff. And this has had some truth to it. I don't want to name names, but at least I have heard a lot of stories about Open Source data integration / ETL tools that wasn't up for the job when you had millions of rows to transform. So I guess this notion have stuck to Open Source data profilers and data quality applications too...

In DataCleaner 1.5 I want this notion demystified and eradicated! Here are some of the things we are working on to make this release a truly enterprise-ready, performance-oriented and scalable application:

  • Multi-threaded, multi-connection, multi-query execution enging
    The execution engine in DataCleaner have been thoroughly refactored to support multithreading, multiple connections and query-splitting to perform loadbalancing on the threads and connections. This really boosts performance for large jobs and sets the bar for processing large result sets in Open Source tools I think.
  • On-disk caching for memory-intensive profiles and validation rules
    Some of the profiles and validation rules are almost inherently memory intensive. We are doing a lot of work optimizing them as much as we can but some thing are simply not possible to change. As an example, a Value Distribution profile simply HAS to know all distinct values of each column that is being profiled. If it doesn't - then it's not a value distribution profile. So we are implementing various degrees of on-disk caching to make this work without flooding memory. This means that the stability of DataCleaner is improved to a heavy league level.
  • Batch processing and scheduling
    The last (but not least important) feature that I'm going to mention is the new command line interface for DataCleaner. By providing a command line interface for executing DataCleaner jobs you are able to introduce DataCleaner into a grand architecture for data quality, data warehousing, master data management or whatever it is that you are using it for. You can schedule it using any scheduling tool that you like and you can save the results to automate reporting and result analysis.

20090107

Using Python and Django to build the new DataCleaner website

I have for a long time been a dedicated Java developer and in many ways, still am. But developing the new website for DataCleaner have been quite an eye-opener for the potential of dynamic languages and Python in particular. There are so many things about that language that I love and I must say that doing the same thing in Java would have taken at least twice the time! And thats even though I'm not an unexperienced Java developer.

OK, so what's the big difference? Well, deployment is one very crucial difference. J2EE servers are great for stability and system administration but often I find myself, as a web developer, not needing all those things that much - I just need a server that always runs and will tell me what I am doing wrong. Django (which have been my Python web-framework) have been excellent in doing this for me so I can kick-start my application in a matter of seconds.

Type-safety is another big difference. Java is type-safe, Python (and other dynamic languages) is not. For back-end development I am a big advocate of type-safety but in front-end development dynamic classes are such a great treat! An example of this is when transfering data from Controllers to the View in the Django framework's Model-View-Controller architecture. If you want to present some domain objects that are related in the view, but not in the domain model (or perhaps the domain model has some details to it that you want to skip for understandability), then you just infer a completely new attribute into the domain object! In Java or other type-safe languages such as C#, you would typically have to create a Map for storing the new particular association and then do a lookup in the view to resolve the association. This means more "logic" in the view and code that is harder to comprehend.

All in all I'm very happy to use Django. I would have liked a few more features in their QuerySet API (especially aggregation queries, which should be on the way) but then again - for the typical website it is pretty sufficient and allows fallback to native SQL. Thank you Django.

Note: This is not to say that I am abandoning Java, not at all! I love Java for it's stability and superior integrational capabilities, but in some cases, I simply want something that is fast and more in tune with the user experiencing and prototyping process of building websites.