20090628

Introducing AnalyzerBeans

It's been some time now since I first designed the core API's of the DataCleaner project and as time goes on, some of my initial assumptions about the design of profilers, validation rules and so on have shown to be less-than-optimal in regards to flexibility and scalability for the application. This is why yesterday I decided to do a major change in the roadmap for the project:

  • The idea about the "webmonitor" application (DataCleaner 2.0) have been cancelled for now. If anyone wants to realize this idea it's still something that I am very much interested in, but as you will see I have found that other priorities are more important.
  • A new project have been founded - for now as a "sandbox" project: AnalyzerBeans. AnalyzerBeans is a rethought architecture for datastore profiling, validation etc. - in one word: "Analysis". When this project is stable and mature we will probably be ready for something I like to think of as a new DataCleaner 2.0.
So why rethink datastore analysis? Because the "old way" have proven to be very cumbersome for some tasks that I did not initially realise would have importance. The current DataCleaner design assumes that all profiles, validation rules etc. do serial-processing of rows. This is not always the best way to do processing although it simplifies optimization of the execution-mechanism because all components execute in the same way and can thus share result sets etc. In AnalyzerBeans we want the best of both worlds: Flexibility to do al sorts of weird processing and rigidity for the lot of profilers which actually do process rows serially.

The solution is a new annotation based component-model. Each profiler, validation rule etc. will not have to implement certain interfaces because we can now mix and match annotations to the specific type of analysis-component - each "AnalyzerBean". There are a lot more interesting features available when we introduce an annotation-based model, but let me first give you a simple example of how a regular row-processing DataCleaner-style profile would look like:
@AnalyzerBean(name="Row counter", execution=ExecutionType.ROW_PROCESSING)
public class MySerialCounter {

    @Configured("Table to count")
    private Table table;
    private long count = 0l;

    @Run
    public void run(Row row, long count) {
        this.count += count;
    }
}
Now this is not so impressive. I've just replaced the IProfile interface of DataCleaner's API's with some annotations. But notice how I've gotten rid of the ProfileDescriptor class which was used to hold metadata about the profiler. Instead the annotations represent the class metadata. This is actually excactly what annotations are for :-) Also notice that I've gotten a type-safe configuration-property using the @Configured annotation. This means that I don't have to parse a string, ask for a Table of the corresponding name etc. And the UI will become a LOT more easy to develop because of type-safe facilities like this.

But an even more exciting way to use the new API is when creating a whole new type of profiler, an exploring AnalyzerBean:
@AnalyzerBean(name="Row counter", execution=ExecutionType.EXPLORING)
public class MySerialCounter {

    @Configured("Table to count")
    private Table table;
    private Number count;

    @Run
    public void run(DataContext dc) {
        DataSet ds = dc.executeQuery(new Query().selectCount().from(table));
        ds.next();
        this.count = (Number) row.getValue(0);
        ds.close();
    }
}
Now this is something totally new: A component that can gain total control of the DataContext and create it's own query based on some @Configured parameters. I imagine that this programming model will give us complete flexibility to do exiting new things that was impossible in the DataCleaner-framework: Join testing, non-serial Value Distribution etc.

There are a few other annotations available to the AnalyzerBean-developers but I will take a look at them in a more in-depth blog-entry later. For now - let me know if you like the ideas and if you have any comments. Anyone who would like to help out in the development of the AnalyzerBeans project should visit our wiki page on the subject.

Update (2010-09-12)

A lot has happened to AnalyzerBeans since this blog entry. Here's a list of blog entries (in chronological order) that will help interested readers dive deeper into the development of AnalyzerBeans:

20090613

Performance benchmark: DataCleaner thrives on lower column counts

Today I've conducted an experiment. After fixing a bug related to CSV-file-reading in DataCleaner, I was wondering how performance was impacted by different kinds of CSV file compositions. The reason that I suspected that this could impact performance is that CSV files with many columns will require a somewhat larger chunk of memory in order to keep a single row in memory compared to CSV files with fewer columns. In the older versions of DataCleaner we discovered that using 200 or more columns would actually make the application run out of memory! Fortunately, this bug is fixed, but there is still a significant performance penalty, as this blog post will hopefully show.

I auto-generated three files for the benchmark: "huge.csv" with 2.000 columns and 16.000 rows, "long.csv" with 250 columns and 128.000 rows and "slim.csv" with only 10 columns and a roaring 3.200.000 rows. Together, each file has 32.000.000 cells to be profiled. I set up a profiler job with the profiles Standard measures and String analysis on all columns.

Here are the (surprising?) results:

filename rows columns start time end timetotal time
huge.csv 16000 2000 18:54:48 19:40:2845:40
long.csv 128000 250 19:44:53 19:52:317:38
slim.csv 3200000 10 19:53:46 19:55:031:17


So the bottom line is: Lowering the number of columns has a very significant, positive impact on performance. Having a lot of columns means that you will need to hold a lot more data in memory and needless to say you will have to replace this large chunk of memory a lot of times during the execution of a large profiler job. Going all the way from 45 minutes to 1½ is quite an improvement - so don't pre-join tables or anything like that before you run them through your profiler.

20090605

eobjects.org @ JavaOne

I am currently hanging out at the lovely JavaOne conference in San Francisco, checking out cool new Java technology and meeting interesting people. Of course I'm here as a representative from my employer I also do some blogging (in Danish).

Yesterday I saw an interesting session about JFreeChart and surviving as an Open Source professional. Dave Gilbert told us about how he has managed to live from his hobby as a JFreeChart developer, about cool new features of the excellent charting API and about the struggles of making money on Open Source. Very fascinating stuff and I hope that everybody in the chart-consuming business will give it a try. I was quite happily surprised to see the new interactive chart functionality that has been put into the API - I'm wondering how that hadn't found my attention before now! It gave rise to a couple of ideas (or rather: Sparked my motivation) for me to try and implement charting in DataCleaner.