20090628

Introducing AnalyzerBeans

It's been some time now since I first designed the core API's of the DataCleaner project and as time goes on, some of my initial assumptions about the design of profilers, validation rules and so on have shown to be less-than-optimal in regards to flexibility and scalability for the application. This is why yesterday I decided to do a major change in the roadmap for the project:

  • The idea about the "webmonitor" application (DataCleaner 2.0) have been cancelled for now. If anyone wants to realize this idea it's still something that I am very much interested in, but as you will see I have found that other priorities are more important.
  • A new project have been founded - for now as a "sandbox" project: AnalyzerBeans. AnalyzerBeans is a rethought architecture for datastore profiling, validation etc. - in one word: "Analysis". When this project is stable and mature we will probably be ready for something I like to think of as a new DataCleaner 2.0.
So why rethink datastore analysis? Because the "old way" have proven to be very cumbersome for some tasks that I did not initially realise would have importance. The current DataCleaner design assumes that all profiles, validation rules etc. do serial-processing of rows. This is not always the best way to do processing although it simplifies optimization of the execution-mechanism because all components execute in the same way and can thus share result sets etc. In AnalyzerBeans we want the best of both worlds: Flexibility to do al sorts of weird processing and rigidity for the lot of profilers which actually do process rows serially.

The solution is a new annotation based component-model. Each profiler, validation rule etc. will not have to implement certain interfaces because we can now mix and match annotations to the specific type of analysis-component - each "AnalyzerBean". There are a lot more interesting features available when we introduce an annotation-based model, but let me first give you a simple example of how a regular row-processing DataCleaner-style profile would look like:
@AnalyzerBean(name="Row counter", execution=ExecutionType.ROW_PROCESSING)
public class MySerialCounter {

    @Configured("Table to count")
    private Table table;
    private long count = 0l;

    @Run
    public void run(Row row, long count) {
        this.count += count;
    }
}
Now this is not so impressive. I've just replaced the IProfile interface of DataCleaner's API's with some annotations. But notice how I've gotten rid of the ProfileDescriptor class which was used to hold metadata about the profiler. Instead the annotations represent the class metadata. This is actually excactly what annotations are for :-) Also notice that I've gotten a type-safe configuration-property using the @Configured annotation. This means that I don't have to parse a string, ask for a Table of the corresponding name etc. And the UI will become a LOT more easy to develop because of type-safe facilities like this.

But an even more exciting way to use the new API is when creating a whole new type of profiler, an exploring AnalyzerBean:
@AnalyzerBean(name="Row counter", execution=ExecutionType.EXPLORING)
public class MySerialCounter {

    @Configured("Table to count")
    private Table table;
    private Number count;

    @Run
    public void run(DataContext dc) {
        DataSet ds = dc.executeQuery(new Query().selectCount().from(table));
        ds.next();
        this.count = (Number) row.getValue(0);
        ds.close();
    }
}
Now this is something totally new: A component that can gain total control of the DataContext and create it's own query based on some @Configured parameters. I imagine that this programming model will give us complete flexibility to do exiting new things that was impossible in the DataCleaner-framework: Join testing, non-serial Value Distribution etc.

There are a few other annotations available to the AnalyzerBean-developers but I will take a look at them in a more in-depth blog-entry later. For now - let me know if you like the ideas and if you have any comments. Anyone who would like to help out in the development of the AnalyzerBeans project should visit our wiki page on the subject.

Update (2010-09-12)

A lot has happened to AnalyzerBeans since this blog entry. Here's a list of blog entries (in chronological order) that will help interested readers dive deeper into the development of AnalyzerBeans:

1 comment:

DBA_Alex said...

I know it's a little late to post comments, but this sounds very interesting and promising for the DataCleaner project.

Looking forward to it's continued success.