It's been some time now since I first designed the core API's of the DataCleaner project and as time goes on, some of my initial assumptions about the design of profilers, validation rules and so on have shown to be less-than-optimal in regards to flexibility and scalability for the application. This is why yesterday I decided to do a major change in the roadmap for the project:
- The idea about the "webmonitor" application (DataCleaner 2.0) have been cancelled for now. If anyone wants to realize this idea it's still something that I am very much interested in, but as you will see I have found that other priorities are more important.
- A new project have been founded - for now as a "sandbox" project: AnalyzerBeans. AnalyzerBeans is a rethought architecture for datastore profiling, validation etc. - in one word: "Analysis". When this project is stable and mature we will probably be ready for something I like to think of as a new DataCleaner 2.0.
The solution is a new annotation based component-model. Each profiler, validation rule etc. will not have to implement certain interfaces because we can now mix and match annotations to the specific type of analysis-component - each "AnalyzerBean". There are a lot more interesting features available when we introduce an annotation-based model, but let me first give you a simple example of how a regular row-processing DataCleaner-style profile would look like:
@AnalyzerBean(name="Row counter", execution=ExecutionType.ROW_PROCESSING)Now this is not so impressive. I've just replaced the IProfile interface of DataCleaner's API's with some annotations. But notice how I've gotten rid of the ProfileDescriptor class which was used to hold metadata about the profiler. Instead the annotations represent the class metadata. This is actually excactly what annotations are for :-) Also notice that I've gotten a type-safe configuration-property using the @Configured annotation. This means that I don't have to parse a string, ask for a Table of the corresponding name etc. And the UI will become a LOT more easy to develop because of type-safe facilities like this.
public class MySerialCounter {
@Configured("Table to count")
private Table table;
private long count = 0l;
@Run
public void run(Row row, long count) {
this.count += count;
}
}
But an even more exciting way to use the new API is when creating a whole new type of profiler, an exploring AnalyzerBean:
@AnalyzerBean(name="Row counter", execution=ExecutionType.EXPLORING)Now this is something totally new: A component that can gain total control of the DataContext and create it's own query based on some @Configured parameters. I imagine that this programming model will give us complete flexibility to do exiting new things that was impossible in the DataCleaner-framework: Join testing, non-serial Value Distribution etc.
public class MySerialCounter {
@Configured("Table to count")
private Table table;
private Number count;
@Run
public void run(DataContext dc) {
DataSet ds = dc.executeQuery(new Query().selectCount().from(table));
ds.next();
this.count = (Number) row.getValue(0);
ds.close();
}
}
There are a few other annotations available to the AnalyzerBean-developers but I will take a look at them in a more in-depth blog-entry later. For now - let me know if you like the ideas and if you have any comments. Anyone who would like to help out in the development of the AnalyzerBeans project should visit our wiki page on the subject.
Update (2010-09-12)
A lot has happened to AnalyzerBeans since this blog entry. Here's a list of blog entries (in chronological order) that will help interested readers dive deeper into the development of AnalyzerBeans: