20100719

Data transformation added to AnalyzerBeans

I have been doing a lot of improvements to the API of AnalyzerBeans - a sandbox project that I am very passionate about. In short it is a new Data Profiling/Analysis engine that I think will eventually replace the core parts of DataCleaner. So here's a bit about "what's cookin'":

  • The largest of the new features is that it is now possible to transform data before it will be analyzed. The idea here is that it should be possible to tokenize/split/convert/etc. values before they enter the analysis. This means one fundamental change to analyzers, namely that they consume data through an intermediary input-column type which can be virtual (to represent eg. a token) or physical (to represent a "regular" column in a datastore). The new component type, "Transformer Beans" will support all the same cool stuff that I've already introduced to the analyzer components like dependency injection, persistent/scalable collections, annotation-driven composition and registration etc.
  • Another neat thing that I'm currently finishing up is an Analysis Job Builder. The idea is that analysis jobs should be immutable because this makes it a lot safer to parallelize the process of executing the jobs. Immutable structure are very good to work with when you are executing but they tend to be tedious when you're building the structure. So I'm also adding an API for building the jobs which will emphasize type-safety and syntactic neatness to make it easy to programmatically manage and verify the jobs you're building. This will make it a lot easier to build a good UI for AnalyzerBeans.

No comments: