Lately I've been blabbering a lot about the marvels of AnalyzerBeans - the project that is aimed at re-implementing an engine for data analysis based on my experience from DataCleaner.
An important milestone in any development project, especially those like AnalyzerBeans, that are implemented bottom-up, is when it is actually possible to use the application without having any developer skills. So far the development of AnalyzerBeans has been focused on making it work in a unittesting perspective, but now we've reached a point where it is also possible to invoke the engine from the command line.
Since we haven't released AnalyzerBeans yet, you will still have to check out the code and build it yourself. It's rather easy - it just requires Subversion and Maven. First, check out the code:
> svn co http://eobjects.org/svn/AnalyzerBeans/trunk AnalyzerBeans
Now build it:
> cd AnalyzerBeans
> mvn install
And now run the example job that's in there:
> java -jar target/AnalyzerBeans.jar \
> -configuration examples/conf.xml -job examples/employees_job.xml
The job will transform/standardize the "full name" and "email address" columns of a CSV-file (located in the examples-folder) and then print out value distribution and string analysis results for the standardized tokens: First name, Last name, Email username, Email domain.
If you've gone this far, you've probably also tried opening the xml-files employees_job.xml and conf.xml in the examples-folder. Maybe you've even figured out that the conf.xml describes the application setup and that the employees_job.xml file describes the job contents. You can edit these files as you please to further explore the application. I will be sure to update my blog soon with some more examples. Also one of the next features of the command line interface will be to print the available Analyzers and Transformers in order to make it easier to author the xml job-files.
If you're just trying this out now and if you are getting excited about AnalyzerBeans, here are my previous blog posts on the subject. Please don't hesistate to let me know what you think.