20100829

Now you can run AnalyzerBeans (from the shell)

Lately I've been blabbering a lot about the marvels of AnalyzerBeans - the project that is aimed at re-implementing an engine for data analysis based on my experience from DataCleaner.

An important milestone in any development project, especially those like AnalyzerBeans, that are implemented bottom-up, is when it is actually possible to use the application without having any developer skills. So far the development of AnalyzerBeans has been focused on making it work in a unittesting perspective, but now we've reached a point where it is also possible to invoke the engine from the command line.

Since we haven't released AnalyzerBeans yet, you will still have to check out the code and build it yourself. It's rather easy - it just requires Subversion and Maven. First, check out the code:

> svn co http://eobjects.org/svn/AnalyzerBeans/trunk AnalyzerBeans

Now build it:

> cd AnalyzerBeans
> mvn install

And now run the example job that's in there:

> java -jar target/AnalyzerBeans.jar \
> -configuration examples/conf.xml -job examples/employees_job.xml

The job will transform/standardize the "full name" and "email address" columns of a CSV-file (located in the examples-folder) and then print out value distribution and string analysis results for the standardized tokens: First name, Last name, Email username, Email domain.

If you've gone this far, you've probably also tried opening the xml-files employees_job.xml and conf.xml in the examples-folder. Maybe you've even figured out that the conf.xml describes the application setup and that the employees_job.xml file describes the job contents. You can edit these files as you please to further explore the application. I will be sure to update my blog soon with some more examples. Also one of the next features of the command line interface will be to print the available Analyzers and Transformers in order to make it easier to author the xml job-files.

If you're just trying this out now and if you are getting excited about AnalyzerBeans, here are my previous blog posts on the subject. Please don't hesistate to let me know what you think.

20100827

A nice abstraction over regular expressions

Often when you're developing data profiling, matching or cleansing software, you're dealing with expression matching, typically through regular expressions (regexes). One thing that I find is that it is often a tedious and error-prone task to define and reuse regexes or parts of regexes. In AnalyzerBeans there's a huge need for easier and reusable pattern matching. To counter this requirement I've come up with a helper-class, Named Pattern, which you can use to match and identify tokens in the patterns in a type-safe and easy way. Here's a short example for matching and tokenizing names based on two simple patterns:

//First define an enum with the tokens in the pattern(s)
public enum NamePart { FIRSTNAME, LASTNAME, TITULATION }

// The two patterns
NamedPattern<NamePart> p1 = new NamedPattern("TITULATION. FIRSTNAME LASTNAME", NamePart.class);
NamedPattern<NamePart> p2 = new NamedPattern("FIRSTNAME LASTNAME", NamePart.class);
NamedPattern<NamePart> p3 = new NamedPattern("LASTNAME, FIRSTNAME", NamePart.class);

// notice the type parameter <NamePart> - the match result type is typesafe!
NamedPatternMatch<NamePart> match = p1.match("Sørensen, Kasper");
assert match == null;

match = p2.match("Sørensen, Kasper");
assert match == null;

// here's a match!
match = p3.match("Sørensen, Kasper");
assert match != null;

String firstName = match.get(NamePart.FIRSTNAME);
String lastName = match.get(NamePart.LASTNAME);
String titulation = match.get(NamePart.TITULATION);

All in all I think that the NamedPattern class (and the NamedPatternMatch) in combination with your own enums is a pretty elegant way to do string pattern matching. There's also a way to specify how the underlying regular expression will be built by letting the enum implement the HasGroupLiteral interface.

Developers can dive into the details of these classes and interfaces at the Javadoc / API Documentation for AnalyzerBeans (package org.eobjects.analyzer.util).

20100809

Visualizations and API documentation for AnalyzerBeans

I've spent a few hours trying to capture some of the basic principles of data flow and execution in my new favourite spare-time project AnalyzerBeans. Here's the results, that you will also find available in the API Documentation.

The first image shows the relationship analyzers, transformers and the data that they consume:

Data flow

The second image shows a "close-up" of a row of data. Some of that values originate from the actual datastore, while some of the values may be virtual, generated by a chain of transformers:

InputRow

Enjoy :)