20100911

More instructions for authoring AnalyzerBeans jobs

I've previously posted a blog entry about how you could now download and run a simple example of AnalyzerBeans in the shell. I've updated the example and improved the command-line interface so that it will further assist you if you are interested in using the tool.

First of all, the command-line tool now has a reasonable usage screen:

> java -jar target/AnalyzerBeans.jar
-conf (-configuration, --configuration-file) FILE
      : XML file describing the configuration of AnalyzerBeans
-ds (-datastore, --datastore-name) VAL
      : Name of datastore when printing a list of schemas, tables or columns
-job (--job-file) FILE
      : An analysis job XML file to execute
-list [ANALYZERS | TRANSFORMERS | DATASTORES | SCHEMAS | TABLES | COLUMNS ]
      : Used to print a list of various elements available in the configuration
-s (-schema, --schema-name) VAL
      : Name of schema when printing a list of tables or columns
-t (-table, --table-name) VAL
      : Name of table when printing a list of columns

As you can see, you can now for example list all available analyzers (there are a lot, so I'm only posting the relevant parts for my up-coming example here, the rest have been replaced with "..."):
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -list ANALYZERS
Analyzers:
----------
...
name: String analyzer
- Consumes multiple input columns
...
name: Value distribution
- Consumes a single input column
- Property: name=Record unique values, type=boolean, required=true
- Property: name=Bottom n most frequent values, type=Integer, required=false
- Property: name=Top n most frequent values, type=Integer, required=false
...
name: Number analyzer
- Consumes multiple input columns

I'll help you read this output: There are three analyzers listed. The String analyzer and Number analyzer both consume multiple columns, which means that they can be configured to have multiple inputs. Value distribution is another analyzer which only consumes a single column and has three configurable properties: Record unique values, Bottom n most frequent values and Top n most frequent values.

You can similarly list available transformers:
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -list TRANSFORMERS
...
or datastores:
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -list DATASTORES
...
or tables and columns in a particular datastore
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -ds employees_csv -list TABLES
...
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -ds employees_csv -t employees -list COLUMNS
...
So now you have all the details that enable you to author an XML-based AnalyzerBeans job yourself. Let's take a look at the example. I'm going to post a few snippets from the employees_job.xml file which I also used in my previous post. Notice that this file has been updated since my last post so you will need to run an "svn update" if you followed my previous tutorial, in order to get up-to-date code and data.

The file starts up with a little metadata. We're not going into detail with that. Then there's the <source> part:
<source>
    <data-context ref="employees_csv" />
    <columns>
      <column id="col_name" path="employees.csv.employees.name" />
      <column id="col_email" path="employees.csv.employees.email" />
      <column id="col_birthdate" path="employees.birthdate" />
    </columns>
</source>
The content is almost self-explanatory. There's a reference to the employees_csv datastore and the three columns defined in the CSV file: name, email, birthdate. Notice the id's (marked in red) of these three columns. These id's will be referenced further down in the XML file.

The next major part of the XML file is the transformation part. Let's have a look at one of the transformations:
<transformer>
  <descriptor ref="Email standardizer" />
  <input ref="col_email" />
  <output id="col_username" name="Email username" />
  <output id="col_domain" name="Email domain" />
</transformer>
This snippet defines that the Email standardizer transformer consumes a single column (col_email) and generates two new virtual columns: col_username and col_domain. Now understanding the final part of the XML file will be pretty obvious. Let's have a look at one of the analyzers defined in the <analysis> part:
<analyzer>
  <descriptor ref="Value distribution" />
  <input ref="col_username" />
</analyzer>
It simply maps the (virtual) col_username column to a Value distribution analyzer which is then executed (along with all the other analyzers defined in the file) when you run the job from the command line:
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -job examples/employees_job.xml
...
Value distribution for column: Email username
Null count: 0
Unique values:
- asbjorn
- foo.bar
- foobar.foo
- jane.doe
- john.doe
- kasper
- santa
...
I hope that you find this XML format pretty straight forward to author. Of course we will be implementing a graphical user interface as well, but for the moment I am actually quite satisfied with this early user interface.

4 comments:

Asbjørn Leeth said...
This comment has been removed by the author.
Asbjørn Leeth said...

Wow, can't wait til I get some to time to try it out. Every thing I have seen so far regarding analyzer beans looks really promising.
Keep up to good work, no excellent work, for all of us :)

tania said...
This comment has been removed by the author.
tania said...

Can you please make color of text bit more dark so it will be more visible in gray background thank you!

nice blog !

- Tanya
Web Design Firm