I've previously posted a blog entry about how you could now download and run a simple example of AnalyzerBeans in the shell. I've updated the example and improved the command-line interface so that it will further assist you if you are interested in using the tool.
First of all, the command-line tool now has a reasonable usage screen:
> java -jar target/AnalyzerBeans.jar
-conf (-configuration, --configuration-file) FILE
: XML file describing the configuration of AnalyzerBeans
-ds (-datastore, --datastore-name) VAL
: Name of datastore when printing a list of schemas, tables or columns
-job (--job-file) FILE
: An analysis job XML file to execute
-list [ANALYZERS | TRANSFORMERS | DATASTORES | SCHEMAS | TABLES | COLUMNS ]
: Used to print a list of various elements available in the configuration
-s (-schema, --schema-name) VAL
: Name of schema when printing a list of tables or columns
-t (-table, --table-name) VAL
: Name of table when printing a list of columns
As you can see, you can now for example list all available analyzers (there are a lot, so I'm only posting the relevant parts for my up-coming example here, the rest have been replaced with "..."):
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -list ANALYZERS
Analyzers:
----------
...
name: String analyzer
- Consumes multiple input columns
...
name: Value distribution
- Consumes a single input column
- Property: name=Record unique values, type=boolean, required=true
- Property: name=Bottom n most frequent values, type=Integer, required=false
- Property: name=Top n most frequent values, type=Integer, required=false
...
name: Number analyzer
- Consumes multiple input columns
I'll help you read this output: There are three analyzers listed. The String analyzer and Number analyzer both consume multiple columns, which means that they can be configured to have multiple inputs. Value distribution is another analyzer which only consumes a single column and has three configurable properties: Record unique values, Bottom n most frequent values and Top n most frequent values.
You can similarly list available transformers:
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -list TRANSFORMERSor datastores:
...
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -list DATASTORESor tables and columns in a particular datastore
...
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -ds employees_csv -list TABLESSo now you have all the details that enable you to author an XML-based AnalyzerBeans job yourself. Let's take a look at the example. I'm going to post a few snippets from the employees_job.xml file which I also used in my previous post. Notice that this file has been updated since my last post so you will need to run an "svn update" if you followed my previous tutorial, in order to get up-to-date code and data.
...
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -ds employees_csv -t employees -list COLUMNS
...
The file starts up with a little metadata. We're not going into detail with that. Then there's the <source> part:
<source>The content is almost self-explanatory. There's a reference to the employees_csv datastore and the three columns defined in the CSV file: name, email, birthdate. Notice the id's (marked in red) of these three columns. These id's will be referenced further down in the XML file.
<data-context ref="employees_csv" />
<columns>
<column id="col_name" path="employees.csv.employees.name" />
<column id="col_email" path="employees.csv.employees.email" />
<column id="col_birthdate" path="employees.birthdate" />
</columns>
</source>
The next major part of the XML file is the transformation part. Let's have a look at one of the transformations:
<transformer>This snippet defines that the Email standardizer transformer consumes a single column (col_email) and generates two new virtual columns: col_username and col_domain. Now understanding the final part of the XML file will be pretty obvious. Let's have a look at one of the analyzers defined in the <analysis> part:
<descriptor ref="Email standardizer" />
<input ref="col_email" />
<output id="col_username" name="Email username" />
<output id="col_domain" name="Email domain" />
</transformer>
<analyzer>It simply maps the (virtual) col_username column to a Value distribution analyzer which is then executed (along with all the other analyzers defined in the file) when you run the job from the command line:
<descriptor ref="Value distribution" />
<input ref="col_username" />
</analyzer>
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -job examples/employees_job.xmlI hope that you find this XML format pretty straight forward to author. Of course we will be implementing a graphical user interface as well, but for the moment I am actually quite satisfied with this early user interface.
...
Value distribution for column: Email username
Null count: 0
Unique values:
- asbjorn
- foo.bar
- foobar.foo
- jane.doe
- john.doe
- kasper
- santa
...
4 comments:
Wow, can't wait til I get some to time to try it out. Every thing I have seen so far regarding analyzer beans looks really promising.
Keep up to good work, no excellent work, for all of us :)
Can you please make color of text bit more dark so it will be more visible in gray background thank you!
nice blog !
- Tanya
Web Design Firm
Post a Comment