20110627

Developing DataCleaner extensions

Today has been all about the release of DataCleaner 2.2. This is a significant release of our Data Quality Analysis product which I think is becoming more and more mature and capable.

One of the really neat things in DataCleaner 2.2 is it's extensibility. Lots of applications are extensible, but few are in my oppinion as easy to approach as DataCleaner. We expose a limited API which is extremely flexible though. This makes it easy for developers to explore the opportunities and the architecture.

Another great strengths of DataCleaner's extension architecture is the ExtensionSwap. With a click in the browser you can install an extension onto a running application. Personally I think it's a quite jaw-dropping effect when you see the seamlessnes of the integration here.

I've recorded this webcast demonstration for developers who want to get started, or just feel curious on how our developer API works.

Also, a few nice resources for you to investigate further:

  • Our reference documentation now contains a "developers guide" with lots of nice info on extension packaging and more.
  • The Develop page on the website contains links to various previous blog entries and instructions. These are still valid even though the 2.2 API has been elaborated.
  • And of course, check out the javadoc API documentation for DataCleaner.

Lastly I want to point out that not only is DataCleaner 2.2 extensible, it is also embeddable. You can now grab DataCleaner in the central Maven repo's and you can bootstrap the application in a really easy fashion:

BootstrapOptions bootstrapOptions = new DefaultBootstrapOptions(args);
Bootstrap bootstrap = new Bootstrap(bootstrapOptions);
bootstrap.run();

For more info, check out the chapter "Embedding DataCleaner" in the reference documentation.

20110620

SassyReader - Open Source reader of SAS data sets for Java

I'm quite excited to announce the first release of a brand new eobjects.org project: SassyReader. SassyReader is in my oppinion in deed something sassy as it fills a gap that has long existed in open source applications that deals with data management (ETL tools, tools like DataCleaner and the like). SassyReader is a library for reading data in the sas7bdat format, aka. the format that the SAS statistical software use! It is written entirely in Java and reads the files from their binary format (eg. it's not a connector to the SAS system, but a reader of the raw data).

Visit the SassyReader websiteSo why is this important? Well first of all because it is very difficult to create systems that interoperate with SAS. SAS does ship a JDBC driver but it's compliancy with JDBC is actually very limited. Even creating a connection will typically require use of SAS's proprietary classes, so you cannot go the standards JDBC way. There is also no JDBC metadata support and you need to set up a server-side SAS/SHARE option to even expose the connection. Furthermore this is an add-on product from SAS which costs additional money if you're just a base SAS user. So doing trivial things like connecting and querying a data set requires a lot of work and money. In my oppinion this is poor practice - a legacy way of trying to lock people in to using only a particular brand of software, simply because interoperability is a big pain.

All in all I see a great benefit in a project like SassyReader for those who simply want a way of reading the data that is stored in SAS files.

I cannot take a whole lot of credit for this project though. Most of the really challenging stuff was created by Matt Shotwell, aka. BioStatMatt, who founded the sas7bdat project which is written in R. My contribution was to port it to Java and fix a few issues on the way. Matt put together a lot of fractioned works that describe various findings about the sas7bdat format. In other words this is a completely reverse engineered library, based on analysis of actual sas7bdat files. During the last months we've had a good conversation going and actually fixing some of the remaining issues in parallel and bringing additions to each other's code.

Today we've released version 0.1 of SassyReader. It's not yet ready for mission critical use as there are still quirks in the format that we haven't figured out. Also there are different shapes and sizes within the format that vary apparently depending on (I'm a bit guessing here) the amount of columns and the operating system that the file was written with. The good thing is that we have a quite extensive test set and for at least the files that I had lying around that I wanted to work with the reader managed to read all but one (11 out of 12)!

Please visit the SassyReader website for more details, and let me know your feedback!