20110620

SassyReader - Open Source reader of SAS data sets for Java

I'm quite excited to announce the first release of a brand new eobjects.org project: SassyReader. SassyReader is in my oppinion in deed something sassy as it fills a gap that has long existed in open source applications that deals with data management (ETL tools, tools like DataCleaner and the like). SassyReader is a library for reading data in the sas7bdat format, aka. the format that the SAS statistical software use! It is written entirely in Java and reads the files from their binary format (eg. it's not a connector to the SAS system, but a reader of the raw data).

Visit the SassyReader websiteSo why is this important? Well first of all because it is very difficult to create systems that interoperate with SAS. SAS does ship a JDBC driver but it's compliancy with JDBC is actually very limited. Even creating a connection will typically require use of SAS's proprietary classes, so you cannot go the standards JDBC way. There is also no JDBC metadata support and you need to set up a server-side SAS/SHARE option to even expose the connection. Furthermore this is an add-on product from SAS which costs additional money if you're just a base SAS user. So doing trivial things like connecting and querying a data set requires a lot of work and money. In my oppinion this is poor practice - a legacy way of trying to lock people in to using only a particular brand of software, simply because interoperability is a big pain.

All in all I see a great benefit in a project like SassyReader for those who simply want a way of reading the data that is stored in SAS files.

I cannot take a whole lot of credit for this project though. Most of the really challenging stuff was created by Matt Shotwell, aka. BioStatMatt, who founded the sas7bdat project which is written in R. My contribution was to port it to Java and fix a few issues on the way. Matt put together a lot of fractioned works that describe various findings about the sas7bdat format. In other words this is a completely reverse engineered library, based on analysis of actual sas7bdat files. During the last months we've had a good conversation going and actually fixing some of the remaining issues in parallel and bringing additions to each other's code.

Today we've released version 0.1 of SassyReader. It's not yet ready for mission critical use as there are still quirks in the format that we haven't figured out. Also there are different shapes and sizes within the format that vary apparently depending on (I'm a bit guessing here) the amount of columns and the operating system that the file was written with. The good thing is that we have a quite extensive test set and for at least the files that I had lying around that I wanted to work with the reader managed to read all but one (11 out of 12)!

Please visit the SassyReader website for more details, and let me know your feedback!

6 comments:

Matt Casters said...

Great work Kasper! I'm looking forward to creating a new "SAS File Input" step with this library in a next Kettle release (4.3).

Matt

Mark Hall said...

Very cool Kasper! Creating a SAS loader package using this library for the forthcoming Weka 3.7.4 release is on my "top 10" list

Cheers,
Mark.

Cool Administrator said...

Great work Kasper! You have done great contribution in this industry while keeping everything open source and it's the best thing to do!


How to Jailbreak iPhone 4S

Sqiar BI said...

SQIAR (http://www.sqiar.com/solutions/technology/tableau) is a leading global consultancy which provides innovative business intelligence services to small and medium size (SMEs) businesses. Our agile approach provides organizations with breakthrough insights and powerful data visualizations to rapidly analyse multiple aspects of their business in perspectives that matter most.

Zack said...

Looks very cool, but a quick question. It appears that it only has the ability to 'read' the dataset. If I wanted to query, I'm guessing I'd have to slurp up the dataset into a queryable object.

Kasper Sørensen said...

Hi Zack. Exactly, that's why we're also making SassyReader available through a MetaModel DataContext. That makes it queryable :-)