20110620

SassyReader - Open Source reader of SAS data sets for Java

I'm quite excited to announce the first release of a brand new eobjects.org project: SassyReader. SassyReader is in my oppinion in deed something sassy as it fills a gap that has long existed in open source applications that deals with data management (ETL tools, tools like DataCleaner and the like). SassyReader is a library for reading data in the sas7bdat format, aka. the format that the SAS statistical software use! It is written entirely in Java and reads the files from their binary format (eg. it's not a connector to the SAS system, but a reader of the raw data).

Visit the SassyReader websiteSo why is this important? Well first of all because it is very difficult to create systems that interoperate with SAS. SAS does ship a JDBC driver but it's compliancy with JDBC is actually very limited. Even creating a connection will typically require use of SAS's proprietary classes, so you cannot go the standards JDBC way. There is also no JDBC metadata support and you need to set up a server-side SAS/SHARE option to even expose the connection. Furthermore this is an add-on product from SAS which costs additional money if you're just a base SAS user. So doing trivial things like connecting and querying a data set requires a lot of work and money. In my oppinion this is poor practice - a legacy way of trying to lock people in to using only a particular brand of software, simply because interoperability is a big pain.

All in all I see a great benefit in a project like SassyReader for those who simply want a way of reading the data that is stored in SAS files.

I cannot take a whole lot of credit for this project though. Most of the really challenging stuff was created by Matt Shotwell, aka. BioStatMatt, who founded the sas7bdat project which is written in R. My contribution was to port it to Java and fix a few issues on the way. Matt put together a lot of fractioned works that describe various findings about the sas7bdat format. In other words this is a completely reverse engineered library, based on analysis of actual sas7bdat files. During the last months we've had a good conversation going and actually fixing some of the remaining issues in parallel and bringing additions to each other's code.

Today we've released version 0.1 of SassyReader. It's not yet ready for mission critical use as there are still quirks in the format that we haven't figured out. Also there are different shapes and sizes within the format that vary apparently depending on (I'm a bit guessing here) the amount of columns and the operating system that the file was written with. The good thing is that we have a quite extensive test set and for at least the files that I had lying around that I wanted to work with the reader managed to read all but one (11 out of 12)!

Please visit the SassyReader website for more details, and let me know your feedback!

9 comments:

Unknown said...

Great work Kasper! I'm looking forward to creating a new "SAS File Input" step with this library in a next Kettle release (4.3).

Matt

Mark Hall said...

Very cool Kasper! Creating a SAS loader package using this library for the forthcoming Weka 3.7.4 release is on my "top 10" list

Cheers,
Mark.

Unknown said...

Great work Kasper! You have done great contribution in this industry while keeping everything open source and it's the best thing to do!


How to Jailbreak iPhone 4S

Zack said...

Looks very cool, but a quick question. It appears that it only has the ability to 'read' the dataset. If I wanted to query, I'm guessing I'd have to slurp up the dataset into a queryable object.

Kasper Sørensen said...

Hi Zack. Exactly, that's why we're also making SassyReader available through a MetaModel DataContext. That makes it queryable :-)

Jerry Hannan said...

SAS and all other SAS Institute Inc product or service names are registered trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
technology gadgets and gizmos

Unknown said...

great and very unique solution, Kaspar! I have a question, the code worked for some demo sas7bdat files. Now I got some sas files from our IT and it seems that your code does not work on them. The result of int subhCount = IO.readInt(pageData, 20); is always 0. Any idea, what could be the cause?

Kasper Sørensen said...

Hi Tobe,

TBH I don't know what that could mean. But I suggest then to raise it as a bug (and please include as much as you can in terms of description - maybe even a sample file if you can) on https://github.com/datacleaner/metamodel_extras

Saamrat said...

Is there any Java library to read XPT format? Or to convert XPT format to SAS7BDAT format?