Open Source acknowledged by the Data Quality community?

Here in Denmark I often feel that the Data Quality, Master Data Management and Business Intelligence field is pretty fearsome towards Open Source software. I think this largely has to do with lack of presence, a lot of prejudices and few established consultancy firms in this part of the world.

This is also why it's always a great surprise, and a good one, when you catch the interest of the international Data Quality venue. Last week I was in a correspondence with the guys over at Data Quality Pro who was building a new Open Source Data Quality page and we had a nice chat about the tools available and the opportunities out there. I'm very glad that people are showing interest and hopefully the Danish BI scene will also adapt to the wonderful world of Open Source software as we go along...

Talking about getting the word out, where should you "advertise" your Open Source product to the business world? I personally think it's hard work to market your product even though you're giving it away for free and you would think that people automatically rushed to your website ;) Of course everyone needs to be aware that the software is here and I try a lot to put the word out there on conferences, wikipedia, sourceforge, freshmeat, ohloh etc.. But let me raise this question to everybody involved with marketing software for no cost: How do you do it?


Why can't my masters thesis be more like my Open Source project?

Being a hard working student I sometimes have to question (and from time to time applaude) the practices of academia and the tools that we use to foster innovation, creativity and knowledge-sharing. My masters thesis subject is concerned with the development methods of open source communities and companies that try to enable community-based development of their products. Yesterday I was considering this quote:

"In the cathedral-builder view of programming, bugs and development problems are tricky, insidious, deep phenomena. It takes months of scrutiny by a dedicated few to develop confidence that you've winkled them all out." - Eric S. Raymond - The Cathredral and The Bazaar

Looking aside from the fact that it deals with programming and not writing a paper (and trying to grow global awareness and knowledge on a specific topic) I thought to myself, that the cathedral-builder process is pretty similar to the process of writing a masters thesis. There are pretty strict guidelines to follow, a lot of scrutiny involved and planning by a dedicated few - in my case myself and my supervisor.

So is there a room for another way, a more open process with distributed peers, continous redesign, short release-spans etc. Obviously there are things like wikipedia that provide this for topics of interest to the general public, but needless to say science projects often go beyond that level of information and have to deal with experiments, not just facts of life such as those in an encyclopedia.

Also there's the issue of academia culture. Ego and elitism doubtlessly play a big part in maintaining a high degree of secrecy and closeness of scientific endeavor. I'd love to see scientists work in a community-enabling fashion and then I'd love to contribute to one of those (or create my own for my masters these). Let's for example try something like this out and we'll be well under way:

  • "Bugtrackers" for all the items that needs investigations
  • "Source control management" and versioning systems for revisions of the paper(s)
  • Chatty mailing lists for peer review and discussions
  • "Continous integration" for managing/matching references and terms within the paper
  • Free availability to all underlying data, not just the published parts

The beauties of this would be similar to the beauties of open source. And particularly in academia there's a strong need to be able to track down who's done what and source control management and reference-management would greatly improve on that account. For evaluators it would be possible to see the actual changes made by each student in group work, for group-working students it would be possible to track the actual changes to the project (as opposed to having to read it all over again everytime you exchange documents)... Perhaps a more "open" science would be just what we need?


A day of releases!

I saw this morning that the new OpenOffice 3 is out! Congratulations to the OO.o crew, I'm enjoying using your product for my upcoming masters thesis :)

Today is also the day that DataCleaner 1.5 "snapshot" has been released. Here's the press release:

As we're moving steadily along towards the release of DataCleaner 1.5 we are fixing a few bugs and enhancing a lot of features. This leads to the desire to release our work since practically nothing has undergone changes that could destabilize the application since the 1.4 release. So today we're releasing DataCleaner 1.5 "snapshot". This also marks the first release under our new LGPL license.

Here are the changes from 1.4 so far:
  • Change of license to LGPL.
  • New profile: Date mask matcher.
  • New profile: Regex matcher.
  • More file types supported (.dat, .txt)
  • XML file support improved (.xml)
Although this is in principle a development/beta release, we feel that it would be worth working with for most of your profiling needs. So... Go on, download it, tell us what you think and we'll see you around!

I hope you all enjoy the new version of DataCleaner!


Fast as lightning!

Whoa! I just got through my Lightning Speak about eobjects.org and DataCleaner at the Open Source Days '08 conference about an hour ago. It was a great experience - very fun and kinda stressing (in the good, "get to the point"-kinda way) to have an alarm clock counting down for your 15 minutes of fame!

And in deed my presentation was very closely to the point. I wanted to tell people about the great creative projects at eobjects.org and especially about DataCleaner and the MetaModel project, which I dubbed a "derivative" project. My speak also quickly sketched the domain of data quality and people where nodding when I concluded that they all should download DataCleaner and give their datasources a quick profile the next time they worked on their projects.

You can download my slides here: http://eobjects.org/resources/download/opensourcedays.pdf

Unfortunately the format of the Ligthning Speak didn't allow for much time for comments and questions from the audience, but I hope and think that they had a good time!