June 2008

20080627

What to remember when presenting OS BI

As I wrote about in my previous blog post I went to Danish IT the other day to talk about Open Source BI. I've spent a lot of time the last two days contemplating on the presentation and how the audience percievede OS BI as pretty immature and insecure... Somehow some of my points about how one could take advantage of Open Source instead of seeing it as a threat and a risk got lost in the mix. I will try and sum up some of the thoughts that I've had on what went wrong and how to present BI products for people who a unfamiliar with Open Source.

First of the presentation wasn't really a "sales meeting", so I took on the academic perspective and showed the audience a broad view of the OS BI arena, including two alternatives for each product group; databases, ETL, reporting and OLAP. This was a really bad idea: Presenting the alternatives within OS was simply too much; instead of being impressed that there was a volume for rivalry between and within the OS communities this was thought of as a bad thing - fragmentation, instability etc. So instead, just show the business people a single suite of products, a silver bullet, even though we all know that this does not exist (neither does it in the commercial world, which is a good thing).
Taking a too feature-driven product-focus was not the best idea. Sure you should point out all the good features of OS BI products but for my demonstrations I focused on the great computational power and advanced features instead of showing them some nice user interfaces. In selecting what to present I would definately recommend to use more user friendly products like eclipse BIRT, Talend, OpenLaszlo, etc. Instead I showed them the Pentaho suite which for a large part has a somewhat boring theme.
Remember to get them a list of companies that have already undergone OS BI initiatives. I forgot this completely and it was a big mistake. I hope for the attendees that they folllowed my advise to go check it out themselves at Pentaho's and JasperSoft's websites.
Stress that the participation thing is not something that nescesarily requires coding-skills. Show them how support forums work and that the communication part is just as important. Without useful information the developers are lost and will probably not focus on the exact same thing as the customers do.
Get more authority into the room. Hard to admit I have a hard time getting authority in a room of business people because I'm more of the academic type. So bring along colleagues and trustees to help convince the audience that your message is legitimate. Also this will help potential customers understand that you're not the only one in the business who cares about OS BI - it will tell them that there are others and certainly enough to get started with consulting, training and recruiting.
Show them the numbers of Open Source. David Wheelers article should give you some good starting points. Address the fact that there are plenty of developers and tell them what motivates them (these points only seemed to kick in when I told them about my own OS projects so I must have not made it clear from the beginning)... Learning, reputation, ideology and "real" work/sponsored developers.

On the positive note however (seeing that the points above may leave you with the impression that the meeting went all wrong, which is not the case) there was several positive experiences. It helped a lot to show the actual community webpages and show how the development process worked. I did this as my last topic and it should have been in there earlier to help the audience get a feeling of the underlying ideas. Also, supplementing with notes on related Open Source products helps people understand that this is not just a "crazy idea" that popped up for BI. Show them JBoss, Apache, Linux, OpenOffice, Eclipse, Mozilla etc. And then tell them about integration, SOA etc. which are architectual challenges that are very suitably overcome using Open Source based solutions.

20080619

Addressing Open Source BI and data quality at Danish IT

I've been offered to address the Danish IT (Dansk IT) networking group on Business Intelligence next wednesday. I'll be concentrating on talking about Open Source business models, pitfalls and opportunities and an overview of the Open Source BI market including demos of various tools - including DataCleaner of course. I'm thinking this is a great opportunity to get people involved with OS BI - an area that has been somewhat overlooked, at least in Denmark, I think.

Update: Just got home from the networking group and it was a very interesting day sparked with lots of discussions and perspectives on BI products. I can't say that everybody was convinced with going Open Source for their BI solutions but they definitely got an impression of what goes on and most people there was very interested in DataCleaner, perhaps because of the few implications it has on the rest of the BI portfolio to apply our data quality solution.

Update: You can now download my slides about Open Source Business Intelligence and please let me know what you think.

20080615

Report on DataCleaner development process

As some of you may know DataCleaner started as an academic project for me, investigating how Open Source projects are established, managed and developed. I've been waiting for a long time for the evaluation of the project but yesterday I finally got the results, and I'm proud to announce that I got the top-grade for the assignment! (12 in the danish grading system, which spans from -2 to 12 - I'll have to blog about the grading system some time, it's hilarious).

The project received notable credits for the explorative style of development and this is something that I'm very proud to keep on practicing. I'm publishing the report for free download here, but I'm affraid it's in Danish, so if you don't understand it you'll have to ... learn Danish ;-)

Download 'Udvikling og styring af Open Source projekter' here (in Danish)

I would appreciate any kind of feedback on my research and I don't mind critics now that I have the acknowledgment of Copenhagen Business School, heh.

20080614

Fluent interfaces in MetaModel

Just spent a couple of hours making the Query class of MetaModel implement fluent interfaces. Damn this syntax looks great:

new Query().select(myColumn).selectCount().from(myTable).where(myColumn, OperatorType.EQUALS_TO, "foobar").groupBy(anotherColumn).orderBy(myColumn);

This was one of the last TODO's with MetaModel in this round so I think we're going to release version 1.0 pretty soon! Stay tuned.

20080611

DataCleaner looks...

Been working a couple of days on a new great look for DataCleaner, check it out:

I'm starting to get real excited about releasing version 1.2, it'll be a radical improvement in terms of both visual experience and functionality I think!

20080601

How to process millions of resultset rows in java

I'm so excited, since I just think that we've solved a very common problem in java applications that have to deal with huge amounts of data. Here's the trouble:

Even though the JDBC spec. defines a way to specify the fetch size when executing queries, some drivers do not implement this feature, which means your program will run out of memory if you query eg. a couple of millions of records.
Even if your driver works as it is supposed to (that would be a reasonable assumption in most cases) there's still no effective way to optimize the computation of the many records by multithreading since the data is streamed through a single connection.

Because of the power of the MetaModel schema and query model we've been able to create a generic mechanism for splitting up a query into other queries that will yield less rows but the same collective result. The way we do this is by identifying attributes that can be used to filter in WHERE clauses, for example:

Consider we want to split up the query: "SELECT name, email FROM persons"
We will investigate the persons table and find columns that can be used to split the total resultset. We might find a reasonable age-column for this, so the query could be split to:

SELECT name, email FROM persons WHERE age < 30 OR age IS NULL
SELECT name, email FROM persons WHERE age > 30 OR age = 30

Depending on the desired size of the partial queries we will split up by finding further columns or by defining finer intervals to split by. Here's how it works:

DataContext dc = ...
Query q = ...

QuerySplitter qs = new QuerySplitter(dc, q);
List<Query> queries = qs.splitQueries();

I'd love to know what you all think of this? Personally I think it's a lovely way to optimize memory consumption and it offers new ways to utilize grid computing by distributing partial queries to diffent nodes in the grid to do remote processing. Also a lot of databases (MySQL for example) only dedicates a single thread per query - so by splitting the queries one could further optimize multithreading on the database.

kasper's source