20091130

Setting up Mondrian for JNDI DataSources, XML/A and custom CSS styles

The other day I decided that I wanted to set up mondrian as an adhoc analysis package for Lund&Bendsens intranet application, "Yacs". I didn't want to install a large application like Pentaho for just this specific need - rather I wanted to deploy just a simple cube schema, reuse the Java EE datasource definition that the intranet-application was already using and apply some basic styling to comply with the corporate profile. Doing these steps showed a lot more complex than I first imagined, primarily because I think the examples of the standalone mondrian distribution are overly complex and poorly designed, incapsulation-wise. Here is a list of steps I recommend doing to set up Mondrian "the right way":

  1. Deploy your Java EE datasource in a container/database-specific way. If you JBoss and MySQL like I do, here's an example datasource descriptor, place it in the deploy-folder with a filename like "mydatasource.xml" (bold parts should be replaced with your specific configuration):
    <datasources>
       <local-tx-datasource>
         <jndi-name>MyDataSource</jndi-name>
         <connection-url>jdbc:mysql://localhost/mydatabase</connection-url>
         <driver-class>com.mysql.jdbc.Driver</driver-class>
         <user-name>username</user-name>
         <password>password</password>
         <min-pool-size>5</min-pool-size>
         <max-pool-size>20</max-pool-size>
         <valid-connection-checker-class-name>
           org.jboss.resource.adapter.jdbc.vendor.MySQLValidConnectionChecker
         </valid-connection-checker-class-name>
         <metadata>
           <type-mapping>mySQL</type-mapping>
         </metadata>
       </local-tx-datasource>
    </datasources>
  2. Unzip the mondrian.war archive so you can edit the application.
  3. Add a container specific mapping of the DataSource to this application. In JBoss this is done by placing a file called "jboss-web.xml" in the WEB-INF folder with this content:
    <jboss-web>
       <resource-ref>
         <res-ref-name>MyDataSource</res-ref-name>
         <res-type>javax.sql.DataSource</res-type>
         <jndi-name>java:/MyDataSource</jndi-name>
       </resource-ref>
    </jboss-web>
  4. Now edit the WEB-INF/web.xml file and add the following entry inside the <web-app> element:
    <resource-ref>
       <res-ref-name>MyDataSource</res-ref-name>
       <res-type>javax.sql.DataSource</res-type>
       <res-auth>Container</res-auth>
    </resource-ref>
    </web-app>
  5. Also change the mapping of the JPivot filter so it goes like this:
    <filter-mapping>
       <filter-name>JPivotController</filter-name>
       <url-pattern>/*</url-pattern>
    </filter-mapping>
  6. Create a schema-file and save it under WEB-INF/mycatalog.xml. I won't give instructions as to writing schemas - Mondrians documentation cover this quite well.
  7. If you want to enable XML/A support, use this as a template for your WEB-INF/datasources.xml file (notice here that we use the application-local JNDI string here (including java:comp/env/...)):
    <datasources>
       <datasource>
         <datasourcename>Provider=Mondrian;DataSource=MyDataSource;</datasourcename>
         <datasourcedescription>My example datasource</datasourcedescription>
         <url>http://localhost:8888/mondrian/xmla</url>
         <datasourceinfo>Provider=mondrian;DataSource=java:comp/env/MyDataSource;
         </datasourceinfo>
         <providername>Mondrian</providername>
         <providertype>MDP</providertype>
         <authenticationmode>Unauthenticated</authenticationmode>
         <catalogs>
           <catalog name="MyCatalog"
             <definition>/WEB-INF/mycatalog.xml</definition>
           </catalog>
         </catalogs>
       </datasource>
    </datasources>
  8. The views on the cube that you now want must be created as individual JSP pages. One of the things that are really lacking in the mondrian bundle is reasonable JSP pages with less complexity and reasonable reuse of datasources. Here's how I build mine (you can more or less put this stuff into the testpage.jsp page and then you're not dependent on all the jsp include stuff) - notice now that the datasource-reference entered here is the just the last, ie. the reference in web.xml:
    <% if (session.getAttribute("query01") == null) { %>
      <jp:mondrianquery id="query01" datasource="MyDataSource" cataloguri="/WEB-INF/mydatasource.xml">
       <!-- Initial MDX query goes here -->
      </jp:mondrianquery>
    <% } %>
Horray! Now you've got connection pooling, sharing and all the other cool stuff that Java EE DataSources provide. Next step: Add styling. First: Remove all the stylesheets that come with mondrian. You don't need them because it's actually quite a lot easier to add your own than to try and modify the existing ones. Here's the result of 10 minutes of styling:



The important CSS id's and classes are:

  • To control styling of the pivot table and its cells it's important that you add <div> around this tag in the JSP:
    <wcf:render ref="query01" xslUri="/WEB-INF/jpivot/table/mdxtable.xsl" xslCache="true" />
    You can then use your div's id-attribute to target cells, headers etc. in your stylesheet.

  • The class .heading-heading: Used for headings of headings, ie. the top-level blue cells in the screenshot above.

  • The classes .column-heading-span, .column-heading-even, .column-heading-odd: Used for column headers, ie. the gray cells above the pivot table content.

  • The classes .column-heading-span, .column-heading-even, .column-heading-odd: Used for column headers, ie. the gray cells above the pivot table content.

  • The classes .row-heading-span, .row-heading-even, .row-heading-odd: Used for row headers, ie. the gray cells to the left of the pivot table content.

  • The classes .cell-even and .cell-odd: Used for the cells on even and odd rows.

I encourage the mondrian crew to clean up the reference application, but I'm guessing they are using the messy configuration to convince people to switch to a full Pentaho deployment :-)

20091109

JPA and the N+1 select problem

Warning to readers: This blog entry contains references to articles only available in Danish. So if you keep on reading be prepared to weep if you want to follow my suggestions ;-)

Lately I've been working hard on Lund&Bendsens intranet and processes involved around it. I've been using JBoss Seam for the most part and overall I'm quite thrilled about this choice of web framework. One of the cool parts about Seam is the way it integrates with Java Persistence API (JPA)/Hibernate and handles my persistence context even when I'm rendering the views for the intranet.

At the same time when I have been developing the intranet features in Seam and JPA, my colleague Kenn Sano wrote an excellent article about the N+1 Select Problem in JPA. Here's what it all comes down to (In Danish):

"Man kan [med JPA] "vandre rundt" i en objektgraf og på magisk vis hentes data, som stilles til rådighed i takt med, at vi traverserer - dvs. objekters tilstand indlæses fra databasen alt imens vi bevæger os rundt i objektgrafen. Hvis man ikke er opmærksom på, hvordan JPA fungerer, kan det resultere i mange SQL-kald mod databasen, hvilket kan have stor negativ indvirkning på performance."
I was doing this plentifully in Seam. When presenting a list of courses or a list of students, Seam lets me easily traverse the items in the list by using EL-expressions such as "#{course.location.address}" which involved several N+1 performance penalties. For instance on the list of all planned courses, including their locations, enrolled students etc. I observed a whole of N*4+1 query penalty. So there's no doubt you need to be aware of the impact of your querying strategy.

Note: I'm not blaming JBoss Seam for this behavior ... Seam makes everything a whole lot easier and when everything is easy you just tend to forget to think yourself ;-) Anyways - go read the article if you're interested in JPA and understand Danish.

20090908

New book on Open Source Business Intelligence tells the DataCleaner-story

About half a year ago we received an exciting inquiry from Jos van Dongen on behalf of him and his co-author Roland Bouman, telling us that they where writing a new book about Open Source Business Intelligence and in particular Pentaho-based solutions. And for this they where looking into DataCleaner for the data profiling section of the book!

The book is now out! It's called "Pentaho Solutions" and it's published by Wiley Publishing. You can read about it and buy it on their website as well.

The book contains a walkthrough for building a data warehouse using Open Souce tools and in doing so applying DataCleaner for the important job of profiling and validation.

We congratulate Roland Bouman and Jos van Dongen for their great work to promote Open Source Business Intelligence and thank them for mentioning DataCleaner while they're at it!

20090714

Open Source Data Quality with DataCleaner 1.5.2

Today I've announced the release of DataCleaner 1.5.2, yay! I'm pretty excited about this release as I think this is probably the biggest of the minor releases to date. And especially I hope to see that our new "single jar file" distribution-option will attract new users. Go read the announcement for more details now :)

20090628

Introducing AnalyzerBeans

It's been some time now since I first designed the core API's of the DataCleaner project and as time goes on, some of my initial assumptions about the design of profilers, validation rules and so on have shown to be less-than-optimal in regards to flexibility and scalability for the application. This is why yesterday I decided to do a major change in the roadmap for the project:

  • The idea about the "webmonitor" application (DataCleaner 2.0) have been cancelled for now. If anyone wants to realize this idea it's still something that I am very much interested in, but as you will see I have found that other priorities are more important.
  • A new project have been founded - for now as a "sandbox" project: AnalyzerBeans. AnalyzerBeans is a rethought architecture for datastore profiling, validation etc. - in one word: "Analysis". When this project is stable and mature we will probably be ready for something I like to think of as a new DataCleaner 2.0.
So why rethink datastore analysis? Because the "old way" have proven to be very cumbersome for some tasks that I did not initially realise would have importance. The current DataCleaner design assumes that all profiles, validation rules etc. do serial-processing of rows. This is not always the best way to do processing although it simplifies optimization of the execution-mechanism because all components execute in the same way and can thus share result sets etc. In AnalyzerBeans we want the best of both worlds: Flexibility to do al sorts of weird processing and rigidity for the lot of profilers which actually do process rows serially.

The solution is a new annotation based component-model. Each profiler, validation rule etc. will not have to implement certain interfaces because we can now mix and match annotations to the specific type of analysis-component - each "AnalyzerBean". There are a lot more interesting features available when we introduce an annotation-based model, but let me first give you a simple example of how a regular row-processing DataCleaner-style profile would look like:
@AnalyzerBean(name="Row counter", execution=ExecutionType.ROW_PROCESSING)
public class MySerialCounter {

    @Configured("Table to count")
    private Table table;
    private long count = 0l;

    @Run
    public void run(Row row, long count) {
        this.count += count;
    }
}
Now this is not so impressive. I've just replaced the IProfile interface of DataCleaner's API's with some annotations. But notice how I've gotten rid of the ProfileDescriptor class which was used to hold metadata about the profiler. Instead the annotations represent the class metadata. This is actually excactly what annotations are for :-) Also notice that I've gotten a type-safe configuration-property using the @Configured annotation. This means that I don't have to parse a string, ask for a Table of the corresponding name etc. And the UI will become a LOT more easy to develop because of type-safe facilities like this.

But an even more exciting way to use the new API is when creating a whole new type of profiler, an exploring AnalyzerBean:
@AnalyzerBean(name="Row counter", execution=ExecutionType.EXPLORING)
public class MySerialCounter {

    @Configured("Table to count")
    private Table table;
    private Number count;

    @Run
    public void run(DataContext dc) {
        DataSet ds = dc.executeQuery(new Query().selectCount().from(table));
        ds.next();
        this.count = (Number) row.getValue(0);
        ds.close();
    }
}
Now this is something totally new: A component that can gain total control of the DataContext and create it's own query based on some @Configured parameters. I imagine that this programming model will give us complete flexibility to do exiting new things that was impossible in the DataCleaner-framework: Join testing, non-serial Value Distribution etc.

There are a few other annotations available to the AnalyzerBean-developers but I will take a look at them in a more in-depth blog-entry later. For now - let me know if you like the ideas and if you have any comments. Anyone who would like to help out in the development of the AnalyzerBeans project should visit our wiki page on the subject.

Update (2010-09-12)

A lot has happened to AnalyzerBeans since this blog entry. Here's a list of blog entries (in chronological order) that will help interested readers dive deeper into the development of AnalyzerBeans:

20090613

Performance benchmark: DataCleaner thrives on lower column counts

Today I've conducted an experiment. After fixing a bug related to CSV-file-reading in DataCleaner, I was wondering how performance was impacted by different kinds of CSV file compositions. The reason that I suspected that this could impact performance is that CSV files with many columns will require a somewhat larger chunk of memory in order to keep a single row in memory compared to CSV files with fewer columns. In the older versions of DataCleaner we discovered that using 200 or more columns would actually make the application run out of memory! Fortunately, this bug is fixed, but there is still a significant performance penalty, as this blog post will hopefully show.

I auto-generated three files for the benchmark: "huge.csv" with 2.000 columns and 16.000 rows, "long.csv" with 250 columns and 128.000 rows and "slim.csv" with only 10 columns and a roaring 3.200.000 rows. Together, each file has 32.000.000 cells to be profiled. I set up a profiler job with the profiles Standard measures and String analysis on all columns.

Here are the (surprising?) results:

filename rows columns start time end timetotal time
huge.csv 16000 2000 18:54:48 19:40:2845:40
long.csv 128000 250 19:44:53 19:52:317:38
slim.csv 3200000 10 19:53:46 19:55:031:17


So the bottom line is: Lowering the number of columns has a very significant, positive impact on performance. Having a lot of columns means that you will need to hold a lot more data in memory and needless to say you will have to replace this large chunk of memory a lot of times during the execution of a large profiler job. Going all the way from 45 minutes to 1½ is quite an improvement - so don't pre-join tables or anything like that before you run them through your profiler.

20090605

eobjects.org @ JavaOne

I am currently hanging out at the lovely JavaOne conference in San Francisco, checking out cool new Java technology and meeting interesting people. Of course I'm here as a representative from my employer I also do some blogging (in Danish).

Yesterday I saw an interesting session about JFreeChart and surviving as an Open Source professional. Dave Gilbert told us about how he has managed to live from his hobby as a JFreeChart developer, about cool new features of the excellent charting API and about the struggles of making money on Open Source. Very fascinating stuff and I hope that everybody in the chart-consuming business will give it a try. I was quite happily surprised to see the new interactive chart functionality that has been put into the API - I'm wondering how that hadn't found my attention before now! It gave rise to a couple of ideas (or rather: Sparked my motivation) for me to try and implement charting in DataCleaner.

20090429

Seam, EJB's and EAR-packaging in Maven

Lately I've been designing a new course on the splendid JBoss Seam (2.1) Web and Java EE framework. One thing that strikes me, being an enterprise Java developer, is that almost no good examples of setting up Seam using Maven exist. I realize that Seam's appeal comes from a tradition of wanting things to be nice and easy, so the seam-gen tool has absolute merits for creating your project but to me the downside of this approach is that you're bound to use Ant as a build tool and for a lot of reasons I won't go into here, I strongly prefer using Maven.

The examples of using Maven to build Seam applications that I have been able to track down have been limited in a variety of ways:

  • Most of them required to use a specific Maven archetype which altered the default project layout.
  • Almost all of them where out of date and was bound to a very "custom" maven repository which I don't think is suitable for enterprise application infrastructure.
  • Those that didn't fall within the former two points where restricted to web-applications (WAR-files) only and thus didn't support using EJB's in the Seam applications.
So this blog post is going to be a walk-through of how I managed (after quite some effort) to configure Maven to be able to build an EAR file containing EJB's and a web-application that is able to utilize these EJB's as Seam components. I'm only highlighting the most interesting parts - You can download the complete example for free - and in contrast to the examples I have seen, it's very bare-boned and won't take more than a few seconds to alter to your needs.

Project structure
OK - to meet the demand of packaging an EAR file I've created a Maven project consisting of three modules. It's possible that this part can be optimized a bit since I use a seperate module for doing all the packaging (which to my understanding is the way to do it). The modules are:
  • ejbs - containing the EJB's and possibly other "backend"-stuff like JPA-classes and so on.
  • web - containing the web application
  • packaging - for packaging the EAR file containg the two other modules (and Seam)
Which module depends on what?
One of the big issues that I faced was figuring out how to configure Maven's dependency management framework to generate a correct set of EAR, JAR and WAR files. If this is done wrong it won't work and you'll waste hours (or at least I did) trying to figure out what is wrong.

All provided dependencies go into the parent project. These are all the dependencies provided by your container (plus Seam itself which will be packaged within the EAR file). You will need to add JBoss's Maven repository to resolve these dependencies. Here are the important parts of my parent project pom:
<repositories>
   <repository>
     <id>repository.jboss.org</id>
     <name>JBoss Repository</name>
     <url>http://repository.jboss.org/maven2</url>
   </repository>
</repositories>
<modules>
   <module>ejbs</module>
   <module>web</module>
   <module>packaging</module>
</modules>
<dependencies>
   <dependency>
     <groupId>org.jboss.seam</groupid>
     <artifactId>jboss-seam</artifactid>
     <version>2.1.1.GA</version>
     <scope>provided</scope>
   </dependency>
   <dependency>
     <groupId>javax.faces</groupid>
     <artifactId>jsf-api</artifactid>
     <version>1.2_02</version>
     <scope>provided</scope>
   </dependency>
   <dependency>
     <groupId>org.hibernate</groupid>
     <artifactId>hibernate-entitymanager</artifactid>
     <version>3.4.0.GA</version>
     <scope>provided</scope>
   </dependency>
   <dependency>
     <groupId>org.hibernate</groupid>
     <artifactId>hibernate-validator</artifactid>
     <version>3.1.0.GA</version>
     <scope>provided</scope>
   </dependency>
   <dependency>
     <groupId>javax.servlet</groupid>
     <artifactId>servlet-api</artifactid>
     <version>2.5</version>
     <scope>provided</scope>
   </dependency>
   <dependency>
     <groupId>javax.servlet.jsp</groupid>
     <artifactId>jsp-api</artifactid>
     <version>2.1</version>
     <scope>provided</scope>
   </dependency>
   <dependency>
     <groupId>javax.ejb</groupid>
     <artifactId>ejb-api</artifactid>
     <version>3.0</version>
   </dependency>
</dependencies>
In the ejbs module you will not need any dependencies!

In the web module you will need to add seam-ui (excluding the "core" seam, because it is provided within the EAR file), facelets and other web-dependencies such as RichFaces, seam-pdf or whatever. Here are the dependencies of my web module pom:
<dependencies>
   <dependency>
     <groupId>com.sun.facelets</groupid>
     <artifactId>jsf-facelets</artifactid>
     <version>1.1.11</version>
   </dependency>
   <dependency>
     <groupId>org.jboss.seam</groupid>
     <artifactId>jboss-seam-ui</artifactid>
     <version>2.1.1.GA</version>
     <exclusions>
       <exclusion>
         <groupId>org.jboss.seam</groupid>
         <artifactId>jboss-seam</artifactid>
       </exclusion>
     </exclusions>
   </dependency>
</dependencies>
If you want to write Seam components that pertain only to the web module, then you will probably also need to include the ejbs module as a dependency in the web module pom. But remember to set the scope as provided, because it's all within the same EAR.

Last but not least you will need to write the packaging module pom. The Seam Documentation have been used to guide how this pom is structured: The ejbs module needs to be a registered EJB module. The web module needs to be a registered WAR module. And Seam itself needs to be a registered EJB module as well! JBoss EL needs to be placed in the /lib directory of the EAR and you need to make sure to exclude the EL API dependency from several artifacts - otherwise you'll get weird classpath issues when deploying. The important parts of the packaging module pom looks like this (pay attention to the parts in bold):
<build>
   <plugins>
     <plugin>
       <artifactId>maven-ear-plugin</artifactid>
       <configuration>
         <modules>
           <jarmodule>
             <groupId>org.jboss.el</groupid>
             <artifactId>jboss-el</artifactid>
             <includeinapplicationxml>false</includeinapplicationxml>
             <bundledir>lib</bundledir>
           </jarmodule>
           <webmodule>
             <!-- add web module groupId and artifactId here -->
             <contextroot>/seam-ejb-ex</contextroot>
           </webmodule>
         </modules>
       </configuration>
     </plugin>
   </plugins>
</build>
<dependencies>
   <dependency>
     <!-- add ejbs module groupId, artifactId and version here -->
     <type>ejb</type>
   </dependency>
   <dependency>
     <!-- add web module groupId, artifactId and version here -->
     <type>war</type>
   </dependency>
   <dependency>
     <groupId>org.jboss.seam</groupid>
     <artifactId>jboss-seam</artifactid>
     <version>2.1.1.GA</version>
     <type>ejb</type>
     <exclusions>
       <exclusion>
         <groupId>javax.el</groupid>
         <artifactId>el-api</artifactid>
       </exclusion>
     </exclusions>
   </dependency>
   <dependency>
     <groupId>org.jboss.el</groupid>
     <artifactId>jboss-el</artifactid>
     <version>1.0_02.CR2</version>
     <type>jar</type>
     <exclusions>
       <exclusion>
         <groupId>javax.el</groupid>
         <artifactId>el-api</artifactid>
       </exclusion>
     </exclusions>
   </dependency>
</dependencies>
So that's all the Maven configuration. Now to some "gotchas" in regards to Seam configuration. A lot of these things are being "disguised" by seam-gen so consider this a list of things you might have forgotten if you're building the project from the bottom up:
  • Remember to put empty seam.properties files in to the resource folders of both the ejbs and web modules.
  • Remember to register the Seam interceptors (as described in the Seam Documentation) in the ejb-jar.xml file, located in the resources/META-INF folder ejbs module.
  • The components.xml file should be located in the web module's WEB-INF folder. But you can't just copy the file from a seam-gen project because seam-gen dynamically replaces some "magic strings" in it to see to container compliancy. Instead you will have to add this static entry to the components.xml file yourself:
    <core:init pattern="seam-ejb-ex/#{ejbName}/local"></core:init>
    Note the seam-ejb-ex part of that static jndi pattern string. You will have to replace this part of the string with the context root entry that was highlighted in the packaging module pom!

I hope this blog entry have cleared up some of the confusion that I met when I tried building Maven-based Seam/EJB projects. Please let me know if it works out and remember that you can download the full example and use it as a reference as you like.

20090420

DataCleaner 1.5.1 released

I'm happy to announce the release of DataCleaner version 1.5.1. This release is a minor release, nevertheless containing a few nice features - especially for the users who are enjoying the exporting features that was introduced in 1.5:

  • An additional HTML export format have been added to the built-in export formats (usable when exporting Profiler results in the desktop app and when executing the runjob command-line tool).
  • The export format is now choosable directly in the desktop app.
  • Four new measures where added to the String Analysis profile: avg. chars and max/min/avg white spaces.

The new version of DataCleaner is (as always) downloadable for free on the downloads page and feedback from users is also greatly appreciated, ie:

We hope that you all enjoy DataCleaner 1.5.1.

20090211

Data quality pro interview

Dylan Jones over at data quality pro is working on a feature about DataCleaner and I'm very thankful for his work already (and pretty excited to see the final result). The feature have just been started with an interview with a very important person ... Me! :-) For all of those who take an interest in DataCleaner, the visions of the product and it's story I hope that you will head over there and read the article.

To be continued with more posts on the data quality pro articles.

20090125

Free OLAP cube icon

Okay this is a bit off-topic compared to my normal posts, but here goes.

When I do websites or GUI design I usually look out for free/open source icon packages such as Tango or Crystal. I have been looking for a nice-looking icon to represent an OLAP cube, preferably in a style and coloring similar to the Tango icon set. Sorry to say, I didn't find any, so I went ahead and spent some hours creating a new icon on my own. Here's the result:


And a plain version without the sum/count text elements (good if you need to resize it to very small sizes):


I'm giving this away under a beerware license, so if you want it, it's yours.

20090120

DataCleaner 1.5 - a heavy league Data Profiler

Often when I speak to data quality professionals and people from the business intelligence world I get the notion that most people think of Open Source tools as slightly immature when it comes to heavy processing, large loads, millions-of-rows-kinda-stuff. And this has had some truth to it. I don't want to name names, but at least I have heard a lot of stories about Open Source data integration / ETL tools that wasn't up for the job when you had millions of rows to transform. So I guess this notion have stuck to Open Source data profilers and data quality applications too...

In DataCleaner 1.5 I want this notion demystified and eradicated! Here are some of the things we are working on to make this release a truly enterprise-ready, performance-oriented and scalable application:

  • Multi-threaded, multi-connection, multi-query execution enging
    The execution engine in DataCleaner have been thoroughly refactored to support multithreading, multiple connections and query-splitting to perform loadbalancing on the threads and connections. This really boosts performance for large jobs and sets the bar for processing large result sets in Open Source tools I think.
  • On-disk caching for memory-intensive profiles and validation rules
    Some of the profiles and validation rules are almost inherently memory intensive. We are doing a lot of work optimizing them as much as we can but some thing are simply not possible to change. As an example, a Value Distribution profile simply HAS to know all distinct values of each column that is being profiled. If it doesn't - then it's not a value distribution profile. So we are implementing various degrees of on-disk caching to make this work without flooding memory. This means that the stability of DataCleaner is improved to a heavy league level.
  • Batch processing and scheduling
    The last (but not least important) feature that I'm going to mention is the new command line interface for DataCleaner. By providing a command line interface for executing DataCleaner jobs you are able to introduce DataCleaner into a grand architecture for data quality, data warehousing, master data management or whatever it is that you are using it for. You can schedule it using any scheduling tool that you like and you can save the results to automate reporting and result analysis.

20090107

Using Python and Django to build the new DataCleaner website

I have for a long time been a dedicated Java developer and in many ways, still am. But developing the new website for DataCleaner have been quite an eye-opener for the potential of dynamic languages and Python in particular. There are so many things about that language that I love and I must say that doing the same thing in Java would have taken at least twice the time! And thats even though I'm not an unexperienced Java developer.

OK, so what's the big difference? Well, deployment is one very crucial difference. J2EE servers are great for stability and system administration but often I find myself, as a web developer, not needing all those things that much - I just need a server that always runs and will tell me what I am doing wrong. Django (which have been my Python web-framework) have been excellent in doing this for me so I can kick-start my application in a matter of seconds.

Type-safety is another big difference. Java is type-safe, Python (and other dynamic languages) is not. For back-end development I am a big advocate of type-safety but in front-end development dynamic classes are such a great treat! An example of this is when transfering data from Controllers to the View in the Django framework's Model-View-Controller architecture. If you want to present some domain objects that are related in the view, but not in the domain model (or perhaps the domain model has some details to it that you want to skip for understandability), then you just infer a completely new attribute into the domain object! In Java or other type-safe languages such as C#, you would typically have to create a Map for storing the new particular association and then do a lookup in the view to resolve the association. This means more "logic" in the view and code that is harder to comprehend.

All in all I'm very happy to use Django. I would have liked a few more features in their QuerySet API (especially aggregation queries, which should be on the way) but then again - for the typical website it is pretty sufficient and allows fallback to native SQL. Thank you Django.

Note: This is not to say that I am abandoning Java, not at all! I love Java for it's stability and superior integrational capabilities, but in some cases, I simply want something that is fast and more in tune with the user experiencing and prototyping process of building websites.