20080514

Drill-to-detail in profiling results

I've just added a new feature to DataCleaner that I'm very happy with. Now when you see a profiling result you can interact with it and drill to details on measures of interest to gain insight into how those measures are composed. Consider the screenshot below (sorry for the ugly window borders, I'm running linux here and we're still trying to tweak the UI to look sweet with the jgoodies look'n'feel).

What you're seeing is that the pattern finder identified a couple of patterns for the ADDRESSLINE2 column and I've just clicked the "aaaaa 999" pattern to find out which value where categorized to fit this pattern. This queries (drills to details) the datastore for these columns. Similarly I could have wanted to find out which values yielded the "max chars", "min chars" or other measures in the profile result matrix above. If I do that it will similarly query the values that originated the measures.

So how is this possible? Thanks to MetaModel it's quite easy to model such data explorations! Even though some of these queries cannot be transformed into SQL it is possible to post-process them using row filters that filter out only the correct rows based on an initial query. Let's look at an example, how to drill to details on the "max words" measure in String analysis:


//For each column result...
MatrixValue mv = matrixValues[7]; // 7 is the index of the "max words" measure

//Construct a query that get's as close as possible to the desired result
Query q = new Query();
q.addSelect(column);
q.addFrom(column.getTable())
q.addGroupBy(column).addSelect(new SelectItem(FunctionType.COUNT,"*",null));

//Set the query as the source of the details
mv.setDetailSource(q);

//Set a post processing row filter to include only those
//rows that have the right number of words
//(something that cannot be expressed in a query)

mv.addDetailRowFilter(new IRowFilter() {
  public boolean accept(Row row) {
      Object value = row.getValue(column);
      if (value != null &&
            new StringTokenizer(value.toString()).countTokens() == numWords) {
            return true;
      }
      return false;
  }
});


The code above is a slight rewrite of the real code for String analysis and it shows how easy it is (and ought to be) to attach a query to a value in a profile result! Drill-to-detail features are now a common thing in DataCleaner and in my oppinion it leverages the use of the profiler functionality to a new level.

No comments: