20101229

YADC2S

Yet Another DataCleaner 2.0 Screenshot :) This is basically just an addition to my previous post about richer reporting and charts in DataCleaner 2.0.


What you see is our new Date Gap Analyzer which can be used to plot a timeline based on FROM and TO dates in a dataset. The analyzer will display gaps in the timeline and overlaps (periods where more than one record exist). This should be pretty useful for finding errors in datasets that contain continuous activities.

The chart is zoomable and scrollable so it is able of displaying quite a lot of data without harming the visual appearance.

7 comments:

Asbjørn Leeth said...

Hi Kasper
Is it posible to get a simple normal distribition graph instead? For me that would be much easier to read
/Asbjørn

Kasper Sørensen said...

Hmm for the value distribution I guess that would make a lot of sense actually... But for the Date Gap Analysis a normal distribution would not be fit, since the point of interest is a bit different - it's mostly used to find gaps in a continuity that is expected.

Asbjørn Leeth said...

I have actual just made such an analysis today using SQL and Excel. I didn't care for the exact gaps but i wanted to know the normal distribution of any gaps.
Of cource i could make a view and with that attribute and the use value distribution, which by the way only gives me a pie, not a normal distribution ;), but it would nice if I could do it all in the tool.

Kasper Sørensen said...

Hmm I think we're maybe talking about different things. The gaps that the Date Gap analyzer will find are not the periods between from and to dates. Rather it lays out a complete timeline based on all from and to dates in a dataset and identifies the gaps that are not covered by some activity. For example in a PM registration system, I might register the following periods:

from | to
03-01-2011 | 05-01-2011
07-01-2011 | 10-01-2011

That would create a timeline spanning from 03-01-2011 to 10-01-2011. And there would be a single gap in it, ie. from 05-01-2011 to 07-01-2011.

In this case I don't think a normal distribution would add much value, would it? I mean, any gap could likely be an inconsistency or DQ issue so getting any distribution of the gaps would not make sense (at least in the scope that I'm thinking, but please tell me otherwise if I'm wrong).

On the other hand if you have a different understanding of the word "gap" you might consider each period to be a gap (which I think is probably what is causing confusion in this discussion). So maybe the wording "date gap" is ambiguous or not explained very well. At least i DO see a relevance to create a normal distribution of the length between from and to dates! Something to start with here would be to apply the "Date to age transformer" and then do a Number analysis. The number analysis does not yet show a normal distribution, but it contains both the mean and the std. deviation, so you could probably produce one yourself from those two measures.

Kasper Sørensen said...

Hmm was just looking further into this. I see that actually I remembered wrong regarding that last workaround idea with the "Date to age" transformer. In order to get the difference between dates it's actually a bit more complicated (but you can do it with a JavaScript transformer) so I think I will create a separate transformer for just this.

Kasper Sørensen said...

There's now a Date diff / Period length transformer available. I think that will fit your purpose?

Asbjørn Leeth said...

Very nice. :)
But I'm still thinking a normaldistribution profile would be a feature request. This profile would take a single column, draw a nice little distribution, calculate modus and median, and perhaps even a standard deviation