20111030

Standardize the date formats in your data

One of the things that I see sometimes is that web forms cause unstandardized data in your database. For example, text fields in web forms do not have a native way to specify the type of the data. So what if you have a field that is supposed to be a date? For example the birthdate of your web users? A lot of web applications are not performing real validations of the format and content of the data entered into such fields. I think this typically occurs because it was not thought of as important at the time of designing the initial web page. But maybe it will become important at a point in time if eg. you want to analyze the age groups of your users! The trouble is that later on in the applications lifecycle, a state of unchangeability enters because you're stuck with a bunch of unstandardized data that you cannot conform to a new standardized data format. This is because you will have a lot of different date formats represented. For example:

  • 2011-10-30
  • 20111030
  • 30th of October, 2011
  • 30/10/11
And maybe some even more exotic ones...
In this blog entry I will show you how to solve that migration issue with the use of DataCleaner.

1. Date mask matching
The first thing we should do is to analyze which date patterns are present in the data. To do this you need to combine two components: The Date mask matcher and the Boolean analyzer. Here are the steps involved.
  1. First set up you datastore in the welcome screen of DataCleaner.
  2. Click the "Analyze!" button to begin composing your job.
  3. In the tree to the left, select the columns of interest - in our example at least the birthdate column.
  4. Click "Add transformer -> Matching and standardization -> Date mask matcher".
Your screen will now look something like this:
In the middle of the screen you see a list of date masks. Each of these produce a boolean output column (seen below). The idea of the Date mask matcher is that it creates the boolean columns so that you can even assert if a particular date is parseable by using several date masks. That's because a single date string like "080910" can be understood in many ways!

2. Analyzing matches
Moving on, we want to see how well our dates match against the date masks. Since all the matches are now stored in boolean columns, we can apply the Boolean analyzer. Here are the steps involved:
  1. Click "Add analyzer -> Boolean analyzer".
  2. Make sure all the transformed boolean columns are checked.
  3. Click the "Run analysis" button.
  4. Wait for the analysis to run.
Your screen will now contain an analysis result like this:
The result has two parts: The Column statistics and the Frequency of combinations.
In the column statistics you can see how much individual date masks have been matched. In our example we can see that 4 of our date masks (no. 2, 3, 5 and 6) are not matched at all, so we may consider removing them from the Date mask matcher.
In the frequency of combinations we get a view of the rows and which match combinations are frequent and less frequent. The most frequent combination is that our date mask no. 1 is the only valid mask. The second most frequent combination (Combination 1) is that none of the date masks apply. If you click the green arrow to the right of the combination you will see which records fall into that category. In our example that looks like this:
This gives us a good hint about which date masks we need to add to our date mask matcher.
The "1982.03.21" date is a simple case - we should simply create a date mask like this: YYYY.MM.dd
The "11th of march, 1982" date is a bit more complex. We need to allow the date mask to have a literal string part (the "th of" part) and it needs to recognize the month by name ("march"), not by number. Fortunately this is still possible, the date mask looks like this: dd'th of' MMMMM, YYYY

3. Converting to dates
While we could continue to refine the analysis, this is a blog, not a reference manual and I want to cut to the chase - the actual migration to standardized dates!
So let us look at how you can convert your date strings to actual date fields which you can then choose to format using a standardized format. To do this, click "Add transformer -> Conversion -> Convert to date". You will now see a configuration panel like this:
In here you also see a list of example date masks. Click the plus-button to add additional date masks to convert by. The converter will try from the top to convert if it can, so in case you have cases like "091011" then you have to make your choice here (I would recommend based on your analysis).
We add the few masks that are relevant for our example:
And we verify that there are no immediate unrecognized dates, by clicking the "Preview data" button:
If a date is not recognized and converted, then the output column will have a null instead of a date. Therefore a good practice would be to look for null values and eg. save them to an error handling file. To do this, here's what we do:
  1. Go to the Filters tab.
  2. Click "Add filter -> Not null".
  3. Select only your converted column (in our example "birthdate (as date)").
  4. Click "INVALID -> Write to CSV file".
  5. Select the desired columns for the error handling file.
  6. Optionally right click the "Write to CSV file" tab and select "Rename component" to give it a name like "write errors".
  7. Go back to the Filters tab and click the VALID button to write the valid records to a CSV file or a spreadsheet.
After these steps, you should be able to inspect your job flow by clicking the Visualize button:
Now your date standardization job is ready to run again and again to enforce standardized dates!