20100913

Pattern finder 2.0 - the latest feature in DataCleaner



I'm happy to be able to present a feature in this blog post that I know a lot of you have been asking for: A new and improved "Pattern finder" (as known in DataCleaner).

The new Pattern finder works similarly to the old one. The new thing is that it supports a wide variety of configuration options (and it has been designed so that it will be significantly easier to add more options, if needed). Here are the current available options:

PropertyDefault valueDescription
Discriminate text case true Sets whether or not text tokens that are upper case and lower case should be treated as different types of tokens.
Discriminate negative numbers false Sets whether or not negative numbers should be treated as different token types than positive numbers.
Discriminate decimals true Sets whether or not decimal numbers should be treated as different token types than integers.
Enable mixed tokens true Enables the "mixed" token type (denoted as '?' output). This type of token will occur when numbers and letters occur without separation of whitespaces.
Ignore repeated spaces false Sets whether or not repeated whitespaces should be ignored (ie. matched with single whitespaces)
Decimal separator ,* The separator used to identify decimal numbers.
Thousands separator .* The character used as a thousands separator in large numbers.
Minus sign -* The character used to denote negative numbers.
Predefined token name (none) Can be used to define an anticipated "predefined token" that should be replaced before any subsequent pattern recognition. Requires that the "Predefined token regexes" property is also set. An example of a name could be "Titulation"
Predefined token regexes (none) Defines a set of regular expressions for the "predefined token". Requires that the "Predefined token name" property is also set. An example value for these regular expressions could be "[Mr,Mrs,Miss,Mister]" (which would correspond to the "Titulation" name).

* = Depending on locale, the shown value is the typical one.

This may all seem complicated, but rest assured that the default values are reasonable and almost exactly resembles what you would expect from the Pattern finder in DataCleaner (except for the "Discriminate text case" property, which is inherently turned off in DataCleaner).

Here's how it works with a set of different inputs (job title, email, name) and configurations:
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -job examples/patternfinder_job.xml 

RESULT:
                            Match count Sample      
Aaaaa Aaa                            17 Sales Rep   
AA Aaaaaaaaa                          2 VP Sales    
Aaaaa Aaaaaaa (AAAA)                  2 Sale Manager (EMEA) 
Aaaaa Aaaaaaa (AAAAA, AAAA)           1 Sales Manager (JAPAN, APAC) 
Aaaaaaaaa                             1 President   


RESULT:
                                  Match count Sample      
aaaaaa.aaa@aaaaaaa[Domain suffix]           4 foo.bar@company.com 
aaaaaaa@aaaaaaaa[Domain suffix]             3 santa@claus.com 


RESULT:
                 Match count Sample      
aaaaaaa aaaaa              4 Jane Doe    
aaaaaaaa, aaaaaa           2 Bar, Foo    
aaa. aaaaaa aaa            1 Mrs. Foobar Foo
Notice that in the email example the two patterns end with "[Domain suffix]". This is because I've registered a corresponding "Predefined token" for this:
<analyzer>
  <descriptor ref="Pattern finder" />
  <properties>
    <property name="Predefined token name" value="Domain suffix" />
    <property name="Predefined token regexes" value="[\.com,\.org]" />
  </properties>
  <input ref="col_email" />
</analyzer>
So now that you've seen the new Pattern finder... Does it meet all your expectations? Let me know if you've got any ideas or unresolved issues!

No comments: