I'm happy to be able to present a feature in this blog post that I know a lot of you have been asking for: A new and improved "Pattern finder" (as known in DataCleaner).
The new Pattern finder works similarly to the old one. The new thing is that it supports a wide variety of configuration options (and it has been designed so that it will be significantly easier to add more options, if needed). Here are the current available options:
Property | Default value | Description |
---|---|---|
Discriminate text case | true | Sets whether or not text tokens that are upper case and lower case should be treated as different types of tokens. |
Discriminate negative numbers | false | Sets whether or not negative numbers should be treated as different token types than positive numbers. |
Discriminate decimals | true | Sets whether or not decimal numbers should be treated as different token types than integers. |
Enable mixed tokens | true | Enables the "mixed" token type (denoted as '?' output). This type of token will occur when numbers and letters occur without separation of whitespaces. |
Ignore repeated spaces | false | Sets whether or not repeated whitespaces should be ignored (ie. matched with single whitespaces) |
Decimal separator | ,* | The separator used to identify decimal numbers. |
Thousands separator | .* | The character used as a thousands separator in large numbers. |
Minus sign | -* | The character used to denote negative numbers. |
Predefined token name | (none) | Can be used to define an anticipated "predefined token" that should be replaced before any subsequent pattern recognition. Requires that the "Predefined token regexes" property is also set. An example of a name could be "Titulation" |
Predefined token regexes | (none) | Defines a set of regular expressions for the "predefined token". Requires that the "Predefined token name" property is also set. An example value for these regular expressions could be "[Mr,Mrs,Miss,Mister]" (which would correspond to the "Titulation" name). |
* = Depending on locale, the shown value is the typical one.
This may all seem complicated, but rest assured that the default values are reasonable and almost exactly resembles what you would expect from the Pattern finder in DataCleaner (except for the "Discriminate text case" property, which is inherently turned off in DataCleaner).
Here's how it works with a set of different inputs (job title, email, name) and configurations:
> java -jar target/AnalyzerBeans.jar -conf examples/conf.xml -job examples/patternfinder_job.xml RESULT: Match count Sample Aaaaa Aaa 17 Sales Rep AA Aaaaaaaaa 2 VP Sales Aaaaa Aaaaaaa (AAAA) 2 Sale Manager (EMEA) Aaaaa Aaaaaaa (AAAAA, AAAA) 1 Sales Manager (JAPAN, APAC) Aaaaaaaaa 1 President RESULT: Match count Sample aaaaaa.aaa@aaaaaaa[Domain suffix] 4 foo.bar@company.com aaaaaaa@aaaaaaaa[Domain suffix] 3 santa@claus.com RESULT: Match count Sample aaaaaaa aaaaa 4 Jane Doe aaaaaaaa, aaaaaa 2 Bar, Foo aaa. aaaaaa aaa 1 Mrs. Foobar FooNotice that in the email example the two patterns end with "[Domain suffix]". This is because I've registered a corresponding "Predefined token" for this:
<analyzer> <descriptor ref="Pattern finder" /> <properties> <property name="Predefined token name" value="Domain suffix" /> <property name="Predefined token regexes" value="[\.com,\.org]" /> </properties> <input ref="col_email" /> </analyzer>So now that you've seen the new Pattern finder... Does it meet all your expectations? Let me know if you've got any ideas or unresolved issues!
No comments:
Post a Comment