20100827

A nice abstraction over regular expressions

Often when you're developing data profiling, matching or cleansing software, you're dealing with expression matching, typically through regular expressions (regexes). One thing that I find is that it is often a tedious and error-prone task to define and reuse regexes or parts of regexes. In AnalyzerBeans there's a huge need for easier and reusable pattern matching. To counter this requirement I've come up with a helper-class, Named Pattern, which you can use to match and identify tokens in the patterns in a type-safe and easy way. Here's a short example for matching and tokenizing names based on two simple patterns:

//First define an enum with the tokens in the pattern(s)
public enum NamePart { FIRSTNAME, LASTNAME, TITULATION }

// The two patterns
NamedPattern<NamePart> p1 = new NamedPattern("TITULATION. FIRSTNAME LASTNAME", NamePart.class);
NamedPattern<NamePart> p2 = new NamedPattern("FIRSTNAME LASTNAME", NamePart.class);
NamedPattern<NamePart> p3 = new NamedPattern("LASTNAME, FIRSTNAME", NamePart.class);

// notice the type parameter <NamePart> - the match result type is typesafe!
NamedPatternMatch<NamePart> match = p1.match("Sørensen, Kasper");
assert match == null;

match = p2.match("Sørensen, Kasper");
assert match == null;

// here's a match!
match = p3.match("Sørensen, Kasper");
assert match != null;

String firstName = match.get(NamePart.FIRSTNAME);
String lastName = match.get(NamePart.LASTNAME);
String titulation = match.get(NamePart.TITULATION);

All in all I think that the NamedPattern class (and the NamedPatternMatch) in combination with your own enums is a pretty elegant way to do string pattern matching. There's also a way to specify how the underlying regular expression will be built by letting the enum implement the HasGroupLiteral interface.

Developers can dive into the details of these classes and interfaces at the Javadoc / API Documentation for AnalyzerBeans (package org.eobjects.analyzer.util).

No comments: