20140714

MetaModel 4.2 to offer 'Schema Inference'

For the upcoming version 4.2 of Apache MetaModel (incubating) I have been working on adding a JSON module to the already quite broad data access library. The purpose of the new JSON module is to be able to read text files with JSON contents and present it as if it's a database you can explore and query. This presented (again) an issue in MetaModel:

How do we present a queryable schema for a datastore which does not intrinsically have a schema?

A couple of times (for instance in implementing MongoDB, CouchDB, HBase or XML modules) we have faced this issue, and answers have varied a little depending on the particular datastore. This time I didn't feel like creating another one-off strategy for building the schema. Rather it felt like a good time to introduce a common abstraction - called the 'schema builder' - which allows for several standard implementations as well as pluggable ways of building or infering the schema structure.

The new version of MetaModel will thus have a SchemaBuilder interface which you can implement yourself to plug in a mechanism for building/defining the schema. Even better, there's a couple of quite useful standard implementations.

To explain this I need some examples. Assume you have these documents in your JSON file:

{"name":"Kasper Sørensen", "type":"person", "country":"Denmark"}
{"name":"Apache MetaModel", "type":"project", "community":"Apache"}
{"name":"Java", "type":"language"}
{"name":"JavaScript", "type":"language"}

Here we have 4 documents/rows containing various things. All have a name and a type field, but other fields such as country or community seems optional or situational.

In many situations you would want to have a schema model containing a single table with all documents. That would be possible in two ways. Either as a columnized form (implemented via SingleTableInferentialSchemaBuilder):

nametypecountrycommunity
Kasper SørensenpersonDenmark
Apache MetaModelprojectApache
Javalanguage
JavaScriptlanguage

Or having a single column of type MAP (implemented via SingleMapColumnSchemaBuilder):

value
{name=Kasper Sørensen, type=person, country=Denmark}
{name=Apache MetaModel, type=project, community=Apache}
{name=Java, type=language}
{name=JavaScript, type=language}

The latter approach of course has the advantage that it allows for the same polymorphic behaviour as the JSON documents itself, but it doesn't allow for a great SQL-like query syntax like the first approach. The first approach needs to do an initial analysis to infer the structure based on observed documents in a sample.

A third approach is to build multiple virtualized tables by splitting by distinct values of the "type" field. In our example, it seems that this "type" field is there to distinguish separate types of documents, so that means we can build tables like this (implemented via MultiTableInferentialSchemaBuilder):

Table 'person'
namecountry
Kasper SørensenDenmark


Table 'project'
namecommunity
Apache MetaModelApache


Table 'language'
name
Java
JavaScript


As you can see, the standard-implementations of schema builder offer quite some flexibility. The feature is still new and can certainly be improved and made more elaborate. But I think this is a great addition to MetaModel - to be able to dynamically define and inject rules for accessing schema-less structures.