20080419

Querying data with DataCleaner-core

I promised the other day that I would return on the topic of using the metadata & data layer of DataCleaner-core. So here's what we'll do;

1) Open up a connection to the database (this is plain old JDBC). Here's how to do it with Derby, but any database could do:


Class.forName("org.apache.derby.jdbc.EmbeddedDriver");
Connection con = DriverManager.getConnection("jdbc:derby:my_database;");


2) Let's create a schema object for an easier, object-oriented way of accessing the data.

JdbcSchemaFactory schemaFactory = new JdbcSchemaFactory();
ISchema[] schemas = schemaFactory.getSchemas(con);

Note that by default the JDBC schema factory only retrieves relations from the database of type "TABLE". You could in some situations though wish to broaden this restriction, for example to enable views:

JdbcSchemaFactory schemaFactory = new JdbcSchemaFactory();
schemaFactory.setTableTypes(new String[] {"TABLE","VIEW"});
ISchema[] schemas = schemaFactory.getSchemas(con);


3) Let's try exploring our metadata, consisting of schemas, tables and columns.

ITable productTable = schemas[0].getTableByName("Products");
IColumn[] productColumns = productTable.getColumns();
IColumn productCodeColumn = productTable.getColumnByName("product_code");

//This next int represents one of the constants in java.sql.Types.
int productCodeType = productCodeColumn.getColumnType();
boolean isProductCodeLiteral = MetadataHelper.isLiteral(productCodeType);
boolean isProductCodeNumber = MetadataHelper.isNumber(productCodeType);


4) Time to make a query or two. Let's start off by just querying the whole table and then querying two specific columns.


JdbcDataFactory dataFactory = new JdbcDataFactory();
IData someDataThatWeWillDiscard = dataFactory.getData(con, productTable);
IData data = dataFactory.getData(con, productTable, productCodeColumn, anotherColumn);
while (data.next()) {

IRow row = data.getRow();
int count = data.getCount();
System.out.println("Observed " + count + " rows with product code: " + row.getValue(productCodeColumn));

}


Notice the IData.getCount() which is crucial to understand. The data factory will try to generate a group by query to optimize the load on traffic between database server and client. Sometimes this is not possible though (for example for TEXT types in Derby, where GROUP BY is not allowed). The getCount() method returns how many occurances there are of this distinct combination of values, represented by the IRow interface. So make sure to always check the count, maybe there are less rows in the result than in the actual database, because the results have been compressed!

Observe in general how strongly typed an API this is. In other data-oriented API's one would have to type in the same column name several places (at least in the query and when iterating through the results) but with the DataCleaner-core metadata and data layer we get a completely object oriented and type safe way to do this. The amazing thing about this API is also that we could have just as well done the same thing with flat files or other data source types.

No comments: