20081111

Querying a CSV file!

Today I'm going to demonstrate some of the functionality of MetaModel 1.1, which provides a datastore-transparent querying API for Java. Actually what I'll show you isn't all that new, we've been able to do most of the querying stuff for CSV files (and other datastores) roughly since MetaModel 1.0, but it seems to me that too few realize what a powerful tool we have with MetaModel, so I'm just going to repeat myself a bit here ;) Also, I'll demonstrate on of the new cool features of MetaModel 1.1 - column type detection, narrowing and value transformation.

For this example I'll use some real data: I've extracted this CSV file from the eobjects.org trac system. It contains a list of all the tickets (issues, bugs, tasks, whatever) that are active at the time of writing... You'll notice if you take a look at the file, that it's not exactly a simple CSV file - a lot of text spans multiple lines and there are quite a lot of different data types.

Ok, so let's get started. I've saved the data to a file: tickets.csv. Now I want to read the file and let MetaModel generate a metadata model based on it. I will also let MetaModel try and detect the column types (since CSV only contains text-types natively) which will automatically transform all values to the most fitting and narrow type that MetaModel can find (this is indicated by the 'true' parameter in the code  below). Here's how we get a hold of the datastore model for the file:

File file = new File("tickets.csv");
DataContext dataContext = DataContextFactory.createCsvDataContext(file, true);

Once we have a DataContext object we are ready to go for our datastore-transparent way of querying. What we do is: We get a hold of the schema of our CSV file and we locate the table of interest. Since CSV is a single-table datastore type, getting the table of interest can be done in two ways:

Table table;

//get table by name
table = schema.getTableByName("tickets");

//get table by index
table = schema.getTables()[0];

Now we can go ahead and investigate the structure of the CSV file. Since we turned on automatic column type narrowing you will see that the 'ticket' column have been converted from a text-based column to an INTEGER type. Also, as MetaModel can verify that 'ticket'-values are never missing, it is asserted that the column is not nullable:

Column ticketColumn = table.getColumnByName("ticket");

//Will print: "Ticket column type: INTEGER"
System.out.println("Ticket column type: " + ticketColumn.getType());

//Will print: "Ticket column nullable: false"
System.out.println("Ticket column nullable: " + ticketColumn.isNullable());

And now for the fun and impressing part... Let's try to make some queries! Here are a couple of examples:

Query q;

//SELECT * FROM tickets ORDER BY ticket
q = new Query().select(table.getColumns()).from(table).orderBy(ticketColumn);

//SELECT SUM(ticket) FROM tickets
q = new Query().select(FunctionType.SUM, ticketColumn).from(table);

//SELECT _reporter AS rep_name, COUNT(*) AS num_tickets FROM tickets GROUP BY rep_name
Column reporterColumn = table.getColumnByName("_reporter");
q = new Query().select(reporterColumn,"rep_name).selectCount().from(table).groupBy(reporterColumn);

To execute the queries is very simple - just ask your DataContext object to execute the query object. MetaModel will then analyze the query and process the result behind the scenes:

DataSet dataSet = dataContext.executeQuery(q);

We can process MetaModel DataSets in quite a few ways. A typical way would be to iterate through it, similar to a ResultSet:

while (dataSet.next()) {
    Row row = dataSet.getRow();
    //Extract values or do something similar with the row
    System.out.println("row: " + row);
}

... Or we can transform the DataSet into a TableModel object:

TableModel tableModel = dataSet.toTableModel();

... Or we can transform it into a list of Object arrays:

List<Object[]> objectArrays = dataSet.toObjectArrays();

As you can see, with the MetaModel API a lot of things that used to be difficult is now really, really easy!

Click here to display the full example.

13 comments:

Gajanan Patil said...

do we have any mechanism to query csv file similarly we do for database?

e.g
if i pass query like select * from abc.csv to get all records in csv
or select col1,col2,col3 from abc.csv I should get DataSet returned.

Kasper Sørensen said...

Hi Gajanan,

That's exactly what MetaModel provides :-)

Ravishankar Arivazhagan said...

Kasper, With the latest version - I see all columns getting detected as String. The code seems to have changed:

https://github.com/apache/metamodel/blob/master/csv/src/main/java/org/apache/metamodel/csv/CsvTable.java#L127

Will this feature to auto detect column types be revived anytime in the future ?

Gajanan Patil said...

Thank you Kasper for your update
but when I use CsvDataContext in 4.6 version to execute query it doesn't offer the same feature.? may be I am missing something here. do you have any code snippet that illustrat this in details.

Thank you for all your extended help.

GP

Kasper Sørensen said...

Hmm I don't think the CSV adaptor in MetaModel ever did auto-conversion to numbers. Not sure why you have that impression?

The SUM function is capable of parsing the strings as numbers, but I believe it will throw if a non-number occurs in the data set. A CSV gives no guarantees like that so our choice so far was to not try and assume anything.

You may however decorate your CsvDataContext explicitly using the conversions available in org.apache.metamodel.convert.Converts.

Ravishankar Arivazhagan said...

Thanks for the prompt response, Kasper.

I was specifically referring to this section in your blogpost:

//Will print: "Ticket column type: INTEGER"
System.out.println("Ticket column type: " + ticketColumn.getType());

I will definitely checkout the conversions.

I have just started exploring Metamodel and form what I have seen so far - it is very useful & interesting. :)

Gajanan Patil said...

Hello Kasper

Do we support any logical/ mathematical operator other than "=" operator in Joins as we have below query that includes "<>" operator which gets failed in parser gets failed in finding the indexOf "=" from private FromItem parseJoinItem() function in FromItemParser.java. So it seems exiting implementation doesn't supports any other operator than "=".

query is : SELECT e.FirstName, d.DeptName FROM emp.employees e RIGHT JOIN emp.dept d ON d.DeptNo <> e.DeptNo WHERE d.DeptNo='ACC013' and d.DeptName='Admin'

Please let me know if that so. then are we planning this in future release.

Thanks
Gajanan

Kasper Sørensen said...

Hi Gajanan,

You are correct, MetaModel does not have other operators supported in ON clauses than equals (=). You could phrase the query as a carthesian product instead and use WHERE though.

Gajanan Patil said...

Thank you Kasper, for your update

Gajanan Patil said...

Hello Kasper,

Do we have any method that take care of comparing 2 DataSets?

Thank you
Gajanan

Kasper Sørensen said...

I don't know of a ready-to-use comparator for it, but I suggest you ask on the user@ mailing list to see if anyone has a good idea for it.

Gajanan Patil said...

Thank you Kasper

I found the one below

1. Converting both the datasets to List say lst1, lst2.
2. then removing matching records from lst1.removeAll(lst2) or vice versa.

Gajanan Patil said...

may be I am missing here something but my observation for MySQL JdbcDataContext is metamodel really don't consider Information_Schema for querying?

or is there any other way to get it?

Thank you
GP