20080601

How to process millions of resultset rows in java

I'm so excited, since I just think that we've solved a very common problem in java applications that have to deal with huge amounts of data. Here's the trouble:

  1. Even though the JDBC spec. defines a way to specify the fetch size when executing queries, some drivers do not implement this feature, which means your program will run out of memory if you query eg. a couple of millions of records.
  2. Even if your driver works as it is supposed to (that would be a reasonable assumption in most cases) there's still no effective way to optimize the computation of the many records by multithreading since the data is streamed through a single connection.
Because of the power of the MetaModel schema and query model we've been able to create a generic mechanism for splitting up a query into other queries that will yield less rows but the same collective result. The way we do this is by identifying attributes that can be used to filter in WHERE clauses, for example:
  • Consider we want to split up the query: "SELECT name, email FROM persons"
  • We will investigate the persons table and find columns that can be used to split the total resultset. We might find a reasonable age-column for this, so the query could be split to:
  1. SELECT name, email FROM persons WHERE age < 30 OR age IS NULL
  2. SELECT name, email FROM persons WHERE age > 30 OR age = 30
Depending on the desired size of the partial queries we will split up by finding further columns or by defining finer intervals to split by. Here's how it works:
DataContext dc = ...
Query q = ...

QuerySplitter qs = new QuerySplitter(dc, q);
List<Query> queries = qs.splitQueries();
I'd love to know what you all think of this? Personally I think it's a lovely way to optimize memory consumption and it offers new ways to utilize grid computing by distributing partial queries to diffent nodes in the grid to do remote processing. Also a lot of databases (MySQL for example) only dedicates a single thread per query - so by splitting the queries one could further optimize multithreading on the database.

3 comments:

Doron said...

Hello,

Here is a way to move or update hundreds of millions of records using ms sql script in less then 1 hour. You basically can track down the process of 1 million records at the time and also figure out the time it takes to process all records. See this article at:
http://www.dfarber.com/computer-consulting-blog/2011/1/14/processing-hundreds-of-millions-records-got-much-easier.aspx it works like a charm for me.

Regards,

Doron
http://www.dfarber.com

Jale said...

see explanation at below link i think it will help to you
http://afewdoubts.blogspot.in/2013/02/difference-between-prepared-and.html

Ramesh said...

we can do easily using pagination on sudo columns. No need to split or do any processing.