Performance benchmark: DataCleaner thrives on lower column counts

Today I've conducted an experiment. After fixing a bug related to CSV-file-reading in DataCleaner, I was wondering how performance was impacted by different kinds of CSV file compositions. The reason that I suspected that this could impact performance is that CSV files with many columns will require a somewhat larger chunk of memory in order to keep a single row in memory compared to CSV files with fewer columns. In the older versions of DataCleaner we discovered that using 200 or more columns would actually make the application run out of memory! Fortunately, this bug is fixed, but there is still a significant performance penalty, as this blog post will hopefully show.

I auto-generated three files for the benchmark: "huge.csv" with 2.000 columns and 16.000 rows, "long.csv" with 250 columns and 128.000 rows and "slim.csv" with only 10 columns and a roaring 3.200.000 rows. Together, each file has 32.000.000 cells to be profiled. I set up a profiler job with the profiles Standard measures and String analysis on all columns.

Here are the (surprising?) results:

filename	rows	columns	start time	end time	total time
huge.csv	16000	2000	18:54:48	19:40:28	45:40
long.csv	128000	250	19:44:53	19:52:31	7:38
slim.csv	3200000	10	19:53:46	19:55:03	1:17

So the bottom line is: Lowering the number of columns has a very significant, positive impact on performance. Having a lot of columns means that you will need to hold a lot more data in memory and needless to say you will have to replace this large chunk of memory a lot of times during the execution of a large profiler job. Going all the way from 45 minutes to 1½ is quite an improvement - so don't pre-join tables or anything like that before you run them through your profiler.

kasper's source

20090613

Performance benchmark: DataCleaner thrives on lower column counts

No comments:

Blog-archive

My links

eobjects - news