Today I've conducted an experiment. After fixing a bug related to CSV-file-reading in DataCleaner, I was wondering how performance was impacted by different kinds of CSV file compositions. The reason that I suspected that this could impact performance is that CSV files with many columns will require a somewhat larger chunk of memory in order to keep a single row in memory compared to CSV files with fewer columns. In the older versions of DataCleaner we discovered that using 200 or more columns would actually make the application run out of memory! Fortunately, this bug is fixed, but there is still a significant performance penalty, as this blog post will hopefully show.
I auto-generated three files for the benchmark: "huge.csv" with 2.000 columns and 16.000 rows, "long.csv" with 250 columns and 128.000 rows and "slim.csv" with only 10 columns and a roaring 3.200.000 rows. Together, each file has 32.000.000 cells to be profiled. I set up a profiler job with the profiles Standard measures and String analysis on all columns.
Here are the (surprising?) results:
filename | rows | columns | start time | end time | total time |
huge.csv | 16000 | 2000 | 18:54:48 | 19:40:28 | 45:40 |
long.csv | 128000 | 250 | 19:44:53 | 19:52:31 | 7:38 |
slim.csv | 3200000 | 10 | 19:53:46 | 19:55:03 | 1:17 |
So the bottom line is: Lowering the number of columns has a very significant, positive impact on performance. Having a lot of columns means that you will need to hold a lot more data in memory and needless to say you will have to replace this large chunk of memory a lot of times during the execution of a large profiler job. Going all the way from 45 minutes to 1½ is quite an improvement - so don't pre-join tables or anything like that before you run them through your profiler.
No comments:
Post a Comment