Performance of Raw Files vs. Flat Files

It’s pretty well accepted that raw files in SSIS are a very fast means of getting data in and out of the pipeline. Jamie Thomson has referenced the use of raw files a number of times and Todd McDermid recently posted about using them for staging data. It occurred to me, that even though I’d always heard they were faster than other options, I’d never actually tested it to see exactly how much of a difference it would make. So, below I’ve posted some admitted unscientific performance testing between raw files and flat (or text) files.

I tested two variations of flat files, delimited and ragged right. The delimited file was configured with a vertical bar (|) as the column delimiter and CR/LF as the row delimiter. The ragged right file was configured as a fixed width with row delimiters – each column had a fixed width, and a final, zero-width column was appended with CR/LF as the delimiter. The same data was used for each test, the following columns being defined:

Name Data Type Precision Scale Length
TestInt32 DT_I4 0 0 0
TestString DT_STR 0 0 50
TestBool DT_BOOL 0 0 0
TestCurrency DT_CY 0 0 0
TestDBTimestamp DT_DBTIMESTAMP 0 0 0
TestWString DT_WSTR 0 0 50
TestNumeric DT_NUMERIC 18 6 0

One thing to note is that when importing from flat files, everything was imported as strings, to avoid any data conversion issues. This is one of the strengths of raw files – no data conversion necessary. But for this test, I was primarily looking at speed of getting the data on and off disk. I also looked at the difference in file sizes between the formats.

I tested each option with 500,000, 1 million, and 10 million rows. I ran each one 4 times for each row count, and discarded the first run to offset the effects of file caching. The results of the runs were averaged for comparison.

When writing files, there’s no big surprises between the options. raw files are faster on 10 million rows by 9.8 seconds. The difference on smaller numbers of rows is pretty insignificant. Here’s a chart showing the times (the raw data is at the end of the post):

image

Reading files did show a difference that I didn’t expect. Read speeds on raw files and delimited files are fairly comparable, with raw files still having the edge in speed. However, reads on ragged right files are significantly slower – well over twice as slow when compared to raw files.

image

File sizes were also as expected, with delimited files having a slight edge over raw files, likely because the string values I used were not all 50 characters in length.

image

In summary, it’s clear that raw files have an advantage in speed. However, the differences weren’t as large as I was expecting, except in the case of ragged right files. So, in general, using raw files are best for performance, but if you are dealing with row counts of less than 1 million rows, it’s not a huge difference unless you are really concerned with performance. Of course, there are plenty of other differences between the formats, and I’d encourage you to research them before making a decision.

Here’s the raw data on the number of seconds to produce each file:

  500,000 1,000,000 10,000,000
Write To Delimited 2.61 5.16 47.02
Write To Ragged 2.66 5.31 49.03
Write To Raw 2.21 4.23 39.21
  500,000 1,000,000 10,000,000
Read From Delimited 0.77 1.52 16.59
Read From Ragged 2.74 5.89 35.39
Read From Raw 0.60 1.08 10.03

and the file size in KB for each:

  500,000 1,000,000 10,000,000
Delimited 44,624 89,792 946,745
Ragged 92,286 184,571 1,845,704
Raw 47,039 94,402 973,308

Please let me know if you’d like more details or have any questions.

This entry was posted in Uncategorized. Bookmark the permalink.

Comments are closed.