It’s pretty well accepted that raw files in SSIS are a very fast means of getting data in and out of the pipeline. Jamie Thomson has referenced the use of raw files a number of times and Todd McDermid recently posted about using them for staging data. It occurred to me, that even though I’d always heard they were faster than other options, I’d never actually tested it to see exactly how much of a difference it would make. So, below I’ve posted some admitted unscientific performance testing between raw files and flat (or text) files.
I tested two variations of flat files, delimited and ragged right. The delimited file was configured with a vertical bar (|) as the column delimiter and CR/LF as the row delimiter. The ragged right file was configured as a fixed width with row delimiters – each column had a fixed width, and a final, zero-width column was appended with CR/LF as the delimiter. The same data was used for each test, the following columns being defined:
Name | Data Type | Precision | Scale | Length |
TestInt32 | DT_I4 | 0 | 0 | 0 |
TestString | DT_STR | 0 | 0 | 50 |
TestBool | DT_BOOL | 0 | 0 | 0 |
TestCurrency | DT_CY | 0 | 0 | 0 |
TestDBTimestamp | DT_DBTIMESTAMP | 0 | 0 | 0 |
TestWString | DT_WSTR | 0 | 0 | 50 |
TestNumeric | DT_NUMERIC | 18 | 6 | 0 |
One thing to note is that when importing from flat files, everything was imported as strings, to avoid any data conversion issues. This is one of the strengths of raw files – no data conversion necessary. But for this test, I was primarily looking at speed of getting the data on and off disk. I also looked at the difference in file sizes between the formats.
I tested each option with 500,000, 1 million, and 10 million rows. I ran each one 4 times for each row count, and discarded the first run to offset the effects of file caching. The results of the runs were averaged for comparison.
When writing files, there’s no big surprises between the options. raw files are faster on 10 million rows by 9.8 seconds. The difference on smaller numbers of rows is pretty insignificant. Here’s a chart showing the times (the raw data is at the end of the post):
Reading files did show a difference that I didn’t expect. Read speeds on raw files and delimited files are fairly comparable, with raw files still having the edge in speed. However, reads on ragged right files are significantly slower – well over twice as slow when compared to raw files.
File sizes were also as expected, with delimited files having a slight edge over raw files, likely because the string values I used were not all 50 characters in length.
In summary, it’s clear that raw files have an advantage in speed. However, the differences weren’t as large as I was expecting, except in the case of ragged right files. So, in general, using raw files are best for performance, but if you are dealing with row counts of less than 1 million rows, it’s not a huge difference unless you are really concerned with performance. Of course, there are plenty of other differences between the formats, and I’d encourage you to research them before making a decision.
Here’s the raw data on the number of seconds to produce each file:
500,000 | 1,000,000 | 10,000,000 | |
Write To Delimited | 2.61 | 5.16 | 47.02 |
Write To Ragged | 2.66 | 5.31 | 49.03 |
Write To Raw | 2.21 | 4.23 | 39.21 |
500,000 | 1,000,000 | 10,000,000 | |
Read From Delimited | 0.77 | 1.52 | 16.59 |
Read From Ragged | 2.74 | 5.89 | 35.39 |
Read From Raw | 0.60 | 1.08 | 10.03 |
and the file size in KB for each:
500,000 | 1,000,000 | 10,000,000 | |
Delimited | 44,624 | 89,792 | 946,745 |
Ragged | 92,286 | 184,571 | 1,845,704 |
Raw | 47,039 | 94,402 | 973,308 |
Please let me know if you’d like more details or have any questions.