Performance of Raw Files vs. Flat Files

It’s pretty well accepted that raw files in SSIS are a very fast means of getting data in and out of the pipeline. Jamie Thomson has referenced the use of raw files a number of times and Todd McDermid recently posted about using them for staging data. It occurred to me, that even though I’d always heard they were faster than other options, I’d never actually tested it to see exactly how much of a difference it would make. So, below I’ve posted some admitted unscientific performance testing between raw files and flat (or text) files.

I tested two variations of flat files, delimited and ragged right. The delimited file was configured with a vertical bar (|) as the column delimiter and CR/LF as the row delimiter. The ragged right file was configured as a fixed width with row delimiters – each column had a fixed width, and a final, zero-width column was appended with CR/LF as the delimiter. The same data was used for each test, the following columns being defined:

Name Data Type Precision Scale Length
TestInt32 DT_I4 0 0 0
TestString DT_STR 0 0 50
TestBool DT_BOOL 0 0 0
TestCurrency DT_CY 0 0 0
TestDBTimestamp DT_DBTIMESTAMP 0 0 0
TestWString DT_WSTR 0 0 50
TestNumeric DT_NUMERIC 18 6 0

One thing to note is that when importing from flat files, everything was imported as strings, to avoid any data conversion issues. This is one of the strengths of raw files – no data conversion necessary. But for this test, I was primarily looking at speed of getting the data on and off disk. I also looked at the difference in file sizes between the formats.

I tested each option with 500,000, 1 million, and 10 million rows. I ran each one 4 times for each row count, and discarded the first run to offset the effects of file caching. The results of the runs were averaged for comparison.

When writing files, there’s no big surprises between the options. raw files are faster on 10 million rows by 9.8 seconds. The difference on smaller numbers of rows is pretty insignificant. Here’s a chart showing the times (the raw data is at the end of the post):

image

Reading files did show a difference that I didn’t expect. Read speeds on raw files and delimited files are fairly comparable, with raw files still having the edge in speed. However, reads on ragged right files are significantly slower – well over twice as slow when compared to raw files.

image

File sizes were also as expected, with delimited files having a slight edge over raw files, likely because the string values I used were not all 50 characters in length.

image

In summary, it’s clear that raw files have an advantage in speed. However, the differences weren’t as large as I was expecting, except in the case of ragged right files. So, in general, using raw files are best for performance, but if you are dealing with row counts of less than 1 million rows, it’s not a huge difference unless you are really concerned with performance. Of course, there are plenty of other differences between the formats, and I’d encourage you to research them before making a decision.

Here’s the raw data on the number of seconds to produce each file:

  500,000 1,000,000 10,000,000
Write To Delimited 2.61 5.16 47.02
Write To Ragged 2.66 5.31 49.03
Write To Raw 2.21 4.23 39.21
  500,000 1,000,000 10,000,000
Read From Delimited 0.77 1.52 16.59
Read From Ragged 2.74 5.89 35.39
Read From Raw 0.60 1.08 10.03

and the file size in KB for each:

  500,000 1,000,000 10,000,000
Delimited 44,624 89,792 946,745
Ragged 92,286 184,571 1,845,704
Raw 47,039 94,402 973,308

Please let me know if you’d like more details or have any questions.

5 Comments

  1. siddhumehta says:

    Nice article John. But if you consider in a BI solution, it RAW files also carries quite a lot of dependencies. It can’t be easily or at all used outside the Data Flow engine.

    It can be read by any tool or viewer (except a third party RAW file viewer utility), also storing data in RAW files makes you dependent on SSIS, as to the best of my knowledge there’s no import-export wizard or BCP kind of tool that can import data from RAW files.

    But I agree with your idea that comparative analysis between different data types and file-formats does help when making a choice during the low-level design.

    –Siddharth.

  2. jwelch says:

    Siddarth, I agree. There are many variables that affect the choice of where to stage your data. I thought Todd’s article did a nice job of comparing them, though I’ve chosen to use SQL Server as my primary staging area, rather than RAW files. Look for an upcoming post on the performance of the SQL option as compared to RAW files.

  3. Phil Brammer says:

    John,
    I’m curious – for the writes, what was the source (flat file, raw, OLE DB, script)?

    In the end, writing to any of the options relies on the disk almost 100%, with the only differences being the provider used and the file format. In the case of the ragged right/fixed-width setup, more data must be written for each row, so it should be slower to read and write. Delimited data likely has to go through a bit more work to concatenate the delimiter in with the data as compared to the raw file format, which I like to think is a dump of the already assembled data flow buffer.

    Also, what if you reran the test again after populating your three test files in the write scenario, and this time set to overwrite the data, hence using an already allocated file? (I don’t think there will be an improvement, but it was a fleeting thought I had.)

    Phil

  4. jwelch says:

    I used a script to generate the rows for the write. I haven’t tried the overwrite option, in part because Windows file caching also affects this, and there’s no simple way to disable it.

  5. [...] most other data source and destination formats.  For some statistics on the speed benefit see this post by John Welch (blog | [...]

Leave a Reply