T-SQL Tuesday #006 – LOBs in the SSIS Dataflow

The topic for T-SQL Tuesday #006 is LOB data, selected by this month’s host, Michael Coles. If you aren’t familiar with T-SQL Tuesdays, Michael has a nice summary in his post.

In database terms, LOBs are Large OBjects, also referred to as BLOBS (Binary Large OBjects). These are data types that can exceed the maximum row size of 8 KB in SQL Server. Most often, they are used for storing large amounts of text or binary data in the database. An example of this would be an application that stores documents in a database table. SSIS has some special features for handling LOBs in the dataflow, and this post will provide an overview of them and provide a few tips on using them wisely, as well as some approaches to keep them from killing your data flow performance.

Data Types for LOBs

LOBs are represented by three types in SSIS: DT_TEXT, DT_NTEXT, and DT_IMAGE. These map to the SQL Server types VARCHAR(MAX), NVARCHAR(MAX), and VARBINARY(MAX). Each of these types can hold 2,147,483,647 bytes, or almost 2GB of data. Columns with these data types have some special rules.

First, if you want to use them in a Derived Column expression, you’ll probably need to cast it to another type, like DT_STR or DT_WSTR, as the LOB types aren’t supported by most of the built-in functions. One thing to be careful of with this is that the entire contents of the LOB may not fit in the data type you are casting to, which will cause a truncation error. You can either set the Derived Column transformation to ignore truncation errors, or redirect error rows so you can handle the longer values differently.

If you are accessing LOBs in a Script component, you have to use the AddBlobData(), GetBlobData(), and ResetBlobData() methods on the column to get to the data.

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    System.Text.UnicodeEncoding encoding= new System.Text.UnicodeEncoding();

    Row.LOBNTEXT.AddBlobData(encoding.GetBytes("My Test String"));
    byte[] bytes = Row.LOBNTEXT.GetBlobData(0, (int)Row.LOBNTEXT.Length);
    string myString = encoding.GetString(bytes);
}

Note that you should use the System.Text.AsciiEncoding class in the above code if you are working with non-Unicode data.

Transforms for LOB Data

There are two transformations in SSIS specifically designed for working with LOB data: Import Column and Export Column. The Import Column transformation is used to import the contents of a file from disk into a column in the data flow, as a LOB. The Export Column is used to do perform the reverse operation – that is, it saves the contents of a LOB column to a file on disk. Since you can process one file per row, these transformations can be really useful for moving multiple files in and out of a database.

There’s a few good walkthroughs on using these transformations, so rather than reposting the same information, here’s a few links:

The Import Column Transformation, by James Beresford (aka BI Monkey);

The Export Column Transformation, by James Beresford (aka BI Monkey)

Importing Files Using SSIS, by me

Performance

Because these data types have the potential to hold so much data, SSIS handles them a little differently than the standard data types. They are allocated separately from regular data in the buffers. When there is memory pressure, SSIS spools buffers to disk. The potential size of LOB data makes it very likely to be spooled, which can be a pretty major performance bottleneck. To minimize it, you want to make sure that the LOB data is being spooled to a nice, fast disk that doesn’t have a lot of contention. To control what disk they are spooled to, you can set the BLOBTempStoragePath property on the Data Flow task. Note that this is a different setting than BufferTempStoragePath, which controls where regular data is spooled. Ideally, you want to make sure each of these settings points to a different physical disk.

The other item to consider is whether you actually need LOB data. In some scenarios, the LOB data types are used for convenience or because there is a possibility that a small percentage of the data would exceed the maximum length of a regular data type. If you are processing a lot of data with LOB types, and most of the LOB data is small enough to fit in an regular data type, then you may be better off using a two pass approach. In one pass, select only the data with LOB values small enough to fit in a standard data type, and cast it in the SELECT statement. SSIS will process this set of data using the normal buffer approach, which should be faster. Then, do a second pass where you select only the data that is too large for a regular data type. This will minimize the amount of data that has to be processed using LOB data types. To figure out the length of a LOB column in SQL Server, you can use the DATALENGTH function.

SELECT DATALENGTH(MyLOBColumn)
FROM MyLOBTable

As usual with performance, your mileage may vary on this one. It’s worked well for me in a few scenarios, but there’s some overhead involved in using two passes. So, as with any performance tips, test it in your environment, with a solid representative sample of your data, before implementing it in production.

There it is – a whirlwind tour of LOBs in SSIS. If you have any questions or comments, please post them.

4 Comments

  1. http:// says:

    I’m having a lot of trouble using SSIS to import data from Active Directory. I know importing AD data is probably out of scope with your post, but I think there are real correlations. I have the connection working just fine, but certain fields being imported appear, as you describe, to be DT_NTEXT and simply show < > in the fields. I’ve tried using two data conversion components to cast them to DT_TEXT. However, once imported, all I get is System.Object[]. I’ll now try the getblobdata() method next, but do you have any advice in this area?

    There are several AD fields, but a few are GUID, Description, memberOf.

    Thank you.

  2. jwelch says:

    What’s your target data type?

    I did look at this a while back (in the 2005 days) and if I recall correctly, some of the AD results where arrays of values, which posed some issues for importing into SSIS. The best place to get help on this is probably on the MSDN forums: http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/threads. I’m on there, along with a number of people much brighter than me.

  3. http:// says:

    Nice, thanks for the referral. Though it’s pretty funny that plenty of posts about using AD with SSIS point back to your post on the subject.

    http://agilebi.com/cs/blogs/jwelch/archive/2007/04/07/retrieving-information-from-active-directory-with-ssis.aspx

    In fact, that’s the post I used to get as far as I have. Except my query string looks like this:

    SELECT objectGUID,sAMAccountName,GivenName,sn,mail,employeeNumber,pwdLastSet,description,displayName,department,streetAddress,l,st,postalCode,telephoneNumber,facsimileTelephoneNumber,memberOf,distinguishedName,modifyTimeStamp,accountExpires,useraccountcontrol
    From ‘LDAP://dc1′
    WHERE objectClass=’user’ AND objectCategory=’Person’

    Most of the fields causing me problems are set to varchar(MAX) or varchar(100) if I’m sure it’s something short like accountExpires or description.

    I currently have a VB script that does all of this and exports to a CSV and then a task to import the file. But it takes the script almost 7 minutes to run for about 4500 users. Even though I couldn’t import all the fields correctly using your tips on using an ADO.NET Connection Manager, I could tell it was worlds different. Import completed in under a minute.

Leave a Reply