SSIS 2008 – Incrementally Updating the Lookup Cache File

There are a number of new features in the 2008 Lookup Transform. Jamie Thomson posted a good overview of the features, and I recommend reading it prior to this article. One of the new features in SSIS 2008 is the ability to cache the data from the Lookup to a local file. This should help improve performance, as the lookup data would only need to be retrieved once, stored locally, and reused. Internally, the lookup cache file (.caw) uses a RAW file structure (.raw), which is very fast to read and write from disk.


One of the first things that I wanted to try with the new caching features was to dynamically update the cache file. That proved to be a bit of a challenge, as Jamie pointed out in his article. One of the main reasons for this is that, while the cache file is very similar to a RAW file, the cache file stores some additional metadata – hash values for each row, and some header information. This means that while you can use a RAW Source component to read the data from a cache file, you must use the Cache Transform to actually create a cache file.


Unfortunately, the Cache Transform does not support incremental updates to the cache file. Each time you use it, it recreates the file completely. However, with a little creativity, you can minimize the amount of data you have to process. I’ve created a sample package that illustrates one approach to this.


The control flow looks like this (the numbers refer to the steps below):


image


The major components of the process are:



  1. Retrieve the last modified date from the Lookup reference table, by using an Execute SQL Task.

  2. Use a Script Task to determine if the cache file already exists. This leverages the System.IO.File class, as discussed in Checking for the Existence of a File. Execution can proceed to either step 3, 4, or 7, based on precedence constraints with expressions.

  3. If the file does not exist, it is created through a data flow. This data flow is straightforward – it retrieves all the rows from the reference table and runs them through a Cache Transform.
      image

  4. If the cache file does exist and the reference table has been modified since the last run of the package, a File System task is used to rename it. This done so that the cache file can be recreated in the next step. It cannot be read from and written to at the same time, so there has to be two copies of it for a short time.

  5. Now the cache file is updated within a data flow task. The OLE DB Source retrieves only the modified rows from the reference table. The rows that haven’t been changed are pulled in from the cache file, using the Raw File Source. The lookup is used to determine if an existing row from the cache file is also included in the set of modified rows from reference table. The lookup uses an expression to set the reference query, so that it contains the same set of rows as the OLE DB Source. If the row matches one from the OLE DB Source, the row is discarded. Only non-matching rows are passed through the lookup. Finally, the rows are combined using a Union All and written back out to a new cache file (with the same name as the original).
    image

  6. The copied version of the cache file is deleted, now that it is no longer needed.

  7. This data flow is used to test the resulting cache file. It is not necessary in a production package.

The Lookup table for this example looks like:


 



CREATE TABLE [dbo].[StatusLookup](
    [ValidValue] [varchar](50) NULL,
    [AlternateValue] [varchar](50) NULL,
    [LastUpdate] [datetime] NULL
) ON [PRIMARY]
GO

This package uses a number of variables:


image


and several expressions (some of which are on the variables above):


image


These are used to make the package a bit more dynamic. The LastRunDate variable controls when the package thinks it was last executed. In a production application, this would need to be set from a parent package or an Execute SQL Task. The LookupTable variable controls the name of the cache files.


I encourage you to download the sample and walk through it to get a in-depth understanding. I’ve really only covered the high points in this article. It’s posted on my SkyDrive.


This was done against CTP6 of SQL Server 2008. The RTM release may necessitate some changes to the pattern and/or the code described here.

This entry was posted in Uncategorized. Bookmark the permalink.

Comments are closed.