In SSIS 2005, one of the challenges with the Lookup transform was that the SQL used to populate the cache was fixed at design time. You could dynamically pass in parameters to the SQL, but only if you turned on the advanced caching options, which puts the lookup into row-by-row operation. With smaller tables (<100,000 rows) this wasn’t too much of an issue, but with larger tables it could be a problem. A common scenario where this functionality would be nice is with Type 2 slowly changing dimensions (SCD). If you are extracting a single day of source records (doing a daily incremental load from the source system), then the lookup cache only needs to contain records that were in effect on the date of the extract. In SSIS 2005, there was no easy way to accomplish this. However, the functionality has been added in 2008.
To illustrate this, I’ve created a simple package with a single data flow that uses AdventureWorks and AdventureWorksDW. The OLE DB Source extracts Sales Order Detail records for a single day from AdventureWorks, based on the RunDate variable.
The Product Lookup is getting its rows from the DimProduct table in AdventureWorksDW. This is a Type 2 SCD table, with effective dates. The Lookup is set up normally. Initially, I had to specify a default query that returned the columns needed from DimProduct, so that the metadata for the data flow was set up properly.
To add an expression to the Lookup, I went back to the Control Flow view, and selected the Data Flow Task. Now, in the Properties window, you can see two properties associated with the Lookup:
SqlCommand is used to override the inner SQL statement used by the Lookup, while SqlCommandParam can be used to modify the entire statement. Normally, SqlCommand is enough for most purposes. I have an expression set on the property, that builds a SELECT statement dynamically, including a WHERE clause that filters by the RunDate variable.
The primary advantage to doing this is a reduction in the number of rows in the lookup cache. For the AdventureWorks sample, filtering the set of dimension rows down to just the ones in effect as of the Sales Order date reduces the cached rows from 606 to 339. Not really a significant amount in this example, but a similar percentage reduction for a million row dimension table would lead to significant savings.
I’ve uploaded the sample package to my SkyDrive.