Preparing A List of IDs for an IN Condition

There’s been a few questions on the MSDN Forums recently about using a list of keys in an IN condition in a WHERE clause. There are a few scenarios were this can be a bit tricky in SSIS, particularly if you need to dynamically create the list of keys. As an example, let’s look at the AdventureWorksDW database. Suppose that you need to extract rows from the FactInternetSales table based on the customer key.  You may need to use a SELECT like this, where the ? should be replaced with a list of keys:

SELECT
ProductKey, OrderDateKey, CustomerKey, SalesOrderNumber, SalesOrderLineNumber,
OrderQuantity, UnitPrice, TotalProductCost, SalesAmount, TaxAmt, Freight
FROM
FactInternetSales
WHERE
CustomerKey IN (?)

There are a number of ways to do this. First off, if the list of keys are coming from a table (the key table), and both tables are in the same database, you may be able to add a JOIN condition to the SELECT. This works if the list of keys can be determined through a WHERE clause.

SELECT
ProductKey, OrderDateKey, FactInternetSales.CustomerKey, SalesOrderNumber, SalesOrderLineNumber,
OrderQuantity, UnitPrice, TotalProductCost, SalesAmount, TaxAmt, Freight
FROM
FactInternetSales
JOIN DimCustomer ON FactInternetSales.CustomerKey = DimCustomer.CustomerKey
WHERE
DimCustomer.LastName LIKE ‘Z%’

This is the simplest approach, but lets assume that we need to get the list of keys from a separate database, and that it is not practical to join between the tables. This could be because they are two different database engines (SQL Server and Oracle, for example) or because of the performance impact of a cross-server join. The next approach to consider is using a temporary table to hold the list of keys. You would use one Data Flow to move the list of IDs from the key table to a temporary table in the same database as the Sales table. Then, in a second Data Flow, the technique discussed above can be applied.


Another approach can be used in situations where creating a temporary table isn’t desirable. An Execute SQL task can be used to retrieve the list of keys into a recordset object. There are two options once the keys are in a recordset. A For Each loop can be used to iterate through the keys, and run the Data Flow for each key in the recordset. A drawback with this approach is that you are processing each key individually, which can slow down the overall processing. The other other option is to use a Script task to create a concatenated list of the keys. This can be used in the Data Flow to process all the keys at once. Both approaches require that the Data Flow source component uses a variable or an expression to get the SQLCommand to execute.


image


The example I am including shows using the Script task to create the list of IDs. The script is:

    Public Sub Main()
Dim vars As Variables
Dim ds As DataSet
Dim dr As DataRow
Dim delimitedKeyList As String = String.Empty

Dts.VariableDispenser.LockOneForRead(“KeyList”, vars)
ds = CType(vars(“KeyList”).Value, DataSet)
vars.Unlock()

For Each dr In ds.Tables(0).Rows
delimitedKeyList += (dr(0).ToString() + “, “)
Next

‘remove last delimiter
delimitedKeyList = delimitedKeyList.Remove(delimitedKeyList.Length – 2, 2)

Dts.VariableDispenser.LockOneForWrite(“INList”, vars)
vars(“INList”).Value = delimitedKeyList
vars.Unlock()

Dts.TaskResult = Dts.Results.Success
End Sub


It takes the KeyList variable (populated with a DataSet object by the Execute SQL task, and writes a list of delimited key values to the INList variable. The INList variable is used in an expression set on the Data Flow:

“SELECT
ProductKey, OrderDateKey, FactInternetSales.CustomerKey, SalesOrderNumber, SalesOrderLineNumber,
OrderQuantity, UnitPrice, TotalProductCost, SalesAmount, TaxAmt, Freight
FROM FactInternetSales
WHERE CustomerKey IN (“
+ @[User::INList] +“)”

Be aware that if you are using an expression, the SELECT string can’t be over 4000 characters in length. For a long list of keys being appended to a long SELECT, that can be a problem. You can work around it by creating the entire SELECT statement in the Script task, and assigning it to a variable there. Another item to note is that the script expects an ADO.NET DataSet object. The Execute SQL task is using an ADO.NET connection, so that is what is passed in. If you are using an OLE DB Connection, the code will need to be updated to convert the ADO recordset (what you get with OLE DB) to a DataTable. Also, you may want to set DelayValidation to TRUE on the Data Flow task, to avoid validation errors when the package initially loads.


A sample package showing the Script approach is available here:



Hopefully this provides information on a number of the options for accomplishing this in SSIS. There are other ways that I didn’t cover in this, but these options cover most scenarios.

This entry was posted in Uncategorized. Bookmark the permalink.

Comments are closed.