Archive for September 2007

Preparing A List of IDs for an IN Condition

There’s been a few questions on the MSDN Forums recently about using a list of keys in an IN condition in a WHERE clause. There are a few scenarios were this can be a bit tricky in SSIS, particularly if you need to dynamically create the list of keys. As an example, let’s look at the AdventureWorksDW database. Suppose that you need to extract rows from the FactInternetSales table based on the customer key.  You may need to use a SELECT like this, where the ? should be replaced with a list of keys:

SELECT
ProductKey, OrderDateKey, CustomerKey, SalesOrderNumber, SalesOrderLineNumber,
OrderQuantity, UnitPrice, TotalProductCost, SalesAmount, TaxAmt, Freight
FROM
FactInternetSales
WHERE
CustomerKey IN (?)

There are a number of ways to do this. First off, if the list of keys are coming from a table (the key table), and both tables are in the same database, you may be able to add a JOIN condition to the SELECT. This works if the list of keys can be determined through a WHERE clause.

SELECT
ProductKey, OrderDateKey, FactInternetSales.CustomerKey, SalesOrderNumber, SalesOrderLineNumber,
OrderQuantity, UnitPrice, TotalProductCost, SalesAmount, TaxAmt, Freight
FROM
FactInternetSales
JOIN DimCustomer ON FactInternetSales.CustomerKey = DimCustomer.CustomerKey
WHERE
DimCustomer.LastName LIKE ‘Z%’

This is the simplest approach, but lets assume that we need to get the list of keys from a separate database, and that it is not practical to join between the tables. This could be because they are two different database engines (SQL Server and Oracle, for example) or because of the performance impact of a cross-server join. The next approach to consider is using a temporary table to hold the list of keys. You would use one Data Flow to move the list of IDs from the key table to a temporary table in the same database as the Sales table. Then, in a second Data Flow, the technique discussed above can be applied.


Another approach can be used in situations where creating a temporary table isn’t desirable. An Execute SQL task can be used to retrieve the list of keys into a recordset object. There are two options once the keys are in a recordset. A For Each loop can be used to iterate through the keys, and run the Data Flow for each key in the recordset. A drawback with this approach is that you are processing each key individually, which can slow down the overall processing. The other other option is to use a Script task to create a concatenated list of the keys. This can be used in the Data Flow to process all the keys at once. Both approaches require that the Data Flow source component uses a variable or an expression to get the SQLCommand to execute.


image


The example I am including shows using the Script task to create the list of IDs. The script is:

    Public Sub Main()
Dim vars As Variables
Dim ds As DataSet
Dim dr As DataRow
Dim delimitedKeyList As String = String.Empty

Dts.VariableDispenser.LockOneForRead(“KeyList”, vars)
ds = CType(vars(“KeyList”).Value, DataSet)
vars.Unlock()

For Each dr In ds.Tables(0).Rows
delimitedKeyList += (dr(0).ToString() + “, “)
Next

‘remove last delimiter
delimitedKeyList = delimitedKeyList.Remove(delimitedKeyList.Length – 2, 2)

Dts.VariableDispenser.LockOneForWrite(“INList”, vars)
vars(“INList”).Value = delimitedKeyList
vars.Unlock()

Dts.TaskResult = Dts.Results.Success
End Sub


It takes the KeyList variable (populated with a DataSet object by the Execute SQL task, and writes a list of delimited key values to the INList variable. The INList variable is used in an expression set on the Data Flow:

“SELECT
ProductKey, OrderDateKey, FactInternetSales.CustomerKey, SalesOrderNumber, SalesOrderLineNumber,
OrderQuantity, UnitPrice, TotalProductCost, SalesAmount, TaxAmt, Freight
FROM FactInternetSales
WHERE CustomerKey IN (“
+ @[User::INList] +“)”

Be aware that if you are using an expression, the SELECT string can’t be over 4000 characters in length. For a long list of keys being appended to a long SELECT, that can be a problem. You can work around it by creating the entire SELECT statement in the Script task, and assigning it to a variable there. Another item to note is that the script expects an ADO.NET DataSet object. The Execute SQL task is using an ADO.NET connection, so that is what is passed in. If you are using an OLE DB Connection, the code will need to be updated to convert the ADO recordset (what you get with OLE DB) to a DataTable. Also, you may want to set DelayValidation to TRUE on the Data Flow task, to avoid validation errors when the package initially loads.


A sample package showing the Script approach is available here:



Hopefully this provides information on a number of the options for accomplishing this in SSIS. There are other ways that I didn’t cover in this, but these options cover most scenarios.

Changes Coming for SSIS 2008

At the PASS Summit this week, I heard a few interesting bits about SSIS 2008 that should be in the next CTP.

One, ADO.NET will be fully supported, with a ADO.NET Data Source (renamed from the Data Reader Data Source) and a ADO.NET Destination. Since ADO.NET has an ODBC provider, we should finally have the ability to use an ODBC database as a destination. And they will both have custom UIs, so no more messing around in the Advanced Editor.

Two, there’s a new data profiling task. This does a really nice job of processing a table and doing all the standard data profiling activities. It’s based on something out of Microsoft Research, so it has some pretty cool capabilities, like a pattern recognition function that will spit out regular expressions that match the contents of the column.

Three, since the script will be run through VSTA instead of VSA, it will be pretty easy to create a web service transform. Just create a script component, reference the web service, and all the proxy code will be created for you. (It’s not news that VSTA is replacing VSA, but I hadn’t thought about how that would impact web services until this).

Four, data flow threading will be much better. Previously, a single execution tree was always single threaded. That’s why in 2005, if you have a long chain of synchronous tasks, you may get better performance by introducing a Union All transform into the sequence. It breaks up the execution tree and allows the engine to run multiple threads. In 2008, the data flow engine will be able to introduce new threads itself.

Five, there will be a Watson-style dump tool available. It will be invoked automatically on a crash, or you could invoke it on demand. It will dump the current state of the package out to a text file, included variable values, data flow progress, etc.

And finally, lookups are going to be drastically enhanced. They can be sourced from ADO.NET or a flat file. We’ll have the ability to cache the lookups and reuse them across data flows in the same package. We’ll also be able to persist the cache between runs of the package. And, interestingly enough, the persisted file will look a lot like a RAW file. There should be some interesting possibilities in that :) . There will also be a “missed rows” output, instead of having to use the error output to capture records that didn’t have a match.

Dynamically Pivoting Rows to Columns

Previously, I posted about dynamically un-pivoting columns to rows. In this post, we’ll look at the reverse scenario. Suppose that you have a list of sales orders, in this format:
























Sales Order Number Product
SO-001 Hammer
SO-001 Table Saw
SO-001 Cordless Screwdriver
SO-002 Table Saw
SO-002 Wrench
SO-003 Flashlight

If there is a (reasonable) maximum number of products, you can create the maximum number of columns needed, and use the Pivot transform. However, if the number of products per order vary significantly, you might want to use something a little more flexible. This can be accomplished by storing the values in a single column as XML, or as a delimited set of values. 


For this example, we’ll take the input described above and transform it to this:















Sales Order Number Product
SO-001 Hammer|Table Saw|Cordless Screwdriver
SO-002 Table Saw|Wrench
SO-003 Flashlight

The data flow for this is straightforward – a Flat File Source, a Script component, and a Multicast (so that a data viewer can be added to see the results of the script). The Script component is where the bulk of the work is done.


image 


The script transform is configured with a single asynchronous output, by setting Output 0′s SynchronousInputID property to “None”. Two columns are added to Output 0, one for the Sales Order Number, and one for the list of products.


image


The script itself is listed below. Essentially, the script checks each row to see if the sales order number has changed from the previous row. If it is the same, it appends the product to a list. If the sales order number is different, the currently stored values for sales order number and the product list are sent to the asynchronous output.


There is some special handling for the first and last rows. On the first row, we don’t want to output anything, as we need to check the next row first. When the input buffer is finished, we need send the last value for sales order number and product list, and then call the SetEndOfRowSet on the output buffer. To make sure we’ve processed all the rows, we need to check EndOfRowSet on the input buffer. This should be checked in the Input0_ProcessInput routine, as it will never be equal to true in the Input0_ProcessInputRow method.


 



   1: Public Class ScriptMain
   2:     Inherits UserComponent
   3:  
   4:     Private orderNum As String
   5:     Private productList As String
   6:     Private blnFirstRow As Boolean = True
   7:  
   8:     Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
   9:  
  10:         If Row.SalesOrderNumber = orderNum Then
  11:             ‘append this item to the list
  12:             productList = productList + “|” + Row.Product
  13:         Else
  14:             If blnFirstRow Then
  15:                 blnFirstRow = False
  16:             Else
  17:                 ‘output the current values
  18:                 Output0Buffer.AddRow()
  19:                 Output0Buffer.SalesOrderNumber = orderNum
  20:                 Output0Buffer.ProductList = productList
  21:             End If
  22:  
  23:             ‘reset the control variables
  24:             orderNum = Row.SalesOrderNumber
  25:             productList = Row.Product
  26:         End If
  27:     End Sub
  28:  
  29:     Public Overrides Sub Input0_ProcessInput(ByVal Buffer As Input0Buffer)
  30:         MyBase.Input0_ProcessInput(Buffer)
  31:         If Buffer.EndOfRowset Then
  32:             Output0Buffer.AddRow()
  33:             Output0Buffer.SalesOrderNumber = orderNum
  34:             Output0Buffer.ProductList = productList
  35:             Output0Buffer.SetEndOfRowset()
  36:         End If
  37:     End Sub
  38: End Class

This can be useful if you need to pivot a varied number of columns. The same techniques can also be used anywhere you want to process a set of rows and change the output based on previous row values. If you’d like to take a more in-depth look, the sample files are on my SkyDrive:



 


 

Using the WMI Data Reader in SSIS

A recent post on the MSDN forums was asking about using the WMI Data Reader Task to retrieve information and send it to a SQL Server destination. I wasn’t very familiar with the task, so I decided to try it out. I wasn’t expecting much, given some of my experiences with the WMI Event Watcher (which I have found fairly difficult to get working consistently). However, it was very easier to configure and use.


WMI (Windows Management Instrumentation) allows you to get information about your hardware, software, and  network. I’m not going to cover it in depth (I’m not anywhere close to being knowledgeable about it), but if you want more information, there is an article from Microsoft on using the WMI Query Language here (http://www.microsoft.com/technet/scriptcenter/resources/begin/ss1206.mspx).


As a sample, I am querying the WIn32_Service class to get a list of services running on my computer, and the start mode for each of them. The query for this is:

SELECT NAME, STARTMODE FROM Win32_Service

The control flow contains two tasks, the WMI Data Reader Task and a data flow to write the results out. For sample purposes, I am only sending it to a Multicast, but it could be sent to any destination.

image           image

The WMI Data Reader Task is configured with the query in the WqlQuerySource, and the OutputType is set to “Data table”. The results are being put into a variable (User::WMIVar) of type Object.


image


The task stores the result of the WMI Query in the specified variable as an ADO.NET DataTable object. This variable is used in the data flow, within a Script Source component. The Script component is configured with two output columns, one for the service name and one for the StartupType. Both columns are using the DT_WSTR(50) data type. The WMIVar is passed in as a Read Only Variable. 


image        image


The Script Source shreds the DataTable into the data flow.

    Public Overrides Sub CreateNewOutputRows()
Dim dataTable As System.Data.DataTable
Dim dataRow As System.Data.DataRow

dataTable = CType(Me.Variables.WMIVar, Data.DataTable)

For Each dataRow In dataTable.Rows
Output0Buffer.AddRow()

Output0Buffer.Name = dataRow.Item(“NAME”).ToString()
Output0Buffer.StartupType = dataRow.Item(“STARTMODE”).ToString()
Next

Output0Buffer.SetEndOfRowset()
End Sub


That’s all that was necessary to use the results from the WMI Data Reader task. It was pretty straightforward to configure, and it runs quickly as well.