Posts tagged ‘Examples’

Insert Parent and Child Tables with SSIS Part 2

Yesterday I posted a simple approach for inserting data into tables with a parent child relationship (like Order Header to Order Detail) where you need to know the generated identity key to tie the records together. That approach relied on having a business value that could be used to consistently tie the parent and child data together. I ended that post with a question – what happens if you don’t have that value?

One example of this might be an order feed that comes from external company. Each row contains the order header and order detail information. Each of these rows should be broken up and inserted into two tables, one for the header data, the other for the detail data. Assuming that there is no business value that can tie these together, you need to get the identity key for the parent row as soon as it is inserted.

One approach for this involves retrieving a result set using an Execute SQL task, and shredding it with the For Each Loop. This approach does have some downsides – please review the Pros and Cons section below for more details. Within the For Each loop, two Execute SQL Tasks are used – one to insert the header row and retrieve it’s ID value, the other to insert the child row.

image

Retrieving the result set in the initial Execute SQL task is straightforward – just retrieve all the order rows. You can use a For Each loop to shred the retrieved result set by selecting the For Each ADO enumerator, and mapping the columns in the result set to variables. The For Each will run once per row in the result set, setting the variable values to match the current row.

SNAGHTMLf0da024

The next Execute SQL task is a little more interesting. It inserts the header row and uses the T-SQL OUTPUT clause to return the identity value for that row.

INSERT INTO parent (order_date, customer_id)
        OUTPUT INSERTED.order_id
        VALUES (GETDATE(), ?)

If you aren’t using a database engine that supports OUTPUT or similar functionality, you can also execute a second SQL statement in the same task to get the identity value using @@IDENTITY (be aware that there are some constraints when using this), SCOPE_IDENTITY(), or the appropriate function for your database.

The Execute SQL task maps the returned identity value to another package variable (order_id in this example).

SNAGHTMLf117a4c

The next Execute SQL task can then use that variable when inserting the child record into the order details table.

Pros and Cons

This approach has the advantage of guaranteeing that your child records are matched to the correct parent records. This can be a big plus if you don’t have a reliable way to match the records when making two passes.

There are a couple of downsides, though. The biggest one is speed. This approach forces you into RBAR (row by agonizing row) mode, so it’s going to be slow, especially compared to a batch load operation. The second problem is that if you are dealing with a lot of columns, creating all the variables and maintaining the mappings isn’t a fun experience.

Overall, I prefer the approach I mentioned in the first article, and I tend to avoid this one if possible. In some cases, even if you don’t have a reliable key, you may be able to work around it by staging some data to a working table or modifying the data source to provide a reliable key.

The sample package for this post is on my SkyDrive.

Inserting Parent and Child Tables with SSIS

A relatively frequent occurrence in data integration is the need to insert data into a parent table (like Order Header) and insert related records into a child table (Order Details). If the data is already populated with appropriate keys, and you are just copying it, this isn’t too complex – just copy the parent table first, then the child. What if the new tables use identity keys, though? You need to get the new identity key for each header row before you can insert the child row. This post is going to walk through one pattern for doing this, and I’ll show an alternate approach in my next post.

The first approach assumes that you have some common piece of information to link the data. In this case, often the simplest approach is to use two Data Flow tasks, run in sequence. This is my preferred solution to this problem, because it’s fast and it’s usually straightforward to implement.

The first Data Flow loads the parent records, and the second loads the child records. Note that the second Data Flow can’t run until the first succeeds.

image

The first Data Flow is pretty straight forward, and simply retrieves the parent (order header) data and inserts it into the parent table. The most important item here is that the source component retrieves the right data – that is, one row per order header, and that it includes some information that can be used to uniquely identify the order. In the sample package I’ve linked to below, you’ll see that the source of the order records is a single table, where a given row includes both header information and the detail. The source query for the data flow selects and groups on customer ID, as that uniquely identifies the order in this scenario (one order per customer, per day).

The second data flow retrieves the order detail for the same source table. It then uses a Lookup transform to retrieve the correct order ID (the identity key) from the parent table. The Lookup just needs enough data to make a unique match – in this case, that’s the current date and the customer id.

image

That’s really all there is to the simple pattern. You can find a sample package that illustrates this on my SkyDrive. But there can be more complex scenarios where you still need to handle a Parent / Child insert. For example, what if there is no reliable key to tie the order detail rows to the order header? In this case you can’t use the lookup. Stay tuned for the next post, where I’ll discuss a different pattern that can handle this scenario, but involves some tradeoffs.

Passing an Object from a Parent Package To a Child

Occasionally, you may run into the need to pass values between packages. In most cases, you can use a Parent Package Variable configuration to pass the value of a variable in the calling (or parent) package to the called (or child) package. However, Parent Package Variable configurations can’t be set up on variables of type Object. Fortunately, you can work around this pretty easily, thanks to the fact that the calling package variables are inherently accessible in the called packages. 

I’ve set up a sample parent and child package to illustrate this. The parent package is straightforward. It uses an Execute SQL task to populate an object variable named TestVar with a recordset object, and then calls the child package.

image image

The child package has a Foreach Loop Container to iterate over the recordset object. It has a Script task that is used to copy the parent package’s variable (TestVar) to a local variable named LocalVar. This is the variable that the Foreach Loop is configured to use. Why copy the value? If you don’t have a local variable to reference in the Foreach Loop, it won’t validate properly.

image image

The script in Copy Variable is pretty simple. It relies on the fact that you can reference parent package variables inherently, as they are included in the collection of variables accessible in the local package. The script just copies the value from one variable to the other, so that the Foreach Loop will have something to do.

public void Main()
{
    Variables vars = null;
    Dts.VariableDispenser.LockForWrite("User::LocalVar");
    Dts.VariableDispenser.LockForRead("User::TestVar");
    Dts.VariableDispenser.GetVariables(ref vars);

    vars["User::LocalVar"].Value = vars["User::TestVar"].Value;
    vars.Unlock();

    Dts.TaskResult = (int)ScriptResults.Success;
}

Please note that for this to work, you cannot have a variable in the child package with the same name as the variable in the parent package. If you do, the local variable will hide the parent variable. Outside of that, this works really well for passing object values between packages. The same technique can also be used in reverse to send values back to the parent package, if you have that need.

The sample has been uploaded to my Skydrive. Let me know if you have any questions.

Continuing a Loop After an Error

A common question on the SSIS forums is how to continue a loop after an error occurs. For example, you might have a For Each Loop container that iterates through a set of files in a folder. If one file is invalid, you want to continue processing the remaining files. Fortunately, there is a simple pattern that allows this to be accomplished.

I created a package to illustrate this. Basically, the package implements the scenario above. It consists of a For Each Loop to iterate through all the files in a folder. There is a data flow task that processes each input file, and runs it through a row count (just for illustration purposes). The Script task is there to illustrate an error handling task could be attached to the data flow, but it’s empty in the example package. It would need to be replaced with the appropriate task to handle a failure, such as a File System task to move the file to another folder.

image

The key to handling the errors is the Sequence container, or more specifically, how the Sequence container is set up. The OnError event handler for the Sequence container has been enabled. You can do this by selecting the Sequence container, then clicking the Event Handlers tab. Select the OnError event in the drop down, then click the link in the designer to create an empty event handler. It’s not necessary to add anything to event handler.

image

Once it has been created, go to the Variables window, and select the option to show System variables. Select the Propagate variable, and set the value to FALSE. This prevents any errors occurring on tasks inside the Sequence container from being raised to the For Each loop.

image

This is all that really needs to be done, but if you run the package with only this set, you’ll notice that the Data Flow task and the Sequence container both report an error. To prevent the Sequence Container from reporting an error, set the MaximumErrorCount property of the container to 0.

image

Note that you do not have to use a sequence container to accomplish this. You could make the same settings on individual tasks inside the For Each loop. However, the Sequence container gives you a common place to make the settings, and you can have as many tasks inside it as you’d like.

The example package is posted on my Skydrive. The zip file contains a folder call Import. The package expects this folder to be located in C:\Temp. It contains 4 text files for the package to process. The 2nd file contains an error (one of the rows has fewer columns than it is supposed to), so that you can see the error behavior.

Calculating a Running Total Part 2

To finish up the topic that I started in my previous post, I wanted to post a more complex example of running total calculations. This is based off a request on the MSDN forums, for a running total calculation that would “slide” along as additional rows were processed. In the poster’s example, the running total needed to reflect the total for the last 10 weeks.

To show how to accomplish this, I’m going to update the same package that was used in the previous example.

image

The script component has the same RunningTotal column defined on the output:

image

The primary difference is in the script itself. Since I want to keep a list of the previous 10 records, I am using a Queue object, which is part of the System.Collections.Generic namespace. This object lets you easily remove the oldest item from the queue, so it works very well for this example. In the script, if the queue has more than 10 items, the oldest item is removed prior to adding a new one. The running total is produced by iterating through all the items in the collection, and storing the result in the RunningTotal output column.

Public Class ScriptMain
    Inherits UserComponent
    Dim runningTotalCollection As Queue(Of Integer) = New Queue(Of Integer)
 
 
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        If runningTotalCollection.Count >= 10 Then
            runningTotalCollection.Dequeue()
        End If
 
        runningTotalCollection.Enqueue(Row.OrderQty)
 
        Dim x As Integer
        Dim sumX As Integer
        For Each x In runningTotalCollection
            sumX += x
        Next
 
        Row.RunningTotal = sumX
    End Sub
 
End Class

This is another example of the flexibility of the Script component. It also helps highlight that by using the Script component, you can store the values from previous rows, and then reference them later. The example package is posted on my Skydrive.

Calculating a Running Total In SSIS

Running totals are a common calculation for reporting, and are occasionally needed in ETL processes to populate aggregate tables. One way to accomplish this in SSIS, if you are retrieving data from a relational database, is to do the running total in the SELECT statement. There are a number of resources on the Internet that illustrate how to accomplish this. However, this may not work for you if you are retrieving information from a flat file. Also, the logic to implement a running total may not be straightforward in SQL, particularly if you need sliding windows for the totals (I’ll explain this later).

Fortunately, there are other ways to accomplish this in SSIS. A very flexible method is to use a script component, as this will let you customize the logic however you prefer. As an example, I’ll take a look at a couple of examples from AdventureWorks.

First, the more straightforward example. The AdventureWorks database has a Sales.SalesOrderDetail table that lists the itemized purchases for each order. Note – I’m only filtering the results for example purposes. The actual code in the package works with the whole table.

image

If I was to calculate a running total for the OrderQty for each SalesOrderID above, I’d expect to see:

SalesOrderID ProductID OrderQty RunningTotal
43685 765 3 3
43685 763 1 4
43685 754 1 5
43685 725 1 6
43686 758 3 3
43686 762 1 4
43686 770 1 5
43687 768 1 1
43687 765 2 3

Notice that the running total resets each time the SalesOrderID changes. To implement a script component that will calculate the running total, I’ve created a new script component and added an output column to it to hold the running total.

image

Inside the script component, I’ve added a variable to store the value of the control number (SalesOrderID), and another variable to capture the running total. When the control number changes, the running total is reset.

Public Class ScriptMain
    Inherits UserComponent
    Dim runningTotal As Integer
    Dim previousValue As Integer
 
 
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        If Not previousValue = Row.SalesOrderID Then
            runningTotal = 0
            previousValue = Row.SalesOrderID
        End If
 
        runningTotal += Row.OrderQty
 
        Row.RunningTotal = runningTotal
    End Sub
 
End Class

This is an easy method to calculate running totals in the data flow. One thing to note is that the data must be sorted by the control number (SalesOrderID in the example) in order for this to work properly.

This is in an example package on my SkyDrive here. I will follow up this post with another sample showing a more complex "running total" problem and how it can be solved in SSIS as well.

Bundling Errors in the Data Flow

I’ve posted about handling multiple errors in SSIS before, but that post was addressing it in the context of the control flow. What if you want to capture a series of data errors from the data flow? An example of this might be a requirement to capture and email all the rows that fail in a Lookup transform, so that the reference table can be updated with new values.

There are a number of ways to do this. You can send the error output of the Lookup to a Flat File destination, and then attach that to an email message using the Send Mail task. If you want to avoid writing anything to disk, you could send the error rows to a Recordset destination, and then use a Script Task to put them into a string variable to send. Or you could just write them directly to a string, using a Script destination.

I’ve put together a sample package that shows how to send the error rows to either a flat file or a string variable, and then how to email both. The control flow consists of a Data Flow task, and two Send Mail tasks (you won’t need both, they are just there for illustration purposes).

image

The data flow is fairly basic. The OLE DB Source uses a hard-coded query to return the numbers 1 through 5. The Lookup uses a hard-coded query to return a reference table with 1 and 2, so we expect 3, 4, and 5 to fail. The Row Count transform is used to capture the number of rows in error, so that we know if an email needs to be sent. The Multicast is used so that we can show both methods of handling the error rows. The Write Errors to File destination simply sends the error rows to a flat file.

image

The Write Errors to String Script Destination uses a StringBuilder class to capture each row, then writes it to the errorRows variable at the end of data flow processing. The StringBuilder is more efficient for handling large strings, which is why it is used here. If the package was processing a large number of rows, the error row string could get quite large.

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports System.Text
 
Public Class ScriptMain
    Inherits UserComponent
    Dim errorMessages As StringBuilder = New StringBuilder()
 
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        errorMessages.AppendLine(String.Format("Key value {0} not found in the reference table.", Row.KEYVALUE))
    End Sub
 
    Public Overrides Sub PostExecute()
        Dim vars As IDTSVariables90
 
        Me.VariableDispenser.LockOneForWrite("errorRows", vars)
        vars("errorRows").Value = errorMessages.ToString()
        vars.Unlock()
    End Sub
 
End Class

Back in the control flow, each of the Send Mail tasks has a precedence constraint that also has an expression set on it. The expression is just checking to see if there were any errors. If not, we don’t need to send the emails.

image

The Send Mail – No Attachment task uses the string variable populated by the Script destination in the data flow as the Message Source for the email.

image

The Send Mail – attachment, on the other hand, uses a file as an attachment to the message. There is something to point out with this method. The package has a string variable defined that holds the path and filename of the error file. This variable is used in an expression on the ConnectionString property of the flat file connection manager, and in an expression on the FileAttachments property of this Send Mail task. This lets us set the path in one place, and avoiding changing it in multiple locations if it ever needs to be updated.

image

That’s all there is to it. I’ve uploaded the sample package to my SkyDrive, located here. To run it, you’ll need to update the SMTP server address in the SMTP connection manager, and the ErrorFile variable to point to a valid path on your machine.

Adding Headers and Footers to Flat Files

A common question with SSIS is how to handle adding headers and footers to a flat file when the format is different that the data rows. This gets more difficult when the header or footer needs to contain data based on the actual data rows, such as a total or a record count.


Basically, this can be accomplished by creating the header or footer in a separate task, and then prefixing or appending it to the data. I’ve prepared a sample package that shows a couple of different ways to accomplish this. For generating footers, it shows how to use a second data flow to append an additional row onto the footer. There are two methods shown for adding headers. Method 1 uses a Script task to add the header by loading the file into a string, adding the header, and saving the file again. Method 2 creates a separate header file, and uses a file copy operation to concatenate the two files.


image


There a couple of things to note about the package. There are three flat file connection managers defined in the package.


image


Two of the connection managers (Destination and DestinationWithFooter) point to the same file, but have different column definitions. All three have their connection strings set by expressions that point to variables. If you want to change the target files, you need to modify the variables, not the connection managers.image


The DestinationFile variables holds the path to a working file, which will hold the data before the headers and footers are added. The DestinationHeader variable holds the path to a working file used only for method 2, which holds the header row before it is concatenated into the final file. The FinalFile variable holds the path to the final output file, which includes the header, data, and footer rows.


The Generate File data flow uses a Data Generator component from Konesans (a free download from www.sqlis.com). It simply generates some data, captures the number of rows using a Row Count transform, and outputs it to the Destination connection manager.


The Append Footer and Create a Header File data flows are very similar. They both use a Script source to generate a single row containing the footer or header data, respectively. The script is below.



Public Overrides Sub CreateNewOutputRows()
    Dim recordCount As Integer
    Dim vars As IDTSVariables90
 
    ‘Get the record count
    Me.VariableDispenser.LockOneForRead(“RecordCount”, vars)
    recordCount = CType(vars(“RecordCount”).Value, Integer)
    vars.Unlock()
 
    ‘Output one row with record count
    Output0Buffer.AddRow()
    Output0Buffer.FooterRow = String.Format(“Footer Row Count: {0}”, recordCount)
    Output0Buffer.SetEndOfRowset()
End Sub

The two differ in the way the destination is handled. In the Append Footer data flow, the Flat File destination is set to append to the file, rather than overwriting it. It uses the DestinationWithFooter connection manager, which points to the same file that the data was sent to in the Generate File data flow. The DestinationWithFooter connection manager defines the file as having a single column. That makes it easier to append the footer row directly to the file. After the Append Footer data flow task has run, there is a single file that has the data rows and a footer row.


image


In the Create a Header File data flow, the header row is sent to a new file. Then, in the Append Files execute process task, the header and the data file are concatenated together using a copy command. This is Method 2 for adding the header.


For Method 1, the Add Header script tasks uses the System.IO.File class and the System.Text.StringBuilder class to create a new file containing the header row and the data rows, which already has the footer included.



Public Sub Main()
    Dim fileContents As New StringBuilder()
    Dim vars As Variables
    Dim recordCount As Integer
    Dim finalFile As String
 
    ‘Get the record count
    Dts.VariableDispenser.LockForRead(“RecordCount”)
    Dts.VariableDispenser.LockForRead(“FinalFile”)
    Dts.VariableDispenser.GetVariables(vars)
    recordCount = CType(vars(“RecordCount”).Value, Integer)
    finalFile = CType(vars(“FinalFile”).Value, String)
    vars.Unlock()
 
    ‘Write header, then append file contents and write back out.
    fileContents.AppendLine(String.Format(“Header Row Count 1: {0}”, recordCount))
    fileContents.Append(File.ReadAllText(Dts.Connections(“Destination”).ConnectionString))
    File.WriteAllText(finalFile, fileContents.ToString())
 
    Dts.TaskResult = Dts.Results.Success
End Sub

Hopefully this example is helpful, if you are trying to sort through the options for adding headers and footers. The example is available on my Skydrive here.

Handling Multiple Errors in SSIS Revisited

Previously, I posted about how multiple errors in SSIS can be combined into a single unit for emailing or logging. Recently, a thread on the MSDN forums pointed out that the example wasn’t quite complete. While it does show how to collect all the error messages and get them into a single string, it doesn’t show how to email or log that string only once. So I’m giving it another shot, and show two options for handling this. This post does build on the previous one, so please read it before proceeding.


If you download the sample package in the previous post and examine it, you’ll see that there is a script task in the OnPostExecute event. This script task is where the error messages are assembled, and the code to deliver the bundled error message would be placed. Please note – the script does not actually email or log the message, that was left open for the reader to implement.


If you run the package, you’ll notice that the OnPostExecute event fires twice. It’s firing once for the data flow, and once for the package. Since the OnPostExecute event handler is where the script is located, this causes the message to be delivered twice – not what we want. To work around this, we need to make sure the OnPostExecute only fires once. One way to handle this is to set the Propagate system variable to false. The Propagate variable controls whether the event is raised to the next container. In this case, if Propagate is set to false on the data flow’s OnPostExecute event, the package’s OnPostExecute will only be fired once.


image


However, this approach requires that all tasks in the package need to have the Propagate variable set to false in their OnPostExecute event handlers. To work around this, I recommend incorporating a single Sequence Container, with the Propagate variable set to false on it’s OnPostExecute. Any tasks in the package should be added inside the Sequence Container. If you do this, the Propagate variable only needs to be set once.


image


Another approach is to eliminate the use of the OnPostExecute event altogether. To do this, the Script Task needs to be moved to the control flow, and the OnPostExecute event handler can be deleted altogether. A Sequence Container should still be used to hold all of the tasks that need to be grouped for error handling. The Script Task should be connected to the Sequence Container with a Failure constraint.


image


This approach also ensures that the Script Task will only execute once, if there is a failure in the Sequence Container.


Hopefully, this helps clarify how to leverage collecting the error messages to only send a single error result. I’ve uploaded the samples here, so please feel free to take a look at them and let me know if you have questions.

Changing Destination Tables With the OLE DB Destination

The OLE DB Destination supports using a variable to get the name of the target table. This allows you to switch the destination table dynamically, but there is a caveat. The columns in each table that you will use must match, or the OLE DB Destination will give a NEEDS_NEW_METADATA error.


What’s a potential way that you might use this capability? Well, one recent scenario posed on a forum was this: A set of rows should be loaded to Table A, unless Table A already had rows in it. If it did, the rows should be loaded to Table B. This is fairly straightforward to implement in SSIS, thanks to expressions and the OLE DB Destination working with a variable.


To set this up, I started with an Execute SQL Task. It was configure to retrieve the record count for Table A into a variable called User::RecordCount. A second variable, User::TableName, was created to hold the table name. The value for this variable is determined from the following expression:


image


This expression evaluates the RecordCount variable. If it is equal to 0, Table A is used, otherwise Table B is used.


The Execute SQL task is connected to a data flow, with an OLE DB Source and the OLE DB Destination. The OLE DB Destination is configured to use the variable for the table name:


image


That’s all there is to it. I’ve posted a sample package to my SkyDrive here:



 


A few notes about the sample: There are three tables involved, the source (Numbers) and two destinations (TableA and TableB). All three tables have a single Integer column (named ID in Numbers, and ColumnA in TableA and TableB). The scripts for these are not included in the sample.