Agile BI

Using For Each to Iterate a Resultset

Posted on March 21, 2007 by jwelch

Update: Kirk Haselden has a great example of this that I found right after I posted this. http://sqljunkies.com/WebLog/knight_reign/archive/2005/03/25/9588.aspx

There’s been a number of questions about iterating through a collection of items and performing an operation for each item in the SSIS forums recently, so I put together a quick example of doing this.

The example package retrieves a list of parent rows from one table, then uses a Foreach Loop to iterate through each parent row. It uses the ID from the parent row to retrieve the matching child records from a second table.

This example assumes the following tables in your database:

CREATE TABLE [dbo].[ChildTable](
[ID] [int] IDENTITY(1,1) NOT NULL,
[ParentID] [int] NOT NULL,
[ChildName] [varchar](25) NULL,
CONSTRAINT [PK_ChildTable] PRIMARY KEY CLUSTERED
( [ID] ASC )

CREATE TABLE [dbo].[ParentTable](
[ID] [int] IDENTITY(1,1) NOT NULL,
[ParentName] [varchar](25) NOT NULL,
CONSTRAINT [PK_ParentTable] PRIMARY KEY CLUSTERED
( [ID] ASC )

It also assumes that the ParentID in ChildTable points to the IDs in ParentTable.

The attached package starts with an Execute SQL tasks that retrieves the records from the ParentTable and stores it in a variable named “parentResultset”. The Foreach Loop container is set to use the “parentResultset” variable as the source variable for enumeration. It maps the ID column from the “parentResultset” variable to the “parentID” variable on the Variable Mappings tab.

There is a variable named “select”. This variable uses an expression to build the following select statement: “SELECT ID, ParentID, ChildName FROM ChildTable WHERE ParentID = ” + the “parentID” variable. The “select” variable is then used in another Execute SQL task.

Hopefully this helps illustrate how to use the Foreach Loop.

Posted in Uncategorized | Comments Off

Comparing methods for translating values in SSIS

Posted on March 18, 2007 by jwelch

A question came up recently on the SSIS forums on translating lookup codes into a description within the data flow. That is, given code “A”, how do you translate it into a more descriptive value, such as “Apple”?

As a general rule of thumb, if the the codes and values might change over time, or if there are a significant amount of them, I would store them in a database table, and use the Lookup component to handle the translation. This makes maintenance much easier, and Lookup components have great performance. In fact, I do this in most cases, since even things that aren’t supposed to change have a habit of doing so.

The question posed on the forum was for a small set of 4 translations. In this case, assuming these values were not going to change often or at all, the suggestions were to use one of three methods: the script component, the derived column component, or the lookup component with a hard-coded list of values (suggested by Phil Brammer http://www.ssistalk.com).

I decided to run each method through its paces to determine how each one performs. I was fairly confident that the script component would be slower than the other options, but I was less sure about the lookup and derived column. So I set up a test package for each scenario, and ran some rows through them to see how they did.

I’ve got the testing methodology below this if you’re interested, but I’m going to jump right into the results. I ran each package 5 times for 500,000 rows and another 5 times for 1,000,000 rows. The lookup performed the best in each test, and the script performed the worst. The derived column transform came close to the lookup and actually beat the lookup once when running the million row test, but the lookup still come out ahead based on the overall average. We are talking about a couple of seconds over a million rows, but every little bit helps.

So, it looks safe to continue using the lookup transform for any type of translation, even when the values are hard-coded.

Scenarios for Testing

Each package has a single data flow in it. The script package has the following in the data flow:

and has this code in it:

Select Case Row.LOOKUPVALUE

Case “A”

Row.Description = “APPLE”

Case “B”

Row.Description = “BANANA”

Case “C”

Row.Description = “CANTALOUPE”

Case “D”

Row.Description = “DATE”

Case “E”

Row.Description = “EGGPLANT”

Case Else

Row.Description = “UNKNOWN”

End Select

The derived column flow looks like:

The derived column is using this code:

LOOKUP_VALUE == “A” ? “APPLE” : LOOKUP_VALUE == “B” ? “BANANA” : LOOKUP_VALUE == “C” ? “CANTALOUPE” : LOOKUP_VALUE == “D” ? “DATE” : LOOKUP_VALUE == “E” ? “EGGPLANT” : “UNKNOWN”

The lookup looks like:

and uses this to generate the values:

SELECT ‘A’ AS VALUE, ‘APPLE’ AS DESCR
UNION ALL
SELECT ‘B’, ‘BANANA’
UNION ALL
SELECT ‘C’, ‘CANTALOUPE’
UNION ALL
SELECT ‘D’, ‘DATE’
UNION ALL
SELECT ‘E’, ‘EGGPLANT’

The unknown values are dealt with by ignoring failures in the lookup, and setting the value in the derived column transform:

TRIM(LOOKUP_VALUE) == “” ? “UNKNOWN” : DESCRIPTION

I used 6 values (A, B, C, D, E, and an empty string) in the source. A, B, C, and D were each used in 20% of the source rows, and E and the empty string accounted for 10% of the source rows each.

Posted in Uncategorized | Comments Off

Sending Header and Detail to different destinations

Posted on March 11, 2007 by jwelch

A question came up today on the SSIS forum about processing flat files that included header and detail information. The poster wanted to send header information to one destination and detail information to another. The header rows have 2 columns, the detail rows have 5.

CUSTOMER|ABC
COST|1000
ITEMS
C1|12345|123|123|123
C2|12345|123|123|123
C3|12345|123|123|123
C4|12345|123|123|123

One way to solve this is a script component, which would have two outputs, and redirect rows based on programatically determining what type of row was being dealt with. The advantages for this approach are:

You only read the source once

Completely flexible for whatever logic you want to use to identify header and detail rows

It handles files that have repeated header / detail sections

A downside is that you will probably have to treat each row from the source as a single column, and break the columns out yourself. Project Real has a great example of this approach (though they show it in custom components, the same concepts apply), so rather than try to cover it again here, I’ll provide a link to their documentation. http://msdn2.microsoft.com/en-us/library/aa964134.aspx

An alternative approach is to read the same flat file twice, once to get header rows, once to get the detail. This works well when you will only have one header, detail combination per input file.

To do this for the file defined above, you need to create two connections to the flat file, one for the header defining two columns, the other for detail with five columns. For the detail connection, set the header rows to skip to 3. The detail source can be handled normally, as no special handling is required.

The header source flow does need a few additions. First, add a script component. In the script component, add a new column to the output (I named mine Rows) to hold the row count. Use the following code to add the row count:

Static RowCount As Integer = 0

RowCount = RowCount + 1

Row.Rows = RowCount

Next, add a conditional split component. Add a case for everything with a Rows value less than 2 (Rows <= 2) to only get the two header values we want. Now this flow can be written to the appropriate destination.

Rather than using a simple row count, you could use the script component to do pattern matching using Like, or other techniques to identify the header rows, and output a simple boolean flag indicating whether the row should be included.

I’ve attached a simple sample package that demonstrates this. It uses the Konesans Trash Destination Adapter (http://www.sqlis.com/56.aspx), so you’ll need that to run the sample.

Posted in Uncategorized | Comments Off

Welcome!

Posted on March 8, 2007 by jwelch

Welcome. I’m John Welch, a managing consultant with Mariner (www.mariner-usa.com), a business intelligence consulting firm in North Carolina. I’m starting this blog to record some of my random thoughts about BI and data warehousing. I have a background in more traditional application development, and was very interested applying agile methodologies and development techniwques in that world. I’ve found that a lot of data warehousing projects don’t take advantage of agile practices, so that will likely be a common topic in this blog.

I currently am working with the Microsoft technology stack (SQL Server, Analysis Services, Integration Services, etc), so those topics will also be highlighted.

Thanks for reading.

Posted in Uncategorized | Comments Off

Using For Each to Iterate a Resultset

Comparing methods for translating values in SSIS

Sending Header and Detail to different destinations

Welcome!

Recent Posts

Recent Comments

Archives

Categories

Meta