Archive for March 2007

Open Source SQL Server Stuff (OSSSS)

I was on the CodePlex (www.codeplex.com) site today looking for the Analysis Services Stored Procedure project, and found that they have a number of SQL Server related projects available. First, all of the SQL Server samples have been uploaded to the site, which is nice if you only want to grab one or two of them.


There is also the aforementioned AS Stored Procedure project (http://www.codeplex.com/ASStoredProcedures) , which has some great examples of extending AS functionality with .NET based stored procedures.


Then there is the SQL Inserter project, which can generate INSERT statements from database tables. It’s handy for quickly moving small amounts of data, if you don’t want to use the Import / Export Wizard.


There are also a couple of test data generators (http://www.codeplex.com/datagenerator and http://www.codeplex.com/TdGen). Data Generator is designed to be used in .NET based development, and TdGen is for databases.

Writing a Resultset to a Flat File

This is another example package. This one retrieves a resultset and stores it in a variable. It then uses a data flow with a custom script source to read each row of the resultset and output it as a delimited string. Why would you want to do this? Well, it enables you to treat a resultset generically from a data flow perspective. There is only a single “column” defined in the metadata. So this works well if you want to export an unknown number of columns from a resultset to a flat file.



First, use an Execute SQL to retrieve and store a resultset in a variable. Then execution should continue to the data flow.


A custom script component acts as the source for the data flow. When adding the script component, make sure to source as the type of the script. Select Inputs and Outputs, and add a single output column to the existing output, and set the type of the column to DT_STR and set the length to 1000 (or something long enough to hold your widest resultset).


Select the Script option next, and add the resultset variable to the ReadOnly Variables property. Click the Design Script button, and add the following to the CreateNewOutputRows function: (Script adapted from Jamie Thomson’s example at http://blogs.conchango.com/jamiethomson/archive/2006/01/04/SSIS_3A00_-Recordsets-instead-of-raw-files.aspx)

Dim oledbAdapter As New OleDb.OleDbDataAdapter
Dim dataTable As New System.Data.DataTable
Dim dataRow As Data.DataRow
Dim delimitedValues As String
Dim i As Integer

oledbAdapter.Fill(dataTable, Me.Variables.resultset)
For Each dataRow In dataTable.Rows
  OutputBuffer.AddRow()


  delimitedValues = String.Empty

    For i = 0 To dataRow.ItemArray.GetUpperBound(0)
      delimitedValues = delimitedValues + dataRow.Item(i).ToString()
      If i <> dataRow.ItemArray.GetUpperBound(0) Then
        delimitedValues = delimitedValues + “|”
      End If
    Next

    OutputBuffer.DelimitedRow = delimitedValues
Next


You’ll also need to add a reference to the System.XML assembly (Select Project..Add Reference from the toolbar). If you want to use a different delimiter, just change the constant “|”.


Now you can connect this to a flat file destination with a single column defined, and output your recordset to that without worrying about the number or type of columns.

Using For Each to Iterate a Resultset

Update: Kirk Haselden has a great example of this that I found right after I posted this. http://sqljunkies.com/WebLog/knight_reign/archive/2005/03/25/9588.aspx 


There’s been a number of questions about iterating through a collection of items and performing an operation for each item in the SSIS forums recently, so I put together a quick example of doing this.


The example package retrieves a list of parent rows from one table, then uses a Foreach Loop to iterate through each parent row. It uses the ID from the parent row to retrieve the matching child records from a second table.



This example assumes the following tables in your database:


CREATE TABLE [dbo].[ChildTable](
[ID] [int] IDENTITY(1,1) NOT NULL,
[ParentID] [int] NOT NULL,
[ChildName] [varchar](25) NULL,
CONSTRAINT [PK_ChildTable] PRIMARY KEY CLUSTERED
( [ID] ASC )


CREATE TABLE [dbo].[ParentTable](
[ID] [int] IDENTITY(1,1) NOT NULL,
[ParentName] [varchar](25) NOT NULL,
CONSTRAINT [PK_ParentTable] PRIMARY KEY CLUSTERED
( [ID] ASC )

It also assumes that the ParentID in ChildTable points to the IDs in ParentTable.

The attached package starts with an Execute SQL tasks that retrieves the records from the ParentTable and stores it in a variable named “parentResultset”. The Foreach Loop container is set to use the “parentResultset” variable as the source variable for enumeration. It maps the ID column from the “parentResultset” variable to the “parentID” variable on the Variable Mappings tab.

There is a variable named “select”. This variable uses an expression to build the following select statement: “SELECT ID, ParentID, ChildName FROM ChildTable WHERE ParentID = ” + the “parentID” variable. The “select” variable is then used in another Execute SQL task.

Hopefully this helps illustrate how to use the Foreach Loop.

Comparing methods for translating values in SSIS

A question came up recently on the SSIS forums on translating lookup codes into a description within the data flow. That is, given code “A”, how do you translate it into a more descriptive value, such as “Apple”?


As a general rule of thumb, if the the codes and values might change over time, or if there are a significant amount of them, I would store them in a database table, and use the Lookup component to handle the translation. This makes maintenance much easier, and Lookup components have great performance. In fact, I do this in most cases, since even things that aren’t supposed to change have a habit of doing so.


The question posed on the forum was for a small set of 4 translations. In this case, assuming these values were not going to change often or at all, the suggestions were to use one of three methods: the script component, the derived column component, or the lookup component with a hard-coded list of values (suggested by Phil Brammer http://www.ssistalk.com).


I decided to run each method through its paces to determine how each one performs. I was fairly confident that the script component would be slower than the other options, but I was less sure about the lookup and derived column. So I set up a test package for each scenario, and ran some rows through them to see how they did.


I’ve got the testing methodology below this if you’re interested, but I’m going to jump right into the results. I ran each package 5 times for 500,000 rows and another 5 times for 1,000,000 rows. The lookup performed the best in each test, and the script performed the worst. The derived column transform came close to the lookup and actually beat the lookup once when running the million row test, but the lookup still come out ahead based on the overall average. We are talking about a couple of seconds over a million rows, but every little bit helps.



So, it looks safe to continue using the lookup transform for any type of translation, even when the values are hard-coded. 


 


Scenarios for Testing


 


Each package has a single data flow in it. The script package has the following in the data flow:



and has this code in it:


Select Case Row.LOOKUPVALUE


Case “A”


Row.Description = “APPLE”


Case “B”


Row.Description = “BANANA”


Case “C”


Row.Description = “CANTALOUPE”


Case “D”


Row.Description = “DATE”


Case “E”


Row.Description = “EGGPLANT”


Case Else


Row.Description = “UNKNOWN”


End Select


The derived column flow looks like:



The derived column is using this code:


 LOOKUP_VALUE == “A” ? “APPLE” : LOOKUP_VALUE == “B” ? “BANANA” : LOOKUP_VALUE == “C” ? “CANTALOUPE” : LOOKUP_VALUE == “D” ? “DATE” : LOOKUP_VALUE == “E” ? “EGGPLANT” : “UNKNOWN”


The lookup looks like:



and uses this to generate the values:


SELECT ‘A’ AS VALUE, ‘APPLE’ AS DESCR
UNION ALL
SELECT ‘B’, ‘BANANA’
UNION ALL
SELECT ‘C’, ‘CANTALOUPE’
UNION ALL
SELECT ‘D’, ‘DATE’
UNION ALL
SELECT ‘E’, ‘EGGPLANT’


The unknown values are dealt with by ignoring failures in the lookup, and setting the value in the derived column transform:


TRIM(LOOKUP_VALUE) == “” ? “UNKNOWN” : DESCRIPTION


I used 6 values (A, B, C, D, E, and an empty string) in the source. A, B, C, and D were each used in 20% of the source rows, and E and the empty string accounted for 10% of the source rows each.

Sending Header and Detail to different destinations

A question came up today on the SSIS forum about processing flat files that included header and detail information. The poster wanted to send header information to one destination and detail information to another. The header rows have 2 columns, the detail rows have 5.


CUSTOMER|ABC
COST|1000
ITEMS
C1|12345|123|123|123
C2|12345|123|123|123
C3|12345|123|123|123
C4|12345|123|123|123


One way to solve this is a script component, which would have two outputs, and redirect rows based on programatically determining what type of row was being dealt with. The advantages for this approach are:



  • You only read the source once

  • Completely flexible for whatever logic you want to use to identify header and detail rows

  •  It handles files that have repeated header / detail sections

A downside is that you will probably have to treat each row from the source as a single column, and break the columns out yourself. Project Real has a great example of this approach (though they show it in custom components, the same concepts apply), so rather than try to cover it again here, I’ll provide a link to their documentation. http://msdn2.microsoft.com/en-us/library/aa964134.aspx


An alternative approach is to read the same flat file twice, once to get header rows, once to get the detail. This works well when you will only have one header, detail combination per input file.


To do this for the file defined above, you need to create two connections to the flat file, one for the header defining two columns, the other for detail with five columns. For the detail connection, set the header rows to skip to 3. The detail source can be handled normally, as no special handling is required.


The header source flow does need a few additions. First, add a script component. In the script component, add a new column to the output (I named mine Rows) to hold the row count. Use the following code to add the row count:


Static RowCount As Integer = 0


RowCount = RowCount + 1


Row.Rows = RowCount


Next, add a conditional split component. Add a case for everything with a Rows value less than 2 (Rows <= 2) to only get the two header values we want. Now this flow can be written to the appropriate destination.


Rather than using a simple row count, you could use the script component to do pattern matching using Like, or other techniques to identify the header rows, and output a simple boolean flag indicating whether the row should be included.


I’ve attached a simple sample package that demonstrates this. It uses the Konesans Trash Destination Adapter (http://www.sqlis.com/56.aspx), so you’ll need that to run the sample.

Welcome!

Welcome. I’m John Welch, a managing consultant with Mariner (www.mariner-usa.com), a business intelligence consulting firm in North Carolina. I’m starting this blog to record some of my random thoughts about BI and data warehousing. I have a background in more traditional application development, and was very interested applying agile methodologies and development techniwques in that world. I’ve found that a lot of data warehousing projects don’t take advantage of agile practices, so that will likely be a common topic in this blog.


I currently am working with the Microsoft technology stack (SQL Server, Analysis Services, Integration Services, etc), so those topics will also be highlighted.


Thanks for reading.