Archive for May 2007

Dynamically Unpivoting Columns to Rows

There’s been a few questions on the forums recently about taking column values and pivoting them to rows. For example, I might have an input file with 5 columns: a key column and 4 category columns.


Key; C1; C2; C3; C4
1;1;2;3;4
2;1;2;3;4
3;1;2;3;4
4;1;2;3;4


and I’d like an output that has 2 columns (the key and the category value):


1;1
1;2
1;3
1;4
2;1
2;2
2;3


Since my input has a set number of columns, I can use the Unpivot Transform to get the desired result.



But what if I have a variable number of columns to pivot? In that case the input file might look like this:


1;1;2;3
2;1;2;3;4;5
3;1
4;1;2;3;4;5;6;7;8
5;1;2;3;4;5;6;7;8;9
6;1;2;3


The first column still indicates the key, but there are a variable number of categories. In Handling Flat Files with Varying Numbers of Columns, I showed how to handle the varying number of columns by treating each row as a single column. This post is going to expand on that technique and show how to incorporate a pivot into the script.


The connection manager and the flat file source should be set up the same way as in the previous post, so I won’t cover that again. The difference is in the script component. Since I want to output more rows than I’m getting as inputs, I’m setting the script up with an asynchronous output. That is done by setting the SynchronousInputID property to None. Two columns were added to the output, one for the key, and one for the category value.



This is the code inside the script task:


Dim Values() As String
Dim i As Integer

Values = Row.Column0.Split(CChar(“;”))

‘Array is zero based – but zero index is the key value,
‘so start with 1
For i = 1 To Values.GetUpperBound(0)
    Output0Buffer.AddRow()
    Output0Buffer.Key = CInt(Values(0))
    Output0Buffer.Value = CInt(Values(i))
Next

The Split function is used to break the string up into individual columns in an array. The code loops through the array, starting with the second column, since the first one represents the key. For each iteration through the loop, a row is added to the output buffer, and the key and and value are added to the row.


That’s pretty much all there is to it. This technique can be used to handle flat files and columns that contain multiple items (in XML, or as a delimited list).

BIDS Helper 0.9 Beta is Released

The 0.9 beta of BIDS Helper has been released on CodePlex. It’s got a lot of great features for working with Analysis Services, and some future enhancements will include SSIS.

Handling Varying Columns, Part 2

In Handling Flat Files with Varying Numbers of Columns, I showed an example of parsing a flat file with an inconsistent number of columns. I used a script component, but Jamie commented that the same thing could be accomplished through a Conditional Split and Derived Column transform. So, here’s part 2.


I added a new data flow to the same package. The data flow is a bit more complicated for this.



The Conditional Split determines what type of row I’m dealing with, and passes it to the appropriate output. It does this by checking how many delimiters appear in the row. The FindString function will return a 0 if the string specified is not found, or if the string specified occurs less than the number of occurrences specified.



Now that I know how many columns I need to parse, I’m use a Derived Column transform to split the columns from the main string.



The expression for the first column looks for the first occurrence of the delimiter.


SUBSTRING(Line,1,FINDSTRING(Line,”,”,1) – 1)


For the second column, the expression is a bit more complicated. It has start from the first delimiter, and stop at the second. Since the SubString function needs the length, the expression is calculating the difference between the first and second delimiter. In addition, it is casting the result to an integer.


(DT_I4)(SUBSTRING(Line,FINDSTRING(Line,”,”,1) + 1,FINDSTRING(Line,”,”,2) – FINDSTRING(Line,”,”,1) – 1))


Finally, the third expression finds the second delimiter, and gets the rest of the string. I’m taking a shortcut by using the full value for the length, since if the length argument is exceeds the length of the string, the rest of the string is returned.


(DT_DBTIMESTAMP)(SUBSTRING(Line,FINDSTRING(Line,”,”,2) + 1,LEN(Line)))


Finally, a Union All is used to combine the data back into a single flow.


Technically, this could be accomplished without the Conditional Split. However, the logic required for the Derived Column transform would be much more complex, as each column parsing expression would have to be wrapped in a conditional expression to see if that column actually existed for the row.


In SSIS, there are usually at least two ways to accomplish anything, which is one of the things I like about it. However, there are differing advantages to the two approaches covered here and in the previous post. In general, I favor using the script component for the following reasons:



  • Easier (at least in my opinion) to introduce complex logic for parsing the columns

  • Simpler data flow

However, the Derived Column is easier if you aren’t comfortable with .NET coding, and makes it easier to interpret what is happening in the data flow.


I’ve attached the updated sample package at the end of this post.

Handling Flat Files with Varying Numbers of Columns

5/15 Update – I added Part 2 to show how to do the same thing with a Conditional Split and a Derived Column transform, per Jamie’s feedback (see the comments on this post).
A common question on the forums has been how to handle flat files that have a varying number of columns. For example, one row contains 3 columns, and another row may contain on two columns. The example below shows a sample file that uses a comma to delimit the columns, and a cursor return / line feed to delimit the row.
TestValue1,100,12/01/2007
TestValue2,200
TestValue3,300,12/01/2007
TestValue4,400,12/01/2007
TestValue5,500
TestValue6,600,12/01/2007
TestValue7,700,12/01/2007
TestValue8,800
TestValue9,900,12/01/2007
TestValue0,1000,12/01/2007

SSIS does not handle this scenario easily, due to the way it parses flat files. It parses by looking for the next column delimiter. The row delimiter is just the column delimiter for the last defined column. So, on our second line in the sample file, SSIS is looking for a comma instead of a CR/LF. The result of this is that the third row ends up combined with the second row, and we get something that looks like this:

I’m not going to go into a discussion about whether this is good or bad. This article is about how to work around it. If you’d like to see it changed in future versions of SSIS, please go to Connect (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=124478) and vote for it to be changed.
Now, onto the workaround. First, I’ve defined a flat file connection manager that treats each row as one column. I’m using the row delimiter (CR/LF) as the column delimiter.

If you are following along, your flat file should preview like this:
 
Next, in a data flow, I’ve added a flat file source that uses the connection manager. It is connected to a script component that is set as a Transform. The Line column is checked as an input.

In the Inputs and Outputs area, I’ve added three columns, for the three real columns in my flat file, and set the data types appropriately.

Finally, I added the following script to the task:
Public Class ScriptMain
    Inherits UserComponent
    Private columnDelimiter() As Char = CType(“,”, Char())

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As
        Input0Buffer)
        Dim rowValues As String()

        rowValues = Row.Line.Split(columnDelimiter)
        If rowValues.GetUpperBound(0) < 2 Then
            ‘Row is not complete – Handle error
            Row.Name_IsNull = True
            Row.Number_IsNull = True
            Row.Date_IsNull = True
        Else
            Row.Name = rowValues.GetValue(0).ToString()
            Row.Number = Convert.ToInt32(rowValues.GetValue(1))
            Row.Date = Convert.ToDateTime(rowValues.GetValue(2))
        End If
    End Sub

End Class
The columnDelimiter variable holds the value for the column delimiter – a comma in my case. The Split function parses the value contained in Line (the single column defined in the connection manager) and returns an array containing one element for each column in it. Since I’m expecting 3 columns, I’m performing a check to see if the array contains all three columns (.NET uses 0-based array indexes). If columns are missing, I have an error that needs to be handled. In this example, I am simply setting all my column values to NULL. The error handling could be enhanced by redirecting the rows to an error output, but I wanted to keep things simple. With this method, I could use a conditional split to filter out the rows with NULL.
Finally, if the correct number of columns are present, I’m setting the output columns created earlier with the values from the array. Notice that the Convert is necessary to make sure the value is the correct type.
That’s pretty much it. Depending on your needs, you may need to customize the script a bit to better handle error conditions, or reparsing the columns. I’ve attached the sample package and text file below. The sample is using the Trash Destination from Konesans, which you can download from www.SQLIS.com

Sample files here

As always, feedback is appreciated.

Handling Multiple Errors in SSIS

[edited on 12/14/2007 to correct an error in the text around string handling - the samples were not modified]
One actual failure in SSIS can trigger a whole series of error messages. For example, failure to convert a column value from a string to an integer in a Derived Column transform generates the following messages:
[Data Conversion [70]] Error: Data conversion failed while converting column “Fiscal year” (18) to column “NumericFiscalYear” (83). The conversion returned status value 2 and status text “The value could not be converted because of a potential loss of data.”.
[Data Conversion [70]] Error: SSIS Error Code DTS_E_INDUCEDTRANSFORMFAILUREONERROR. The “output column “NumericFiscalYear” (83)” failed because error code 0xC020907F occurred, and the error row disposition on “output column “NumericFiscalYear” (83)” specifies failure on error. An error occurred on the specified object of the specified component. There may be error messages posted before this with more information about the failure.
[DTS.Pipeline] Error: SSIS Error Code DTS_E_PROCESSINPUTFAILED. The ProcessInput method on component “Data Conversion” (70) failed with error code 0xC0209029. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. There may be error messages posted before this with more information about the failure.
[DTS.Pipeline] Error: SSIS Error Code DTS_E_THREADFAILED. Thread “WorkThread0″ has exited with error code 0xC0209029. There may be error messages posted before this with more information on why the thread has exited.
If you are logging errors to a flat file or an error logging table, then recording each error may be fine. However, if you’re writing the errors to the Windows event log, or sending them via email, you may not want to record multiple messages each time an error occurs. You might want to record only the first message, or you might want to group all the errors into a single log entry or email. Fortunately, the event model in SSIS allows you to easily customize how errors are handled.
I’ve put together a small sample package that shows how you might accomplish this. The package contains a single data flow that loads a text file, attempts to convert a column from string to numeric, and writes it to a Trash destination (see www.SQLIS.com to get this component).
The text file has an invalid value in one of the columns, which will cause the data flow to fail, and generate the four messages listed above. The package is set up to capture all of the error messages generated, store them in a collection, and concatenate them into a single string when the package is finished executing. Once that is done, the resulting string could be emailed or recorded to a log.
As mentioned, the data flow is very straightforward:

I’ve also created two variables at the package level: errorMessages as an Object, and emailText as a String. I’ll explain why later in the post.
The real work occurs in the event handlers. SSIS raises events for all executables(packages and tasks are both executables). The event we’re interested in is the OnError event, which is raised once for each error that occurs.
You get to the event handlers by selecting the Event Handlers tab in the SSIS designer. Once there, the Executable for which you want to capture events needs to be selected.

Since I want to handle errors for anything in the package, I’m setting the executable to CaptureErrors (the name of the package). By default, any event raised by a child executable (that is, an executable that is nested inside another executable) will also be raised in its parent. You can disable that behavior by setting the Propagate system variable, but that’s a topic for another post. I’m also using “OnError” from the list of events and have added a Script Task to the event handler.

The Script Task has two variables passed in: the System::ErrorDescription, which contains the text of the error message, and User:errorMessages, which I’ll use to keep track of all the error messages.

Here’s the script used in the Script Task:
Dim messages As Collections.ArrayList
Try
messages = CType(Dts.Variables(“errorMessages”).Value, Collections.ArrayList)
Catch ex As Exception
messages = New Collections.ArrayList()
End Try

messages.Add(Dts.Variables(“ErrorDescription”).Value.ToString())
Dts.Variables(“errorMessages”).Value = messages
Dts.TaskResult = Dts.Results.Success
I’m first attempting to retrieve the ArrayList from the errorMessages variable. If the value of the variable can’t be cast to an ArrayList, it indicates that it hasn’t been initialized yet. Once that has been handled, the error description is added to the ArrayList. This handles capturing the list of all error messages.
The next step is to process all the messages in order to email or log them. Since I only want to do this once, I’m using the OnPostExecute event, which fires when the executable is finished running.

There is another Script Task present in this event handler. This one has the User::errorMessages and User:emailText variables passed in.

The script in this task is concatenating a long string based on the error messages captured and returning it in the emailText variable:
Dim errorDesc As String
Dim messages As Collections.ArrayList

Try
messages = CType(Dts.Variables(“errorMessages”).Value, Collections.ArrayList)
Catch ex As Exception
‘If there is an exception – the object was never initialized, so there were no errors
Return
End Try

For Each errorDesc In messages
Dts.Variables(“emailText”).Value = Dts.Variables(“emailText”).Value.ToString + errorDesc + vbCrLf
Next

Dts.TaskResult = Dts.Results.Success
Once that has been done, the resulting string could be emailed or logged as desired. Since SSIS can generate fairly verbose error messages, I chose to store the messages in an ArrayList object. I then use the ArrayList to build the actual string message.
Hopefully, this information is helpful to anyone who wants to customize SSIS event handling. I’ve attached the sample package, and the text file used in the sample. If you have any feedback or suggestions for improvement, please leave them in the comments.

Checking for Overlapping Rows in a Slowly Changing Dimension

A question was posted on the MSDN forums about validating that there are no overlapping rows in a slowly changing dimension (SCD). It is common to use a start and end date in a type 2 SCD to designate when each version of the dimension member was in effect. However, it can create problems if the date ranges for one version of the row overlap with other versions of the row.


A simple way to validate that there is no overlap in an existing dimension table is to run the following query:


SELECT
*
FROM
DimTable A
LEFT JOIN DimTable B ON (A.BusinessKey = B.BusinessKey AND A.DimID <> B.DimID)
WHERE
(A.BeginDate BETWEEN B.BeginDate AND B.EndDate
OR A.EndDate BETWEEN B.BeginDate AND B.EndDate)
ORDER BY
A.DimID


It simply joins the table to itself to compare the date range of a given row with other rows that have the same business key. It assumes that the dimension table contains a surrogate key, a business key, and a begin and end date.


That’s great if you are validating the table after the fact, but what if you want check a new row before you insert it? In some cases, especially if you are dealing with dimension rows that arrive out of sequence, you may need to validate that the effective range for a new row doesn’t overlap with existing rows. If you want to do this in a data flow, the method above doesn’t work, nor does it lend itself to a simple Lookup. You’d need to perform a BETWEEN in the Lookup, which means that caching needs to be disabled. On large lookup tables, this can result in poor performance.


What is really needed is a way to perform an equi-join between the business key and each of the effective dates, so that we can use the caching mode of the Lookup. Fortunately we can do this in SQL Server 2005 using a recursive common table expression (CTE). Using the CTE, we can generate a row for each “covered” date, that is, each day between the start and end effective dates. Once we have that, we can perform one lookup to match on the business key and begin date, and a second lookup to match on the business key and end date. If either of the lookups hits a match, then the date range for the new row overlaps an existing row.


This is the SQL for the lookups:


WITH DIM_CTE (
    DimID
    ,BusinessKey
    ,CoveredDate
    ,BeginDate
    ,EndDate
    ,Days) AS
(
    SELECT
        dim.DimID
        ,dim.BusinessKey
        ,dim.BeginDate AS CoveredDate
        ,dim.BeginDate
        ,dim.EndDate
        ,1 AS Days
    FROM
        DimTable dim


    UNION ALL

    SELECT
        dim.DimID
        ,dim.BusinessKey
        ,DATEADD(day, DIM_CTE.Days, dim.BeginDate) AS CoveredDate
        ,dim.BeginDate
        ,dim.EndDate
        ,DIM_CTE.Days + 1 AS Days
    FROM
        DIM_CTE INNER JOIN DimTable dim
            ON (DIM_CTE.DimID = dim.DimID)
    WHERE
       
DIM_CTE.Days <= DATEDIFF(day, dim.BeginDate, dim.EndDate) 
)
SELECT
    A.DimID, A.CoveredDate, A.BusinessKey
FROM
    DIM_CTE A
OPTION (MAXRECURSION 400)


An important thing to note about the CTE is the use of the OPTION (MAXRECURSION 400). This is a query hint that tells SQL Server how many levels of recursion to allow. Since the CTE recurses once for each day between the effective start date and end date for the rows in the dimension table, you need to make sure that MAXRECURSION is set to the max days between start and end dates. You can use the following SQL to determine what value to use:

SELECT
    MAX(DATEDIFF(day, dim.BeginDate, dim.EndDate))
FROM
    DimTable dim

The Lookups in the SSIS package are both set to ignore failures. A conditional split is used to determine whether both dates fall outside a valid range (by checking whether an ID column was populated by the Lookup using this expression:  ”!(ISNULL(BeginDateID) && ISNULL(EndDateID))”.


A caveat about this technique: the CTE can generate a lot of rows, which can impact performance. Depending on your scenario, it may make more sense to insert the rows, then perform a cleanup routine afterward, or to validate the date ranges in a batch operation before starting the data flow.

If you have any comments about optimizing or improving this, please leave a comment.