Agile BI

BIDS Helper 0.9 Beta is Released

Posted on May 16, 2007 by jwelch

The 0.9 beta of BIDS Helper has been released on CodePlex. It’s got a lot of great features for working with Analysis Services, and some future enhancements will include SSIS.

Posted in Uncategorized | Comments Off

Handling Varying Columns, Part 2

Posted on May 16, 2007 by jwelch

In Handling Flat Files with Varying Numbers of Columns, I showed an example of parsing a flat file with an inconsistent number of columns. I used a script component, but Jamie commented that the same thing could be accomplished through a Conditional Split and Derived Column transform. So, here’s part 2.

I added a new data flow to the same package. The data flow is a bit more complicated for this.

The Conditional Split determines what type of row I’m dealing with, and passes it to the appropriate output. It does this by checking how many delimiters appear in the row. The FindString function will return a 0 if the string specified is not found, or if the string specified occurs less than the number of occurrences specified.

Now that I know how many columns I need to parse, I’m use a Derived Column transform to split the columns from the main string.

The expression for the first column looks for the first occurrence of the delimiter.

SUBSTRING(Line,1,FINDSTRING(Line,”,”,1) – 1)

For the second column, the expression is a bit more complicated. It has start from the first delimiter, and stop at the second. Since the SubString function needs the length, the expression is calculating the difference between the first and second delimiter. In addition, it is casting the result to an integer.

(DT_I4)(SUBSTRING(Line,FINDSTRING(Line,”,”,1) + 1,FINDSTRING(Line,”,”,2) – FINDSTRING(Line,”,”,1) – 1))

Finally, the third expression finds the second delimiter, and gets the rest of the string. I’m taking a shortcut by using the full value for the length, since if the length argument is exceeds the length of the string, the rest of the string is returned.

(DT_DBTIMESTAMP)(SUBSTRING(Line,FINDSTRING(Line,”,”,2) + 1,LEN(Line)))

Finally, a Union All is used to combine the data back into a single flow.

Technically, this could be accomplished without the Conditional Split. However, the logic required for the Derived Column transform would be much more complex, as each column parsing expression would have to be wrapped in a conditional expression to see if that column actually existed for the row.

In SSIS, there are usually at least two ways to accomplish anything, which is one of the things I like about it. However, there are differing advantages to the two approaches covered here and in the previous post. In general, I favor using the script component for the following reasons:

Easier (at least in my opinion) to introduce complex logic for parsing the columns

Simpler data flow

However, the Derived Column is easier if you aren’t comfortable with .NET coding, and makes it easier to interpret what is happening in the data flow.

I’ve attached the updated sample package at the end of this post.

Posted in Uncategorized | Comments Off

Handling Flat Files with Varying Numbers of Columns

Posted on May 8, 2007 by jwelch

5/15 Update – I added Part 2 to show how to do the same thing with a Conditional Split and a Derived Column transform, per Jamie’s feedback (see the comments on this post).�
A common question on the forums has been how to handle flat files that have a varying number of columns. For example, one row contains 3 columns, and another row may contain on two columns. The example below shows a sample file that uses a comma to delimit the columns, and a cursor return / line feed to delimit the row.�
TestValue1,100,12/01/2007
TestValue2,200
TestValue3,300,12/01/2007
TestValue4,400,12/01/2007
TestValue5,500
TestValue6,600,12/01/2007
TestValue7,700,12/01/2007
TestValue8,800
TestValue9,900,12/01/2007
TestValue0,1000,12/01/2007�
SSIS does not handle this scenario easily, due to the way it parses flat files. It parses by looking for the next column delimiter. The row delimiter is just the column delimiter for the last defined column. So, on our second line in the sample file, SSIS is looking for a comma instead of a CR/LF. The result of this is that the third row ends up combined with the second row, and we get something that looks like this:�
�
I’m not going to go into a discussion about whether this is good or bad. This article is about how to work around it. If you’d like to see it changed in future versions of SSIS, please go to Connect (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=124478) and vote for it to be changed.�
Now, onto the workaround. First, I’ve defined a flat file connection manager that treats each row as one column. I’m using the row delimiter (CR/LF) as the column delimiter.�
�
If you are following along, your flat file should preview like this:�
�
Next, in a data flow, I’ve added a flat file source that uses the connection manager. It is connected to a script component that is set as a Transform. The Line column is checked as an input.�
�
In the Inputs and Outputs area, I’ve added three columns, for the three real columns in my flat file, and set the data types appropriately.�
�
Finally, I added the following script to the task:�
Public Class ScriptMain
    Inherits UserComponent
    Private columnDelimiter() As Char = CType(“,”, Char()) �
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As
        Input0Buffer)
        Dim rowValues As String()
        rowValues = Row.Line.Split(columnDelimiter)
        If rowValues.GetUpperBound(0) < 2 Then
            ‘Row is not complete – Handle error
            Row.Name_IsNull = True
            Row.Number_IsNull = True
            Row.Date_IsNull = True
        Else
            Row.Name = rowValues.GetValue(0).ToString()
            Row.Number = Convert.ToInt32(rowValues.GetValue(1))
            Row.Date = Convert.ToDateTime(rowValues.GetValue(2))
        End If
    End Sub
End Class
The columnDelimiter variable holds the value for the column delimiter – a comma in my case. The Split function parses the value contained in Line (the single column defined in the connection manager) and returns an array containing one element for each column in it. Since I’m expecting 3 columns, I’m performing a check to see if the array contains all three columns (.NET uses 0-based array indexes). If columns are missing, I have an error that needs to be handled. In this example, I am simply setting all my column values to NULL. The error handling could be enhanced by redirecting the rows to an error output, but I wanted to keep things simple. With this method, I could use a conditional split to filter out the rows with NULL.�
Finally, if the correct number of columns are present, I’m setting the output columns created earlier with the values from the array. Notice that the Convert is necessary to make sure the value is the correct type.�
That’s pretty much it. Depending on your needs, you may need to customize the script a bit to better handle error conditions, or reparsing the columns. I’ve attached the sample package and text file below. The sample is using the Trash Destination from Konesans, which you can download from www.SQLIS.com

Sample files here

As always, feedback is appreciated.

Posted in Uncategorized | Tagged Examples, Flat Files, SSIS | Comments Off

Handling Multiple Errors in SSIS

Posted on May 5, 2007 by jwelch

[edited on 12/14/2007 to correct an error in the text around string handling – the samples were not modified]
One actual failure in SSIS can trigger a whole series of error messages. For example, failure to convert a column value from a string to an integer in a Derived Column transform generates the following messages:
[Data Conversion [70]] Error: Data conversion failed while converting column “Fiscal year” (18) to column “NumericFiscalYear” (83). The conversion returned status value 2 and status text “The value could not be converted because of a potential loss of data.”.
[Data Conversion [70]] Error: SSIS Error Code DTS_E_INDUCEDTRANSFORMFAILUREONERROR. The “output column “NumericFiscalYear” (83)” failed because error code 0xC020907F occurred, and the error row disposition on “output column “NumericFiscalYear” (83)” specifies failure on error. An error occurred on the specified object of the specified component. There may be error messages posted before this with more information about the failure.
[DTS.Pipeline] Error: SSIS Error Code DTS_E_PROCESSINPUTFAILED. The ProcessInput method on component “Data Conversion” (70) failed with error code 0xC0209029. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. There may be error messages posted before this with more information about the failure.
[DTS.Pipeline] Error: SSIS Error Code DTS_E_THREADFAILED. Thread “WorkThread0” has exited with error code 0xC0209029. There may be error messages posted before this with more information on why the thread has exited.
If you are logging errors to a flat file or an error logging table, then recording each error may be fine. However, if you’re writing the errors to the Windows event log, or sending them via email, you may not want to record multiple messages each time an error occurs. You might want to record only the first message, or you might want to group all the errors into a single log entry or email. Fortunately, the event model in SSIS allows you to easily customize how errors are handled.
I’ve put together a small sample package that shows how you might accomplish this. The package contains a single data flow that loads a text file, attempts to convert a column from string to numeric, and writes it to a Trash destination (see www.SQLIS.com to get this component).
The text file has an invalid value in one of the columns, which will cause the data flow to fail, and generate the four messages listed above. The package is set up to capture all of the error messages generated, store them in a collection, and concatenate them into a single string when the package is finished executing. Once that is done, the resulting string could be emailed or recorded to a log.
As mentioned, the data flow is very straightforward:
�
I’ve also created two variables at the package level: errorMessages as an Object, and emailText as a String. I’ll explain why later in the post.
The real work occurs in the event handlers. SSIS raises events for all executables(packages and tasks are both executables). The event we’re interested in is the OnError event, which is raised once for each error that occurs.
You get to the event handlers by selecting the Event Handlers tab in the SSIS designer. Once there, the Executable for which you want to capture events needs to be selected.

Since I want to handle errors for anything in the package, I’m setting the executable to CaptureErrors (the name of the package). By default, any event raised by a child executable (that is, an executable that is nested inside another executable) will also be raised in its parent. You can disable that behavior by setting the Propagate system variable, but that’s a topic for another post. I’m also using “OnError” from the list of events and have added a Script Task to the event handler.

The Script Task has two variables passed in: the System::ErrorDescription, which contains the text of the error message, and User:errorMessages, which I’ll use to keep track of all the error messages.

Here’s the script used in the Script Task:
Dim messages As Collections.ArrayList
Try
messages = CType(Dts.Variables(“errorMessages”).Value, Collections.ArrayList)
Catch ex As Exception
messages = New Collections.ArrayList()
End Try
messages.Add(Dts.Variables(“ErrorDescription”).Value.ToString())
Dts.Variables(“errorMessages”).Value = messages
Dts.TaskResult = Dts.Results.Success
I’m first attempting to retrieve the ArrayList from the errorMessages variable. If the value of the variable can’t be cast to an ArrayList, it indicates that it hasn’t been initialized yet. Once that has been handled, the error description is added to the ArrayList. This handles capturing the list of all error messages.
The next step is to process all the messages in order to email or log them. Since I only want to do this once, I’m using the OnPostExecute event, which fires when the executable is finished running.

There is another Script Task present in this event handler. This one has the User::errorMessages and User:emailText variables passed in.

The script in this task is concatenating a long string based on the error messages captured and returning it in the emailText variable:
Dim errorDesc As String
Dim messages As Collections.ArrayList
Try
messages = CType(Dts.Variables(“errorMessages”).Value, Collections.ArrayList)
Catch ex As Exception
‘If there is an exception – the object was never initialized, so there were no errors
Return
End Try
For Each errorDesc In messages
Dts.Variables(“emailText”).Value = Dts.Variables(“emailText”).Value.ToString + errorDesc + vbCrLf
Next
Dts.TaskResult = Dts.Results.Success
Once that has been done, the resulting string could be emailed or logged as desired. Since SSIS can generate fairly verbose error messages, I chose to store the messages in an ArrayList object. I then use the ArrayList to build the actual string message.
Hopefully, this information is helpful to anyone who wants to customize SSIS event handling. I’ve attached the sample package, and the text file used in the sample. If you have any feedback or suggestions for improvement, please leave them in the comments.

Posted in Uncategorized | Tagged Error Handling, Examples, SSIS | Comments Off

Checking for Overlapping Rows in a Slowly Changing Dimension

Posted on May 5, 2007 by jwelch

A question was posted on the MSDN forums about validating that there are no overlapping rows in a slowly changing dimension (SCD). It is common to use a start and end date in a type 2 SCD to designate when each version of the dimension member was in effect. However, it can create problems if the date ranges for one version of the row overlap with other versions of the row.

A simple way to validate that there is no overlap in an existing dimension table is to run the following query:

SELECT
*
FROM
DimTable A
LEFT JOIN DimTable B ON (A.BusinessKey = B.BusinessKey AND A.DimID <> B.DimID)
WHERE
(A.BeginDate BETWEEN B.BeginDate AND B.EndDate
OR A.EndDate BETWEEN B.BeginDate AND B.EndDate)
ORDER BY
A.DimID

It simply joins the table to itself to compare the date range of a given row with other rows that have the same business key. It assumes that the dimension table contains a surrogate key, a business key, and a begin and end date.

That’s great if you are validating the table after the fact, but what if you want check a new row before you insert it? In some cases, especially if you are dealing with dimension rows that arrive out of sequence, you may need to validate that the effective range for a new row doesn’t overlap with existing rows. If you want to do this in a data flow, the method above doesn’t work, nor does it lend itself to a simple Lookup. You’d need to perform a BETWEEN in the Lookup, which means that caching needs to be disabled. On large lookup tables, this can result in poor performance.

What is really needed is a way to perform an equi-join between the business key and each of the effective dates, so that we can use the caching mode of the Lookup. Fortunately we can do this in SQL Server 2005 using a recursive common table expression (CTE). Using the CTE, we can generate a row for each “covered” date, that is, each day between the start and end effective dates. Once we have that, we can perform one lookup to match on the business key and begin date, and a second lookup to match on the business key and end date. If either of the lookups hits a match, then the date range for the new row overlaps an existing row.

This is the SQL for the lookups:

WITH DIM_CTE (
    DimID
    ,BusinessKey
    ,CoveredDate
    ,BeginDate
    ,EndDate
    ,Days) AS
(
    SELECT
        dim.DimID
        ,dim.BusinessKey
        ,dim.BeginDate AS CoveredDate
        ,dim.BeginDate
        ,dim.EndDate
        ,1 AS Days
    FROM
        DimTable dim

UNION ALL

    SELECT
        dim.DimID
        ,dim.BusinessKey
        ,DATEADD(day, DIM_CTE.Days, dim.BeginDate) AS CoveredDate
        ,dim.BeginDate
        ,dim.EndDate
        ,DIM_CTE.Days + 1 AS Days
    FROM
        DIM_CTE INNER JOIN DimTable dim
            ON (DIM_CTE.DimID = dim.DimID)
    WHERE
        DIM_CTE.Days <= DATEDIFF(day, dim.BeginDate, dim.EndDate)
)
SELECT
    A.DimID, A.CoveredDate, A.BusinessKey
FROM
    DIM_CTE A
OPTION (MAXRECURSION 400)

An important thing to note about the CTE is the use of the OPTION (MAXRECURSION 400). This is a query hint that tells SQL Server how many levels of recursion to allow. Since the CTE recurses once for each day between the effective start date and end date for the rows in the dimension table, you need to make sure that MAXRECURSION is set to the max days between start and end dates. You can use the following SQL to determine what value to use:

SELECT
MAX(DATEDIFF(day, dim.BeginDate, dim.EndDate))
FROM
DimTable dim

The Lookups in the SSIS package are both set to ignore failures. A conditional split is used to determine whether both dates fall outside a valid range (by checking whether an ID column was populated by the Lookup using this expression: “!(ISNULL(BeginDateID) && ISNULL(EndDateID))”.

A caveat about this technique: the CTE can generate a lot of rows, which can impact performance. Depending on your scenario, it may make more sense to insert the rows, then perform a cleanup routine afterward, or to validate the date ranges in a batch operation before starting the data flow.

If you have any comments about optimizing or improving this, please leave a comment.

Posted in Uncategorized | Comments Off

Package Configuration Editor

Posted on April 23, 2007 by jwelch

Well, after a few weeks of playing around with it, I finally finished an initial version of a tool for editing SSIS configurations. It’s pretty basic, but it is functional. It supports editing configurations stored in XML files and databases. You can download it from here.

The initial version supports editing existing configuration sources. That is, you can’t create a new configuration file from scratch with it, but you can add new configurations, and edit and delete existing configurations. The nice part about this is that if you add a new configuration to an existing file, packages that use the configuration file will automatically make use of the new configuration, without you have to change the package at all. There is one caveat to this behavior, the package path value in the configuration has to be valid in the package that is using the configuration file. Otherwise, you will get a warning message when the package loads.

The same thing holds true for database configurations. If you add a new package path, but reuse an existing configuration filter, it will be applied to any packages that use that configuration filter. The database editor treats the configuration filter and package path as the key for a configuration entry, so you have to delete and re-insert a row to update either of those columns.

Some future improvements that I would like to make:

Add some validation to the values. Right now there is no checking to ensure the data being entered can successfully be converted to the appropriate data type.
Add a package browser for the package path.

I’ve tested this a bit myself, but I’m definitely interested in feedback on it, whether it is bugs or suggestions for improvement. You can leave comments on this post, on the file listing, or you can email me at jwelch@agilebi.com.

Posted in Uncategorized | Comments Off

SSIS Presentation at CITPG

Posted on April 12, 2007 by jwelch

I presented an “SSIS 101” session to one of our local user groups yesterday, the Carolina IT Professionals Group (http://www.carolinait.org/). It was an interesting presentation to do, as the audience is fairly broad, covering networking professionals, application developers, and hardware people. Out of the roughly 100 people in attendence, only seven had actually used SSIS, and there were two or three more that had used DTS. Over the past few years, most of my presentations have been done to audiences that are familiar with data warehousing, so it was a little different to present to a group that wasn’t ETL centric.

However, it was a good experience. The audience asked a lot of good questions. A number of people followed up after the presentation to say that they hadn’t been familiar with SSIS prior to the presentation, but now that they had seen it, they see a lot of use for it in automating some data movement tasks. One person told me that he currently spent one to two hours every morning babysitting a set of cursor based stored procedures that moved data from one server to another, and now he was going to automate the process with SSIS.

I’ve recently done a series of presentations around SSIS that focused on more advanced topics, and on some product development that Mariner (my employer) is doing around SSIS. I’ve been acting as a product manager on that, and it’s been a lot of fun. It’s given me the opportunity to apply a lot of the experiences I have had with SSIS toward creating something that will help companies manage their SSIS implementions over time. I promise not to turn this blog into a running advertisement, but as we get closer to launch I will include some links to the product information, as I’d love to get feedback from the community on it.

Posted in Uncategorized | Comments Off

Retrieving Information from Active Directory with SSIS

Posted on April 7, 2007 by jwelch

I was recently asked to help someone retrieve information from Active Directory for use in SSIS. My initial thought was “This shouldn’t be too difficult. I know there is an OLE DB driver for Active Directory, so it should just be a matter of hooking up a few components.” As it turns out, it took a few hours of painful trial and error to get something working. Also, at the end of it all, I found a different approach that was as simple as I had originally envisioned it. Hopefully, this will save some time for the next person needing to do this.
I’ll start with a summary of the results, in case you don’t feel like wading through the rest. It does work, but there are a few caveats. You can use either an OLE DB connection manager, or an ADO.NET connection manager with the .NET OLE DB wrapper. I recommend the ADO.NET connection manager, as it will allow you to use access Active Directory as a source in your data flows. If you are using the OLE DB connection manager, you will be limited to using Execute SQL tasks and passing the recordset into the data flow. There is a sample package attached that shows both approaches.
I started out using an OLE DB connection manager and choose the OLE DB Provider for Microsoft Directory Services. I put in the server name, left the authentication as Windows Integrated, and clicked Test Connection. The test was successful, so I moved on to the next step.

I attempted to set up an OLE DB data source in my data flow, using my OLE DB connection manager. However, nothing worked. So I backed up, and tried an Execute SQL task. Still nothing. My first thought was that I didn’t have the syntax right for the query, but after trying it successfully outside of SSIS, I was looking at the connection again. Eventually I found a reference in MSDN to setting the ADSI Flag to 1 on the connection properties. After updating that, my Execute SQL task started working.

The Execute SQL was set up to return the full resultset to an object variable. The query used was “SELECT cn FROM ‘LDAP://DC=DOMAINCONTROLLER,DC=local’ WHERE objectClass=’User'”. This retrieves all the users from Active Directory.

I then went back to using an OLE DB Source in the data flow, but it still wouldn’t work. The OLE DB Source didn’t seem like it was able to execute the query or get any metadata from the provider. So I took the approach of using a script source to process the recordset in the data flow. (See this post for more information on handling recordsets in scripts.) That worked fine, but I was wasn’t completely satisfied.
I went back and tried creating a new connection manager. This time I used the ADO.NET connection manager, but still used the OLE DB Provider for Microsoft Directory Services through the .NET Provider for OLE DB. Just for grins, I didn’t bother setting the ADSI Flag on this connection.
�
I then created an ADO.NET Reader data source in my data flow, set the connection to the connection manager I just created and put the same LDAP query in it.

When I ran the updated package, it worked first time through (without the ADSI Flag=1). Since it’s been my understanding that it is always preferable to use the OLE DB connection manager when working with OLE DB providers, this behavior was a little frustrating. It seems like you should get the same behavior whether you are using the OLE DB connection manager or the ADO.NET connection manager with the OLE DB wrapper.
Anyway, that was my journey through SSIS with Active Directory. I’d definitely recommend using the ADO.NET connection manager, as it make the process much easier. Please check out the sample package attached if you have further questions.

Posted in Uncategorized | Tagged AD, SSAS | Comments Off

Open Source SQL Server Stuff (OSSSS)

Posted on March 30, 2007 by jwelch

I was on the CodePlex (www.codeplex.com) site today looking for the Analysis Services Stored Procedure project, and found that they have a number of SQL Server related projects available. First, all of the SQL Server samples have been uploaded to the site, which is nice if you only want to grab one or two of them.

There is also the aforementioned AS Stored Procedure project (http://www.codeplex.com/ASStoredProcedures) , which has some great examples of extending AS functionality with .NET based stored procedures.

Then there is the SQL Inserter project, which can generate INSERT statements from database tables. It’s handy for quickly moving small amounts of data, if you don’t want to use the Import / Export Wizard.

There are also a couple of test data generators (http://www.codeplex.com/datagenerator and http://www.codeplex.com/TdGen). Data Generator is designed to be used in .NET based development, and TdGen is for databases.

Posted in Uncategorized | Comments Off

Writing a Resultset to a Flat File

Posted on March 22, 2007 by jwelch

This is another example package. This one retrieves a resultset and stores it in a variable. It then uses a data flow with a custom script source to read each row of the resultset and output it as a delimited string. Why would you want to do this? Well, it enables you to treat a resultset generically from a data flow perspective. There is only a single “column” defined in the metadata. So this works well if you want to export an unknown number of columns from a resultset to a flat file.

First, use an Execute SQL to retrieve and store a resultset in a variable. Then execution should continue to the data flow.

A custom script component acts as the source for the data flow. When adding the script component, make sure to source as the type of the script. Select Inputs and Outputs, and add a single output column to the existing output, and set the type of the column to DT_STR and set the length to 1000 (or something long enough to hold your widest resultset).

Select the Script option next, and add the resultset variable to the ReadOnly Variables property. Click the Design Script button, and add the following to the CreateNewOutputRows function: (Script adapted from Jamie Thomson’s example at http://blogs.conchango.com/jamiethomson/archive/2006/01/04/SSIS_3A00_-Recordsets-instead-of-raw-files.aspx)

Dim oledbAdapter As New OleDb.OleDbDataAdapter
Dim dataTable As New System.Data.DataTable
Dim dataRow As Data.DataRow
Dim delimitedValues As String
Dim i As Integer

oledbAdapter.Fill(dataTable, Me.Variables.resultset)
For Each dataRow In dataTable.Rows
OutputBuffer.AddRow()

delimitedValues = String.Empty

    For i = 0 To dataRow.ItemArray.GetUpperBound(0)
      delimitedValues = delimitedValues + dataRow.Item(i).ToString()
      If i <> dataRow.ItemArray.GetUpperBound(0) Then
        delimitedValues = delimitedValues + “|”
      End If
    Next

OutputBuffer.DelimitedRow = delimitedValues
Next

You’ll also need to add a reference to the System.XML assembly (Select Project..Add Reference from the toolbar). If you want to use a different delimiter, just change the constant “|”.

Now you can connect this to a flat file destination with a single column defined, and output your recordset to that without worrying about the number or type of columns.

Posted in Uncategorized | Comments Off

BIDS Helper 0.9 Beta is Released

Handling Varying Columns, Part 2

Handling Flat Files with Varying Numbers of Columns

Handling Multiple Errors in SSIS

Checking for Overlapping Rows in a Slowly Changing Dimension

Package Configuration Editor

SSIS Presentation at CITPG

Retrieving Information from Active Directory with SSIS

Open Source SQL Server Stuff (OSSSS)

Writing a Resultset to a Flat File

Recent Posts

Recent Comments

Archives

Categories

Meta