Archive for the ‘Uncategorized’ Category.

Going to the PASS Business Analytics Conference

I found out recently that I’ll be able to attend the PASS Business Analytics Conference this year, which I’m pretty excited about. Also, I’m not presenting at this conference, so I will actually get to relax and enjoy the sessions by other speakers. If you haven’t registered yet, now’s a good time*.

There’s a lot of great content at this conference, and it’s a bit challenging in some time slots to decide on exactly what I want to see most. However, there are 3 sessions that I will definitely be attending:

AV-400-M – Deep Dive into Power Query Formula Language

Power Query is a cool technology for data transformation – one that I believe (hope) will continue to evolve and become relevant outside of just Excel. And it’s usefulness in quicking mashing up some data inside Excel is outstanding. This is a focused session on the formula language, which I’m interested in, and it’s being delivered by Matt Masson and Theresa Palmer-Boroski. Matt does a great job on these types of presentations. I haven’t seen Theresa present yet, but I’m confident she’ll do a great job, and this will be a good session.

ID-100 – Creating an End-To-End Power View Reporting Solution

Devin Knight (a co-worker at Pragmatic Works) is delivering this, and he puts together great sessions. Power View is one of those technologies that I don’t spend a lot of time with, but I know I need to know it better, and this session should help with that. Devin has a lot of practical experience with Power View, so this will be a great opportunity to get a real world look at what’s involved.

SA-102 – Analytics and OLTP Systems: Is Hekaton A Game-Changer?

Hekaton is the new in-memory technology in 2014. It’s primary focus is on improving the performance of OLTP applications, but Todd McDermid will be looking at it from the perspective of delivering analytics. He’ll be answering the question of whether it can be used to deliver a single database that suited for both transactional processing and analytics, and I’m very interested to see the results. I feel like the Hekaton technologies could have a place in the BI world, but I haven’t had a chance to go out and really investigate it myself. Thanks to Todd’s efforts, I won’t have to.

There are a lot of great sessions, and those are just 3 of the ones that appealed to me.  I’m really looking forward to attending, and I hope to see you there.

*If you aren’t already registered, you can use the discount code BABQ9B to get $150 off your registration.

It’s Alive!

You may have noticed a real dearth of posts over the last couple of months. That’s been a function of two things:

  1. I’ve been very busy at work with the release of Vivid, our Excel OLAP client, and working on the upcoming releases of Mist and Hadron.
  2. My hosting company and blog software conspired to prevent me (or any of the bloggers at Agile BI) from posting or even logging into the site.

I’m still really busy at work, but I have resolved the 2nd item. New hosting company, new blog software, and everything’s up and running. Unfortunately, if you had an account on the site previously, you will need to create a new one, but hey – look at it as a chance to get that user name you always wanted :) .

In any case, the posts should be back on a more regular schedule over the next few months. David Darden, Melinda Cole, and Justin James also will be posting some updates. Please let me know if you experience any problems with the new site, and I’ll try to get them addressed as soon as possible.

SQL Heroes Contest Winners

They announced the SQL Heroes contest winners at the PASS Summit keynote today. I’m very happy to say that both BIDSHelper and ssisUnit were in the top 5! BIDSHelper had a perfect score of 55 (the only perfect score), based on the judging criteria (yes, 55 was the max you could score). ssisUnit tied with Extended Event Manager, from Jonathan Kehayias, with a score of 45.

Thanks for voting, and please keep the suggestions for improvements coming in.

Preview of the SSWUG Virtual Conference

The SSWUG Business Intelligence Virtual Conference is just two weeks off now. They’ve made a 10 minute preview of my session on SSIS configurations available here: http://www.vconferenceonline.com/speaker.asp?id=Jwelch, if you are interested in getting a sneak peak. There are a few other SSIS previews available:

Matthew Roche’s session on scripting in SSIS: http://www.vconferenceonline.com/speaker.asp?id=mroche

Matt Masson’s session on advanced Lookup scenarios: http://www.vconferenceonline.com/speaker.asp?id=mmasson

If you’d like a $10 discount on the conference fee (making it $90 – a bargain :) ), you can use this code when you register – VCTAF167753-174.

A “Merge” Destination Component

At the Microsoft BI Conference, I did a chalk talk on creating custom components for SSIS. It seemed to go pretty well, as I got a lot of questions and interaction from the audience, which is the point of a chalk talk. As part of it, I demonstrated a component that can be used to perform a Merge (or any other batch oriented SQL statement) in the data flow. This is a pretty common request / complaint about SSIS, as evident from this post on Jamie’s blog.

There’s a pretty common pattern in SSIS for for performing updates from the data flow. You can use the OLE DB Command, but it processes data row by row, meaning that performance is not great. To get better performance, you can write the rows to be updated to a working table using an OLE DB Destination in the data flow, and then use an Execute SQL task in the control flow to perform a batch UPDATE against the true destination table. The same concept can be used with the new MERGE statement in SQL Server 2008, as shown in Using MERGE from SSIS. However, since by default SSIS will drop connections after each task completes, you can’t easily use a temporary table as the working table, so you have to create and clean up “permanent” tables, either in the control flow each time the package runs, or by just leaving them in the database.

Clearly, there are some drawbacks to this. You probably don’t want permanent tables in your database that are there only to support a back end process, and it would certainly be simpler to not have to create and delete the table in each package that needs to make use of them. To work around this, I created the Batch Destination component.

Basically, the component works by reading the metadata from the input columns in the data flow, and creating a working table in the target database. It then loads all the incoming rows to the working table. After all the incoming rows have been received and saved in the working table, it executes a SQL statement supplied by the developer. This SQL statement could be a MERGE statement, an UPDATE statement, or pretty much any other DML statement you’d like. After executing the SQL statement, it deletes the working table.

The user interface for the component is shown below. You provide an ADO.NET connection manager to a SQL Server database, a working table name (this can be a temp table, just use # in front of the name), and the SQL command to execute. On the Input Columns tab, you select the columns that you want to include in the working table.

image

There are a few limitations in this component. One, it only works with SQL Server through an ADO.NET connection manager currently. Two, it performs very little validation right now, so you can easily add a SQL command that is not actually valid, and you won’t find out till you run the component. Three, it’s written against the SQL Server 2008 API, so it won’t work in 2005.

I’m planning on making the source code available on CodePlex, but in the meantime, you can download a compiled version from my SkyDrive. To install it, you need to put the .DLL in the GAC, and copy it to your C:\Program Files (x86)\Microsoft SQL Server\100\DTS\PipelineComponents directory. To add it to the toolbox in BIDS, right-click on the toolbox, select “Choose Items…”, select the SSIS Data Flow Items tab, and select the BatchDestination item in the list.

ssisUnit 1.0 Is Available

The 1.0 bits for ssisUnit are now on CodePlex. The Release page for it includes versions for SQL Server 2005 and 2008.

The biggest feature in this release is a GUI for creating the unit tests. The test files are still XML, but the GUI abstracts away a lot of the detail involved in creating a test. A couple of things to be aware of with the GUI, however:

  • It does not check to see if your test suite has been saved before closing. So remember, "Save Early, Save Often" :)
  • It’s possible (even likely) that you will see some errors while using the GUI. Being the most recently developed part of ssisUnit, it’s had the least testing. So if you see an error, please open an issue on CodePlex.

If you are reading this, and wondering why anyone would want to unit test an SSIS package, please check out the TechEd Online presentation I posted about recently, Testing the SQL Database. You might also consider attending the Unit Testing an SSIS Package session at the SSWUG BI Virtual Conference. I’ll be covering the basics of unit testing SSIS packages, and an overview of using ssisUnit to automate the process.

If you are reading this, and wondering why unit test at all, I highly encourage you to check out Test Driven Development, By Kent Beck. Not a traditional read for database developers, but it will give you a completely different take on the importance and benefits of unit testing.

Importing Files Using SSIS

A topic that has come up a few times recently is the idea of loading a set of files into a database using SSIS. One of the data flow components in SSIS is the Import Column transform, which allows you to load files into a binary column. There is a great video by Brian Knight that shows how to use it, and I recommend viewing that  (12/01/2013 update: evidently, the video this post originally linked to is no longer available). If you are looking for a quick overview of the component, and a different approach for loading the list of filenames to import, then read on.

 

The Import Column transform works in the data flow, and imports one file for each row that is passed through it. It expects a column that contains the file name to import as an input. It outputs a column of type DT_TEXT, DT_NTEXT, or DT_IMAGE, that contains the file contents.

 

I’ve included a sample package with this post that uses the Import Column transform. It has a single data flow that uses a script component to get the list of files to import.

 

pic1[4]

 

The package has two connection managers, one of which points to a SQL Server database where the files will be stored. The other connection manager is a File connection manager, that is pointed to a folder. This is the folder that we want to import the files from.

 

p1

 

The script component was created as a Source. A single output column of type DT_WSTR was added to contain the filenames.

 

1

 

On the connection managers page, the File connection manager is specified so that it can be accessed from the script.

 

1[4]

 

The script uses the Directory class from the System.IO namespace. By calling the GetFiles method, the code can iterate through all of the files in the directory, and output one row for each file.

 

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports System.IO

Public Class ScriptMain
    Inherits UserComponent

    Public Overrides Sub CreateNewOutputRows()
        Dim fileName As String

        For Each fileName In Directory.GetFiles(Me.Connections.ImportFilesDir.AcquireConnection(Nothing).ToString())
            Output0Buffer.AddRow()
            Output0Buffer.Filename = fileName
        Next

        Output0Buffer.SetEndOfRowset()
    End Sub

End Class

 

The next component is the Import Columns component. Configuring it can be a little difficult. I found the Books Online documentation a little unclear on what had to be done. On the Input Columns tab, the column that contains the filename (including path) to import needs to be selected.

 

1[1]

 

On the Input and Output Properties tab, a new column was added to hold the binary contents of the file. When adding this column, make a note of the LineageID value, as it needs to be used in the next step.

 

1[1]

 

After adding the output column, the input column (that contains the filename, not the file contents), needs to be selected. The LinageID from the previous step needs to be put into the FileDataColumnID property. This tells the component which column to populate with the file contents.

 

1[3]

 

The OLE DB Destination is fairly straightforward, as it just maps the columns from the data flow to the database.

 

Hopefully this helps if you are working with the Import Column transform. The samples are located on my Skydrive.

Handling Flat Files with Varying Numbers of Columns

5/15 Update – I added Part 2 to show how to do the same thing with a Conditional Split and a Derived Column transform, per Jamie’s feedback (see the comments on this post).
A common question on the forums has been how to handle flat files that have a varying number of columns. For example, one row contains 3 columns, and another row may contain on two columns. The example below shows a sample file that uses a comma to delimit the columns, and a cursor return / line feed to delimit the row.
TestValue1,100,12/01/2007
TestValue2,200
TestValue3,300,12/01/2007
TestValue4,400,12/01/2007
TestValue5,500
TestValue6,600,12/01/2007
TestValue7,700,12/01/2007
TestValue8,800
TestValue9,900,12/01/2007
TestValue0,1000,12/01/2007

SSIS does not handle this scenario easily, due to the way it parses flat files. It parses by looking for the next column delimiter. The row delimiter is just the column delimiter for the last defined column. So, on our second line in the sample file, SSIS is looking for a comma instead of a CR/LF. The result of this is that the third row ends up combined with the second row, and we get something that looks like this:

I’m not going to go into a discussion about whether this is good or bad. This article is about how to work around it. If you’d like to see it changed in future versions of SSIS, please go to Connect (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=124478) and vote for it to be changed.
Now, onto the workaround. First, I’ve defined a flat file connection manager that treats each row as one column. I’m using the row delimiter (CR/LF) as the column delimiter.

If you are following along, your flat file should preview like this:
 
Next, in a data flow, I’ve added a flat file source that uses the connection manager. It is connected to a script component that is set as a Transform. The Line column is checked as an input.

In the Inputs and Outputs area, I’ve added three columns, for the three real columns in my flat file, and set the data types appropriately.

Finally, I added the following script to the task:
Public Class ScriptMain
    Inherits UserComponent
    Private columnDelimiter() As Char = CType(“,”, Char())

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As
        Input0Buffer)
        Dim rowValues As String()

        rowValues = Row.Line.Split(columnDelimiter)
        If rowValues.GetUpperBound(0) < 2 Then
            ‘Row is not complete – Handle error
            Row.Name_IsNull = True
            Row.Number_IsNull = True
            Row.Date_IsNull = True
        Else
            Row.Name = rowValues.GetValue(0).ToString()
            Row.Number = Convert.ToInt32(rowValues.GetValue(1))
            Row.Date = Convert.ToDateTime(rowValues.GetValue(2))
        End If
    End Sub

End Class
The columnDelimiter variable holds the value for the column delimiter – a comma in my case. The Split function parses the value contained in Line (the single column defined in the connection manager) and returns an array containing one element for each column in it. Since I’m expecting 3 columns, I’m performing a check to see if the array contains all three columns (.NET uses 0-based array indexes). If columns are missing, I have an error that needs to be handled. In this example, I am simply setting all my column values to NULL. The error handling could be enhanced by redirecting the rows to an error output, but I wanted to keep things simple. With this method, I could use a conditional split to filter out the rows with NULL.
Finally, if the correct number of columns are present, I’m setting the output columns created earlier with the values from the array. Notice that the Convert is necessary to make sure the value is the correct type.
That’s pretty much it. Depending on your needs, you may need to customize the script a bit to better handle error conditions, or reparsing the columns. I’ve attached the sample package and text file below. The sample is using the Trash Destination from Konesans, which you can download from www.SQLIS.com

Sample files here

As always, feedback is appreciated.

Handling Multiple Errors in SSIS

[edited on 12/14/2007 to correct an error in the text around string handling - the samples were not modified]
One actual failure in SSIS can trigger a whole series of error messages. For example, failure to convert a column value from a string to an integer in a Derived Column transform generates the following messages:
[Data Conversion [70]] Error: Data conversion failed while converting column “Fiscal year” (18) to column “NumericFiscalYear” (83). The conversion returned status value 2 and status text “The value could not be converted because of a potential loss of data.”.
[Data Conversion [70]] Error: SSIS Error Code DTS_E_INDUCEDTRANSFORMFAILUREONERROR. The “output column “NumericFiscalYear” (83)” failed because error code 0xC020907F occurred, and the error row disposition on “output column “NumericFiscalYear” (83)” specifies failure on error. An error occurred on the specified object of the specified component. There may be error messages posted before this with more information about the failure.
[DTS.Pipeline] Error: SSIS Error Code DTS_E_PROCESSINPUTFAILED. The ProcessInput method on component “Data Conversion” (70) failed with error code 0xC0209029. The identified component returned an error from the ProcessInput method. The error is specific to the component, but the error is fatal and will cause the Data Flow task to stop running. There may be error messages posted before this with more information about the failure.
[DTS.Pipeline] Error: SSIS Error Code DTS_E_THREADFAILED. Thread “WorkThread0″ has exited with error code 0xC0209029. There may be error messages posted before this with more information on why the thread has exited.
If you are logging errors to a flat file or an error logging table, then recording each error may be fine. However, if you’re writing the errors to the Windows event log, or sending them via email, you may not want to record multiple messages each time an error occurs. You might want to record only the first message, or you might want to group all the errors into a single log entry or email. Fortunately, the event model in SSIS allows you to easily customize how errors are handled.
I’ve put together a small sample package that shows how you might accomplish this. The package contains a single data flow that loads a text file, attempts to convert a column from string to numeric, and writes it to a Trash destination (see www.SQLIS.com to get this component).
The text file has an invalid value in one of the columns, which will cause the data flow to fail, and generate the four messages listed above. The package is set up to capture all of the error messages generated, store them in a collection, and concatenate them into a single string when the package is finished executing. Once that is done, the resulting string could be emailed or recorded to a log.
As mentioned, the data flow is very straightforward:

I’ve also created two variables at the package level: errorMessages as an Object, and emailText as a String. I’ll explain why later in the post.
The real work occurs in the event handlers. SSIS raises events for all executables(packages and tasks are both executables). The event we’re interested in is the OnError event, which is raised once for each error that occurs.
You get to the event handlers by selecting the Event Handlers tab in the SSIS designer. Once there, the Executable for which you want to capture events needs to be selected.

Since I want to handle errors for anything in the package, I’m setting the executable to CaptureErrors (the name of the package). By default, any event raised by a child executable (that is, an executable that is nested inside another executable) will also be raised in its parent. You can disable that behavior by setting the Propagate system variable, but that’s a topic for another post. I’m also using “OnError” from the list of events and have added a Script Task to the event handler.

The Script Task has two variables passed in: the System::ErrorDescription, which contains the text of the error message, and User:errorMessages, which I’ll use to keep track of all the error messages.

Here’s the script used in the Script Task:
Dim messages As Collections.ArrayList
Try
messages = CType(Dts.Variables(“errorMessages”).Value, Collections.ArrayList)
Catch ex As Exception
messages = New Collections.ArrayList()
End Try

messages.Add(Dts.Variables(“ErrorDescription”).Value.ToString())
Dts.Variables(“errorMessages”).Value = messages
Dts.TaskResult = Dts.Results.Success
I’m first attempting to retrieve the ArrayList from the errorMessages variable. If the value of the variable can’t be cast to an ArrayList, it indicates that it hasn’t been initialized yet. Once that has been handled, the error description is added to the ArrayList. This handles capturing the list of all error messages.
The next step is to process all the messages in order to email or log them. Since I only want to do this once, I’m using the OnPostExecute event, which fires when the executable is finished running.

There is another Script Task present in this event handler. This one has the User::errorMessages and User:emailText variables passed in.

The script in this task is concatenating a long string based on the error messages captured and returning it in the emailText variable:
Dim errorDesc As String
Dim messages As Collections.ArrayList

Try
messages = CType(Dts.Variables(“errorMessages”).Value, Collections.ArrayList)
Catch ex As Exception
‘If there is an exception – the object was never initialized, so there were no errors
Return
End Try

For Each errorDesc In messages
Dts.Variables(“emailText”).Value = Dts.Variables(“emailText”).Value.ToString + errorDesc + vbCrLf
Next

Dts.TaskResult = Dts.Results.Success
Once that has been done, the resulting string could be emailed or logged as desired. Since SSIS can generate fairly verbose error messages, I chose to store the messages in an ArrayList object. I then use the ArrayList to build the actual string message.
Hopefully, this information is helpful to anyone who wants to customize SSIS event handling. I’ve attached the sample package, and the text file used in the sample. If you have any feedback or suggestions for improvement, please leave them in the comments.

Retrieving Information from Active Directory with SSIS

I was recently asked to help someone retrieve information from Active Directory for use in SSIS. My initial thought was “This shouldn’t be too difficult. I know there is an OLE DB driver for Active Directory, so it should just be a matter of hooking up a few components.” As it turns out, it took a few hours of painful trial and error to get something working. Also, at the end of it all, I found a different approach that was as simple as I had originally envisioned it. Hopefully, this will save some time for the next person needing to do this.
I’ll start with a summary of the results, in case you don’t feel like wading through the rest. It does work, but there are a few caveats. You can use either an OLE DB connection manager, or an ADO.NET connection manager with the .NET OLE DB wrapper. I recommend the ADO.NET connection manager, as it will allow you to use access Active Directory as a source in your data flows. If you are using the OLE DB connection manager, you will be limited to using Execute SQL tasks and passing the recordset into the data flow. There is a sample package attached that shows both approaches.
I started out using an OLE DB connection manager and choose the OLE DB Provider for Microsoft Directory Services. I put in the server name, left the authentication as Windows Integrated, and clicked Test Connection. The test was successful, so I moved on to the next step.

I attempted to set up an OLE DB data source in my data flow, using my OLE DB connection manager. However, nothing worked. So I backed up, and tried an Execute SQL task. Still nothing. My first thought was that I didn’t have the syntax right for the query, but after trying it successfully outside of SSIS, I was looking at the connection again. Eventually I found a reference in MSDN to setting the ADSI Flag to 1 on the connection properties. After updating that, my Execute SQL task started working.

 The Execute SQL was set up to return the full resultset to an object variable. The query used was “SELECT cn FROM ‘LDAP://DC=DOMAINCONTROLLER,DC=local’ WHERE objectClass=’User’”. This retrieves all the users from Active Directory.

I then went back to using an OLE DB Source in the data flow, but it still wouldn’t work. The OLE DB Source didn’t seem like it was able to execute the query or get any metadata from the provider. So I took the approach of using a script source to process the recordset in the data flow.  (See this post for more information on handling recordsets in scripts.) That worked fine, but I was wasn’t completely satisfied.
I went back and tried creating a new connection manager. This time I used the ADO.NET connection manager, but still used the OLE DB Provider for Microsoft Directory Services through the .NET Provider for OLE DB. Just for grins, I didn’t bother setting the ADSI Flag on this connection.

I then created an ADO.NET Reader data source in my data flow, set the connection to the connection manager I just created and put the same LDAP query in it.

When I ran the updated package, it worked first time through (without the ADSI Flag=1). Since it’s been my understanding that it is always preferable to use the OLE DB connection manager when working with OLE DB providers, this behavior was a little frustrating. It seems like you should get the same behavior whether you are using the OLE DB connection manager or the ADO.NET connection manager with the OLE DB wrapper.
Anyway, that was my journey through SSIS with Active Directory. I’d definitely recommend using the ADO.NET connection manager, as it make the process much easier. Please check out the sample package attached if you have further questions.