Welcome to Agile BI Community Sign in | Join | Help

As noted by Matt Masson from the Integration Services team, not much has changed in SSIS 2008 R2. In fact, R2 is pretty much identical, with the exception of the ADO.NET Destination. So if you are developing packages, you can build them using the 2008 or 2008 R2 version of BIDS, and they can be used in either environment successfully, as long as the package doesn’t have an ADO.NET Destination. If it does, a package developed in the 2008 R2 version of BIDS will give errors when you open on a machine with the 2008 version of SSIS installed. This is because the 2008 version of the ADO.NET Destination doesn’t know what to do with the new property added to the R2 version.

This wouldn’t be a big deal, if you could have side-by-side installs of BIDS 2008 and BIDS 2008 R2, but the install for R2 replaces the 2008 version of BIDS. So, if you need to move packages developed in 2008 R2 to 2008 (a common scenario for me), you have to do a little extra work. There are a few values that need to be removed or changed by directly editing the package XML.

As always, it’s a good idea to make a backup of your package before editing the XML directly.

Open the package in your favorite text or XML editor, and look for the section that contains <components>. Underneath that, you need to locate the <component> tag that relates to your ADO NET Destination (the sample below is easy, since the name is “ADO NET Destination”, but that’s not the typical case (you do give your components meaningful names, right?).

<component id="16" name="ADO NET Destination" componentClassID="{2E42D45B-F83C-400F-8D77-61DDE6A7DF29}" description="Writes to a database using ADO.NET provider." localeId="-1" usesDispositions="true" validateExternalMetadata="True" version="1" pipelineVersion="0" contactInfo="Writes to a database using ADO.NET provider.;Microsoft Corporation; Microsoft SqlServer v10; © 2007 Microsoft Corporation; All Rights Reserved; http://www.microsoft.com/sql/support;1">
  <properties>
    <property id="23" name="TableOrViewName" dataType="System.String" state="default" isArray="false" description="The Destination database table name." typeConverter="" UITypeEditor="Microsoft.DataTransformationServices.Controls.ModalMultilineStringEditor, Microsoft.DataTransformationServices.Controls, Version= 10.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91" containsID="false" expressionType="Notify">"sample"</property>
    <property id="24" name="BatchSize" dataType="System.Int32" state="default" isArray="false" description="Specify the number of rows per batch. The default is 0 which uses the same size as SSIS internal buffer. " typeConverter="" UITypeEditor="" containsID="false" expressionType="Notify">0</property>
    <property id="25" name="CommandTimeout" dataType="System.Int32" state="default" isArray="false" description="The number of seconds before a command times out. A value of 0 indicates infinite time-out. " typeConverter="" UITypeEditor="" containsID="false" expressionType="Notify">30</property>
    <property id="26" name="UseBulkInsertWhenPossible" dataType="System.Boolean" state="default" isArray="false" description="Specifies whether to use the SqlBulkCopy interface to improve the performance of bulk insert operations. Only certain providers support this interface." typeConverter="" UITypeEditor="" containsID="false" expressionType="Notify">true</property>
    <property id="27" name="UserComponentTypeName" dataType="System.String" state="default" isArray="false" description="" typeConverter="" UITypeEditor="" containsID="false" expressionType="None">Microsoft.SqlServer.Dts.Pipeline.ADONETDestination, Microsoft.SqlServer.ADONETDest, Version=10.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91</property>
  </properties>

Once you’ve located the right <component> tag, you need to make two changes:

One, change the version=”1” attribute in the <component> tag to version=”0”.

<component id="16"
           name="ADO NET Destination"
           componentClassID="{2E42D45B-F83C-400F-8D77-61DDE6A7DF29}"
           description="Writes to a database using ADO.NET provider."
           localeId="-1"
           usesDispositions="true" 
           validateExternalMetadata="True" 
           version="0" 
           pipelineVersion="0" 
           contactInfo="Writes to a database using ADO.NET provider.;Microsoft Corporation; Microsoft SqlServer v10; © 2007 Microsoft Corporation; All Rights Reserved; http://www.microsoft.com/sql/support;1">

Second, remove the entire <property name=”UseBulkInsertWhenPossible> element. You can comment it out, as shown below, or just delete it.

    <property id="25" name="CommandTimeout" dataType="System.Int32" state="default" isArray="false" description="The number of seconds before a command times out. A value of 0 indicates infinite time-out. " typeConverter="" UITypeEditor="" containsID="false" expressionType="Notify">30</property>
    <!--<property id="26" name="UseBulkInsertWhenPossible" dataType="System.Boolean" state="default" isArray="false" description="Specifies whether to use the SqlBulkCopy interface to improve the performance of bulk insert operations. Only certain providers support this interface." typeConverter="" UITypeEditor="" containsID="false" expressionType="Notify">true</property>-->
    <property id="27" name="UserComponentTypeName" dataType="System.String" state="default" isArray="false" description="" typeConverter="" UITypeEditor="" containsID="false" expressionType="None">Microsoft.SqlServer.Dts.Pipeline.ADONETDestination, Microsoft.SqlServer.ADONETDest, Version=10.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91</property>

If you have a lot of packages to move back and forth, you probably don’t want to hand edit this each time. I wrote a little application that will process a folder of .dtsx files, and strip the 2008 R2 information out. This is done using an XSLT transform (available in the source code). The application and source are available to download from my SkyDrive.

I’ll be presenting a couple of sessions at the Carolina Code Camp on Saturday, May 15th. This will be my second time presenting at the Carolina Code Camp, and I’m looking forward to it. Last time, I had great audience participation, and the event was well organized and fun.

I’ll be presenting on “Processing Flat Files with SSIS” at 8:45 AM and “Unit Testing SSIS Packages” at 12:30 PM. Looking forward to both – it should be fun.

If you are in the area, feel free to drop in – there is no charge for the Code Camp.

The topic for T-SQL Tuesday #006 is LOB data, selected by this month’s host, Michael Coles. If you aren’t familiar with T-SQL Tuesdays, Michael has a nice summary in his post.

In database terms, LOBs are Large OBjects, also referred to as BLOBS (Binary Large OBjects). These are data types that can exceed the maximum row size of 8 KB in SQL Server. Most often, they are used for storing large amounts of text or binary data in the database. An example of this would be an application that stores documents in a database table. SSIS has some special features for handling LOBs in the dataflow, and this post will provide an overview of them and provide a few tips on using them wisely, as well as some approaches to keep them from killing your data flow performance.

Data Types for LOBs

LOBs are represented by three types in SSIS: DT_TEXT, DT_NTEXT, and DT_IMAGE. These map to the SQL Server types VARCHAR(MAX), NVARCHAR(MAX), and VARBINARY(MAX). Each of these types can hold 2,147,483,647 bytes, or almost 2GB of data. Columns with these data types have some special rules.

First, if you want to use them in a Derived Column expression, you’ll probably need to cast it to another type, like DT_STR or DT_WSTR, as the LOB types aren’t supported by most of the built-in functions. One thing to be careful of with this is that the entire contents of the LOB may not fit in the data type you are casting to, which will cause a truncation error. You can either set the Derived Column transformation to ignore truncation errors, or redirect error rows so you can handle the longer values differently.

If you are accessing LOBs in a Script component, you have to use the AddBlobData(), GetBlobData(), and ResetBlobData() methods on the column to get to the data.

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    System.Text.UnicodeEncoding encoding= new System.Text.UnicodeEncoding();

    Row.LOBNTEXT.AddBlobData(encoding.GetBytes("My Test String"));
    byte[] bytes = Row.LOBNTEXT.GetBlobData(0, (int)Row.LOBNTEXT.Length);
    string myString = encoding.GetString(bytes);
}

Note that you should use the System.Text.AsciiEncoding class in the above code if you are working with non-Unicode data.

Transforms for LOB Data

There are two transformations in SSIS specifically designed for working with LOB data: Import Column and Export Column. The Import Column transformation is used to import the contents of a file from disk into a column in the data flow, as a LOB. The Export Column is used to do perform the reverse operation – that is, it saves the contents of a LOB column to a file on disk. Since you can process one file per row, these transformations can be really useful for moving multiple files in and out of a database.

There’s a few good walkthroughs on using these transformations, so rather than reposting the same information, here’s a few links:

The Import Column Transformation, by James Beresford (aka BI Monkey);

The Export Column Transformation, by James Beresford (aka BI Monkey)

Importing Files Using SSIS, by me

Performance

Because these data types have the potential to hold so much data, SSIS handles them a little differently than the standard data types. They are allocated separately from regular data in the buffers. When there is memory pressure, SSIS spools buffers to disk. The potential size of LOB data makes it very likely to be spooled, which can be a pretty major performance bottleneck. To minimize it, you want to make sure that the LOB data is being spooled to a nice, fast disk that doesn’t have a lot of contention. To control what disk they are spooled to, you can set the BLOBTempStoragePath property on the Data Flow task. Note that this is a different setting than BufferTempStoragePath, which controls where regular data is spooled. Ideally, you want to make sure each of these settings points to a different physical disk.

The other item to consider is whether you actually need LOB data. In some scenarios, the LOB data types are used for convenience or because there is a possibility that a small percentage of the data would exceed the maximum length of a regular data type. If you are processing a lot of data with LOB types, and most of the LOB data is small enough to fit in an regular data type, then you may be better off using a two pass approach. In one pass, select only the data with LOB values small enough to fit in a standard data type, and cast it in the SELECT statement. SSIS will process this set of data using the normal buffer approach, which should be faster. Then, do a second pass where you select only the data that is too large for a regular data type. This will minimize the amount of data that has to be processed using LOB data types. To figure out the length of a LOB column in SQL Server, you can use the DATALENGTH function.

SELECT DATALENGTH(MyLOBColumn)
FROM MyLOBTable

As usual with performance, your mileage may vary on this one. It’s worked well for me in a few scenarios, but there’s some overhead involved in using two passes. So, as with any performance tips, test it in your environment, with a solid representative sample of your data, before implementing it in production.

There it is – a whirlwind tour of LOBs in SSIS. If you have any questions or comments, please post them.

I’ll be presenting at the Triad Developers Guild on Tuesday. The presentation will be on Developing Custom Components for SSIS, a topic I’ve spoken on a few times before. Here’s the abstract:

“SSIS data flows are great tools for moving data. But what if you need to go beyond the out-of-the-box components provided with SSIS? John Welch will describe how custom components are a great way to encapsulate and reuse functionality for the data flow in SSIS. He will discuss what it takes to create and deploy custom components in SSIS, review the pros and cons of using custom components instead of scripts, and discuss some of the common challenges and issues with creating them.”

I’m looking forward to the presentation – it usually generates some good discussion, and I like talking about developing for SSIS. If you are in the area, please drop by.

I’ve been meaning to post about this ever since I saw Jamie’s post on “The SQL Developer Gap”. I couldn’t agree more with what Jamie expressed. Prior to getting into serious BI development, I was primarily an application developer. This was at a time when agile development was becoming very popular. As part of that, refactoring, test driven development, continuous integration, and automated unit testing were becoming accepted as good software engineering practices, and tool support was coming along very quickly.

Then I switched over to BI development in the SQL Server 7 time frame. In a lot of ways, it was like going back to the dark ages. No refactoring support, no automated testing, no concept of builds. Nothing significant changed until SQL Server 2005, when tools like SSIS and SSAS took their first steps toward becoming more “developer friendly” by leveraging Visual Studio to easily integrate into source control and the beginnings of multi-developer support. However, there haven’t really been any improvements in this since 2005. Refactoring, automated testing, automated builds, etc., can all be done, but they are painful and time consuming to set up, and require a fair amount of specialized knowledge to do correctly. In addition, these are all skills that the average BI developer usually doesn’t posses.

To join in Jamie’s rant, this is something that has aggravated me increasingly over the last few years. In many ways, BI is ideally suited to an agile approach and developer tools that increase productivity – requirements shift on the whim of the business, you need to deliver quickly and often, and you need easy mechanisms to confirm that what you are delivering provides the correct results. There are many tasks in developing BI solutions that are repetitive and could be easily automated, if only the tools provided better support for it. And developer productivity using the SQL Server BI tools hasn’t seen a significant increase since 2005.

I’m spending a fair amount of my time these days working in Visual Studio, where I have the luxury of a built in unit testing tool, the capability to switch between visual editing and text editing depending on which makes the most sense, the ability to easily do a diff between two versions in source control, a full undo-redo stack, etc. And I get to use add-ins like ReSharper (a fantastic tool that I can’t recommend enough). It really highlights the difference between developing traditional applications and BI applications these days.

That’s part of the reason I joined Varigence, where I have the opportunity to actually help developers deliver BI solutions faster and better. Our approach makes it much easier to support the same features that you see in traditional application development tools. I’ve been pretty pleased to see how easy it is for us to add productivity features to our tools – honestly, it makes me wonder why BI developers had to wait this long for these features to be available in the tools we use on a daily basis.

The Charlotte SQL Saturday is coming up next weekend, March 6th. There is a great lineup of speakers presenting, and we are approaching capacity for the event, so get registered soon, if you haven’t already. It’s going to be a great day – free SQL training from a lot of well-known, well-respected names, a number of our local community members, and some Microsoft people as well.

I’ll be presenting on Creating Custom Components for SSIS – a great way to extend the out-of-the-box functionality of Integration Services, and on Vulcan, an open source framework for modeling and generating portions of your BI solution.

We’re got a great set of sponsors, including SQL Sentry (thanks Peter and Greg for all the work and support for the event – without them, it wouldn’t be what it is today – and if you are using SSAS, you really should check out Performance Advisor for Analysis Services), and Microsoft, who’s providing access to their campus for the event, among other things. Quest, Confio, and Red Gate provide some great tools for administering and developing for SQL Server, Intellinet provides services for SQL Server, and CozyRoc provides a great set of SSIS tasks and components.

Thanks to everyone who made it out to the Columbia Code Camp this weekend, even with the sleet and snow in the area. I had a number of requests for the slides from my presentations, so I’ve uploaded them to my SkyDrive.

Introduction to SSIS (SpeakerRate link)

Creating Custom Components for SSIS (SpeakerRate link)  (the sample component used in this presentation is on CodePlex in the Community Tasks and Components project.

Thanks again for attending, and if you have any follow up questions, please leave them in the comments.

Thanks to the Columbia Enterprise Developer’s Guild for letting me present last night. The audience was great, and I got a lot of good questions. Several people asked if the samples could be made available, and I also had a request to post the slides for some people who weren’t able to make it. So, here they are. I’ve posted them to my SkyDrive here. If you have any questions or comments, please feel free to leave them here.

It’s a new year, and already a lot going on. The new job is going well, but keeping me extremely busy. I’ve got several upcoming presentations, and there’s a SQL Saturday event planned for Charlotte in March that I’m helping organize. I’m also happy to say that my MVP status was re-awarded for 2010.

I have an upcoming presentation at the Columbia Enterprise Developers Guild, next Wednesday the 13th. The presentation will be on handling flat files in SSIS.

Processing Flat Files with SSIS

When doing data integration, a common requirement is to work with flat files, whether for importing data into a system from an external source, or to export it to provide to other systems. SQL Server Integration Services (SSIS) supports flat files, but there can be a number of challenges when working with them. This is particularly true if your flat files have multiple data formats contained in a single file, the data has complex formatting, or the files have inconsistent formatting. This session will help you to be more efficient when working with these types of files. You’ll learn to handle missing delimiters in the files, and parsing files that have multiple data formats. You’ll also see how to produce complex output formats, like headers and footers that contain summary information.

I’ll also be doing a couple of presentations at the Columbia Code Camp on January 30th.

Creating Custom Components for SSIS

SSIS data flows are great tools for moving data. But what if you need to go beyond the out-of-the-box components provided with SSIS? Custom components are a great way to encapsulate and reuse functionality for the data flow in SSIS. We will discuss what it takes to create and deploy custom components in SSIS, review the pros and cons of using custom components instead of scripts, and discuss some of the common challenges and issues with creating them.

Introduction to SSIS

SQL Server Integration Services is a tool provided with SQL Server for moving data between data stores. It is the successor to DTS, but there are many fundamental changes in how SSIS works. This session will provide an overview of SSIS, with a focus on the key elements of SSIS that you need to know to get the most use out of it. This session will help developers efficiently use SSIS when they need to move data around the organization.

If you happen to be in the area, please drop by for these presentations.

It’s pretty well accepted that raw files in SSIS are a very fast means of getting data in and out of the pipeline. Jamie Thomson has referenced the use of raw files a number of times and Todd McDermid recently posted about using them for staging data. It occurred to me, that even though I’d always heard they were faster than other options, I’d never actually tested it to see exactly how much of a difference it would make. So, below I’ve posted some admitted unscientific performance testing between raw files and flat (or text) files.

I tested two variations of flat files, delimited and ragged right. The delimited file was configured with a vertical bar (|) as the column delimiter and CR/LF as the row delimiter. The ragged right file was configured as a fixed width with row delimiters – each column had a fixed width, and a final, zero-width column was appended with CR/LF as the delimiter. The same data was used for each test, the following columns being defined:

Name Data Type Precision Scale Length
TestInt32 DT_I4 0 0 0
TestString DT_STR 0 0 50
TestBool DT_BOOL 0 0 0
TestCurrency DT_CY 0 0 0
TestDBTimestamp DT_DBTIMESTAMP 0 0 0
TestWString DT_WSTR 0 0 50
TestNumeric DT_NUMERIC 18 6 0

One thing to note is that when importing from flat files, everything was imported as strings, to avoid any data conversion issues. This is one of the strengths of raw files – no data conversion necessary. But for this test, I was primarily looking at speed of getting the data on and off disk. I also looked at the difference in file sizes between the formats.

I tested each option with 500,000, 1 million, and 10 million rows. I ran each one 4 times for each row count, and discarded the first run to offset the effects of file caching. The results of the runs were averaged for comparison.

When writing files, there’s no big surprises between the options. raw files are faster on 10 million rows by 9.8 seconds. The difference on smaller numbers of rows is pretty insignificant. Here’s a chart showing the times (the raw data is at the end of the post):

image

Reading files did show a difference that I didn’t expect. Read speeds on raw files and delimited files are fairly comparable, with raw files still having the edge in speed. However, reads on ragged right files are significantly slower – well over twice as slow when compared to raw files.

image

File sizes were also as expected, with delimited files having a slight edge over raw files, likely because the string values I used were not all 50 characters in length.

image

In summary, it’s clear that raw files have an advantage in speed. However, the differences weren’t as large as I was expecting, except in the case of ragged right files. So, in general, using raw files are best for performance, but if you are dealing with row counts of less than 1 million rows, it’s not a huge difference unless you are really concerned with performance. Of course, there are plenty of other differences between the formats, and I’d encourage you to research them before making a decision.

Here’s the raw data on the number of seconds to produce each file:

  500,000 1,000,000 10,000,000
Write To Delimited 2.61 5.16 47.02
Write To Ragged 2.66 5.31 49.03
Write To Raw 2.21 4.23 39.21
  500,000 1,000,000 10,000,000
Read From Delimited 0.77 1.52 16.59
Read From Ragged 2.74 5.89 35.39
Read From Raw 0.60 1.08 10.03

and the file size in KB for each:

  500,000 1,000,000 10,000,000
Delimited 44,624 89,792 946,745
Ragged 92,286 184,571 1,845,704
Raw 47,039 94,402 973,308

Please let me know if you’d like more details or have any questions.

I’ve worked with Mariner for almost 12 years. It’s been a very good journey, with many great experiences. I’ve worked with a lot of great people, and delivered some really interesting BI solutions to clients in a number of industries. One aspect of my job that I always particularly enjoyed was helping developers be more productive when creating BI solutions, and reducing the repetitive (read: “boring”) aspects of developing solutions on the Microsoft stack.

Recently, a new opportunity to focus more heavily on that came along. As a result, after a long and enjoyable career with Mariner doing business intelligence consulting, I am taking a new position with Varigence, a company that is producing tools that will make implementing BI solutions faster and easier, as well as introduce new capabilities and better integration into the Microsoft BI stack.

I’m really looking forward to the new role and the new experiences it will offer. I will continue to be heavily involved in Microsoft BI, so I plan to maintain this blog and continue speaking and writing on it as often as often as possible.

We had good turnout at the Greenville, SC SSIG on Tuesday. If you attended, I hope you enjoyed the presentation. After the meeting, I promised several attendees that I would make the samples developed during the demo available, and here they are. The zip includes both the SSAS project files, and a backup the sample database that the cube was built on. Both are done using the 2008 version of SQL Server.

If you have any questions about , please post them in the comments.

I’ll be doing a presentation on Analysis Services at the SQL Server Innovators Guild in Greenville, SC on Tuesday, Dec. 1st. I’ll be delivering an introduction to SSAS, with lots of demos. If you are interested in attending, please register here. It’s a presentation that I’ve done a few times now, but because it’s mostly demo, something new and interesting always comes up.

 

Introduction to Analysis Services 2008

This session is intended to introduce database developers to Analysis Services 2008, with a focus on being able to quickly construct usable OLAP cubes. This presentation will be light on slides, and heavy on demonstrating how to perform the steps to create the cubes. During this session, we will cover the creation of a new cube from an existing database step by step. We will also highlight the reasons for using Analysis Services, and applicable scenarios for using it.

One of the common problems that beginners have with SSIS is debugging errors involving variables. One example of this occurs when a package uses a Foreach Loop container. These are often used to set a variable value differently for each iteration of a loop. If something fails during the loop, you might want to check the value of the variable in order to determine what went wrong.

Fortunately, this is pretty easy to accomplish in SSIS. You can see the value of any package variable in BIDS when you debug the package by following the steps below:

  1. First, set a breakpoint on a task where you'd like to check the current variable values. You can set a breakpoint by right-clicking on the task and choosing Edit Breakpoints.image
  2. Choose OnPreExecute to see values before the task executes and OnPostExecute to see them after execution. Click OK after enabling the breakpoint.
    image 
  3. Run the package in debug mode (press F5) in Visual Studio. The package will run until the breakpoint is hit.
    image
  4. Once execution stops at the breakpoint, open the Locals window (Ctrl+Alt+V, L or Debug..Windows..Locals)
    image
  5. Expand the Variable node in the Locals window. You can see the current values for all your variables, including system variables, in this window. You may have to scroll down to see your variables in the list
    image

This is a useful technique for troubleshooting packages that use variables, particularly if the variable values are changed during package execution.

I’m really looking forward to the PASS Summit next week, and getting a chance to visit with a lot of the people in the community that I interact with on a regular basis. It’s going to be a really busy week, as there’s a lot of great sessions that I’m looking forward to attending, and a few things that I’m going to be delivering myself.

A quick summary of where I’ll be during the conference:

Outside of that, I’ll be around at other sessions, the evening events, and in the “Ask the Experts” area. Looking forward to seeing everyone there.

More Posts Next page »