Posts tagged ‘SSIS’

Creating a Basic Package Using Biml

This article is going to walk through the process of creating a simple package using Biml and the Biml Package Generator feature in BIDS Helper. To start out, you need to install the latest beta of BIDS Helper from CodePlex. Once that is set up, you should create a new Integration Services project in BIDS. In the project, right-click on the Project in the Solution Explorer. There’s a new item in this menu – Add New Biml File.

image

Clicking Add New Biml File will add a new file to the Miscellaneous folder in the solution named BimlScript.biml. (The name is automatically generated, so it may be BimlScript1.biml, etc). You can right-click on the file and choose rename to give the file a more specific name. For this example, rename the file “BasicPackage.biml”.

Double-clicking on the file will open the XML editor inside of BIDS. The editor supports Intellisense for Biml, so typing an opening tag (“<”) will give you a list of valid options for tags you can use. (If you aren’t seeing the Intellisense, please check this link for troubleshooting steps.)

image

For this example, copy and paste the following Biml into the document. Since the code below includes the document root tags (<Biml>), you’ll want to make sure you replace the entire contents of the Biml file.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Connections>
        <Connection Name="AdventureWorks" ConnectionString="Server=.;Initial Catalog=AdventureWorks;Integrated Security=SSPI;Provider=SQLNCLI10"/>
    </Connections>
    <Packages>
        <Package Name="Biml Sample" AutoCreateConfigurationsType="None" ConstraintMode="Linear">
            <Tasks>
                <Dataflow Name="Extract Table List">
                    <Transformations>
                        <OleDbSource Name="Get Table List" ConnectionName="AdventureWorks">
                            <DirectInput>SELECT * FROM sys.tables</DirectInput>
                        </OleDbSource>
                        <Multicast Name="Multicast"/>
                    </Transformations>
                </Dataflow>
            </Tasks>
        </Package>
    </Packages>
</Biml>

The first section (<Connections>) of this Biml defines an OleDbConnection that points to the AdventureWorks database. The next section (inside the <Packages> tag) defines a single package that contains a Dataflow task (the <Dataflow> tag). The Dataflow task contains two components, an OleDb Source and an Union All transformation.

The next step is to take this definition of a package, and actually generate the package from it. To do this, right-click on the Biml file, and choose Expand Biml File from the context menu.

image

A new package will be added to the SSIS Packages folder, named Biml Sample.dtsx. If you review the generated package, you’ll see that it matches up to what was defined in the Biml code.

imageimage

That’s a quick introduction to the Biml functionality in BIDS Helper. In the next article, we’ll set the stage for some more advanced (read: more interesting) uses of Biml, including some scripting.

Speaking at the Carolina Code Camp

I’ll be presenting a couple of sessions at the Carolina Code Camp on Saturday, May 15th. This will be my second time presenting at the Carolina Code Camp, and I’m looking forward to it. Last time, I had great audience participation, and the event was well organized and fun.

I’ll be presenting on “Processing Flat Files with SSIS” at 8:45 AM and “Unit Testing SSIS Packages” at 12:30 PM. Looking forward to both – it should be fun.

If you are in the area, feel free to drop in – there is no charge for the Code Camp.

T-SQL Tuesday #006 – LOBs in the SSIS Dataflow

The topic for T-SQL Tuesday #006 is LOB data, selected by this month’s host, Michael Coles. If you aren’t familiar with T-SQL Tuesdays, Michael has a nice summary in his post.

In database terms, LOBs are Large OBjects, also referred to as BLOBS (Binary Large OBjects). These are data types that can exceed the maximum row size of 8 KB in SQL Server. Most often, they are used for storing large amounts of text or binary data in the database. An example of this would be an application that stores documents in a database table. SSIS has some special features for handling LOBs in the dataflow, and this post will provide an overview of them and provide a few tips on using them wisely, as well as some approaches to keep them from killing your data flow performance.

Data Types for LOBs

LOBs are represented by three types in SSIS: DT_TEXT, DT_NTEXT, and DT_IMAGE. These map to the SQL Server types VARCHAR(MAX), NVARCHAR(MAX), and VARBINARY(MAX). Each of these types can hold 2,147,483,647 bytes, or almost 2GB of data. Columns with these data types have some special rules.

First, if you want to use them in a Derived Column expression, you’ll probably need to cast it to another type, like DT_STR or DT_WSTR, as the LOB types aren’t supported by most of the built-in functions. One thing to be careful of with this is that the entire contents of the LOB may not fit in the data type you are casting to, which will cause a truncation error. You can either set the Derived Column transformation to ignore truncation errors, or redirect error rows so you can handle the longer values differently.

If you are accessing LOBs in a Script component, you have to use the AddBlobData(), GetBlobData(), and ResetBlobData() methods on the column to get to the data.

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    System.Text.UnicodeEncoding encoding= new System.Text.UnicodeEncoding();

    Row.LOBNTEXT.AddBlobData(encoding.GetBytes("My Test String"));
    byte[] bytes = Row.LOBNTEXT.GetBlobData(0, (int)Row.LOBNTEXT.Length);
    string myString = encoding.GetString(bytes);
}

Note that you should use the System.Text.AsciiEncoding class in the above code if you are working with non-Unicode data.

Transforms for LOB Data

There are two transformations in SSIS specifically designed for working with LOB data: Import Column and Export Column. The Import Column transformation is used to import the contents of a file from disk into a column in the data flow, as a LOB. The Export Column is used to do perform the reverse operation – that is, it saves the contents of a LOB column to a file on disk. Since you can process one file per row, these transformations can be really useful for moving multiple files in and out of a database.

There’s a few good walkthroughs on using these transformations, so rather than reposting the same information, here’s a few links:

The Import Column Transformation, by James Beresford (aka BI Monkey);

The Export Column Transformation, by James Beresford (aka BI Monkey)

Importing Files Using SSIS, by me

Performance

Because these data types have the potential to hold so much data, SSIS handles them a little differently than the standard data types. They are allocated separately from regular data in the buffers. When there is memory pressure, SSIS spools buffers to disk. The potential size of LOB data makes it very likely to be spooled, which can be a pretty major performance bottleneck. To minimize it, you want to make sure that the LOB data is being spooled to a nice, fast disk that doesn’t have a lot of contention. To control what disk they are spooled to, you can set the BLOBTempStoragePath property on the Data Flow task. Note that this is a different setting than BufferTempStoragePath, which controls where regular data is spooled. Ideally, you want to make sure each of these settings points to a different physical disk.

The other item to consider is whether you actually need LOB data. In some scenarios, the LOB data types are used for convenience or because there is a possibility that a small percentage of the data would exceed the maximum length of a regular data type. If you are processing a lot of data with LOB types, and most of the LOB data is small enough to fit in an regular data type, then you may be better off using a two pass approach. In one pass, select only the data with LOB values small enough to fit in a standard data type, and cast it in the SELECT statement. SSIS will process this set of data using the normal buffer approach, which should be faster. Then, do a second pass where you select only the data that is too large for a regular data type. This will minimize the amount of data that has to be processed using LOB data types. To figure out the length of a LOB column in SQL Server, you can use the DATALENGTH function.

SELECT DATALENGTH(MyLOBColumn)
FROM MyLOBTable

As usual with performance, your mileage may vary on this one. It’s worked well for me in a few scenarios, but there’s some overhead involved in using two passes. So, as with any performance tips, test it in your environment, with a solid representative sample of your data, before implementing it in production.

There it is – a whirlwind tour of LOBs in SSIS. If you have any questions or comments, please post them.

Developer Gaps

I’ve been meaning to post about this ever since I saw Jamie’s post on “The SQL Developer Gap”. I couldn’t agree more with what Jamie expressed. Prior to getting into serious BI development, I was primarily an application developer. This was at a time when agile development was becoming very popular. As part of that, refactoring, test driven development, continuous integration, and automated unit testing were becoming accepted as good software engineering practices, and tool support was coming along very quickly.

Then I switched over to BI development in the SQL Server 7 time frame. In a lot of ways, it was like going back to the dark ages. No refactoring support, no automated testing, no concept of builds. Nothing significant changed until SQL Server 2005, when tools like SSIS and SSAS took their first steps toward becoming more “developer friendly” by leveraging Visual Studio to easily integrate into source control and the beginnings of multi-developer support. However, there haven’t really been any improvements in this since 2005. Refactoring, automated testing, automated builds, etc., can all be done, but they are painful and time consuming to set up, and require a fair amount of specialized knowledge to do correctly. In addition, these are all skills that the average BI developer usually doesn’t posses.

To join in Jamie’s rant, this is something that has aggravated me increasingly over the last few years. In many ways, BI is ideally suited to an agile approach and developer tools that increase productivity – requirements shift on the whim of the business, you need to deliver quickly and often, and you need easy mechanisms to confirm that what you are delivering provides the correct results. There are many tasks in developing BI solutions that are repetitive and could be easily automated, if only the tools provided better support for it. And developer productivity using the SQL Server BI tools hasn’t seen a significant increase since 2005.

I’m spending a fair amount of my time these days working in Visual Studio, where I have the luxury of a built in unit testing tool, the capability to switch between visual editing and text editing depending on which makes the most sense, the ability to easily do a diff between two versions in source control, a full undo-redo stack, etc. And I get to use add-ins like ReSharper (a fantastic tool that I can’t recommend enough). It really highlights the difference between developing traditional applications and BI applications these days.

That’s part of the reason I joined Varigence, where I have the opportunity to actually help developers deliver BI solutions faster and better. Our approach makes it much easier to support the same features that you see in traditional application development tools. I’ve been pretty pleased to see how easy it is for us to add productivity features to our tools – honestly, it makes me wonder why BI developers had to wait this long for these features to be available in the tools we use on a daily basis.

Slides From the Columbia Code Camp

Thanks to everyone who made it out to the Columbia Code Camp this weekend, even with the sleet and snow in the area. I had a number of requests for the slides from my presentations, so I’ve uploaded them to my SkyDrive.


Introduction to SSIS (SpeakerRate link)


Creating Custom Components for SSIS (SpeakerRate link)  (the sample component used in this presentation is on CodePlex in the Community Tasks and Components project.


Thanks again for attending, and if you have any follow up questions, please leave them in the comments.

Slides From “Processing Flat Files with SSIS”

Thanks to the Columbia Enterprise Developer’s Guild for letting me present last night. The audience was great, and I got a lot of good questions. Several people asked if the samples could be made available, and I also had a request to post the slides for some people who weren’t able to make it. So, here they are. I’ve posted them to my SkyDrive here. If you have any questions or comments, please feel free to leave them here.

A New Year and Upcoming Presentations

It’s a new year, and already a lot going on. The new job is going well, but keeping me extremely busy. I’ve got several upcoming presentations, and there’s a SQL Saturday event planned for Charlotte in March that I’m helping organize. I’m also happy to say that my MVP status was re-awarded for 2010.

I have an upcoming presentation at the Columbia Enterprise Developers Guild, next Wednesday the 13th. The presentation will be on handling flat files in SSIS.

Processing Flat Files with SSIS

When doing data integration, a common requirement is to work with flat files, whether for importing data into a system from an external source, or to export it to provide to other systems. SQL Server Integration Services (SSIS) supports flat files, but there can be a number of challenges when working with them. This is particularly true if your flat files have multiple data formats contained in a single file, the data has complex formatting, or the files have inconsistent formatting. This session will help you to be more efficient when working with these types of files. You’ll learn to handle missing delimiters in the files, and parsing files that have multiple data formats. You’ll also see how to produce complex output formats, like headers and footers that contain summary information.

I’ll also be doing a couple of presentations at the Columbia Code Camp on January 30th.

Creating Custom Components for SSIS

SSIS data flows are great tools for moving data. But what if you need to go beyond the out-of-the-box components provided with SSIS? Custom components are a great way to encapsulate and reuse functionality for the data flow in SSIS. We will discuss what it takes to create and deploy custom components in SSIS, review the pros and cons of using custom components instead of scripts, and discuss some of the common challenges and issues with creating them.

Introduction to SSIS

SQL Server Integration Services is a tool provided with SQL Server for moving data between data stores. It is the successor to DTS, but there are many fundamental changes in how SSIS works. This session will provide an overview of SSIS, with a focus on the key elements of SSIS that you need to know to get the most use out of it. This session will help developers efficiently use SSIS when they need to move data around the organization.

If you happen to be in the area, please drop by for these presentations.

Performance of Raw Files vs. Flat Files

It’s pretty well accepted that raw files in SSIS are a very fast means of getting data in and out of the pipeline. Jamie Thomson has referenced the use of raw files a number of times and Todd McDermid recently posted about using them for staging data. It occurred to me, that even though I’d always heard they were faster than other options, I’d never actually tested it to see exactly how much of a difference it would make. So, below I’ve posted some admitted unscientific performance testing between raw files and flat (or text) files.

I tested two variations of flat files, delimited and ragged right. The delimited file was configured with a vertical bar (|) as the column delimiter and CR/LF as the row delimiter. The ragged right file was configured as a fixed width with row delimiters – each column had a fixed width, and a final, zero-width column was appended with CR/LF as the delimiter. The same data was used for each test, the following columns being defined:

Name Data Type Precision Scale Length
TestInt32 DT_I4 0 0 0
TestString DT_STR 0 0 50
TestBool DT_BOOL 0 0 0
TestCurrency DT_CY 0 0 0
TestDBTimestamp DT_DBTIMESTAMP 0 0 0
TestWString DT_WSTR 0 0 50
TestNumeric DT_NUMERIC 18 6 0

One thing to note is that when importing from flat files, everything was imported as strings, to avoid any data conversion issues. This is one of the strengths of raw files – no data conversion necessary. But for this test, I was primarily looking at speed of getting the data on and off disk. I also looked at the difference in file sizes between the formats.

I tested each option with 500,000, 1 million, and 10 million rows. I ran each one 4 times for each row count, and discarded the first run to offset the effects of file caching. The results of the runs were averaged for comparison.

When writing files, there’s no big surprises between the options. raw files are faster on 10 million rows by 9.8 seconds. The difference on smaller numbers of rows is pretty insignificant. Here’s a chart showing the times (the raw data is at the end of the post):

image

Reading files did show a difference that I didn’t expect. Read speeds on raw files and delimited files are fairly comparable, with raw files still having the edge in speed. However, reads on ragged right files are significantly slower – well over twice as slow when compared to raw files.

image

File sizes were also as expected, with delimited files having a slight edge over raw files, likely because the string values I used were not all 50 characters in length.

image

In summary, it’s clear that raw files have an advantage in speed. However, the differences weren’t as large as I was expecting, except in the case of ragged right files. So, in general, using raw files are best for performance, but if you are dealing with row counts of less than 1 million rows, it’s not a huge difference unless you are really concerned with performance. Of course, there are plenty of other differences between the formats, and I’d encourage you to research them before making a decision.

Here’s the raw data on the number of seconds to produce each file:

  500,000 1,000,000 10,000,000
Write To Delimited 2.61 5.16 47.02
Write To Ragged 2.66 5.31 49.03
Write To Raw 2.21 4.23 39.21
  500,000 1,000,000 10,000,000
Read From Delimited 0.77 1.52 16.59
Read From Ragged 2.74 5.89 35.39
Read From Raw 0.60 1.08 10.03

and the file size in KB for each:

  500,000 1,000,000 10,000,000
Delimited 44,624 89,792 946,745
Ragged 92,286 184,571 1,845,704
Raw 47,039 94,402 973,308

Please let me know if you’d like more details or have any questions.

Implementing PerformUpgrade in Custom Components

If you develop custom components for SSIS, you may have the need to update them as you add new functionality. If you are just upgrading the functionality, but not changing the metadata, then you can simply recompile and redeploy the component. An example of this type of update would be changing the component to do additional warning or informational logging. The code has to be updated, but the metadata (the properties of the component, the settings for the inputs and outputs) was not modified.

The other type of update involves changing the component’s metadata. Examples of this would be adding a new property to the component or adding new inputs or outputs. In this case, you could increment the assembly version of your component, but then you would have to remove the old one from any data flows, and then add the new one back in and reconnect it. Rather than forcing users of the component to go through that effort for every package that uses the component, you can implement the PerformUpgrade method on your component. The PerformUpgrade method will be called when the package is loaded and the current version of the component does not match the version stored in the package’s metadata. You can use this method to compare the current version of the component to the expected version, and adjust the metadata appropriately.

Setting the CurrentVersion

To use this, you have to tell SSIS what the current version of your component is. You do this by setting the CurrentVersion property in the DtsPipelineComponent attribute that can be set on the PipelineComponent class:

[DtsPipelineComponent(
    DisplayName = "Test Component",
    ComponentType = ComponentType.Transform,
    CurrentVersion = 1,
    NoEditor = true)]
public class TestComponent : PipelineComponent

The CurrentVersion property defaults to zero, so a value of 1 indicates that this component is now on it’s second version.

Performing the Upgrade

Next, you need to implement some code in the PerformUpgrade method. This consists of first getting the value of the CurrentVersion property, and at the end of the method, setting the version in the component’s metadata to the current version.

public override void PerformUpgrade(int pipelineVersion)
{
    // Obtain the current component version from the attribute.
    DtsPipelineComponentAttribute componentAttribute =
      (DtsPipelineComponentAttribute)Attribute.GetCustomAttribute(this.GetType(), typeof(DtsPipelineComponentAttribute), false);
    int currentVersion = componentAttribute.CurrentVersion;

    if (ComponentMetaData.Version < currentVersion)
    {
        //Do the upgrade here
    }

    // Update the saved component version metadata to the current version.
    ComponentMetaData.Version = currentVersion;
}

The actual upgrade code can vary a good bit, from adding custom properties, adjusting the data types of outputs, or adding / deleting inputs or outputs. I won’t show the logic for these things here, but it’s pretty similar to the same code you’d use in ProvideComponentProperties.

Handling Multiple Upgrades

The code above is based on the sample in Books Online, but there’s a slight issue. Determining what upgrades need to be applied can be more complicated than simply comparing the current version to the ComponentMetaData version. Imagine that you have already upgraded the component from version 0 to version 1, by adding a new property. Now, you discover a need to add another new property, which will result in version 2. What do you do about the property added in version 1? You don’t want to add it twice for components that have already been upgraded to version 1. But it’s also possibly that not all packages have been upgraded from version 0 yet, so for those you need to add both properties. By altering to version check logic a little, you can accommodate upgrading from multiple versions pretty easily:

if (ComponentMetaData.Version < 1)
{
    //Perform upgrade for V1
}

if (ComponentMetaData.Version < 2)
{
    //Perform upgrade for V2
}

This change will ensure that the appropriate upgrade steps are taken for each version.

Some Other Thoughts

There’s a few things to be aware of with PerformUpgrade. One, it’s called only when the package is loaded, and the version stored in the package’s metadata is different than the binary component. This can occur both at design time (when the package is opened in Visual Studio), or at runtime (when executing the package from DTEXEC, etc).

Two, when you update the CurrentVersion property, and then add the component to a new package, the version number in the package metadata will initially be set to 0. So the next time the package is opened, the PerformUpgrade will be performed. Since the ProvideComponentProperties would have already set the metadata appropriately for new version of the component, the PerformUpgrade can cause errors by attempting to add the same metadata again. This appears to be a bug in the behavior when adding the component to the data flow, and it occurs under both 2005 and 2008. The workaround is code the PerformUpgrade method to check before altering any metadata, to make sure that it doesn’t already exist.

Three, due to what looks like another bug, when the package is opened the second time after the component is initially added to the package, the version will be incremented at the end of PerformUpgrade (assuming you use the code above that updates the version). However, this change does not mark the package as dirty in the designer, so the updated version number will not be saved unless some other property in the package is modified, and then the package is saved. This isn’t a huge problem – though you do need to make sure that the code in PerformUpgrade can be run repeatedly to avoid issues.

That’s pretty much it. Hopefully this will be helpful if you are developing custom components for SSIS.

Quick Tip for Specifying Columns in Source Components

It’s generally accepted* that when you are using an OLE DB Source or an ADO.NET Source, you should specify all the columns, rather than selecting the table from the dropdown, or using a SELECT * SQL statement.

If you’re like me, then you probably don’t have the columns in all your tables memorized, and you don’t like typing, so getting a complete list of columns is a bit painful. You either have to use the query designer to create the whole query, or go to SSMS and create the SELECT statement from there. However, there is a simpler way. Just type “SELECT * FROM [your table here]” into the source component’s SQL Command box, and then click the Build Query button. The query designer will automatically expand the * into the specific column names. Just click OK as soon as the designer opens, and you end up with a fully populated column list.

 

*For some of the reasons why this is generally accepted from an SSIS perspective, see these posts from Jamie Thomson: http://blogs.conchango.com/jamiethomson/archive/2006/02/21/SSIS_3A00_-SELECT-_2A002E002E002E00_-or-select-from-a-dropdown-in-an-OLE-DB-Source-component_3F00_.aspx and https://blogs.conchango.com/jamiethomson/archive/2006/01/05/2554.aspx (point #4). If you want reasons from a general SQL standpoint, search for “never use ”SELECT *” in SQL” in your favorite search engine.