Posts tagged ‘SSIS’

Performance of Raw Files vs. Flat Files

It’s pretty well accepted that raw files in SSIS are a very fast means of getting data in and out of the pipeline. Jamie Thomson has referenced the use of raw files a number of times and Todd McDermid recently posted about using them for staging data. It occurred to me, that even though I’d always heard they were faster than other options, I’d never actually tested it to see exactly how much of a difference it would make. So, below I’ve posted some admitted unscientific performance testing between raw files and flat (or text) files.

I tested two variations of flat files, delimited and ragged right. The delimited file was configured with a vertical bar (|) as the column delimiter and CR/LF as the row delimiter. The ragged right file was configured as a fixed width with row delimiters – each column had a fixed width, and a final, zero-width column was appended with CR/LF as the delimiter. The same data was used for each test, the following columns being defined:

Name Data Type Precision Scale Length
TestInt32 DT_I4 0 0 0
TestString DT_STR 0 0 50
TestBool DT_BOOL 0 0 0
TestCurrency DT_CY 0 0 0
TestDBTimestamp DT_DBTIMESTAMP 0 0 0
TestWString DT_WSTR 0 0 50
TestNumeric DT_NUMERIC 18 6 0

One thing to note is that when importing from flat files, everything was imported as strings, to avoid any data conversion issues. This is one of the strengths of raw files – no data conversion necessary. But for this test, I was primarily looking at speed of getting the data on and off disk. I also looked at the difference in file sizes between the formats.

I tested each option with 500,000, 1 million, and 10 million rows. I ran each one 4 times for each row count, and discarded the first run to offset the effects of file caching. The results of the runs were averaged for comparison.

When writing files, there’s no big surprises between the options. raw files are faster on 10 million rows by 9.8 seconds. The difference on smaller numbers of rows is pretty insignificant. Here’s a chart showing the times (the raw data is at the end of the post):

image

Reading files did show a difference that I didn’t expect. Read speeds on raw files and delimited files are fairly comparable, with raw files still having the edge in speed. However, reads on ragged right files are significantly slower – well over twice as slow when compared to raw files.

image

File sizes were also as expected, with delimited files having a slight edge over raw files, likely because the string values I used were not all 50 characters in length.

image

In summary, it’s clear that raw files have an advantage in speed. However, the differences weren’t as large as I was expecting, except in the case of ragged right files. So, in general, using raw files are best for performance, but if you are dealing with row counts of less than 1 million rows, it’s not a huge difference unless you are really concerned with performance. Of course, there are plenty of other differences between the formats, and I’d encourage you to research them before making a decision.

Here’s the raw data on the number of seconds to produce each file:

  500,000 1,000,000 10,000,000
Write To Delimited 2.61 5.16 47.02
Write To Ragged 2.66 5.31 49.03
Write To Raw 2.21 4.23 39.21
  500,000 1,000,000 10,000,000
Read From Delimited 0.77 1.52 16.59
Read From Ragged 2.74 5.89 35.39
Read From Raw 0.60 1.08 10.03

and the file size in KB for each:

  500,000 1,000,000 10,000,000
Delimited 44,624 89,792 946,745
Ragged 92,286 184,571 1,845,704
Raw 47,039 94,402 973,308

Please let me know if you’d like more details or have any questions.

Implementing PerformUpgrade in Custom Components

If you develop custom components for SSIS, you may have the need to update them as you add new functionality. If you are just upgrading the functionality, but not changing the metadata, then you can simply recompile and redeploy the component. An example of this type of update would be changing the component to do additional warning or informational logging. The code has to be updated, but the metadata (the properties of the component, the settings for the inputs and outputs) was not modified.

The other type of update involves changing the component’s metadata. Examples of this would be adding a new property to the component or adding new inputs or outputs. In this case, you could increment the assembly version of your component, but then you would have to remove the old one from any data flows, and then add the new one back in and reconnect it. Rather than forcing users of the component to go through that effort for every package that uses the component, you can implement the PerformUpgrade method on your component. The PerformUpgrade method will be called when the package is loaded and the current version of the component does not match the version stored in the package’s metadata. You can use this method to compare the current version of the component to the expected version, and adjust the metadata appropriately.

Setting the CurrentVersion

To use this, you have to tell SSIS what the current version of your component is. You do this by setting the CurrentVersion property in the DtsPipelineComponent attribute that can be set on the PipelineComponent class:

[DtsPipelineComponent(
    DisplayName = "Test Component",
    ComponentType = ComponentType.Transform,
    CurrentVersion = 1,
    NoEditor = true)]
public class TestComponent : PipelineComponent

The CurrentVersion property defaults to zero, so a value of 1 indicates that this component is now on it’s second version.

Performing the Upgrade

Next, you need to implement some code in the PerformUpgrade method. This consists of first getting the value of the CurrentVersion property, and at the end of the method, setting the version in the component’s metadata to the current version.

public override void PerformUpgrade(int pipelineVersion)
{
    // Obtain the current component version from the attribute.
    DtsPipelineComponentAttribute componentAttribute =
      (DtsPipelineComponentAttribute)Attribute.GetCustomAttribute(this.GetType(), typeof(DtsPipelineComponentAttribute), false);
    int currentVersion = componentAttribute.CurrentVersion;

    if (ComponentMetaData.Version < currentVersion)
    {
        //Do the upgrade here
    }

    // Update the saved component version metadata to the current version.
    ComponentMetaData.Version = currentVersion;
}

The actual upgrade code can vary a good bit, from adding custom properties, adjusting the data types of outputs, or adding / deleting inputs or outputs. I won’t show the logic for these things here, but it’s pretty similar to the same code you’d use in ProvideComponentProperties.

Handling Multiple Upgrades

The code above is based on the sample in Books Online, but there’s a slight issue. Determining what upgrades need to be applied can be more complicated than simply comparing the current version to the ComponentMetaData version. Imagine that you have already upgraded the component from version 0 to version 1, by adding a new property. Now, you discover a need to add another new property, which will result in version 2. What do you do about the property added in version 1? You don’t want to add it twice for components that have already been upgraded to version 1. But it’s also possibly that not all packages have been upgraded from version 0 yet, so for those you need to add both properties. By altering to version check logic a little, you can accommodate upgrading from multiple versions pretty easily:

if (ComponentMetaData.Version < 1)
{
    //Perform upgrade for V1
}

if (ComponentMetaData.Version < 2)
{
    //Perform upgrade for V2
}

This change will ensure that the appropriate upgrade steps are taken for each version.

Some Other Thoughts

There’s a few things to be aware of with PerformUpgrade. One, it’s called only when the package is loaded, and the version stored in the package’s metadata is different than the binary component. This can occur both at design time (when the package is opened in Visual Studio), or at runtime (when executing the package from DTEXEC, etc).

Two, when you update the CurrentVersion property, and then add the component to a new package, the version number in the package metadata will initially be set to 0. So the next time the package is opened, the PerformUpgrade will be performed. Since the ProvideComponentProperties would have already set the metadata appropriately for new version of the component, the PerformUpgrade can cause errors by attempting to add the same metadata again. This appears to be a bug in the behavior when adding the component to the data flow, and it occurs under both 2005 and 2008. The workaround is code the PerformUpgrade method to check before altering any metadata, to make sure that it doesn’t already exist.

Three, due to what looks like another bug, when the package is opened the second time after the component is initially added to the package, the version will be incremented at the end of PerformUpgrade (assuming you use the code above that updates the version). However, this change does not mark the package as dirty in the designer, so the updated version number will not be saved unless some other property in the package is modified, and then the package is saved. This isn’t a huge problem – though you do need to make sure that the code in PerformUpgrade can be run repeatedly to avoid issues.

That’s pretty much it. Hopefully this will be helpful if you are developing custom components for SSIS.

Quick Tip for Specifying Columns in Source Components

It’s generally accepted* that when you are using an OLE DB Source or an ADO.NET Source, you should specify all the columns, rather than selecting the table from the dropdown, or using a SELECT * SQL statement.

If you’re like me, then you probably don’t have the columns in all your tables memorized, and you don’t like typing, so getting a complete list of columns is a bit painful. You either have to use the query designer to create the whole query, or go to SSMS and create the SELECT statement from there. However, there is a simpler way. Just type “SELECT * FROM [your table here]” into the source component’s SQL Command box, and then click the Build Query button. The query designer will automatically expand the * into the specific column names. Just click OK as soon as the designer opens, and you end up with a fully populated column list.

 

*For some of the reasons why this is generally accepted from an SSIS perspective, see these posts from Jamie Thomson: http://blogs.conchango.com/jamiethomson/archive/2006/02/21/SSIS_3A00_-SELECT-_2A002E002E002E00_-or-select-from-a-dropdown-in-an-OLE-DB-Source-component_3F00_.aspx and https://blogs.conchango.com/jamiethomson/archive/2006/01/05/2554.aspx (point #4). If you want reasons from a general SQL standpoint, search for “never use ”SELECT *” in SQL” in your favorite search engine.

Warning about Using String Variables in Derived Column Expressions

I ran across an interesting behavior in SSIS this week, and thought it was worth sharing. Occasionally, I’ll have the need to use the value of a variable (set in the control flow) in the data flow. The typical way to accomplish this is to use a Derived Column transformation to introduce the variable’s value into the data flow. However, you need to be cautious if you are doing this with a String variable.

When you are working with Derived Column expressions, the output type of the Derived Column expression is based on the input values. For example, inputting an expression like 1+2 will result in an output column data type of DT_I4 (an integer). Basically, the editor guesses the data type based on the input values. Generally, this is OK – if you are referencing other data flow columns, or variables that aren’t of type string, the lengths of the values are static. However, when the editor calculates the length of string variables, it uses the current value in the variable. So, if your variable has 3 characters, the output data type only expects three characters. If the value in the variable is static and never changes, this is fine. But if the value is updated regularly (perhaps it’s a variable in a template package) or the value is set through an expression (so that the value changes at runtime), this can cause truncation errors in the package.

I created a sample package that demonstrates this. The package is pretty simple: a single Data Flow task, with an OLE DB Source, a Derived Column transformation, and a Union All (just there to terminate the data flow).

image

There is a single variable named TestVar that contains the value “1234”.

image

In the Derived Column transformation, there is a single new column added. Note that the calculated length for the output column is 4, matching the value “1234” in the variable.

image

If the package is run right now, everything works fine. But that’s not the case if you change the value in the variable to “123456”.

image

Running the package after changing the value results in the following error:

[Derived Column [91]] Error: The "component "Derived Column" (91)" failed because truncation occurred, and the truncation row disposition on "output column "ColumnFromStringVar" (289)" specifies failure on truncation. A truncation error occurred on the specified object of the specified component.

This occurs because the metadata in the Derived Column transformation isn’t updated when the variable’s value is changed. So, to avoid seeing this error occur in your packages, you need to explicitly set the output column’s length.

In 2005, you could change the calculated data type by editing the data type, length, precision, and scale fields. In 2008, however, the values are locked. You can change the data type by going into the advanced editor for the Derived Column. However, it’s easier to simply cast the string value, to force the Derived Column editor to treat it as if it has constant length.

image

By using this approach, as long as your string variable’s value is less than 50 characters, the Derived Column will continue to work. It’s best to set the length of the cast to the same value as the destination column’s length.

There’s a Connect submission on improving this behavior, either by updating the pipeline’s metadata as the string variable’s value changes, or by throwing a validation warning or error if the current value of the variable exceeds the output length in the Derived Column transformation. If you agree that this could use some improvement, you can vote here: https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=470995

In the meantime, I’d highly recommend performing an explicit cast in any Derived Column expression that uses a string variable, particularly if the value is subject to change.

The sample package is on my SkyDrive, if you’d like to see the error in action.

Setting “Work Offline” without Opening the Project

If you’ve done much work with SSIS, you’re probably aware that on opening a package in BIDS, SSIS validates all the objects in the package. This can cause the packages to open very slowly, particularly if it has connections to a database that is currently unavailable.


I recently needed to upgrade several older packages to SSIS 2008. Unfortunately, these packages were referencing a database that I no longer have access to. On top of that, there are a number of data flows in each package, all of which use this non-existent database. Opening a package in BIDS was taking more than 10 minutes, which is about 9 minutes and 55 seconds past the point where my patience runs out. Normally, this wouldn’t be that big of a deal. Once the project was opened, I would just set the Work Offline option (located under the SSIS menu in BIDS), which prevents the validation from running.


image


However, each package was in its own project (for reasons I won’t go into in this post, but primarily failure to plan ahead), so I was looking at a very slow and painful process to upgrade these packages.


Fortunately, there is a way to enable the Work Offline option prior to actually opening the project. Locate the BIDS *.user file associated with the project. For SSIS, this file should be located in the same folder as the project (.dtproj) file, and will have a filename like “<project name>.dtproj.user”. Open this file in Notepad, and you should see something like the following (it’s got a few additional tags in 2008, but the general format is the same):

<?xml version=“1.0″ encoding=“utf-8″?>
<DataTransformationsUserConfiguration xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance” xmlns:xsd=“http://www.w3.org/2001/XMLSchema”>
<Configurations>
<Configuration>
<Name>Development</Name>
<Options>
<UserIDs />
<UserPasswords />
<OfflineMode>false</OfflineMode>
</Options>
</Configuration>
</Configurations>
</DataTransformationsUserConfiguration>

Locate the <OfflineMode> tag (in red above) and change the value from false to true. Now, when the project is opened, it will already be in Offline mode, so you won’t have to suffer through a lengthy validation process.

Updating Custom Assembly References in SSIS

Just a quick tip related to a recent problem I had with a custom task I developed for SSIS. I had to recompile it recently, and in the process, managed to change the public key associated with the assembly. Since SSIS uses strong names (which include the public key) to reference custom tasks, this caused a bit of a problem. None of the packages that used the custom task would execute properly, and I really didn’t want to go through 20+ packages and delete the task, then add it back and try to remember how it was set up in each package. Fortunately, fixing this was much simpler. I opened one of packages in the XML view, located the original strong name (which looks something like this “Mariner.Pamlico.ETL.ControlFlowTasks.CustomTask, Mariner.Pamlico.ETL.ControlFlowTasks, Version=1.0.0.0, Culture=neutral, PublicKeyToken=73ea249dd43ab42e”), and did a search and replace on all the packages to update the reference with the new public key. I could then execute all my packages without any problems, and they all retained their original settings.

Like I said, just a quick tip if you ever have to update multiple packages to reference a new version of a custom task or component. A few words of warning – before updating the XML of a package directly, it’s always a good idea to make a backup. Also, this works as long as you haven’t changed the public interface for the task. If you have removed properties, it’s likely to cause new and interesting errors when you attempt to open the package in BIDS.

Check for SQL Agent Job Completion

Recently, a user on the MSDN forums asked how to check an Agent job from SSIS to see if it’s completed. Once it had completed, additional steps need to be performed, for example, archiving flat files. However, this couldn’t be done till the job completed. Fortunately, this isn’t too difficult to accomplish in SSIS.

I put together the following sample package to illustrate this. Basically, the package starts a job, then goes into a loop to see if the package has completed.

image

The Start Job Execute SQL Task uses the sp_start_job stored procedure to start an Agent job. sp_start_job is asynchronous, that it, it does not wait for the job to complete, but returns right away. This sample is just running a job that pauses execution for several seconds before completing.

image

The For Loop container is configured to look at the @executionStatus variable. The initial expression (InitExpression) sets the value of the variable to 1 at the beginning of the loop. The loop will continue executing until the EvalExpression is false, in this case when @executionStatus is no longer equal to 1.

image

The first step inside the loop is to delay for a second. This allows the job to get started initially, and prevents the package from bombarding the server with requests. The delay is implemented using a WAITFOR statement in the Execute SQL Task.

image

Finally, another Execute SQL Task is used to check the current status of the job. This is done by calling sp_help_job, which returns, among other things, the current_execution_status column. The value in this column indicates what the job is doing currently. 1 indicates that it is currently executing, so when the value is no longer 1, it’s time to exit the loop. The value of the current_execution_status column is assigned to the @executionStatus variable, and the loop repeats as long as the value is 1.

image

This is the Result Set page, where the column is assigned to the variable.

image

As I mentioned, this is fairly straightforward to implement. The one wrinkle I found is that you do need some amount of delay between calling sp_start_job and checking the execution status. If you do it immediately, it’s likely that the job hasn’t started yet, and the loop will exit immediately. On my system, introducing the 1 second delay gave the job time to actually get started. Also, you’ll need to alter the Agent job being called to match your system.

The sample package is on my Skydrive. Let me know if you have questions or feedback.

SSIS Event Handlers

There was a question recently by a fellow MVP, who was wondering if the event handlers in SSIS were fired synchronously or asynchronously. The answer (which I double-checked with Matt Masson) is that they are synchronous. When events are fired asynchronously, they are queued up, and the system runs them as it gets to them. The main thread of the program may continue executing while the events are firing. However, when they are run synchronously, the main thread is paused while the event is executed. Events are not queued up, they are executed as soon as they fired.

What does this mean for your SSIS packages? Well, for the most part it reduces potential issues that can be caused by events firing out of sequence. There can also be threading issues with asynchronous events. So generally using synchronous events makes the programming model a little simpler. It also means that any work done in the event handler will be complete prior to the main thread picking back up. For an example, take the following package, which performs some setup prior to running a data flow. Assuming the setup runs correctly, the data flow should be run.

image

The Perform Setup Work has an OnPostExecute event handler that verifies that setup was performed properly, and sets the value of a variable to 1 or 0, depending on whether it was successful or not.

image

The expression on the precedence constraint checks the value of the variable, to determine whether the data flow should be executed.

image

Since the event handler is executed synchronously, you can count on any work that is done in the event handler being complete before execution moves on from that task. That includes the setting of any variables in the package, which makes the above approach possible. If the event handlers were run asynchronously, you would not be able to count on the variable being set before the expression on the precedence constraint was checked.

However, because your main package execution will be paused while event handlers are firing, you’ll want to make sure that any code in the event handlers runs quickly. You can’t use the event handlers to queue up long running processes, as you can in some languages.

Event handlers are a very powerful, but often underutilized, feature in SSIS. Hopefully this additional information will make them a little easier to use.

SSIS Tasks and Components

Todd McDermid (a frequent poster on the MSDN SSIS forums, and creator of the Kimball SCD component) and I have started a new project on CodePlex along with Todd McDermid, the SSIS Community Tasks and Components (ssisCTC) project. This project is intended to be a umbrella project for custom SSIS tasks and components that are released as open source, as well as provide a listing of other open components for SSIS. I’ve published the Batch Destination component that I created to the site, and intend to add some more over the coming months.

Why set up a community project around this? Primarily, because I’ve found that many single person projects get one release, and then are never touched again. Having multiple people involved helps insure that if one person gets busy, or moves on to other interests, there are still people around to maintain and enhance what’s been put out there. So, if you are interested in joining or contributing, please let me know, or get in touch with use through the CodePlex site. Also, if you have or know of a freely available SSIS component or task, let us know, and we’ll be happy to add it to the list.

The Microsoft BI Conference

I’m doing a couple of chalk talks at the Microsoft BI Conference, occurring in Seattle, WA on October 6-8. The chalk talks are a different format than the typical presentation: smaller audience, no (or very few) slides, and more conversational in format. I’m doing one on unit testing for SSIS, and another on building custom components for SSIS. I really enjoy the chalk talk format, and I’m looking forward to some good conversations. If you are planning on being at the conference, please stop by.