Archive for September 2008

PASS Summit 2008

Last post on presentations for a while, I think / hope.

I’ll be presenting on Unit Testing in SSIS at the PASS Summit 2008, occurring in Seattle, WA from Nov. 18-21. There’s lots of great content planned at PASS this year, and it’s definitely worth attending. If you are there, drop by the session and say hello.

The Microsoft BI Conference

I’m doing a couple of chalk talks at the Microsoft BI Conference, occurring in Seattle, WA on October 6-8. The chalk talks are a different format than the typical presentation: smaller audience, no (or very few) slides, and more conversational in format. I’m doing one on unit testing for SSIS, and another on building custom components for SSIS. I really enjoy the chalk talk format, and I’m looking forward to some good conversations. If you are planning on being at the conference, please stop by.

SQL Saturday in Greenville, SC

There’s a SQL Saturday coming up in Greenville, SC on October 11th. It’s a free, day long event with a number of great speakers doing sessions. I don’t include myself in the great speaker category, but I was asked to present a couple of sessions on SSIS, including one on data profiling in the ETL process, one of my favorite topics.

Again, it’s free and there’s some great content, so if you are in the area or close by, you should check it out.

The SSWUG BI Virtual Conference Is Moving…

but not very far, just to Nov. 5-7. In order to provide more value to attendees, the BI conference is being run at the same time as the SharePoint, SQL, and .NET conferences. So now, instead of just having access to the BI content, you’ll be able to pick and choose session from all four conferences. Personally, I love the idea of 3 days of solid BI content, but if you like a little more variety, this should be great.


Check in at http://www.vconferenceonline.com/Business-Intelligence/ to see the updated information.

Presenting at the Triad SQL Server User Group

I’m presenting at the Triad SQL Server User Group tomorrow night on Migrating to Reporting Services 2008. It was a lively group last time I was there, and I’m looking forward to visiting them again. If you are in the area, please stop by.

Here’s the abstract for the presentation:

With the new and enhanced capabilities that Reporting Services 2008 offers, more and more companies are evaluating a migration from other reporting technologies to Reporting Services. This session will cover the commonly needed steps in a migration process from both project planning and technical delivery perspectives. Commonly used patterns in migrations will also be discussed. There will be specific focus on how ad-hoc reporting environments can be set up while leveraging the new features in 2008.

They are currently having some problems with their site (at www.triadsql.com) but if you need directions, you can get them from the Triad Developers Group site here: http://www.triaddev.org/directions/. They meet at the same location.

Error Redirection with the OLE DB Destination

A question that comes up occasionally is how to handle errors that occur on the OLE DB Destination in the data flow. Errors that occur when using the OLE DB Destination to insert into a table are usually caused by primary key, foreign key, or check constraints violations. When one of these errors occurs, the default behavior of the OLE DB Destination is to fail and not insert any rows, even the good ones. So the question is, "How can you insert the good rows, while redirecting the bad rows somewhere else?"

There is always the option of preprocessing the data prior to sending it to the OLE DB Destination, to ensure that no constraints will be violated. However, this usually means that you are incurring the validation cost twice – once up front, and then again before you insert into the table. You can drop and recreate the constraints on the table, but this incurs some performance penalties itself. As usual with anything dealing with performance, your mileage may vary, so the best way to see which performs best for you is to test it in your environment.

Another approach is to leave the constraints intact on the destination table, and handle the constraint violations through error redirection. You can enable error redirection on the OLE DB Destination, but you have to change the "Maximum insert commit size" property to make it work. By default, this value is set so that the entire set of rows going to the database is committed at once. A failure to insert one rows means that no rows will be inserted.

image

If you set this to a value of one, the OLE DB Destination will only try to commit a single row at time. The problem with this is that single row inserts are painfully slow. So you probably want to set this value considerably higher, between 10,000 and 100,000 rows. You may have to try a few different values to determine what works best in your environment.

Now that you have the OLE DB Destination set up to commit multiple, smaller batches instead of one huge batch, you can enable error redirection. However, it will still redirect the entire batch of records that contain an error, not just the individual rows in error. For example, if you defined a batch size of 10,000 rows, and one row in the 10,000 row batch has an error, they will all be redirected to the error output. So how do you set it up to get all the good rows inserted, and get down to just the rows in error? We really want it to work on single row batches for error handling purposes, but as pointed out earlier, that can be very slow.

There is an approach that allows you to get the best of both worlds – single row error handling with good performance. It involves staging the inserts through multiple OLE DB Destinations, each with smaller batch sizes. The last one in the process should have a commit size of 1, so that you get single rows on error redirection.

To illustrate this technique, I’ve created a sample project. It has a few external dependencies that need to be set up before it will run. You need a database with three tables in it. One is a Numbers table, which are incredibly useful for quickly generating sample data. If you don’t have one, please take a look at this post from Adam Machanic to see why you need one. The second table is used to test inserting data into a table with constraints.

CREATE TABLE [dbo].[SampleTable](
    [SampleTableID] [int] NOT NULL,
    [Name] [varchar](50) NULL,
    [Description] [varchar](250) NULL,
    [CheckValue] [int] NULL,
 CONSTRAINT [PK_SampleTable] PRIMARY KEY CLUSTERED 
([SampleTableID] ASC )) ON [PRIMARY]
GO
 
ALTER TABLE [dbo].[SampleTable]  WITH CHECK ADD  CONSTRAINT [CK_SampleTable] CHECK  (([CheckValue]>=(0) AND [CheckValue]<=(9)))
GO

 

The third table is a duplicate of the second table, but with no constraints. This table is used to catch the error rows.

CREATE TABLE [dbo].[SampleTableError](
    [SampleTableID] [int] NOT NULL,
    [Name] [varchar](50) NULL,
    [Description] [varchar](250) NULL,
    [CheckValue] [int] NULL
) ON [PRIMARY]
GO

The package control flow truncates the destination table (SampleTable), and then runs a data flow.

image

The data flow consists of a source that generates 5000 rows of data, with two of the rows violating constraints on the destination table. The OLE DB Source is connected to a OLE DB Destination that is pointed to the SampleTable, with a Maximum Insert Commit Size set to 1000 rows. The error output is connected to a second OLE DB Destination, that inserts into the same table (SampleTable). The second destination has it’s Maximum Insert Commit Size set to 1. Finally, the error output from that OLE DB Destination is connected to a third one, which inserts the error rows into SampleTableError.

image

The first OLE DB Destination attempts to insert a batch, giving us good performance. If the batch is successful, it moves on to the next batch. This keeps the inserts happening quickly. However, if the batch has one or more error rows, the entire batch is redirected to the second OLE DB Destination. This one inserts rows one at a time, giving us the single row error handling that we wanted. Good rows are inserted, but the error rows are redirected to the third OLE DB Destination to be written to an error table. It doesn’t have to be an OLE DB Destination to handle the error rows, you could use a Flat File Destination or a script component to process them.

You can modify this technique by using additional stages of OLE DB Destinations (for example, to go from 50,000 rows to 10,000 to 1) but I try not to go beyond three levels to keep things understandable. Your needs may vary, depending on the performance and number of rows you are processing.

I’ve uploaded the sample package to my SkyDrive. This one is done with SSIS 2008, but the same technique works in 2005.