Welcome to Agile BI Community Sign in | Join | Help

The Charlotte SQL Saturday is coming up next weekend, March 6th. There is a great lineup of speakers presenting, and we are approaching capacity for the event, so get registered soon, if you haven’t already. It’s going to be a great day – free SQL training from a lot of well-known, well-respected names, a number of our local community members, and some Microsoft people as well.

I’ll be presenting on Creating Custom Components for SSIS – a great way to extend the out-of-the-box functionality of Integration Services, and on Vulcan, an open source framework for modeling and generating portions of your BI solution.

We’re got a great set of sponsors, including SQL Sentry (thanks Peter and Greg for all the work and support for the event – without them, it wouldn’t be what it is today – and if you are using SSAS, you really should check out Performance Advisor for Analysis Services), and Microsoft, who’s providing access to their campus for the event, among other things. Quest, Confio, and Red Gate provide some great tools for administering and developing for SQL Server, Intellinet provides services for SQL Server, and CozyRoc provides a great set of SSIS tasks and components.

Thanks to everyone who made it out to the Columbia Code Camp this weekend, even with the sleet and snow in the area. I had a number of requests for the slides from my presentations, so I’ve uploaded them to my SkyDrive.

Introduction to SSIS (SpeakerRate link)

Creating Custom Components for SSIS (SpeakerRate link)  (the sample component used in this presentation is on CodePlex in the Community Tasks and Components project.

Thanks again for attending, and if you have any follow up questions, please leave them in the comments.

Thanks to the Columbia Enterprise Developer’s Guild for letting me present last night. The audience was great, and I got a lot of good questions. Several people asked if the samples could be made available, and I also had a request to post the slides for some people who weren’t able to make it. So, here they are. I’ve posted them to my SkyDrive here. If you have any questions or comments, please feel free to leave them here.

It’s a new year, and already a lot going on. The new job is going well, but keeping me extremely busy. I’ve got several upcoming presentations, and there’s a SQL Saturday event planned for Charlotte in March that I’m helping organize. I’m also happy to say that my MVP status was re-awarded for 2010.

I have an upcoming presentation at the Columbia Enterprise Developers Guild, next Wednesday the 13th. The presentation will be on handling flat files in SSIS.

Processing Flat Files with SSIS

When doing data integration, a common requirement is to work with flat files, whether for importing data into a system from an external source, or to export it to provide to other systems. SQL Server Integration Services (SSIS) supports flat files, but there can be a number of challenges when working with them. This is particularly true if your flat files have multiple data formats contained in a single file, the data has complex formatting, or the files have inconsistent formatting. This session will help you to be more efficient when working with these types of files. You’ll learn to handle missing delimiters in the files, and parsing files that have multiple data formats. You’ll also see how to produce complex output formats, like headers and footers that contain summary information.

I’ll also be doing a couple of presentations at the Columbia Code Camp on January 30th.

Creating Custom Components for SSIS

SSIS data flows are great tools for moving data. But what if you need to go beyond the out-of-the-box components provided with SSIS? Custom components are a great way to encapsulate and reuse functionality for the data flow in SSIS. We will discuss what it takes to create and deploy custom components in SSIS, review the pros and cons of using custom components instead of scripts, and discuss some of the common challenges and issues with creating them.

Introduction to SSIS

SQL Server Integration Services is a tool provided with SQL Server for moving data between data stores. It is the successor to DTS, but there are many fundamental changes in how SSIS works. This session will provide an overview of SSIS, with a focus on the key elements of SSIS that you need to know to get the most use out of it. This session will help developers efficiently use SSIS when they need to move data around the organization.

If you happen to be in the area, please drop by for these presentations.

It’s pretty well accepted that raw files in SSIS are a very fast means of getting data in and out of the pipeline. Jamie Thomson has referenced the use of raw files a number of times and Todd McDermid recently posted about using them for staging data. It occurred to me, that even though I’d always heard they were faster than other options, I’d never actually tested it to see exactly how much of a difference it would make. So, below I’ve posted some admitted unscientific performance testing between raw files and flat (or text) files.

I tested two variations of flat files, delimited and ragged right. The delimited file was configured with a vertical bar (|) as the column delimiter and CR/LF as the row delimiter. The ragged right file was configured as a fixed width with row delimiters – each column had a fixed width, and a final, zero-width column was appended with CR/LF as the delimiter. The same data was used for each test, the following columns being defined:

Name Data Type Precision Scale Length
TestInt32 DT_I4 0 0 0
TestString DT_STR 0 0 50
TestBool DT_BOOL 0 0 0
TestCurrency DT_CY 0 0 0
TestDBTimestamp DT_DBTIMESTAMP 0 0 0
TestWString DT_WSTR 0 0 50
TestNumeric DT_NUMERIC 18 6 0

One thing to note is that when importing from flat files, everything was imported as strings, to avoid any data conversion issues. This is one of the strengths of raw files – no data conversion necessary. But for this test, I was primarily looking at speed of getting the data on and off disk. I also looked at the difference in file sizes between the formats.

I tested each option with 500,000, 1 million, and 10 million rows. I ran each one 4 times for each row count, and discarded the first run to offset the effects of file caching. The results of the runs were averaged for comparison.

When writing files, there’s no big surprises between the options. raw files are faster on 10 million rows by 9.8 seconds. The difference on smaller numbers of rows is pretty insignificant. Here’s a chart showing the times (the raw data is at the end of the post):

image

Reading files did show a difference that I didn’t expect. Read speeds on raw files and delimited files are fairly comparable, with raw files still having the edge in speed. However, reads on ragged right files are significantly slower – well over twice as slow when compared to raw files.

image

File sizes were also as expected, with delimited files having a slight edge over raw files, likely because the string values I used were not all 50 characters in length.

image

In summary, it’s clear that raw files have an advantage in speed. However, the differences weren’t as large as I was expecting, except in the case of ragged right files. So, in general, using raw files are best for performance, but if you are dealing with row counts of less than 1 million rows, it’s not a huge difference unless you are really concerned with performance. Of course, there are plenty of other differences between the formats, and I’d encourage you to research them before making a decision.

Here’s the raw data on the number of seconds to produce each file:

  500,000 1,000,000 10,000,000
Write To Delimited 2.61 5.16 47.02
Write To Ragged 2.66 5.31 49.03
Write To Raw 2.21 4.23 39.21
  500,000 1,000,000 10,000,000
Read From Delimited 0.77 1.52 16.59
Read From Ragged 2.74 5.89 35.39
Read From Raw 0.60 1.08 10.03

and the file size in KB for each:

  500,000 1,000,000 10,000,000
Delimited 44,624 89,792 946,745
Ragged 92,286 184,571 1,845,704
Raw 47,039 94,402 973,308

Please let me know if you’d like more details or have any questions.

I’ve worked with Mariner for almost 12 years. It’s been a very good journey, with many great experiences. I’ve worked with a lot of great people, and delivered some really interesting BI solutions to clients in a number of industries. One aspect of my job that I always particularly enjoyed was helping developers be more productive when creating BI solutions, and reducing the repetitive (read: “boring”) aspects of developing solutions on the Microsoft stack.

Recently, a new opportunity to focus more heavily on that came along. As a result, after a long and enjoyable career with Mariner doing business intelligence consulting, I am taking a new position with Varigence, a company that is producing tools that will make implementing BI solutions faster and easier, as well as introduce new capabilities and better integration into the Microsoft BI stack.

I’m really looking forward to the new role and the new experiences it will offer. I will continue to be heavily involved in Microsoft BI, so I plan to maintain this blog and continue speaking and writing on it as often as often as possible.

We had good turnout at the Greenville, SC SSIG on Tuesday. If you attended, I hope you enjoyed the presentation. After the meeting, I promised several attendees that I would make the samples developed during the demo available, and here they are. The zip includes both the SSAS project files, and a backup the sample database that the cube was built on. Both are done using the 2008 version of SQL Server.

If you have any questions about , please post them in the comments.

I’ll be doing a presentation on Analysis Services at the SQL Server Innovators Guild in Greenville, SC on Tuesday, Dec. 1st. I’ll be delivering an introduction to SSAS, with lots of demos. If you are interested in attending, please register here. It’s a presentation that I’ve done a few times now, but because it’s mostly demo, something new and interesting always comes up.

 

Introduction to Analysis Services 2008

This session is intended to introduce database developers to Analysis Services 2008, with a focus on being able to quickly construct usable OLAP cubes. This presentation will be light on slides, and heavy on demonstrating how to perform the steps to create the cubes. During this session, we will cover the creation of a new cube from an existing database step by step. We will also highlight the reasons for using Analysis Services, and applicable scenarios for using it.

One of the common problems that beginners have with SSIS is debugging errors involving variables. One example of this occurs when a package uses a Foreach Loop container. These are often used to set a variable value differently for each iteration of a loop. If something fails during the loop, you might want to check the value of the variable in order to determine what went wrong.

Fortunately, this is pretty easy to accomplish in SSIS. You can see the value of any package variable in BIDS when you debug the package by following the steps below:

  1. First, set a breakpoint on a task where you'd like to check the current variable values. You can set a breakpoint by right-clicking on the task and choosing Edit Breakpoints.image
  2. Choose OnPreExecute to see values before the task executes and OnPostExecute to see them after execution. Click OK after enabling the breakpoint.
    image 
  3. Run the package in debug mode (press F5) in Visual Studio. The package will run until the breakpoint is hit.
    image
  4. Once execution stops at the breakpoint, open the Locals window (Ctrl+Alt+V, L or Debug..Windows..Locals)
    image
  5. Expand the Variable node in the Locals window. You can see the current values for all your variables, including system variables, in this window. You may have to scroll down to see your variables in the list
    image

This is a useful technique for troubleshooting packages that use variables, particularly if the variable values are changed during package execution.

I’m really looking forward to the PASS Summit next week, and getting a chance to visit with a lot of the people in the community that I interact with on a regular basis. It’s going to be a really busy week, as there’s a lot of great sessions that I’m looking forward to attending, and a few things that I’m going to be delivering myself.

A quick summary of where I’ll be during the conference:

Outside of that, I’ll be around at other sessions, the evening events, and in the “Ask the Experts” area. Looking forward to seeing everyone there.

There’s a new book available for pre-order - “SQL Server MVP Deep Dives”. This book is a little unusual in that 53 MVPs came together to contribute 59 chapters to the book. Some of the best SQL Server authors in the world contributed chapters to it. I’m certainly not one of that group, but somehow, I managed to get included, and it’s a great honor to be in such good company. The book covers a wide variety of SQL Server topics, including:

  • design
  • development
  • administration
  • tuning and optimization
  • and business intelligence (my personal favorite)

This book was a special project for the authors involved. 100% of the author royalties go to War Child International, which is a charity that works to help children affected by war across the world.

So, if you like the idea of learning some interesting things about SQL Server and helping children at the same time, get this book. If you are attending the PASS Summit, it will be available for purchase from the conference bookstore onsite. There will be a large number of the authors at the Summit (including me), so there will be plenty of opportunities to get your copy signed.

Occasionally, you may run into the need to pass values between packages. In most cases, you can use a Parent Package Variable configuration to pass the value of a variable in the calling (or parent) package to the called (or child) package. However, Parent Package Variable configurations can’t be set up on variables of type Object. Fortunately, you can work around this pretty easily, thanks to the fact that the calling package variables are inherently accessible in the called packages. 

I’ve set up a sample parent and child package to illustrate this. The parent package is straightforward. It uses an Execute SQL task to populate an object variable named TestVar with a recordset object, and then calls the child package.

image image

The child package has a Foreach Loop Container to iterate over the recordset object. It has a Script task that is used to copy the parent package’s variable (TestVar) to a local variable named LocalVar. This is the variable that the Foreach Loop is configured to use. Why copy the value? If you don’t have a local variable to reference in the Foreach Loop, it won’t validate properly.

image image

The script in Copy Variable is pretty simple. It relies on the fact that you can reference parent package variables inherently, as they are included in the collection of variables accessible in the local package. The script just copies the value from one variable to the other, so that the Foreach Loop will have something to do.

public void Main()
{
    Variables vars = null;
    Dts.VariableDispenser.LockForWrite("User::LocalVar");
    Dts.VariableDispenser.LockForRead("User::TestVar");
    Dts.VariableDispenser.GetVariables(ref vars);
    vars["User::LocalVar"].Value = vars["User::TestVar"].Value;
    vars.Unlock();
    Dts.TaskResult = (int)ScriptResults.Success;
}

Please note that for this to work, you cannot have a variable in the child package with the same name as the variable in the parent package. If you do, the local variable will hide the parent variable. Outside of that, this works really well for passing object values between packages. The same technique can also be used in reverse to send values back to the parent package, if you have that need.

The sample has been uploaded to my Skydrive. Let me know if you have any questions.

If you develop custom components for SSIS, you may have the need to update them as you add new functionality. If you are just upgrading the functionality, but not changing the metadata, then you can simply recompile and redeploy the component. An example of this type of update would be changing the component to do additional warning or informational logging. The code has to be updated, but the metadata (the properties of the component, the settings for the inputs and outputs) was not modified.

The other type of update involves changing the component’s metadata. Examples of this would be adding a new property to the component or adding new inputs or outputs. In this case, you could increment the assembly version of your component, but then you would have to remove the old one from any data flows, and then add the new one back in and reconnect it. Rather than forcing users of the component to go through that effort for every package that uses the component, you can implement the PerformUpgrade method on your component. The PerformUpgrade method will be called when the package is loaded and the current version of the component does not match the version stored in the package’s metadata. You can use this method to compare the current version of the component to the expected version, and adjust the metadata appropriately.

Setting the CurrentVersion

To use this, you have to tell SSIS what the current version of your component is. You do this by setting the CurrentVersion property in the DtsPipelineComponent attribute that can be set on the PipelineComponent class:

[DtsPipelineComponent(
    DisplayName = "Test Component",
    ComponentType = ComponentType.Transform,
    CurrentVersion = 1,
    NoEditor = true)]
public class TestComponent : PipelineComponent

The CurrentVersion property defaults to zero, so a value of 1 indicates that this component is now on it’s second version.

Performing the Upgrade

Next, you need to implement some code in the PerformUpgrade method. This consists of first getting the value of the CurrentVersion property, and at the end of the method, setting the version in the component’s metadata to the current version.

public override void PerformUpgrade(int pipelineVersion)
{
    // Obtain the current component version from the attribute.
    DtsPipelineComponentAttribute componentAttribute = 
      (DtsPipelineComponentAttribute)Attribute.GetCustomAttribute(this.GetType(), typeof(DtsPipelineComponentAttribute), false);
    int currentVersion = componentAttribute.CurrentVersion;
    if (ComponentMetaData.Version < currentVersion)
    {
        //Do the upgrade here
    }
    // Update the saved component version metadata to the current version.
    ComponentMetaData.Version = currentVersion;
}

The actual upgrade code can vary a good bit, from adding custom properties, adjusting the data types of outputs, or adding / deleting inputs or outputs. I won’t show the logic for these things here, but it’s pretty similar to the same code you’d use in ProvideComponentProperties.

Handling Multiple Upgrades

The code above is based on the sample in Books Online, but there’s a slight issue. Determining what upgrades need to be applied can be more complicated than simply comparing the current version to the ComponentMetaData version. Imagine that you have already upgraded the component from version 0 to version 1, by adding a new property. Now, you discover a need to add another new property, which will result in version 2. What do you do about the property added in version 1? You don’t want to add it twice for components that have already been upgraded to version 1. But it’s also possibly that not all packages have been upgraded from version 0 yet, so for those you need to add both properties. By altering to version check logic a little, you can accommodate upgrading from multiple versions pretty easily:

if (ComponentMetaData.Version < 1)
{
    //Perform upgrade for V1
}
if (ComponentMetaData.Version < 2)
{
    //Perform upgrade for V2
}

This change will ensure that the appropriate upgrade steps are taken for each version.

Some Other Thoughts

There’s a few things to be aware of with PerformUpgrade. One, it’s called only when the package is loaded, and the version stored in the package’s metadata is different than the binary component. This can occur both at design time (when the package is opened in Visual Studio), or at runtime (when executing the package from DTEXEC, etc).

Two, when you update the CurrentVersion property, and then add the component to a new package, the version number in the package metadata will initially be set to 0. So the next time the package is opened, the PerformUpgrade will be performed. Since the ProvideComponentProperties would have already set the metadata appropriately for new version of the component, the PerformUpgrade can cause errors by attempting to add the same metadata again. This appears to be a bug in the behavior when adding the component to the data flow, and it occurs under both 2005 and 2008. The workaround is code the PerformUpgrade method to check before altering any metadata, to make sure that it doesn’t already exist.

Three, due to what looks like another bug, when the package is opened the second time after the component is initially added to the package, the version will be incremented at the end of PerformUpgrade (assuming you use the code above that updates the version). However, this change does not mark the package as dirty in the designer, so the updated version number will not be saved unless some other property in the package is modified, and then the package is saved. This isn’t a huge problem – though you do need to make sure that the code in PerformUpgrade can be run repeatedly to avoid issues.

That’s pretty much it. Hopefully this will be helpful if you are developing custom components for SSIS.

Since SQL Azure is currently in a Community Technology Preview, the technology and this information provided below is subject to change. This post is based on the August 18th CTP.

Now that I’ve been working with SSIS against Azure for a few days, I thought I’d post about my experiences. Overall, I’m pretty happy with it, considering that it is a pre-release product. I’ve had some good and and some bad experiences, but with what I am seeing right now, and the direction it’s heading in, I think it has a good future.

Prior to the CTP, people wanting to get an early start with SQL Azure were advised to developed locally against SQL Express. Theoretically, you could then simply change your connection strings to point to SQL Azure, and away you go. In practice, that’s not exactly how it worked for me with SSIS (your mileage may vary - .NET apps are probably much easier to port).

Make sure you read through the documentation first – there’s a lot of good information there, and some of it is pretty important. The first thing to note is that SQL Azure currently does not support OLE DB. The normal recommendation for SSIS is to use the OLE DB Source or Destination to access SQL Server. However, if you want to port your packages to SQL Azure, you must use the ADO.NET Source and Destinations. This is fine for 2008, but if you are using SSIS 2005, there is no ADO.NET Destination, so you would have to implement your own through a script component.

The second thing to be aware of is that bulk insert operations are not currently supported (though it’s been said they will be available in a later CTP). Since the ADO.NET Destination doesn’t support bulk inserts anyway, this isn’t a huge issue. However, if you are writing your own destination (in a script component or custom component), you can’t currently use the ADO.NET SqlBulkCopy class.

So, with those two caveats out of the way, it should be pretty much like creating any data flow in SSIS – add a source, add a destination, and you are ready to go. However, I got the following error when using the ADO.NET Source and Destination:

image

This error appears to come up because SQL Azure does not currently support the system catalog tables that ADO.NET calls to retrieve table information.  For the ADO.NET Source, since you can’t type the table name in, the simplest way to work around  this is to use the SQL Command option and specify a SQL Statement instead of the Table or View option.

image

For the ADO.NET Destination, your only choice is to use the Table or View option, so you can just type the table name in. The table name must be provided in the following format: “schema”.”tablename”.

Once this is done, you can run the package, and watch your data move. Once or twice, I saw validation warnings that prevented the package from running, but these all went away the next time I ran it, so I’m guessing it was a momentary connectivity issue. I’m on the road right now, so I don’t have the most stable internet connection available.

I’ll be posting a follow up to this soon that talks about performance, and how you can tune your packages to move data in and out more quickly. I should also have a few performance test results to share.

I’ve been playing around with the SQL Azure CTP for a little bit, and generally, it’s going well. However, as with any new technology, there are plenty of things to learn.  I’m planning a series of posts around SQL Azure to share what I’m learning about it. And yes, there will be some SSIS thrown in there, too – what good is a database in the cloud if you can’t get your data in and out?

One of the first things I did was create a new database and some tables (rather obvious, I suppose – you can’t really do much in SQL without that). Something that you will likely encounter immediately when creating tables is the difference between what you can do in SQL Server and what SQL Azure supports. Primarily, it’s related to physical options affecting the storage. As a comparison, here’s the script that SQL Server Management Studio generates if you right-click on a table and choose Script Table As…Create To.

USE [MyDatabase]
GO 
 
SET ANSI_NULLS ON
GO 
 
SET QUOTED_IDENTIFIER ON
GO 
 
SET ANSI_PADDING ON
GO 
 
CREATE TABLE [dbo].[MyTable](
    [MyKey] [int] IDENTITY(1,1) NOT NULL,
    [MyString] [varchar](30) NOT NULL,
    [UpdateID] [int] NOT NULL,
    [UpdateDate] [datetime] NOT NULL,
 CONSTRAINT [pkMyTable] PRIMARY KEY CLUSTERED (
    [MyKey] ASC)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY],
 CONSTRAINT [akMyTable] UNIQUE NONCLUSTERED (
    [MyString] ASC)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY])
 ON [PRIMARY] 
GO 
 
SET ANSI_PADDING OFF
GO 
 
ALTER TABLE [dbo].[MyTable] ADD
  CONSTRAINT [dfUpdateDate]  DEFAULT (getdate()) FOR [UpdateDate]
GO

Here’s the same CREATE TABLE script, but trimmed down to just the items SQL Azure supports:

SET QUOTED_IDENTIFIER ON
GO 
 
CREATE TABLE [dbo].[MyTable](
    [MyKey] [int] IDENTITY(1,1) NOT NULL,
    [MyString] [varchar](30) NOT NULL,
    [UpdateID] [int] NOT NULL,
    [UpdateDate] [datetime] NOT NULL,
 CONSTRAINT [pkMyTable] PRIMARY KEY CLUSTERED (
    [MyKey] ASC)WITH (STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF),
 CONSTRAINT [akMyTable] UNIQUE NONCLUSTERED (
    [MyString] ASC)WITH (STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF))
GO
 
ALTER TABLE [dbo].[MyTable] ADD
  CONSTRAINT [dfUpdateDate]  DEFAULT (getdate()) FOR [UpdateDate]
GO

This is all documented in the SQL Azure documentation on MSDN, under the Transaction SQL Reference. And, as expected, most of the options that aren’t supported are related to physical storage.

One item that does stand out a bit, though, is USE. The USE statement is supported, but only if it references the current database, as in USE MyDatabase when you are connected to MyDatabase. Executing USE MyOtherDatabase when you are connected to MyDatabase will result in an error. Instead, you have to disconnect from MyDatabase and connect to MyOtherDatabase. It does make some sense not to allow users to switch databases in a multi-tenant model (I can picture all sorts of interesting hacks being created if that were possible). I do wonder, though, why it was included at all, as it is fairly useless in it’s current form. Maybe a future enhancement?

Anyway, if you are interested in SQL Azure, what this space for more updates as I continue working with it.

More Posts Next page »