Posts tagged ‘SSIS 2008’

SSIS 2008 – Incrementally Updating the Lookup Cache File

There are a number of new features in the 2008 Lookup Transform. Jamie Thomson posted a good overview of the features, and I recommend reading it prior to this article. One of the new features in SSIS 2008 is the ability to cache the data from the Lookup to a local file. This should help improve performance, as the lookup data would only need to be retrieved once, stored locally, and reused. Internally, the lookup cache file (.caw) uses a RAW file structure (.raw), which is very fast to read and write from disk.


One of the first things that I wanted to try with the new caching features was to dynamically update the cache file. That proved to be a bit of a challenge, as Jamie pointed out in his article. One of the main reasons for this is that, while the cache file is very similar to a RAW file, the cache file stores some additional metadata – hash values for each row, and some header information. This means that while you can use a RAW Source component to read the data from a cache file, you must use the Cache Transform to actually create a cache file.


Unfortunately, the Cache Transform does not support incremental updates to the cache file. Each time you use it, it recreates the file completely. However, with a little creativity, you can minimize the amount of data you have to process. I’ve created a sample package that illustrates one approach to this.


The control flow looks like this (the numbers refer to the steps below):


image


The major components of the process are:



  1. Retrieve the last modified date from the Lookup reference table, by using an Execute SQL Task.

  2. Use a Script Task to determine if the cache file already exists. This leverages the System.IO.File class, as discussed in Checking for the Existence of a File. Execution can proceed to either step 3, 4, or 7, based on precedence constraints with expressions.

  3. If the file does not exist, it is created through a data flow. This data flow is straightforward – it retrieves all the rows from the reference table and runs them through a Cache Transform.
      image

  4. If the cache file does exist and the reference table has been modified since the last run of the package, a File System task is used to rename it. This done so that the cache file can be recreated in the next step. It cannot be read from and written to at the same time, so there has to be two copies of it for a short time.

  5. Now the cache file is updated within a data flow task. The OLE DB Source retrieves only the modified rows from the reference table. The rows that haven’t been changed are pulled in from the cache file, using the Raw File Source. The lookup is used to determine if an existing row from the cache file is also included in the set of modified rows from reference table. The lookup uses an expression to set the reference query, so that it contains the same set of rows as the OLE DB Source. If the row matches one from the OLE DB Source, the row is discarded. Only non-matching rows are passed through the lookup. Finally, the rows are combined using a Union All and written back out to a new cache file (with the same name as the original).
    image

  6. The copied version of the cache file is deleted, now that it is no longer needed.

  7. This data flow is used to test the resulting cache file. It is not necessary in a production package.

The Lookup table for this example looks like:


 



CREATE TABLE [dbo].[StatusLookup](
    [ValidValue] [varchar](50) NULL,
    [AlternateValue] [varchar](50) NULL,
    [LastUpdate] [datetime] NULL
) ON [PRIMARY]
GO

This package uses a number of variables:


image


and several expressions (some of which are on the variables above):


image


These are used to make the package a bit more dynamic. The LastRunDate variable controls when the package thinks it was last executed. In a production application, this would need to be set from a parent package or an Execute SQL Task. The LookupTable variable controls the name of the cache files.


I encourage you to download the sample and walk through it to get a in-depth understanding. I’ve really only covered the high points in this article. It’s posted on my SkyDrive.


This was done against CTP6 of SQL Server 2008. The RTM release may necessitate some changes to the pattern and/or the code described here.

Setting the Lookup Query With an Expression

In SSIS 2005, one of the challenges with the Lookup transform was that the SQL used to populate the cache was fixed at design time. You could dynamically pass in parameters to the SQL, but only if you turned on the advanced caching options, which puts the lookup into row-by-row operation. With smaller tables (<100,000 rows) this wasn’t too much of an issue, but with larger tables it could be a problem. A common scenario where this functionality would be nice is with Type 2 slowly changing dimensions (SCD). If you are extracting a single day of source records (doing a daily incremental load from the source system), then the lookup cache only needs to contain records that were in effect on the date of the extract. In SSIS 2005, there was no easy way to accomplish this. However, the functionality has been added in 2008.


To illustrate this, I’ve created a simple package with a single data flow that uses AdventureWorks and AdventureWorksDW. The OLE DB Source extracts Sales Order Detail records for a single day from AdventureWorks, based on the RunDate variable.


image


image


The Product Lookup is getting its rows from the DimProduct table in AdventureWorksDW. This is a Type 2 SCD table, with effective dates. The Lookup is set up normally. Initially, I had to specify a default query that returned the columns needed from DimProduct, so that the metadata for the data flow was set up properly.


To add an expression to the Lookup, I went back to the Control Flow view, and selected the Data Flow Task. Now, in the Properties window, you can see two properties associated with the Lookup:


image


SqlCommand is used to override the inner SQL statement used by the Lookup, while SqlCommandParam can be used to modify the entire statement. Normally, SqlCommand is enough for most purposes. I have an expression set on the property, that builds a SELECT statement dynamically, including a WHERE clause that filters by the RunDate variable.


image


The primary advantage to doing this is a reduction in the number of rows in the lookup cache. For the AdventureWorks sample, filtering the set of dimension rows down to just the ones in effect as of the Sales Order date reduces the cached rows from 606 to 339. Not really a significant amount in this example, but a similar percentage reduction for a million row dimension table would lead to significant savings.


I’ve uploaded the sample package to my SkyDrive.

Renaming the Script Project in SSIS 2008

I found something interesting in SSIS 2008 today, and thought I’d share it. I was working with David Darden on a package that included a script task. In 2008, the Visual Studio Tools for Applications (VSTA) environment is used to edit script tasks. It exposes a few more details about the script task, including the project name. By default, it creates a rather long project name, as you can see in the screen shot below. (Yes, that is C#. It’s my preferred language, so expect to see more of it here now that it’s available in 2008). The namespace is based on the project name, so you get the same long value for it.

image

While poking around in the script task’s expressions editor, we found the ScriptProjectName property. This property does not show up in the Properties window (F4), only in the expressions editor. By setting this, you can update the project name in the script. For example, setting this property to "NewScriptProjectName", as shown here:

image

results in this showing up in the script project:

image 

There are a few caveats to this. One, if you do this after initially editing the script, the project name will be updated, but the namespace will remain the same. Two, you must save the package after setting the expression, in order for the expression to be applied. If you add the expression, then edit the script without saving, it will not use the new project name.

So, if you want to name your script project with a friendly name, do the following:

  1. Add the Script Task.
  2. Add an expression to set the ScriptProjectName property to your desired value.
  3. Save the package.
  4. Now you can use the Edit Script button to edit the script project with the friendly name.

I do want to point out that the name of the project has no real impact on functionality. The script will not run any differently if you do this. It’s purely for aesthetic purposes. But if the long, random looking project names bother you, it’s nice to know you can fix them. There is also a MSDN forum post here that details the steps to edit the namespace after the script has been edited for the first time. It’s a little easier to set it this way, though, so that it is handled up front.

RetainSameConnection and Expressions

I ran across a forum post today that highlighted a limitation with connection managers that I wasn’t aware of, so I thought I’d post it here. If you have RetainSameConnection set to TRUE, and you are also trying to update the connection string through an expression, be aware the connection will not change to reflect the new connection string. Basically, the connection will retain the original value (as RetainSameConnection would indicate).


You might encounter this if you are looping through a set of databases on a server. If you wanted to perform an operation on each database that required a temp table, you would set RetainSameConnection to TRUE. Unfortunately, as soon as you do that, the connection will no longer be dynamic.


I’ve created a sample package to illustrate this. The package flow is straightforward. There is a ForEach Loop that iterates through the databases on the server. The Execute SQL Task retrieves the row count from the sys.tables system view, and stores it in a variable. The Script Task fires two information messages that show the current value of the connection string, and the row count from the variable.


image


These are the results if RetainSameConnection is set to FALSE (with the correct counts):


image


and these are the results if RetainSameConnection is set to TRUE (the counts are all the same, from the initial connection):


image


This isn’t a major issue, but it is helpful to be aware of it. The same behavior occurs under CTP6 for Katmai. It makes sense, given that the RetainSameConnection property is supposed to keep the same connection throughout the package, but it can cause some unintended consequences. I’ve posted the sample to my Skydrive here.

Using the Data Profiling Task To Profile All the Tables in a Database

SQL Server 2008 introduces a new task that allows you to profile the data in a table. Jamie Thomson has a great series of posts about the different profiling options available on the task. If you are not already familiar with the task, I highly recommend starting there, as I won’t be covering that information here. Instead, I’ll be focusing on a way to make the task a bit more dynamic.


Before I go any further, I need to mention that this article is based on CTP6, and the technique I’m describing may need to be updated to work in future versions. However, I’ll be talking with the SSIS team in April about improving the task so that this approach may be a bit easier.


I was pretty excited when the task appeared in CTP5, but I was a bit disappointed by the actual experience of using it. My biggest disappointment was that you to select individual tables to profile. I was really hoping to be able to point it to a database, select all the tables I wanted to profile, and let it run. Unfortunately, there did not appear to be a way to even set the table through an expression. There is a ProfileRequests property, but it is a collection, so you can’t set expressions on it.


After some poking and prodding, though, I found the ProfileInputXml property. This property is not visible in the Properties window in BIDS, but you can locate it in the expressions dialog for the task. This property holds a complete XML based description of the profile requests for the package. Since it is available to set through an expression, it opens up the possibility to make the task dynamic.


The first thing I had to do was to get a sample of the XML. You can get a copy of the XML by viewing the XML code for the package, locating the task, and grabbing the contents of the ProfileInputXml tags. Since this is XML contained within XML, all the typical tags have been converted to their escaped versions (< becomes “&gt;”, etc.). The other approach is to use the SSIS object model to extract the value directly from the property. Once I had the XML, it became a matter of replacing the appropriate bits of the XML to make it dynamic. I ended up with the following:



<?xml version=\“1.0\” encoding=\“utf-16\”?>
<DataProfile xmlns:xsi=\http://www.w3.org/2001/XMLSchema-instance\ 
xmlns:xsd=\http://www.w3.org/2001/XMLSchema\
xmlns=\“http://schemas.microsoft.com/sqlserver/2008/DataDebugger/\”>
  <DataSources />
  <DataProfileInput>
    <ProfileMode>Exact</ProfileMode>
    <Timeout>0</Timeout>
    <Requests>
      <ColumnNullRatioProfileRequest ID=\“NullRatioReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
      </ColumnNullRatioProfileRequest>
      <ColumnStatisticsProfileRequest ID=\“StatisticsReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
      </ColumnStatisticsProfileRequest>
      <ColumnLengthDistributionProfileRequest ID=\“LengthDistReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
        <IgnoreLeadingSpace>false</IgnoreLeadingSpace>
        <IgnoreTrailingSpace>true</IgnoreTrailingSpace>
      </ColumnLengthDistributionProfileRequest>
      <ColumnValueDistributionProfileRequest ID=\“ValueDistReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
        <Option>FrequentValues</Option>
        <FrequentValueThreshold>0.001</FrequentValueThreshold>
      </ColumnValueDistributionProfileRequest>
      <ColumnPatternProfileRequest ID=\“PatternReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
        <MaxNumberOfPatterns>10</MaxNumberOfPatterns>
        <PercentageDataCoverageDesired>95</PercentageDataCoverageDesired>
        <CaseSensitive>false</CaseSensitive>
        <Delimiters> \\t\\r\\n</Delimiters>
        <Symbols>,.;:-\”‘`~=&amp;/\\\\@!?()&lt;&gt;[]{}|#*^%</Symbols>
        <TagTableName />
      </ColumnPatternProfileRequest>
      <CandidateKeyProfileRequest ID=\“KeyReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <KeyColumns>
          <Column IsWildCard=\“true\” />
        </KeyColumns>
        <ThresholdSetting>Specified</ThresholdSetting>
        <KeyStrengthThreshold>0.95</KeyStrengthThreshold>
        <VerifyOutputInFastMode>false</VerifyOutputInFastMode>
        <MaxNumberOfViolations>100</MaxNumberOfViolations>
      </CandidateKeyProfileRequest>
      <FunctionalDependencyProfileRequest ID=\“FDReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <DeterminantColumns>
          <Column IsWildCard=\“true\” />
        </DeterminantColumns>
        <DependentColumn IsWildCard=\“true\” />
        <ThresholdSetting>Specified</ThresholdSetting>
        <FDStrengthThreshold>0.95</FDStrengthThreshold>
        <VerifyOutputInFastMode>false</VerifyOutputInFastMode>
        <MaxNumberOfViolations>100</MaxNumberOfViolations>
      </FunctionalDependencyProfileRequest>
    </Requests>
  </DataProfileInput>
  <DataProfileOutput>
    <Profiles />
  </DataProfileOutput>
</DataProfile>

Note that this string has already had the quotes and special characters escaped (using the \ symbol) and two variables embedded in it. The User::SchemaName variable holds the schema name, and the User::TableName holds the table name. This XML is set through an expression on the ProfileInputXml property.


I created a simple package to iterate all the tables in a database, and run this task for each one.


image


The Execute SQL Task (Get List of Tables) retrieves the list of tables and their schemas as a recordset, and stores it in an Object variable. The For Each Loop (Loop through tables) iterates through the recordset, updating the table and schema variables with the current values. The Data Profiling Task is set to save each profile to a file in the C:\Temp folder.


The XML above contains all of the request types for profiling, including the ColumnPatternProfileRequest. This one runs fairly slowly, so you may want to remove it before testing this. To take out a particular request, just delete that section of the XML, from the opening tag of the request to the closing tag (<ColumnPatternProfileRequest> to </ColumnPatternProfileRequest>). Also, I would recommend that you not run this against a production database without clearing it with someone first, as it generates a lot of database activity.


This gives you a way to easily profile an entire database, or even all the databases on a server, with a little enhancement. It also makes it more practical to run this against a database that you are not familiar with, in order to get a general feel for the data. Being a consultant, that is very valuable to me. Being able to do this goes a long way to reducing my concerns about the usability of this task.  I’m hoping that the user interface will be improved by RTM, but I also want to make sure it continues to support setting the requests dynamically.


I’ve posted the sample to my Skydrive here. Let me know if you run into any problems with it.

The November CTP for SQL Server 2008 Is Out!

The download link is here: http://www.microsoft.com/downloads/details.aspx?FamilyId=3BF4C5CA-B905-4EBC-8901-1D4C1D1DA884&displaylang=en

Evidently there have been some issues getting it on the Connect site, so they are hosting it in Microsoft’s Download Center for now.

Changes Coming for SSIS 2008

At the PASS Summit this week, I heard a few interesting bits about SSIS 2008 that should be in the next CTP.

One, ADO.NET will be fully supported, with a ADO.NET Data Source (renamed from the Data Reader Data Source) and a ADO.NET Destination. Since ADO.NET has an ODBC provider, we should finally have the ability to use an ODBC database as a destination. And they will both have custom UIs, so no more messing around in the Advanced Editor.

Two, there’s a new data profiling task. This does a really nice job of processing a table and doing all the standard data profiling activities. It’s based on something out of Microsoft Research, so it has some pretty cool capabilities, like a pattern recognition function that will spit out regular expressions that match the contents of the column.

Three, since the script will be run through VSTA instead of VSA, it will be pretty easy to create a web service transform. Just create a script component, reference the web service, and all the proxy code will be created for you. (It’s not news that VSTA is replacing VSA, but I hadn’t thought about how that would impact web services until this).

Four, data flow threading will be much better. Previously, a single execution tree was always single threaded. That’s why in 2005, if you have a long chain of synchronous tasks, you may get better performance by introducing a Union All transform into the sequence. It breaks up the execution tree and allows the engine to run multiple threads. In 2008, the data flow engine will be able to introduce new threads itself.

Five, there will be a Watson-style dump tool available. It will be invoked automatically on a crash, or you could invoke it on demand. It will dump the current state of the package out to a text file, included variable values, data flow progress, etc.

And finally, lookups are going to be drastically enhanced. They can be sourced from ADO.NET or a flat file. We’ll have the ability to cache the lookups and reuse them across data flows in the same package. We’ll also be able to persist the cache between runs of the package. And, interestingly enough, the persisted file will look a lot like a RAW file. There should be some interesting possibilities in that :) . There will also be a “missed rows” output, instead of having to use the error output to capture records that didn’t have a match.

SQL Server 2008 – Using Merge From SSIS

In my last post, I talked about using the new MERGE statement in SQL Server 2008. It allows you to merge the data in a result set with a table, based on a join condition that you specify. This is a common scenario in data warehousing, where you often need to take a working table and compare it to a target table to determine if a row should inserted, updated, or deleted. This is often the case for dimension table. Currently, there is no SSIS component in the June 2007 CTP that leverages the MERGE statement directly. This post will explore one possible pattern for using MERGE from SSIS.

In SSIS 2005, a commonly used pattern for updating dimension tables was to retrieve all of the rows intended for the dimension table in a data flow via a source component. Then the rows would be run through a lookup component to determine if they existed in the target table. Rows to be updated (those found in the lookup) would be run through an OLE DB Command component. Rows to be inserted would be sent to an OLE DB Destination.

image

In cases where performance is a concern, the OLE DB Command could be replaced with another OLE DB Destination that writes all rows to be updated to a temporary table. In the control flow, an Execute SQL task would issue an UPDATE statement, using the temporary table as a source.

image

To implement this using a MERGE statement, we can modify the second pattern. All rows that are read from the source should be written to a temporary table. Then, the Execute SQL task will be used to run a MERGE statement that uses the temporary table as the source for the merge. The matching of the rows will be done through the join condition in the MERGE statement.

image

Why would using the MERGE statement be better than using the lookup pattern described above? For one thing, the data flow is simpler, with no branches in execution. Also, the data flow only has to write to a single table, instead of two tables. I haven’t done any performance testing on the MERGE statement yet, so I can’t say for certain whether it is faster.

I’m still hoping for a destination component that can be used directly in the data flow to perform the MERGE, but the SSIS team has indicated that it probably won’t happen.