Archive for March 2008

RetainSameConnection and Expressions

I ran across a forum post today that highlighted a limitation with connection managers that I wasn’t aware of, so I thought I’d post it here. If you have RetainSameConnection set to TRUE, and you are also trying to update the connection string through an expression, be aware the connection will not change to reflect the new connection string. Basically, the connection will retain the original value (as RetainSameConnection would indicate).


You might encounter this if you are looping through a set of databases on a server. If you wanted to perform an operation on each database that required a temp table, you would set RetainSameConnection to TRUE. Unfortunately, as soon as you do that, the connection will no longer be dynamic.


I’ve created a sample package to illustrate this. The package flow is straightforward. There is a ForEach Loop that iterates through the databases on the server. The Execute SQL Task retrieves the row count from the sys.tables system view, and stores it in a variable. The Script Task fires two information messages that show the current value of the connection string, and the row count from the variable.


image


These are the results if RetainSameConnection is set to FALSE (with the correct counts):


image


and these are the results if RetainSameConnection is set to TRUE (the counts are all the same, from the initial connection):


image


This isn’t a major issue, but it is helpful to be aware of it. The same behavior occurs under CTP6 for Katmai. It makes sense, given that the RetainSameConnection property is supposed to keep the same connection throughout the package, but it can cause some unintended consequences. I’ve posted the sample to my Skydrive here.

ssisUnit – A Unit Testing Tool for SSIS

I’ve been a bit lax posting on my blogs and the MSDN forums recently. Fortunately, I have a good reason (at least I think it’s a good one). :) My employer, Mariner, has graciously given me permission to open source a unit testing framework for SSIS packages. Preparing it for release has taken a bit more time than I expected, as I wanted to polish up a few items, and that led to a few more changes, etc. The framework, as we were using it, was definitely functional, but I wanted to make a few changes for ease of use. Now I have those changes in, and an alpha (but functional) version is available on Codeplex, under the ssisUnit project.

I’ve posted previously about unit testing for SSIS, and how I really missed the automated unit testing capability that I’d taken for granted in more traditional application development. Since the team currently has no plans for unit testing for SSIS, it seemed like a good time to get this out and available to the public. There are currently methods of testing SSIS, but most of them involve testing the package as a whole. One of our goals for unit testing SSIS, though, was to enable testing at a more granular level. ssisUnit enables testing down to the individual task level in the control flow. In future iterations, I’d like to expand the functionality to include testing individual components in the data flow.

We have additional plans for ssisUnit in the future, including Visual Studio integration, additional command capabilities, and a GUI for creating the test cases. The current version is v0.50, and I hope to have another release by mid-April. Please download it, give it a whirl, and provide feedback and suggestions for improvement. If you’re interested in contributing to the project, please leave a comment on this post, or email me at john.welch@mariner-usa.com.

Using the Data Profiling Task To Profile All the Tables in a Database

SQL Server 2008 introduces a new task that allows you to profile the data in a table. Jamie Thomson has a great series of posts about the different profiling options available on the task. If you are not already familiar with the task, I highly recommend starting there, as I won’t be covering that information here. Instead, I’ll be focusing on a way to make the task a bit more dynamic.


Before I go any further, I need to mention that this article is based on CTP6, and the technique I’m describing may need to be updated to work in future versions. However, I’ll be talking with the SSIS team in April about improving the task so that this approach may be a bit easier.


I was pretty excited when the task appeared in CTP5, but I was a bit disappointed by the actual experience of using it. My biggest disappointment was that you to select individual tables to profile. I was really hoping to be able to point it to a database, select all the tables I wanted to profile, and let it run. Unfortunately, there did not appear to be a way to even set the table through an expression. There is a ProfileRequests property, but it is a collection, so you can’t set expressions on it.


After some poking and prodding, though, I found the ProfileInputXml property. This property is not visible in the Properties window in BIDS, but you can locate it in the expressions dialog for the task. This property holds a complete XML based description of the profile requests for the package. Since it is available to set through an expression, it opens up the possibility to make the task dynamic.


The first thing I had to do was to get a sample of the XML. You can get a copy of the XML by viewing the XML code for the package, locating the task, and grabbing the contents of the ProfileInputXml tags. Since this is XML contained within XML, all the typical tags have been converted to their escaped versions (< becomes “&gt;”, etc.). The other approach is to use the SSIS object model to extract the value directly from the property. Once I had the XML, it became a matter of replacing the appropriate bits of the XML to make it dynamic. I ended up with the following:



<?xml version=\“1.0\” encoding=\“utf-16\”?>
<DataProfile xmlns:xsi=\http://www.w3.org/2001/XMLSchema-instance\ 
xmlns:xsd=\http://www.w3.org/2001/XMLSchema\
xmlns=\“http://schemas.microsoft.com/sqlserver/2008/DataDebugger/\”>
  <DataSources />
  <DataProfileInput>
    <ProfileMode>Exact</ProfileMode>
    <Timeout>0</Timeout>
    <Requests>
      <ColumnNullRatioProfileRequest ID=\“NullRatioReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
      </ColumnNullRatioProfileRequest>
      <ColumnStatisticsProfileRequest ID=\“StatisticsReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
      </ColumnStatisticsProfileRequest>
      <ColumnLengthDistributionProfileRequest ID=\“LengthDistReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
        <IgnoreLeadingSpace>false</IgnoreLeadingSpace>
        <IgnoreTrailingSpace>true</IgnoreTrailingSpace>
      </ColumnLengthDistributionProfileRequest>
      <ColumnValueDistributionProfileRequest ID=\“ValueDistReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
        <Option>FrequentValues</Option>
        <FrequentValueThreshold>0.001</FrequentValueThreshold>
      </ColumnValueDistributionProfileRequest>
      <ColumnPatternProfileRequest ID=\“PatternReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
        <MaxNumberOfPatterns>10</MaxNumberOfPatterns>
        <PercentageDataCoverageDesired>95</PercentageDataCoverageDesired>
        <CaseSensitive>false</CaseSensitive>
        <Delimiters> \\t\\r\\n</Delimiters>
        <Symbols>,.;:-\”‘`~=&amp;/\\\\@!?()&lt;&gt;[]{}|#*^%</Symbols>
        <TagTableName />
      </ColumnPatternProfileRequest>
      <CandidateKeyProfileRequest ID=\“KeyReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <KeyColumns>
          <Column IsWildCard=\“true\” />
        </KeyColumns>
        <ThresholdSetting>Specified</ThresholdSetting>
        <KeyStrengthThreshold>0.95</KeyStrengthThreshold>
        <VerifyOutputInFastMode>false</VerifyOutputInFastMode>
        <MaxNumberOfViolations>100</MaxNumberOfViolations>
      </CandidateKeyProfileRequest>
      <FunctionalDependencyProfileRequest ID=\“FDReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <DeterminantColumns>
          <Column IsWildCard=\“true\” />
        </DeterminantColumns>
        <DependentColumn IsWildCard=\“true\” />
        <ThresholdSetting>Specified</ThresholdSetting>
        <FDStrengthThreshold>0.95</FDStrengthThreshold>
        <VerifyOutputInFastMode>false</VerifyOutputInFastMode>
        <MaxNumberOfViolations>100</MaxNumberOfViolations>
      </FunctionalDependencyProfileRequest>
    </Requests>
  </DataProfileInput>
  <DataProfileOutput>
    <Profiles />
  </DataProfileOutput>
</DataProfile>

Note that this string has already had the quotes and special characters escaped (using the \ symbol) and two variables embedded in it. The User::SchemaName variable holds the schema name, and the User::TableName holds the table name. This XML is set through an expression on the ProfileInputXml property.


I created a simple package to iterate all the tables in a database, and run this task for each one.


image


The Execute SQL Task (Get List of Tables) retrieves the list of tables and their schemas as a recordset, and stores it in an Object variable. The For Each Loop (Loop through tables) iterates through the recordset, updating the table and schema variables with the current values. The Data Profiling Task is set to save each profile to a file in the C:\Temp folder.


The XML above contains all of the request types for profiling, including the ColumnPatternProfileRequest. This one runs fairly slowly, so you may want to remove it before testing this. To take out a particular request, just delete that section of the XML, from the opening tag of the request to the closing tag (<ColumnPatternProfileRequest> to </ColumnPatternProfileRequest>). Also, I would recommend that you not run this against a production database without clearing it with someone first, as it generates a lot of database activity.


This gives you a way to easily profile an entire database, or even all the databases on a server, with a little enhancement. It also makes it more practical to run this against a database that you are not familiar with, in order to get a general feel for the data. Being a consultant, that is very valuable to me. Being able to do this goes a long way to reducing my concerns about the usability of this task.  I’m hoping that the user interface will be improved by RTM, but I also want to make sure it continues to support setting the requests dynamically.


I’ve posted the sample to my Skydrive here. Let me know if you run into any problems with it.