Posts tagged ‘Data Profiling’

Using the Data Profiling Task To Profile All the Tables in a Database

SQL Server 2008 introduces a new task that allows you to profile the data in a table. Jamie Thomson has a great series of posts about the different profiling options available on the task. If you are not already familiar with the task, I highly recommend starting there, as I won’t be covering that information here. Instead, I’ll be focusing on a way to make the task a bit more dynamic.


Before I go any further, I need to mention that this article is based on CTP6, and the technique I’m describing may need to be updated to work in future versions. However, I’ll be talking with the SSIS team in April about improving the task so that this approach may be a bit easier.


I was pretty excited when the task appeared in CTP5, but I was a bit disappointed by the actual experience of using it. My biggest disappointment was that you to select individual tables to profile. I was really hoping to be able to point it to a database, select all the tables I wanted to profile, and let it run. Unfortunately, there did not appear to be a way to even set the table through an expression. There is a ProfileRequests property, but it is a collection, so you can’t set expressions on it.


After some poking and prodding, though, I found the ProfileInputXml property. This property is not visible in the Properties window in BIDS, but you can locate it in the expressions dialog for the task. This property holds a complete XML based description of the profile requests for the package. Since it is available to set through an expression, it opens up the possibility to make the task dynamic.


The first thing I had to do was to get a sample of the XML. You can get a copy of the XML by viewing the XML code for the package, locating the task, and grabbing the contents of the ProfileInputXml tags. Since this is XML contained within XML, all the typical tags have been converted to their escaped versions (< becomes “&gt;”, etc.). The other approach is to use the SSIS object model to extract the value directly from the property. Once I had the XML, it became a matter of replacing the appropriate bits of the XML to make it dynamic. I ended up with the following:



<?xml version=\“1.0\” encoding=\“utf-16\”?>
<DataProfile xmlns:xsi=\http://www.w3.org/2001/XMLSchema-instance\ 
xmlns:xsd=\http://www.w3.org/2001/XMLSchema\
xmlns=\“http://schemas.microsoft.com/sqlserver/2008/DataDebugger/\”>
  <DataSources />
  <DataProfileInput>
    <ProfileMode>Exact</ProfileMode>
    <Timeout>0</Timeout>
    <Requests>
      <ColumnNullRatioProfileRequest ID=\“NullRatioReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
      </ColumnNullRatioProfileRequest>
      <ColumnStatisticsProfileRequest ID=\“StatisticsReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
      </ColumnStatisticsProfileRequest>
      <ColumnLengthDistributionProfileRequest ID=\“LengthDistReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
        <IgnoreLeadingSpace>false</IgnoreLeadingSpace>
        <IgnoreTrailingSpace>true</IgnoreTrailingSpace>
      </ColumnLengthDistributionProfileRequest>
      <ColumnValueDistributionProfileRequest ID=\“ValueDistReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
        <Option>FrequentValues</Option>
        <FrequentValueThreshold>0.001</FrequentValueThreshold>
      </ColumnValueDistributionProfileRequest>
      <ColumnPatternProfileRequest ID=\“PatternReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
        <MaxNumberOfPatterns>10</MaxNumberOfPatterns>
        <PercentageDataCoverageDesired>95</PercentageDataCoverageDesired>
        <CaseSensitive>false</CaseSensitive>
        <Delimiters> \\t\\r\\n</Delimiters>
        <Symbols>,.;:-\”‘`~=&amp;/\\\\@!?()&lt;&gt;[]{}|#*^%</Symbols>
        <TagTableName />
      </ColumnPatternProfileRequest>
      <CandidateKeyProfileRequest ID=\“KeyReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <KeyColumns>
          <Column IsWildCard=\“true\” />
        </KeyColumns>
        <ThresholdSetting>Specified</ThresholdSetting>
        <KeyStrengthThreshold>0.95</KeyStrengthThreshold>
        <VerifyOutputInFastMode>false</VerifyOutputInFastMode>
        <MaxNumberOfViolations>100</MaxNumberOfViolations>
      </CandidateKeyProfileRequest>
      <FunctionalDependencyProfileRequest ID=\“FDReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <DeterminantColumns>
          <Column IsWildCard=\“true\” />
        </DeterminantColumns>
        <DependentColumn IsWildCard=\“true\” />
        <ThresholdSetting>Specified</ThresholdSetting>
        <FDStrengthThreshold>0.95</FDStrengthThreshold>
        <VerifyOutputInFastMode>false</VerifyOutputInFastMode>
        <MaxNumberOfViolations>100</MaxNumberOfViolations>
      </FunctionalDependencyProfileRequest>
    </Requests>
  </DataProfileInput>
  <DataProfileOutput>
    <Profiles />
  </DataProfileOutput>
</DataProfile>

Note that this string has already had the quotes and special characters escaped (using the \ symbol) and two variables embedded in it. The User::SchemaName variable holds the schema name, and the User::TableName holds the table name. This XML is set through an expression on the ProfileInputXml property.


I created a simple package to iterate all the tables in a database, and run this task for each one.


image


The Execute SQL Task (Get List of Tables) retrieves the list of tables and their schemas as a recordset, and stores it in an Object variable. The For Each Loop (Loop through tables) iterates through the recordset, updating the table and schema variables with the current values. The Data Profiling Task is set to save each profile to a file in the C:\Temp folder.


The XML above contains all of the request types for profiling, including the ColumnPatternProfileRequest. This one runs fairly slowly, so you may want to remove it before testing this. To take out a particular request, just delete that section of the XML, from the opening tag of the request to the closing tag (<ColumnPatternProfileRequest> to </ColumnPatternProfileRequest>). Also, I would recommend that you not run this against a production database without clearing it with someone first, as it generates a lot of database activity.


This gives you a way to easily profile an entire database, or even all the databases on a server, with a little enhancement. It also makes it more practical to run this against a database that you are not familiar with, in order to get a general feel for the data. Being a consultant, that is very valuable to me. Being able to do this goes a long way to reducing my concerns about the usability of this task.  I’m hoping that the user interface will be improved by RTM, but I also want to make sure it continues to support setting the requests dynamically.


I’ve posted the sample to my Skydrive here. Let me know if you run into any problems with it.