Using the Data Profiling Task To Profile All the Tables in a Database

SQL Server 2008 introduces a new task that allows you to profile the data in a table. Jamie Thomson has a great series of posts about the different profiling options available on the task. If you are not already familiar with the task, I highly recommend starting there, as I won’t be covering that information here. Instead, I’ll be focusing on a way to make the task a bit more dynamic.


Before I go any further, I need to mention that this article is based on CTP6, and the technique I’m describing may need to be updated to work in future versions. However, I’ll be talking with the SSIS team in April about improving the task so that this approach may be a bit easier.


I was pretty excited when the task appeared in CTP5, but I was a bit disappointed by the actual experience of using it. My biggest disappointment was that you to select individual tables to profile. I was really hoping to be able to point it to a database, select all the tables I wanted to profile, and let it run. Unfortunately, there did not appear to be a way to even set the table through an expression. There is a ProfileRequests property, but it is a collection, so you can’t set expressions on it.


After some poking and prodding, though, I found the ProfileInputXml property. This property is not visible in the Properties window in BIDS, but you can locate it in the expressions dialog for the task. This property holds a complete XML based description of the profile requests for the package. Since it is available to set through an expression, it opens up the possibility to make the task dynamic.


The first thing I had to do was to get a sample of the XML. You can get a copy of the XML by viewing the XML code for the package, locating the task, and grabbing the contents of the ProfileInputXml tags. Since this is XML contained within XML, all the typical tags have been converted to their escaped versions (< becomes “&gt;”, etc.). The other approach is to use the SSIS object model to extract the value directly from the property. Once I had the XML, it became a matter of replacing the appropriate bits of the XML to make it dynamic. I ended up with the following:



<?xml version=\“1.0\” encoding=\“utf-16\”?>
<DataProfile xmlns:xsi=\http://www.w3.org/2001/XMLSchema-instance\ 
xmlns:xsd=\http://www.w3.org/2001/XMLSchema\
xmlns=\“http://schemas.microsoft.com/sqlserver/2008/DataDebugger/\”>
  <DataSources />
  <DataProfileInput>
    <ProfileMode>Exact</ProfileMode>
    <Timeout>0</Timeout>
    <Requests>
      <ColumnNullRatioProfileRequest ID=\“NullRatioReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
      </ColumnNullRatioProfileRequest>
      <ColumnStatisticsProfileRequest ID=\“StatisticsReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
      </ColumnStatisticsProfileRequest>
      <ColumnLengthDistributionProfileRequest ID=\“LengthDistReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
        <IgnoreLeadingSpace>false</IgnoreLeadingSpace>
        <IgnoreTrailingSpace>true</IgnoreTrailingSpace>
      </ColumnLengthDistributionProfileRequest>
      <ColumnValueDistributionProfileRequest ID=\“ValueDistReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
        <Option>FrequentValues</Option>
        <FrequentValueThreshold>0.001</FrequentValueThreshold>
      </ColumnValueDistributionProfileRequest>
      <ColumnPatternProfileRequest ID=\“PatternReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <Column IsWildCard=\“true\” />
        <MaxNumberOfPatterns>10</MaxNumberOfPatterns>
        <PercentageDataCoverageDesired>95</PercentageDataCoverageDesired>
        <CaseSensitive>false</CaseSensitive>
        <Delimiters> \\t\\r\\n</Delimiters>
        <Symbols>,.;:-\”‘`~=&amp;/\\\\@!?()&lt;&gt;[]{}|#*^%</Symbols>
        <TagTableName />
      </ColumnPatternProfileRequest>
      <CandidateKeyProfileRequest ID=\“KeyReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <KeyColumns>
          <Column IsWildCard=\“true\” />
        </KeyColumns>
        <ThresholdSetting>Specified</ThresholdSetting>
        <KeyStrengthThreshold>0.95</KeyStrengthThreshold>
        <VerifyOutputInFastMode>false</VerifyOutputInFastMode>
        <MaxNumberOfViolations>100</MaxNumberOfViolations>
      </CandidateKeyProfileRequest>
      <FunctionalDependencyProfileRequest ID=\“FDReq\”>
        <DataSourceID>DatabaseConn</DataSourceID>
        <Table Schema=\“” + @[User::SchemaName] + “\” Table=\“” + @[User::TableName] + “\” />
        <DeterminantColumns>
          <Column IsWildCard=\“true\” />
        </DeterminantColumns>
        <DependentColumn IsWildCard=\“true\” />
        <ThresholdSetting>Specified</ThresholdSetting>
        <FDStrengthThreshold>0.95</FDStrengthThreshold>
        <VerifyOutputInFastMode>false</VerifyOutputInFastMode>
        <MaxNumberOfViolations>100</MaxNumberOfViolations>
      </FunctionalDependencyProfileRequest>
    </Requests>
  </DataProfileInput>
  <DataProfileOutput>
    <Profiles />
  </DataProfileOutput>
</DataProfile>

Note that this string has already had the quotes and special characters escaped (using the \ symbol) and two variables embedded in it. The User::SchemaName variable holds the schema name, and the User::TableName holds the table name. This XML is set through an expression on the ProfileInputXml property.


I created a simple package to iterate all the tables in a database, and run this task for each one.


image


The Execute SQL Task (Get List of Tables) retrieves the list of tables and their schemas as a recordset, and stores it in an Object variable. The For Each Loop (Loop through tables) iterates through the recordset, updating the table and schema variables with the current values. The Data Profiling Task is set to save each profile to a file in the C:\Temp folder.


The XML above contains all of the request types for profiling, including the ColumnPatternProfileRequest. This one runs fairly slowly, so you may want to remove it before testing this. To take out a particular request, just delete that section of the XML, from the opening tag of the request to the closing tag (<ColumnPatternProfileRequest> to </ColumnPatternProfileRequest>). Also, I would recommend that you not run this against a production database without clearing it with someone first, as it generates a lot of database activity.


This gives you a way to easily profile an entire database, or even all the databases on a server, with a little enhancement. It also makes it more practical to run this against a database that you are not familiar with, in order to get a general feel for the data. Being a consultant, that is very valuable to me. Being able to do this goes a long way to reducing my concerns about the usability of this task.  I’m hoping that the user interface will be improved by RTM, but I also want to make sure it continues to support setting the requests dynamically.


I’ve posted the sample to my Skydrive here. Let me know if you run into any problems with it.

15 Comments

  1. mmasson says:

    Hiya,

    This was one of the first things I noticed when using the task for the first time. Very clever work around!

    There have been a number of usability enhancements done to the task recently (ability to add new connections from the task for example), including an option on the Quick Profile dialog which lets you select all tables in the database in one shot.

    ~Matt

  2. jwelch says:

    That’s great news, Matt. Will those be in the CTP refresh?

    Even with the ability to do that from the UI, I still like the ability to set the profile requests through an expression. I can envision a lot of scenarios where I may want to choose the table(s) to be profiled dynamically. I think that’s very useful functionality, and it would be helpful to have the same capability on many of the database maintenance tasks.

  3. http:// says:

    Yes, the changes should be in the CTP Refresh.

  4. http:// says:

    How do I stop each iteration overwriting the results of the previous data profile task? The end result of me running this package only ever contains the last table in the loop.

  5. http:// says:

    By the way this has been a handy tip for me to stumble across. I’ll be using it to provide some dynamic profiling tools to help out for a large data integration project we’re doing. This is proving to be a very handy way to provide visibility of the data to business people without them neeing to write one sql query.

  6. http:// says:

    I used a file system task to rename the file in the for each loop container to the name of the table. This ensures I have one file per table.

    The only thing left for me to do is figure out how to get all the tables per database into the one single output. Is it possible to dynamically build the ProfileInputXml based off rows within a source table?

    I’ll give this a crack anyway and see. If anybody has more appropriate suggestions i’d really appreciate your input.

  7. Ayyappan says:

    Thanks Matt, It is useful articel. keep up your good work.

  8. julie smith says:

    John, this looks very exciting. Thanks for giving access to it via your skydrive! @datachix1

  9. Steff Sullivan says:

    Hi, I’ve been trying to build my own package from scratch using yours as an example.

    My XML expression looks like:

    Net Adventure Works

    Exact
    0

    {C026647F-A104-44BA-83A4-DEFF6CE3DD3C}

    {C026647F-A104-44BA-83A4-DEFF6CE3DD3C}

    {C026647F-A104-44BA-83A4-DEFF6CE3DD3C}

    false
    true

    {C026647F-A104-44BA-83A4-DEFF6CE3DD3C}

    FrequentValues
    0.001

    I’m getting the error message:
    Error: The result of the expression “@[User::InputXML]” on property “ProfileInputXml” cannot be written to the property. The expression was evaluated, but cannot be set on the property.

    My Output despite this error looks like:

    DynamicFixedSQLServerProfiler.xml

    Net Adventure Works

    OLEDB Adventure Works

    OLEDB Repository

    Exact
    0

    Are you able to advise where I might be going wrong?

    NB. I also had issues with tables with XML in which would break the task forever. Excluding these tables from the set to operate on and recreating the task has prevented a repeat occurence.

    • Steff Sullivan says:

      Apologies my copy and paste didn’t work out all that well!

      My XML expression looks like:

      Net Adventure Works

      Exact
      0

      {C026647F-A104-44BA-83A4-DEFF6CE3DD3C}

      {C026647F-A104-44BA-83A4-DEFF6CE3DD3C}

      {C026647F-A104-44BA-83A4-DEFF6CE3DD3C}

      false
      true

      {C026647F-A104-44BA-83A4-DEFF6CE3DD3C}

      FrequentValues
      0.001

      I’m getting the error message:
      Error: The result of the expression “@[User::InputXML]” on property “ProfileInputXml” cannot be written to the property. The expression was evaluated, but cannot be set on the property.

      My Output despite this error looks like:

      DynamicFixedSQLServerProfiler.xml

      Net Adventure Works

      OLEDB Adventure Works

      OLEDB Repository

      Exact
      0

  10. jwelch says:

    Your XML is still not formatted (I think it’s a limitation in the WordPress comments). The error message indicates that the XML you are using for ProfileInputXml isn’t 100% correct, so I’d guess that the problem is somewhere in the expression – maybe a delimiter or element is being left out.

  11. Shree says:

    Hi,
    I tried this it works fine. i need one more help on this.how to build dynamic column in CandidateKeyProfileRequest.
    Regards,

  12. eshwar vempati says:

    Thanks a lot, i used your package by simple modifications to get xml configuration on schema name, table name values.

  13. Duncan Boydell says:

    This is absolute gold-dust, was just wondering how to change the line : to allow a column name variable like you’ve done with the table and schema. This would save we so much time.Thanks

  14. Kaarel says:

    Thanks! Still works great in SQL 2016.

Leave a Reply to Shree