Posts tagged ‘SQL Server’

PASS Summit 2008

Last post on presentations for a while, I think / hope.

I’ll be presenting on Unit Testing in SSIS at the PASS Summit 2008, occurring in Seattle, WA from Nov. 18-21. There’s lots of great content planned at PASS this year, and it’s definitely worth attending. If you are there, drop by the session and say hello.

SQL Saturday in Greenville, SC

There’s a SQL Saturday coming up in Greenville, SC on October 11th. It’s a free, day long event with a number of great speakers doing sessions. I don’t include myself in the great speaker category, but I was asked to present a couple of sessions on SSIS, including one on data profiling in the ETL process, one of my favorite topics.

Again, it’s free and there’s some great content, so if you are in the area or close by, you should check it out.

Testing the SQL Database at TechED

While at TechEd this summer, I participated in a panel discussion on testing the database and related technologies. The panel also included Gert Drapers and Jamie Laflen of the Visual Studio Team System – Database Edition team, David Reed (who manages the SQL Server community samples on CodePlex), and Adam Machanic as the moderator. The discussion covered some of the pros and cons of testing data, and some of the common issues encountered. It was also a chance to talk about testing related areas (like SSIS packages), of which I am a big fan. If you are interested in seeing it, it was just put online.

If you are interested in unit testing SSIS packages, keep an eye on www.codeplex.com/ssisunit. I should have a new release going up soon, which adds some much needed features, and a much more extensible framework for adding new functionality.

New Tool to Help With Partitioned Tables

Stuart Ozer on the SQL Server Customer Advisory Team (CAT) has posted a new tool to CodePlex to help with the management of "staging" tables for swapping data in and out of partitioned tables. Since this is a common scenario in data warehousing scenarios, I’m really looking forward to trying this out on my next project. If you are interested in it, you can see a description here, and download the code here.

Checking for Overlapping Rows in a Slowly Changing Dimension

A question was posted on the MSDN forums about validating that there are no overlapping rows in a slowly changing dimension (SCD). It is common to use a start and end date in a type 2 SCD to designate when each version of the dimension member was in effect. However, it can create problems if the date ranges for one version of the row overlap with other versions of the row.


A simple way to validate that there is no overlap in an existing dimension table is to run the following query:


SELECT
*
FROM
DimTable A
LEFT JOIN DimTable B ON (A.BusinessKey = B.BusinessKey AND A.DimID <> B.DimID)
WHERE
(A.BeginDate BETWEEN B.BeginDate AND B.EndDate
OR A.EndDate BETWEEN B.BeginDate AND B.EndDate)
ORDER BY
A.DimID


It simply joins the table to itself to compare the date range of a given row with other rows that have the same business key. It assumes that the dimension table contains a surrogate key, a business key, and a begin and end date.


That’s great if you are validating the table after the fact, but what if you want check a new row before you insert it? In some cases, especially if you are dealing with dimension rows that arrive out of sequence, you may need to validate that the effective range for a new row doesn’t overlap with existing rows. If you want to do this in a data flow, the method above doesn’t work, nor does it lend itself to a simple Lookup. You’d need to perform a BETWEEN in the Lookup, which means that caching needs to be disabled. On large lookup tables, this can result in poor performance.


What is really needed is a way to perform an equi-join between the business key and each of the effective dates, so that we can use the caching mode of the Lookup. Fortunately we can do this in SQL Server 2005 using a recursive common table expression (CTE). Using the CTE, we can generate a row for each “covered” date, that is, each day between the start and end effective dates. Once we have that, we can perform one lookup to match on the business key and begin date, and a second lookup to match on the business key and end date. If either of the lookups hits a match, then the date range for the new row overlaps an existing row.


This is the SQL for the lookups:


WITH DIM_CTE (
    DimID
    ,BusinessKey
    ,CoveredDate
    ,BeginDate
    ,EndDate
    ,Days) AS
(
    SELECT
        dim.DimID
        ,dim.BusinessKey
        ,dim.BeginDate AS CoveredDate
        ,dim.BeginDate
        ,dim.EndDate
        ,1 AS Days
    FROM
        DimTable dim


    UNION ALL

    SELECT
        dim.DimID
        ,dim.BusinessKey
        ,DATEADD(day, DIM_CTE.Days, dim.BeginDate) AS CoveredDate
        ,dim.BeginDate
        ,dim.EndDate
        ,DIM_CTE.Days + 1 AS Days
    FROM
        DIM_CTE INNER JOIN DimTable dim
            ON (DIM_CTE.DimID = dim.DimID)
    WHERE
       
DIM_CTE.Days <= DATEDIFF(day, dim.BeginDate, dim.EndDate) 
)
SELECT
    A.DimID, A.CoveredDate, A.BusinessKey
FROM
    DIM_CTE A
OPTION (MAXRECURSION 400)


An important thing to note about the CTE is the use of the OPTION (MAXRECURSION 400). This is a query hint that tells SQL Server how many levels of recursion to allow. Since the CTE recurses once for each day between the effective start date and end date for the rows in the dimension table, you need to make sure that MAXRECURSION is set to the max days between start and end dates. You can use the following SQL to determine what value to use:

SELECT
    MAX(DATEDIFF(day, dim.BeginDate, dim.EndDate))
FROM
    DimTable dim

The Lookups in the SSIS package are both set to ignore failures. A conditional split is used to determine whether both dates fall outside a valid range (by checking whether an ID column was populated by the Lookup using this expression:  ”!(ISNULL(BeginDateID) && ISNULL(EndDateID))”.


A caveat about this technique: the CTE can generate a lot of rows, which can impact performance. Depending on your scenario, it may make more sense to insert the rows, then perform a cleanup routine afterward, or to validate the date ranges in a batch operation before starting the data flow.

If you have any comments about optimizing or improving this, please leave a comment.

Open Source SQL Server Stuff (OSSSS)

I was on the CodePlex (www.codeplex.com) site today looking for the Analysis Services Stored Procedure project, and found that they have a number of SQL Server related projects available. First, all of the SQL Server samples have been uploaded to the site, which is nice if you only want to grab one or two of them.


There is also the aforementioned AS Stored Procedure project (http://www.codeplex.com/ASStoredProcedures) , which has some great examples of extending AS functionality with .NET based stored procedures.


Then there is the SQL Inserter project, which can generate INSERT statements from database tables. It’s handy for quickly moving small amounts of data, if you don’t want to use the Import / Export Wizard.


There are also a couple of test data generators (http://www.codeplex.com/datagenerator and http://www.codeplex.com/TdGen). Data Generator is designed to be used in .NET based development, and TdGen is for databases.