Posts tagged ‘Dimension’

I’m balanced, except when I’m not

Jamie posted a question on ragged hierarchies to his blog today:  Are you balanced or unbalanced? I started thinking about it this evening, and decided to post my thoughts here. I think that, for the most part, business requirements drive should drive the decision between a balanced and unbalanced hierarchy. But there is a gray area. You can often force an unbalanced hierarchy into a balanced hierarchy by filling in the missing levels between the top and bottom. This would typically be implemented by either leaving the intervening levels blank, or copying the parent value down through each intervening level.

For example, say I have the following unbalanced hierarchy, where the DC branch rolls up directly to the North region, and the California (CA) territory has no branches:

Corporate Region Territory Branch
HQ North NY NYC
  North   DC
  North MA Boston
  South NC Charlotte
  West CA  

I can implement a balanced hierarchy by filling in the blanks, like this:

Corporate Region Territory Branch
HQ North NY NYC
  North North DC
  North MA Boston
  South NC Charlotte
  West CA CA

Sometimes this is perfectly acceptable, but it may not accurately represent the business information. That is a call that has to be made on a case by case basis. However, there are some reasons that balancing the hierarchy may be desirable:

  • Parent-child hierarchies, when used in Analysis Services, only have aggregates created for the key attribute and the All level (or top attribute, if you disable the All level). That has a significant performance impact on large dimensions.
  • If you using Analysis Services, you can still simulate an unbalanced hierarchy by using the Hide Member If property, which allows you to hide a level if it is blank, or contains the same value as its parent.
  • Parent-child table relationships, while being simple to set up at the relational database level, are difficult to query. If you are developing a data mart that end users will be creating queries against, parent-child relationships are likely to cause problems.

In general, I prefer balanced hierarchies, but mostly for technical reasons. There are certainly business cases where they just don’t apply.

Checking for Overlapping Rows in a Slowly Changing Dimension

A question was posted on the MSDN forums about validating that there are no overlapping rows in a slowly changing dimension (SCD). It is common to use a start and end date in a type 2 SCD to designate when each version of the dimension member was in effect. However, it can create problems if the date ranges for one version of the row overlap with other versions of the row.


A simple way to validate that there is no overlap in an existing dimension table is to run the following query:


SELECT
*
FROM
DimTable A
LEFT JOIN DimTable B ON (A.BusinessKey = B.BusinessKey AND A.DimID <> B.DimID)
WHERE
(A.BeginDate BETWEEN B.BeginDate AND B.EndDate
OR A.EndDate BETWEEN B.BeginDate AND B.EndDate)
ORDER BY
A.DimID


It simply joins the table to itself to compare the date range of a given row with other rows that have the same business key. It assumes that the dimension table contains a surrogate key, a business key, and a begin and end date.


That’s great if you are validating the table after the fact, but what if you want check a new row before you insert it? In some cases, especially if you are dealing with dimension rows that arrive out of sequence, you may need to validate that the effective range for a new row doesn’t overlap with existing rows. If you want to do this in a data flow, the method above doesn’t work, nor does it lend itself to a simple Lookup. You’d need to perform a BETWEEN in the Lookup, which means that caching needs to be disabled. On large lookup tables, this can result in poor performance.


What is really needed is a way to perform an equi-join between the business key and each of the effective dates, so that we can use the caching mode of the Lookup. Fortunately we can do this in SQL Server 2005 using a recursive common table expression (CTE). Using the CTE, we can generate a row for each “covered” date, that is, each day between the start and end effective dates. Once we have that, we can perform one lookup to match on the business key and begin date, and a second lookup to match on the business key and end date. If either of the lookups hits a match, then the date range for the new row overlaps an existing row.


This is the SQL for the lookups:


WITH DIM_CTE (
    DimID
    ,BusinessKey
    ,CoveredDate
    ,BeginDate
    ,EndDate
    ,Days) AS
(
    SELECT
        dim.DimID
        ,dim.BusinessKey
        ,dim.BeginDate AS CoveredDate
        ,dim.BeginDate
        ,dim.EndDate
        ,1 AS Days
    FROM
        DimTable dim


    UNION ALL

    SELECT
        dim.DimID
        ,dim.BusinessKey
        ,DATEADD(day, DIM_CTE.Days, dim.BeginDate) AS CoveredDate
        ,dim.BeginDate
        ,dim.EndDate
        ,DIM_CTE.Days + 1 AS Days
    FROM
        DIM_CTE INNER JOIN DimTable dim
            ON (DIM_CTE.DimID = dim.DimID)
    WHERE
       
DIM_CTE.Days <= DATEDIFF(day, dim.BeginDate, dim.EndDate) 
)
SELECT
    A.DimID, A.CoveredDate, A.BusinessKey
FROM
    DIM_CTE A
OPTION (MAXRECURSION 400)


An important thing to note about the CTE is the use of the OPTION (MAXRECURSION 400). This is a query hint that tells SQL Server how many levels of recursion to allow. Since the CTE recurses once for each day between the effective start date and end date for the rows in the dimension table, you need to make sure that MAXRECURSION is set to the max days between start and end dates. You can use the following SQL to determine what value to use:

SELECT
    MAX(DATEDIFF(day, dim.BeginDate, dim.EndDate))
FROM
    DimTable dim

The Lookups in the SSIS package are both set to ignore failures. A conditional split is used to determine whether both dates fall outside a valid range (by checking whether an ID column was populated by the Lookup using this expression:  ”!(ISNULL(BeginDateID) && ISNULL(EndDateID))”.


A caveat about this technique: the CTE can generate a lot of rows, which can impact performance. Depending on your scenario, it may make more sense to insert the rows, then perform a cleanup routine afterward, or to validate the date ranges in a batch operation before starting the data flow.

If you have any comments about optimizing or improving this, please leave a comment.