Death by SQL…an Act in Two Parts

CartoonHow common is it to run into performance issues with SQL Server? Daily?  Hourly? Maybe for you, it’s a common existence; and for that, I’m sorry. And how are some ways that you deal with performance degradation in SQL? I’d venture to say that, for most, it would involve juggling indexes, statistics, etc. on the tables in question. But what about going about this all differently?

What if we take a step back and look at the code itself? Maybe the code is the problem, not the server performance. Since running across SQLServerCentral in the early days of my BI experience, there were a few blog posts and articles which have stuck with me throughout. One such article, More RBAR and “Tuning” UPDATEs, has been of great help to me.

This article opened up my eyes to a completely different way of thinking when it comes to performance problems within SQL Server. I highly suggest reading it before continuing with the rest of my post here.

I ran into this “tuning” problem the other day when working with some Fact records that I was trying to tie to a Type 2 Dimension. I have about 37,000 school enrollment records for which I need to find the appropriate Student ID surrogate key among 273,000 different student records. It seemed pretty simple enough:

  • Link the Fact record to the Dimension record using the Student Number
  • Based upon the Fact record’s registration date column, place the record with the correct version of the Student

Act 1

There are two different ways to construct our SQL statement to get this job accomplished: either set-based or row-by-row (see RBAR above) Obviously, one is much more preferred above the other method. Take, for example, the code below (RBAR):

UPDATE [PreStage_FactSchoolEnrollment]
SET [AlternateStudentID] =
(SELECT
    (SELECT TOP 1 DS.[StudentID]
    FROM [DimStudent] DS
    WHERE DS.[EffectiveEndDate] >= [FactSchoolEnrollment].[RegistrationDate]
        AND DS.[StudentSISID] = [FactSchoolEnrollment].[Pupil_Number]
    ORDER BY DS.[EffectiveEndDate],DS.[StudentID])
FROM [FactSchoolEnrollment]
WHERE [PreStage_FactSchoolEnrollment].[ID] = [FactSchoolEnrollment].[ID])

Seems innocent enough, right? However, there is a huge performance issue with this query. Below is a screenshot of one particular piece of the actual execution plan from this query above. Remember our record counts: ~37,000 Fact records and ~273,000 Dimension records.

queryplan_indexspool

That’s right…that number circled above is over 8 BILLION rows that were created in memory!! (8,465,578,262 to be exact). This is the base problem with RBAR queries. In essence, this query, as it is currently structured, queried and stored the ENTIRE dimension (all 273,000 records) for EACH of the incoming Fact records (37,000). That is where the 8.4 Billion records are created. Notice that this update took over 48 minutes run. There isn’t an index in the world that is going to help this type of performance monster.

Act 2

Enter set-based SQL. How about we reconstruct this query as a set-based query instead? Look at the differences in the SQL below:

UPDATE PreStage_FactSchoolEnrollment
SET PreStage_FactSchoolEnrollment.AlternateStudentID = ISNULL(b.StudentID,-1)

FROM (SELECT ROW_NUMBER() OVER (PARTITION BY StudentSISID ORDER
BY EffectiveEndDate ASC) as rownum, Pupil_Number, sub_b.StudentID
FROM FactSchoolEnrollment sub_a
INNER JOIN DimStudent sub_b ON
    sub_a.Pupil_Number = sub_b.StudentSISID
WHERE (sub_b.EffectiveEndDate >= sub_a.RegistrationDate)) b
WHERE PreStage_FactSchoolEnrollment.Pupil_Number = b.Pupil_Number

This end result of this query is EXACTLY the same as the above query; the only difference is that this query took all of 9 seconds to return data. Now that’s a performance gain!

Followup

Now, understandably, it may not be feasible to rewrite your SQL code because of different constraints. But, if you can, at all, (and I’m pleading with you here), PLEASE try to rewrite the code itself. You will be surprised at how much of a difference syntax can make!

Leave a Reply