It’s been a busy couple of weeks, but I’ve finally found some time to play around with the June CTP of SQL Server 2008. One of the items that caught my interest was the MERGE statement. This is a new addition to the T-SQL language that lets you perform INSERTs, UPDATEs, and DELETEs against a table in a single statement. You specify a source table to compare to, and what conditions should result in an UPDATE vs. an INSERT.
This is a fairly common scenario in data warehousing, where you may need to either insert new rows into a fact table, or update existing rows with new information. Here’s a an example scenario to illustrate this. A store sends an hourly feed of transactions that have occurred in the last hour. You need to populate a fact table that should contain a monthly total for each customer, showing how many of each product were purchased in the month and how much was spent. The fact table needs to updated each time the hourly sales transaction comes in, so that it is always current. And you have to take into account that a customer may have already purchased an item in the same month, so you may need to update an existing row in the fact table rather than insert a new one.
Our fact table looks like this:
CREATE TABLE [dbo].[FactProductSalesSnapshot]( [DimCustomerID] [int] NOT NULL, [PurchaseDateID] [int] NOT NULL, [DimProductID] [int] NOT NULL, [Quantity] [int] NOT NULL, [SalesAmount] [money] NOT NULL )
and the merge table (the source for the MERGE statement) looks like this:
CREATE TABLE [dbo].[SalesTransactionStage]( [SalesTransactionID] [int] NOT NULL, [CustomerID] [int] NOT NULL, [DateID] [int] NOT NULL, [ProductID] [int] NOT NULL, [Quantity] [int] NOT NULL, [SalesAmount] [money] NOT NULL, [BatchID] [tinyint] NOT NULL )
The batch ID indicates which batch of hourly sales transactions this row belongs to. To merge this in, you’d execute the following statement:
MERGE FactProductSalesSnapshot AS fact USING ( SELECT CustomerID ,DateID ,ProductID ,Quantity ,SalesAmount FROM SalesTransactionStage WHERE BatchID = 4 ) AS src ON ( fact.DimProductID = src.ProductID AND fact.DimCustomerID = src.CustomerID AND fact.PurchaseDateID = src.DateID ) WHEN MATCHED THEN UPDATE SET fact.Quantity = fact.Quantity + src.Quantity ,fact.SalesAmount = fact.SalesAmount + src.SalesAmount WHEN NOT MATCHED THEN INSERT (DimCustomerID, PurchaseDateID, DimProductID, Quantity, SalesAmount) VALUES (src.CustomerID, src.DateID, src.ProductID, src.Quantity, src.SalesAmount);
To break this statement down a little bit, the USING (..) portion defines the source for the Merge to use for comparisons (the SalesTransactionStage table defined previously). The ON clause works like the ON clause in a JOIN statement, and determines how the source is matched against the target. In this case, we’re joining on the Product, Customer, and Date keys.
The WHEN MATCHED clause determines what should happen if we find a match. We are performing an UPDATE, and adding the values from the stage table to existing balance in the fact table. WHEN NOT MATCHED indicates that this particular combination of product, customer, and date does not yet exist, so the new values are inserted.
The WHEN clauses also support conditional logic, so we could add additional logic to catch returns (where the quantity is less than zero) and store that information in a separate column.
Overall, the MERGE statement seems to work very well, and I think it will be very useful if you are building warehouses on SQL Server. I’m planning on a part 2 to this to show how MERGE can be used from Integration Services. Currently, there doesn’t appear to be a direct way to leverage it, outside of an Execute SQL statement, which has some drawbacks. I am curious to see if Microsoft introduces a new destination component for the pipeline that takes advantage of the MERGE functionality.