Archive for May 2008

See if a File is Locked

A question that comes up occasionally on the forums is how to determine if a file is being written to by another process. It can cause issues if a file is partially written when a source component starts extracting data from it.

There is a way to check this through a Script Task. The script below attempts to lock the file for writing. If the file is be written to be another process, the attempt to lock it will fail. If the file is available, the attempt will succeed.

Public Sub Main()
    Dim vars As Variables
    Dim file As System.IO.FileStream
 
    Try
        file = System.IO.File.Open(Dts.Connections("FlatFile").ConnectionString, System.IO.FileMode.Open, System.IO.FileAccess.Write)
    Catch ex As Exception
        Dts.VariableDispenser.LockOneForRead("FileLocked", vars)
        vars("FileLocked").Value = True
        vars.Unlock()
    Finally
        If Not file Is Nothing Then
            file.Close()
        End If
    End Try
 
    Dts.TaskResult = Dts.Results.Success
End Sub

The package needs to have a connection manager that points to the file you want to check (named "FlatFile" in the script above), and a variable ("FileLocked") that is set to True if the file is locked. The variable should be set to False by default. The variable can be used in a precedence constraint to control whether other tasks execute.

Presentation at the Triad SQL Server User Group

On May 15th I’ll be presenting at the Triad SQL Server User Group, located in High Point, NC. The presentation will be focusing on new features in SQL Server 2008 for business intelligence and data warehousing. If you’re in the area, please stop in and introduce yourself.

SSIS 2008 – Incrementally Updating the Lookup Cache File

There are a number of new features in the 2008 Lookup Transform. Jamie Thomson posted a good overview of the features, and I recommend reading it prior to this article. One of the new features in SSIS 2008 is the ability to cache the data from the Lookup to a local file. This should help improve performance, as the lookup data would only need to be retrieved once, stored locally, and reused. Internally, the lookup cache file (.caw) uses a RAW file structure (.raw), which is very fast to read and write from disk.


One of the first things that I wanted to try with the new caching features was to dynamically update the cache file. That proved to be a bit of a challenge, as Jamie pointed out in his article. One of the main reasons for this is that, while the cache file is very similar to a RAW file, the cache file stores some additional metadata – hash values for each row, and some header information. This means that while you can use a RAW Source component to read the data from a cache file, you must use the Cache Transform to actually create a cache file.


Unfortunately, the Cache Transform does not support incremental updates to the cache file. Each time you use it, it recreates the file completely. However, with a little creativity, you can minimize the amount of data you have to process. I’ve created a sample package that illustrates one approach to this.


The control flow looks like this (the numbers refer to the steps below):


image


The major components of the process are:



  1. Retrieve the last modified date from the Lookup reference table, by using an Execute SQL Task.

  2. Use a Script Task to determine if the cache file already exists. This leverages the System.IO.File class, as discussed in Checking for the Existence of a File. Execution can proceed to either step 3, 4, or 7, based on precedence constraints with expressions.

  3. If the file does not exist, it is created through a data flow. This data flow is straightforward – it retrieves all the rows from the reference table and runs them through a Cache Transform.
      image

  4. If the cache file does exist and the reference table has been modified since the last run of the package, a File System task is used to rename it. This done so that the cache file can be recreated in the next step. It cannot be read from and written to at the same time, so there has to be two copies of it for a short time.

  5. Now the cache file is updated within a data flow task. The OLE DB Source retrieves only the modified rows from the reference table. The rows that haven’t been changed are pulled in from the cache file, using the Raw File Source. The lookup is used to determine if an existing row from the cache file is also included in the set of modified rows from reference table. The lookup uses an expression to set the reference query, so that it contains the same set of rows as the OLE DB Source. If the row matches one from the OLE DB Source, the row is discarded. Only non-matching rows are passed through the lookup. Finally, the rows are combined using a Union All and written back out to a new cache file (with the same name as the original).
    image

  6. The copied version of the cache file is deleted, now that it is no longer needed.

  7. This data flow is used to test the resulting cache file. It is not necessary in a production package.

The Lookup table for this example looks like:


 



CREATE TABLE [dbo].[StatusLookup](
    [ValidValue] [varchar](50) NULL,
    [AlternateValue] [varchar](50) NULL,
    [LastUpdate] [datetime] NULL
) ON [PRIMARY]
GO

This package uses a number of variables:


image


and several expressions (some of which are on the variables above):


image


These are used to make the package a bit more dynamic. The LastRunDate variable controls when the package thinks it was last executed. In a production application, this would need to be set from a parent package or an Execute SQL Task. The LookupTable variable controls the name of the cache files.


I encourage you to download the sample and walk through it to get a in-depth understanding. I’ve really only covered the high points in this article. It’s posted on my SkyDrive.


This was done against CTP6 of SQL Server 2008. The RTM release may necessitate some changes to the pattern and/or the code described here.

Setting the Lookup Query With an Expression

In SSIS 2005, one of the challenges with the Lookup transform was that the SQL used to populate the cache was fixed at design time. You could dynamically pass in parameters to the SQL, but only if you turned on the advanced caching options, which puts the lookup into row-by-row operation. With smaller tables (<100,000 rows) this wasn’t too much of an issue, but with larger tables it could be a problem. A common scenario where this functionality would be nice is with Type 2 slowly changing dimensions (SCD). If you are extracting a single day of source records (doing a daily incremental load from the source system), then the lookup cache only needs to contain records that were in effect on the date of the extract. In SSIS 2005, there was no easy way to accomplish this. However, the functionality has been added in 2008.


To illustrate this, I’ve created a simple package with a single data flow that uses AdventureWorks and AdventureWorksDW. The OLE DB Source extracts Sales Order Detail records for a single day from AdventureWorks, based on the RunDate variable.


image


image


The Product Lookup is getting its rows from the DimProduct table in AdventureWorksDW. This is a Type 2 SCD table, with effective dates. The Lookup is set up normally. Initially, I had to specify a default query that returned the columns needed from DimProduct, so that the metadata for the data flow was set up properly.


To add an expression to the Lookup, I went back to the Control Flow view, and selected the Data Flow Task. Now, in the Properties window, you can see two properties associated with the Lookup:


image


SqlCommand is used to override the inner SQL statement used by the Lookup, while SqlCommandParam can be used to modify the entire statement. Normally, SqlCommand is enough for most purposes. I have an expression set on the property, that builds a SELECT statement dynamically, including a WHERE clause that filters by the RunDate variable.


image


The primary advantage to doing this is a reduction in the number of rows in the lookup cache. For the AdventureWorks sample, filtering the set of dimension rows down to just the ones in effect as of the Sales Order date reduces the cached rows from 606 to 339. Not really a significant amount in this example, but a similar percentage reduction for a million row dimension table would lead to significant savings.


I’ve uploaded the sample package to my SkyDrive.