Posts tagged ‘Expressions’

Checking Whether a Character is Numeric

In SSIS, checking whether a value is numeric has always been a bit painful. Unfortunately, the SSIS expression language doesn’t have an IsNumeric function. Jamie Thomson (blog | twitter) has a nice post here that explains one way to do this check in a script component.

However, there is a shortcut you can take if you only need to check a specific character to see if it’s numeric. You can use the following expression:

FINDSTRING("0123456789", [CharacterToCheck], 1) != 0

If this returns True, the character is numeric. False means it’s not numeric.

This works by checking for the existence of the character in the string that contains all the numeric characters. If it finds an occurrence, then the character must be numeric. Obviously, this has some limitations over a real IsNumeric function, but if you just need to check single characters, it’s a lot faster than writing a script.

Warning about Using String Variables in Derived Column Expressions

I ran across an interesting behavior in SSIS this week, and thought it was worth sharing. Occasionally, I’ll have the need to use the value of a variable (set in the control flow) in the data flow. The typical way to accomplish this is to use a Derived Column transformation to introduce the variable’s value into the data flow. However, you need to be cautious if you are doing this with a String variable.

When you are working with Derived Column expressions, the output type of the Derived Column expression is based on the input values. For example, inputting an expression like 1+2 will result in an output column data type of DT_I4 (an integer). Basically, the editor guesses the data type based on the input values. Generally, this is OK – if you are referencing other data flow columns, or variables that aren’t of type string, the lengths of the values are static. However, when the editor calculates the length of string variables, it uses the current value in the variable. So, if your variable has 3 characters, the output data type only expects three characters. If the value in the variable is static and never changes, this is fine. But if the value is updated regularly (perhaps it’s a variable in a template package) or the value is set through an expression (so that the value changes at runtime), this can cause truncation errors in the package.

I created a sample package that demonstrates this. The package is pretty simple: a single Data Flow task, with an OLE DB Source, a Derived Column transformation, and a Union All (just there to terminate the data flow).

image

There is a single variable named TestVar that contains the value “1234”.

image

In the Derived Column transformation, there is a single new column added. Note that the calculated length for the output column is 4, matching the value “1234” in the variable.

image

If the package is run right now, everything works fine. But that’s not the case if you change the value in the variable to “123456”.

image

Running the package after changing the value results in the following error:

[Derived Column [91]] Error: The "component "Derived Column" (91)" failed because truncation occurred, and the truncation row disposition on "output column "ColumnFromStringVar" (289)" specifies failure on truncation. A truncation error occurred on the specified object of the specified component.

This occurs because the metadata in the Derived Column transformation isn’t updated when the variable’s value is changed. So, to avoid seeing this error occur in your packages, you need to explicitly set the output column’s length.

In 2005, you could change the calculated data type by editing the data type, length, precision, and scale fields. In 2008, however, the values are locked. You can change the data type by going into the advanced editor for the Derived Column. However, it’s easier to simply cast the string value, to force the Derived Column editor to treat it as if it has constant length.

image

By using this approach, as long as your string variable’s value is less than 50 characters, the Derived Column will continue to work. It’s best to set the length of the cast to the same value as the destination column’s length.

There’s a Connect submission on improving this behavior, either by updating the pipeline’s metadata as the string variable’s value changes, or by throwing a validation warning or error if the current value of the variable exceeds the output length in the Derived Column transformation. If you agree that this could use some improvement, you can vote here: https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=470995

In the meantime, I’d highly recommend performing an explicit cast in any Derived Column expression that uses a string variable, particularly if the value is subject to change.

The sample package is on my SkyDrive, if you’d like to see the error in action.

Setting the Lookup Query With an Expression

In SSIS 2005, one of the challenges with the Lookup transform was that the SQL used to populate the cache was fixed at design time. You could dynamically pass in parameters to the SQL, but only if you turned on the advanced caching options, which puts the lookup into row-by-row operation. With smaller tables (<100,000 rows) this wasn’t too much of an issue, but with larger tables it could be a problem. A common scenario where this functionality would be nice is with Type 2 slowly changing dimensions (SCD). If you are extracting a single day of source records (doing a daily incremental load from the source system), then the lookup cache only needs to contain records that were in effect on the date of the extract. In SSIS 2005, there was no easy way to accomplish this. However, the functionality has been added in 2008.


To illustrate this, I’ve created a simple package with a single data flow that uses AdventureWorks and AdventureWorksDW. The OLE DB Source extracts Sales Order Detail records for a single day from AdventureWorks, based on the RunDate variable.


image


image


The Product Lookup is getting its rows from the DimProduct table in AdventureWorksDW. This is a Type 2 SCD table, with effective dates. The Lookup is set up normally. Initially, I had to specify a default query that returned the columns needed from DimProduct, so that the metadata for the data flow was set up properly.


To add an expression to the Lookup, I went back to the Control Flow view, and selected the Data Flow Task. Now, in the Properties window, you can see two properties associated with the Lookup:


image


SqlCommand is used to override the inner SQL statement used by the Lookup, while SqlCommandParam can be used to modify the entire statement. Normally, SqlCommand is enough for most purposes. I have an expression set on the property, that builds a SELECT statement dynamically, including a WHERE clause that filters by the RunDate variable.


image


The primary advantage to doing this is a reduction in the number of rows in the lookup cache. For the AdventureWorks sample, filtering the set of dimension rows down to just the ones in effect as of the Sales Order date reduces the cached rows from 606 to 339. Not really a significant amount in this example, but a similar percentage reduction for a million row dimension table would lead to significant savings.


I’ve uploaded the sample package to my SkyDrive.

Common Expressions In the Data Flow

I use the Derived Column transform in the Data Flow a lot. In fact, I can’t remember the last time I wrote a production package that didn’t include at least one Derived Column transform in it. Over time, I’ve built up a lot of common expressions that I keep using, as well as some that I stored "just in case". I thought it would be good to post a few of them here, primarily because it makes it easier for me to find them, but hopefully others will find these useful as well.

References to columns ([StringColumn], for example) need to be replaced with the appropriate column name in the data flow. Values in italics need to be updated with the appropriate value (literal, column, or variable). Also, some of the date expressions may not work if your regional settings are not US date formats.

Description Expression
Count the occurrences of a specific character in a string LEN([StringColumn]) – LEN(REPLACE([StringColumn],"Character to count",""))
Cast a date stored in a string column as YYMMDD to DT_DBTIMESTAMP (DT_DBTIMESTAMP)(SUBSTRING([StringColumn],3,2) + "-" + SUBSTRING([StringColumn],5,2) + "-" + SUBSTRING([StringColumn],1,2))
Cast a date stored in a string column as YYYYMMDD to DT_DBTIMESTAMP (DT_DBTIMESTAMP)(SUBSTRING([StringColumn],1,4) + "-" + SUBSTRING([StringColumn],5,2) + "-" + SUBSTRING([StringColumn],7,2))
Cast a date stored in a string column as DDMMMYYYY to DT_DBTIMESTAMP (DT_DBTIMESTAMP)(SUBSTRING([StringColumn],3,3) + "-" + SUBSTRING([StringColumn],1,2) + "-" + SUBSTRING([StringColumn],6,4))
Pad a string to the specified length [StringColumn] + REPLICATE(" ", 10 – LEN([StringColumn]))
Check a date to see if it in a valid range and replace with NULL if it is not [DateColumn] < (DT_DBTIMESTAMP)"1753/01/01" ? NULL(DT_DBTIMESTAMP) : [DateColumn > (DT_DBTIMESTAMP)"9999/12/31" ? NULL(DT_DBTIMESTAMP) : [DateColumn]
Convert seconds to minutes and seconds (DT_WSTR,20)([NumberOfSeconds] / 60) + ":" + (DT_WSTR,20)([NumberOfSeconds] % 60)
Combine date and time strings to DT_DBTIMESTAMP (DT_DBTIMESTAMP)("2008-03-25" + " " + "10:20:00")
Return True if Date is a Weekend DATEPART("Weekday",[DateColumn]) == 7 ? (DT_BOOL)"True" : DATEPART("Weekday",[DateColumn]) == 1 ? (DT_BOOL)"True" : (DT_BOOL)"False"
Cast a string to a numeric with a range check and return 0 if out of range [StringColumn] > "9" ? 0 : [StringColumn] < "0" ? 0 : ((DT_NUMERIC,1,0)[StringColumn])

RetainSameConnection and Expressions

I ran across a forum post today that highlighted a limitation with connection managers that I wasn’t aware of, so I thought I’d post it here. If you have RetainSameConnection set to TRUE, and you are also trying to update the connection string through an expression, be aware the connection will not change to reflect the new connection string. Basically, the connection will retain the original value (as RetainSameConnection would indicate).


You might encounter this if you are looping through a set of databases on a server. If you wanted to perform an operation on each database that required a temp table, you would set RetainSameConnection to TRUE. Unfortunately, as soon as you do that, the connection will no longer be dynamic.


I’ve created a sample package to illustrate this. The package flow is straightforward. There is a ForEach Loop that iterates through the databases on the server. The Execute SQL Task retrieves the row count from the sys.tables system view, and stores it in a variable. The Script Task fires two information messages that show the current value of the connection string, and the row count from the variable.


image


These are the results if RetainSameConnection is set to FALSE (with the correct counts):


image


and these are the results if RetainSameConnection is set to TRUE (the counts are all the same, from the initial connection):


image


This isn’t a major issue, but it is helpful to be aware of it. The same behavior occurs under CTP6 for Katmai. It makes sense, given that the RetainSameConnection property is supposed to keep the same connection throughout the package, but it can cause some unintended consequences. I’ve posted the sample to my Skydrive here.