Agile BI

Address Columns Generically In a Script Component

Posted on October 21, 2007 by jwelch

When writing script components in SSIS, it’s often useful to be able to process the columns without knowing exactly what they are. For example, you may want to loop through all the columns to check for a conditional, like NULL values in the columns. Or you may want to take a set of columns and concatenate them into a single string for output, or output each one as an XML element. In any of these cases, you don’t necessarily care about the individual column details.

One option is to use Reflection. This is fairly easy to set up, but is not the fastest performing option. To use Reflection, add this to the top of the Script class:

Imports System.Reflection

In the ProcessInputRow, you can use the following code:

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

        Dim column As IDTSInputColumn90

        Dim rowType As Type = Row.GetType()

        Dim columnValue As PropertyInfo
        For Each column In Me.ComponentMetaData.InputCollection(0).InputColumnCollection

            columnValue = rowType.GetProperty(column.Name)

            If columnValue.GetValue(Row, Nothing).ToString() = “January” Then

                columnValue.SetValue(Row, String.Empty, Nothing)

            End If

        Next

    End Sub

This example is checking the value of each column to see if it is “January”, and if it is, writing an empty string back to the column. Be aware that this option is not the best performing option. It’s OK for small data sets, but I would not use it for large ones. You can get better performance from the Reflection option by caching the PropertyInfo objects for each column in an instance variable, but if you are really interested in performance, keep reading.

This option is a bit more complex, but performs better. This is based off some code originally written by MSDN Forum member jaegd (original post here). It’s been simplified a good bit, but that introduces some limitations. As written, this isn’t very type safe, so be sure to add code to check the data type if you are not working with strings.

Public Class ScriptMain

    Inherits UserComponent

    Private inputBuffer As PipelineBuffer
    Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer)

        inputBuffer = Buffer

        MyBase.ProcessInput(InputID, Buffer)

    End Sub
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

        Dim counter As Integer = 0
        For counter = 0 To inputBuffer.ColumnCount – 1

            If inputBuffer.Item(counter).ToString() = “January” Then

                inputBuffer.Item(counter) = String.Empty

            End If

        Next

    End Sub

End Class

In the ProcessInput method, the Buffer object is cached, as we’ll need it to reference the columns by their index. In the ProcessInputRow method, we can use the Item property and the column index to read from or write to the column.

That’s two options for working with columns generically in the Script component. As mentioned above, use the second option for performance.

GenericColumns.zip Compressed (zipped) Folder
48.8 KB GenericColumns.zip

As usual, I’ve posted a sample with both methods on my SkyDrive.

Posted in Uncategorized | Comments Off

Preparing A List of IDs for an IN Condition

Posted on September 30, 2007 by jwelch

There’s been a few questions on the MSDN Forums recently about using a list of keys in an IN condition in a WHERE clause. There are a few scenarios were this can be a bit tricky in SSIS, particularly if you need to dynamically create the list of keys. As an example, let’s look at the AdventureWorksDW database. Suppose that you need to extract rows from the FactInternetSales table based on the customer key. You may need to use a SELECT like this, where the ? should be replaced with a list of keys:

SELECT

    ProductKey, OrderDateKey, CustomerKey, SalesOrderNumber, SalesOrderLineNumber,

    OrderQuantity, UnitPrice, TotalProductCost, SalesAmount, TaxAmt, Freight

FROM

    FactInternetSales

WHERE

    CustomerKey IN (?)

There are a number of ways to do this. First off, if the list of keys are coming from a table (the key table), and both tables are in the same database, you may be able to add a JOIN condition to the SELECT. This works if the list of keys can be determined through a WHERE clause.

SELECT

    ProductKey, OrderDateKey, FactInternetSales.CustomerKey, SalesOrderNumber, SalesOrderLineNumber,

    OrderQuantity, UnitPrice, TotalProductCost, SalesAmount, TaxAmt, Freight

FROM

    FactInternetSales

    JOIN DimCustomer ON FactInternetSales.CustomerKey = DimCustomer.CustomerKey

WHERE

    DimCustomer.LastName LIKE ‘Z%’

This is the simplest approach, but lets assume that we need to get the list of keys from a separate database, and that it is not practical to join between the tables. This could be because they are two different database engines (SQL Server and Oracle, for example) or because of the performance impact of a cross-server join. The next approach to consider is using a temporary table to hold the list of keys. You would use one Data Flow to move the list of IDs from the key table to a temporary table in the same database as the Sales table. Then, in a second Data Flow, the technique discussed above can be applied.

Another approach can be used in situations where creating a temporary table isn’t desirable. An Execute SQL task can be used to retrieve the list of keys into a recordset object. There are two options once the keys are in a recordset. A For Each loop can be used to iterate through the keys, and run the Data Flow for each key in the recordset. A drawback with this approach is that you are processing each key individually, which can slow down the overall processing. The other other option is to use a Script task to create a concatenated list of the keys. This can be used in the Data Flow to process all the keys at once. Both approaches require that the Data Flow source component uses a variable or an expression to get the SQLCommand to execute.

The example I am including shows using the Script task to create the list of IDs. The script is:

    Public Sub Main()

        Dim vars As Variables

        Dim ds As DataSet

        Dim dr As DataRow

        Dim delimitedKeyList As String = String.Empty
        Dts.VariableDispenser.LockOneForRead(“KeyList”, vars)

        ds = CType(vars(“KeyList”).Value, DataSet)

        vars.Unlock()
        For Each dr In ds.Tables(0).Rows

            delimitedKeyList += (dr(0).ToString() + “, “)

        Next
        ‘remove last delimiter

        delimitedKeyList = delimitedKeyList.Remove(delimitedKeyList.Length – 2, 2)
        Dts.VariableDispenser.LockOneForWrite(“INList”, vars)

        vars(“INList”).Value = delimitedKeyList

        vars.Unlock()
        Dts.TaskResult = Dts.Results.Success

    End Sub

It takes the KeyList variable (populated with a DataSet object by the Execute SQL task, and writes a list of delimited key values to the INList variable. The INList variable is used in an expression set on the Data Flow:

“SELECT

    ProductKey, OrderDateKey, FactInternetSales.CustomerKey, SalesOrderNumber, SalesOrderLineNumber,

    OrderQuantity, UnitPrice, TotalProductCost, SalesAmount, TaxAmt, Freight

FROM    FactInternetSales

WHERE     CustomerKey IN (“ + @[User::INList] +“)”

Be aware that if you are using an expression, the SELECT string can’t be over 4000 characters in length. For a long list of keys being appended to a long SELECT, that can be a problem. You can work around it by creating the entire SELECT statement in the Script task, and assigning it to a variable there. Another item to note is that the script expects an ADO.NET DataSet object. The Execute SQL task is using an ADO.NET connection, so that is what is passed in. If you are using an OLE DB Connection, the code will need to be updated to convert the ADO recordset (what you get with OLE DB) to a DataTable. Also, you may want to set DelayValidation to TRUE on the Data Flow task, to avoid validation errors when the package initially loads.

A sample package showing the Script approach is available here:

UsingIN.zip Compressed (zipped) Folder
22.1 KB UsingIN.zip

Hopefully this provides information on a number of the options for accomplishing this in SSIS. There are other ways that I didn’t cover in this, but these options cover most scenarios.

Posted in Uncategorized | Comments Off

Changes Coming for SSIS 2008

Posted on September 24, 2007 by jwelch

At the PASS Summit this week, I heard a few interesting bits about SSIS 2008 that should be in the next CTP.

One, ADO.NET will be fully supported, with a ADO.NET Data Source (renamed from the Data Reader Data Source) and a ADO.NET Destination. Since ADO.NET has an ODBC provider, we should finally have the ability to use an ODBC database as a destination. And they will both have custom UIs, so no more messing around in the Advanced Editor.

Two, there’s a new data profiling task. This does a really nice job of processing a table and doing all the standard data profiling activities. It’s based on something out of Microsoft Research, so it has some pretty cool capabilities, like a pattern recognition function that will spit out regular expressions that match the contents of the column.

Three, since the script will be run through VSTA instead of VSA, it will be pretty easy to create a web service transform. Just create a script component, reference the web service, and all the proxy code will be created for you. (It’s not news that VSTA is replacing VSA, but I hadn’t thought about how that would impact web services until this).

Four, data flow threading will be much better. Previously, a single execution tree was always single threaded. That’s why in 2005, if you have a long chain of synchronous tasks, you may get better performance by introducing a Union All transform into the sequence. It breaks up the execution tree and allows the engine to run multiple threads. In 2008, the data flow engine will be able to introduce new threads itself.

Five, there will be a Watson-style dump tool available. It will be invoked automatically on a crash, or you could invoke it on demand. It will dump the current state of the package out to a text file, included variable values, data flow progress, etc.

And finally, lookups are going to be drastically enhanced. They can be sourced from ADO.NET or a flat file. We’ll have the ability to cache the lookups and reuse them across data flows in the same package. We’ll also be able to persist the cache between runs of the package. And, interestingly enough, the persisted file will look a lot like a RAW file. There should be some interesting possibilities in that :). There will also be a “missed rows” output, instead of having to use the error output to capture records that didn’t have a match.

Posted in Uncategorized | Comments Off

Dynamically Pivoting Rows to Columns

Posted on September 14, 2007 by jwelch

Previously, I posted about dynamically un-pivoting columns to rows. In this post, we’ll look at the reverse scenario. Suppose that you have a list of sales orders, in this format:

Sales Order Number	Product
SO-001	Hammer
SO-001	Table Saw
SO-001	Cordless Screwdriver
SO-002	Table Saw
SO-002	Wrench
SO-003	Flashlight

If there is a (reasonable) maximum number of products, you can create the maximum number of columns needed, and use the Pivot transform. However, if the number of products per order vary significantly, you might want to use something a little more flexible. This can be accomplished by storing the values in a single column as XML, or as a delimited set of values.

For this example, we’ll take the input described above and transform it to this:

Sales Order Number	Product
SO-001	Hammer\|Table Saw\|Cordless Screwdriver
SO-002	Table Saw\|Wrench
SO-003	Flashlight

The data flow for this is straightforward – a Flat File Source, a Script component, and a Multicast (so that a data viewer can be added to see the results of the script). The Script component is where the bulk of the work is done.

The script transform is configured with a single asynchronous output, by setting Output 0’s SynchronousInputID property to “None”. Two columns are added to Output 0, one for the Sales Order Number, and one for the list of products.

The script itself is listed below. Essentially, the script checks each row to see if the sales order number has changed from the previous row. If it is the same, it appends the product to a list. If the sales order number is different, the currently stored values for sales order number and the product list are sent to the asynchronous output.

There is some special handling for the first and last rows. On the first row, we don’t want to output anything, as we need to check the next row first. When the input buffer is finished, we need send the last value for sales order number and product list, and then call the SetEndOfRowSet on the output buffer. To make sure we’ve processed all the rows, we need to check EndOfRowSet on the input buffer. This should be checked in the Input0_ProcessInput routine, as it will never be equal to true in the Input0_ProcessInputRow method.

   1: Public Class ScriptMain

   2:     Inherits UserComponent

3:

   4:     Private orderNum As String

   5:     Private productList As String

   6:     Private blnFirstRow As Boolean = True

7:

   8:     Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

9:

  10:         If Row.SalesOrderNumber = orderNum Then

  11:             ‘append this item to the list

  12:             productList = productList + “|” + Row.Product

  13:         Else

  14:             If blnFirstRow Then

  15:                 blnFirstRow = False

  16:             Else

  17:                 ‘output the current values

  18:                 Output0Buffer.AddRow()

  19:                 Output0Buffer.SalesOrderNumber = orderNum

  20:                 Output0Buffer.ProductList = productList

  21:             End If

22:

  23:             ‘reset the control variables

  24:             orderNum = Row.SalesOrderNumber

  25:             productList = Row.Product

  26:         End If

  27:     End Sub

28:

  29:     Public Overrides Sub Input0_ProcessInput(ByVal Buffer As Input0Buffer)

  30:         MyBase.Input0_ProcessInput(Buffer)

  31:         If Buffer.EndOfRowset Then

  32:             Output0Buffer.AddRow()

  33:             Output0Buffer.SalesOrderNumber = orderNum

  34:             Output0Buffer.ProductList = productList

  35:             Output0Buffer.SetEndOfRowset()

  36:         End If

  37:     End Sub

  38: End Class

This can be useful if you need to pivot a varied number of columns. The same techniques can also be used anywhere you want to process a set of rows and change the output based on previous row values. If you’d like to take a more in-depth look, the sample files are on my SkyDrive:

Dynamic Pivot.zip Compressed (zipped) Folder
27.0 KB Dynamic Pivot.zip

Posted in Uncategorized | Comments Off

Using the WMI Data Reader in SSIS

Posted on September 5, 2007 by jwelch

A recent post on the MSDN forums was asking about using the WMI Data Reader Task to retrieve information and send it to a SQL Server destination. I wasn’t very familiar with the task, so I decided to try it out. I wasn’t expecting much, given some of my experiences with the WMI Event Watcher (which I have found fairly difficult to get working consistently). However, it was very easier to configure and use.

WMI (Windows Management Instrumentation) allows you to get information about your hardware, software, and network. I’m not going to cover it in depth (I’m not anywhere close to being knowledgeable about it), but if you want more information, there is an article from Microsoft on using the WMI Query Language here (http://www.microsoft.com/technet/scriptcenter/resources/begin/ss1206.mspx).

As a sample, I am querying the WIn32_Service class to get a list of services running on my computer, and the start mode for each of them. The query for this is:

SELECT NAME, STARTMODE FROM Win32_Service

The control flow contains two tasks, the WMI Data Reader Task and a data flow to write the results out. For sample purposes, I am only sending it to a Multicast, but it could be sent to any destination.

The WMI Data Reader Task is configured with the query in the WqlQuerySource, and the OutputType is set to “Data table”. The results are being put into a variable (User::WMIVar) of type Object.

The task stores the result of the WMI Query in the specified variable as an ADO.NET DataTable object. This variable is used in the data flow, within a Script Source component. The Script component is configured with two output columns, one for the service name and one for the StartupType. Both columns are using the DT_WSTR(50) data type. The WMIVar is passed in as a Read Only Variable.

The Script Source shreds the DataTable into the data flow.

    Public Overrides Sub CreateNewOutputRows()

        Dim dataTable As System.Data.DataTable

        Dim dataRow As System.Data.DataRow
        dataTable = CType(Me.Variables.WMIVar, Data.DataTable)
        For Each dataRow In dataTable.Rows

            Output0Buffer.AddRow()
            Output0Buffer.Name = dataRow.Item(“NAME”).ToString()

            Output0Buffer.StartupType = dataRow.Item(“STARTMODE”).ToString()

        Next
        Output0Buffer.SetEndOfRowset()

    End Sub

That’s all that was necessary to use the results from the WMI Data Reader task. It was pretty straightforward to configure, and it runs quickly as well.

Posted in Uncategorized | Comments Off

I’m balanced, except when I’m not

Posted on August 31, 2007 by jwelch

Jamie posted a question on ragged hierarchies to his blog today: Are you balanced or unbalanced? I started thinking about it this evening, and decided to post my thoughts here. I think that, for the most part, business requirements drive should drive the decision between a balanced and unbalanced hierarchy. But there is a gray area. You can often force an unbalanced hierarchy into a balanced hierarchy by filling in the missing levels between the top and bottom. This would typically be implemented by either leaving the intervening levels blank, or copying the parent value down through each intervening level.

For example, say I have the following unbalanced hierarchy, where the DC branch rolls up directly to the North region, and the California (CA) territory has no branches:

Corporate	Region	Territory	Branch
HQ	North	NY	NYC
	North		DC
	North	MA	Boston
	South	NC	Charlotte
	West	CA

I can implement a balanced hierarchy by filling in the blanks, like this:

Corporate	Region	Territory	Branch
HQ	North	NY	NYC
	North	North	DC
	North	MA	Boston
	South	NC	Charlotte
	West	CA	CA

Sometimes this is perfectly acceptable, but it may not accurately represent the business information. That is a call that has to be made on a case by case basis. However, there are some reasons that balancing the hierarchy may be desirable:

Parent-child hierarchies, when used in Analysis Services, only have aggregates created for the key attribute and the All level (or top attribute, if you disable the All level). That has a significant performance impact on large dimensions.
If you using Analysis Services, you can still simulate an unbalanced hierarchy by using the Hide Member If property, which allows you to hide a level if it is blank, or contains the same value as its parent.
Parent-child table relationships, while being simple to set up at the relational database level, are difficult to query. If you are developing a data mart that end users will be creating queries against, parent-child relationships are likely to cause problems.

In general, I prefer balanced hierarchies, but mostly for technical reasons. There are certainly business cases where they just don’t apply.

Posted in Uncategorized | Comments Off

2007 PASS Community Summit

Posted on August 28, 2007 by jwelch

At the first of this month, I received word that I will be presenting at the PASS Summit this year. They had a late push for more BI front-end presentations, and evidently they were desperate enough to accept not one, but two of mine. One will be a co-presentation with my co-worker, David Darden, on developing non-financial applications in PerformancePoint Plan. The other will be on delivering BI information through the ProClarity Dashboard Server product. It’s late notice, but I’m excited to be doing it.

Prepping for the presentations has taken up a lot of time, recently, as well as a tremendous amount of real work to do. Fortunately, I’m working on projects right now that are leveraging both technologies, so I can get some work done on both at the same time.

Between the presentations and work deadlines, it’s been tough to find much time to update my blog. I am working on a few topics for it, though, so keep watching.

If you are attending the PASS Summit, drop me an email. I’d looking forward to the chance to meet some of you in person.

Posted in Uncategorized | Comments Off

SQL Server 2008 CTP 4 is Out

Posted on August 1, 2007 by jwelch

And you can download it from here. It looks like there are a few new improvements in Reporting Services, and some new data types in SSIS (to complement the new data types in the relational engine), but nothing too major. Hopefully, I’ll have time to take it for a spin this weekend.

Posted in Uncategorized | Comments Off

PerformancePoint Plan and Time

Posted on July 28, 2007 by jwelch

I’ve recently gotten started on a project using Microsoft’s new PerformancePoint Server (PPS) Plan software. I’m actually writing this on a flight back from Seattle after meeting with the PPS team. It was a valuable trip, and I learned a lot about the product.

Perhaps more importantly, I learned about the best way to utilize it in our scenario, which is not a traditional financial scenario. Microsoft is supporting and encouraging its use in non-financial applications, but there are definitely some things that you have to plan ahead for.

One of the first things we encountered was in handling time. Plan creates the time dimension for you, based on some settings that you specify. You have the option to choose between a number of standard financial calendars, but our calendar definitely did not fall into any of the available options. We ended up creating an additional dimension to hold our time attributes, but there are still a number of open issues around coordinating this alternate time with the built-in financial calendar (which is required in every model). As we resolve some of these issues, I’ll post the resolutions.

Fortunately, from an overall standpoint it looks like PPS Plan will support our scenario very well. As we continue forward with the project, I’ll post some additional details.

Posted in Uncategorized | Comments Off

Iterative Development In BI

Posted on July 28, 2007 by jwelch

One of the staples in agile (and many other methodologies) development is the idea of iterative development. That’s the idea that, rather than doing an entire project in one big pass, it is broken up into smaller iterations. Each iteration ideally consists of a specific set of deliverables with value to the end user (ideally working code), and is a small enough slice of the project that the deliverables can be met in a short time frame*. So in a traditional application development project, the first iteration might deliver a working data entry form to the end user. The second iteration might add an additional form and a background validation process, and so on until all the application requirements are met.

Another feature of iterative development is that each iteration involves some aspects of the full software development lifecycle. For each iteration, you expect to go though some requirements, design, development, and testing. But these activities are focused on just what is required for the current iteration. There usually is a small amount of up front time spent mapping out the entire project, but not in an extreme amount of detail.

Most of the BI projects I have been involved in, however, follow much more of a waterfall approach. There usually are intermediate milestones defined, but it’s usually not working code, it’s a requirement document, or an architecture diagram. What’s really depressing about this is the look of incomprehension I get when I suggest breaking things into smaller iterations.

Here’s some of the common objections I hear to doing iterative development, and my responses to them.

“We have to gather all the requirements up front so we can make sure we don’t miss anything.”
This is one of the most common, and one of the more absurd, objectives. The chances that you can gather all or even most of the requirements up front on a project of any complexity are small. There’s even less chance that the requirements won’t change over the course of the project. I much prefer to do detailed requirements in a just-in-time fashion.
“If we don’t know all the requirements at the time we begin design, we might build something that can’t be extended.”
I agree that designing a BI solution is easier if you have all the pieces in front of you when you begin. However, it is certainly possible to create a flexible design without knowing every detail about how it will be used. In fact, since BI systems encourage emergent behavior, you can count on new requirements arising over the course of the project, and after it is completed. That means a flexible design is necessary, even if you think you have all the requirements up front.
“It takes too long to produce working code to deliver iterations that quickly.”
This is one argument that holds some weight with me. If you are starting from scratch, and you need to build a star schema, load it with data, and create reports on that to deliver to end users, it can be difficult to accomplish that in four to eight weeks. However, that’s where I feel that you may have to narrow the focus of the iteration – instead of building all 8 needed dimensions, you build 4 instead. Then you build the remaining 4 in the next iteration. This does require some creativity to make sure you are building something of use to the end user, while still being able to complete it in the iteration’s timeframe.

That covers some of the common objections. I have done short iterations on BI projects, with a high degree of success, so I know that not only is it possible, it works well. In a future post, I will list some of the benefits I find in doing iterative development on BI projects.

*What is a short time frame? This is somewhat subjective, and you will find opinions ranging from several days to several months. My personal feeling is that an iteration shouldn’t go past 8 weeks, and I think 4 weeks is a much more manageable size.

Posted in Uncategorized | Comments Off

Address Columns Generically In a Script Component

Preparing A List of IDs for an IN Condition

Changes Coming for SSIS 2008

Dynamically Pivoting Rows to Columns

Using the WMI Data Reader in SSIS

I’m balanced, except when I’m not

2007 PASS Community Summit

SQL Server 2008 CTP 4 is Out

PerformancePoint Plan and Time

Iterative Development In BI

Recent Posts

Recent Comments

Archives

Categories

Meta