Archive for July 2007

PerformancePoint Plan and Time

I’ve recently gotten started on a project using Microsoft’s new PerformancePoint Server (PPS) Plan software. I’m actually writing this on a flight back from Seattle after meeting with the PPS team. It was a valuable trip, and I learned a lot about the product.


Perhaps more importantly, I learned about the best way to utilize it in our scenario, which is not a traditional financial scenario. Microsoft is supporting and encouraging its use in non-financial applications, but there are definitely some things that you have to plan ahead for.


One of the first things we encountered was in handling time. Plan creates the time dimension for you, based on some settings that you specify. You have the option to choose between a number of standard financial calendars, but our calendar definitely did not fall into any of the available options. We ended up creating an additional dimension to hold our time attributes, but there are still a number of open issues around coordinating this alternate time with the built-in financial calendar (which is required in every model). As we resolve some of these issues, I’ll post the resolutions.


Fortunately, from an overall standpoint it looks like PPS Plan will support our scenario very well. As we continue forward with the project, I’ll post some additional details.

Iterative Development In BI

One of the staples in agile (and many other methodologies) development is the idea of iterative development. That’s the idea that, rather than doing an entire project in one big pass, it is broken up into smaller iterations. Each iteration ideally consists of a specific set of deliverables with value to the end user (ideally working code), and is a small enough slice of the project that the deliverables can be met in a short time frame*. So in a traditional application development project, the first iteration might deliver a working data entry form to the end user. The second iteration might add an additional form and a background validation process, and so on until all the application requirements are met.


Another feature of iterative development is that each iteration involves some aspects of the full software development lifecycle. For each iteration, you expect to go though some requirements, design, development, and testing. But these activities are focused on just what is required for the current iteration. There usually is a small amount of up front time spent mapping out the entire project, but not in an extreme amount of detail.


Most of the BI projects I have been involved in, however, follow much more of a waterfall approach. There usually are intermediate milestones defined, but it’s usually not working code, it’s a requirement document, or an architecture diagram. What’s really depressing about this is the look of incomprehension I get when I suggest breaking things into smaller iterations.


Here’s some of the common objections I hear to doing iterative development, and my responses to them.



  • “We have to gather all the requirements up front so we can make sure we don’t miss anything.”
    This is one of the most common, and one of the more absurd, objectives. The chances that you can gather all or even most of the requirements up front on a project of any complexity are small. There’s even less chance that the requirements won’t change over the course of the project. I much prefer to do detailed requirements in a just-in-time fashion.
  • “If we don’t know all the requirements at the time we begin design, we might build something that can’t be extended.”
    I agree that designing a BI solution is easier if you have all the pieces in front of you when you begin. However, it is certainly possible to create a flexible design without knowing every detail about how it will be used. In fact, since BI systems encourage emergent behavior, you can count on new requirements arising over the course of the project, and after it is completed. That means a flexible design is necessary, even if you think you have all the requirements up front.
  • “It takes too long to produce working code to deliver iterations that quickly.”
    This is one argument that holds some weight with me. If you are starting from scratch, and you need to build a star schema, load it with data, and create reports on that to deliver to end users, it can be difficult to accomplish that in four to eight weeks. However, that’s where I feel that you may have to narrow the focus of the iteration – instead of building all 8 needed dimensions, you build 4 instead. Then you build the remaining 4 in the next iteration. This does require some creativity to make sure you are building something of use to the end user, while still being able to complete it in the iteration’s timeframe.

That covers some of the common objections. I have done short iterations on BI projects, with a high degree of success, so I know that not only is it possible, it works well. In a future post, I will list some of the benefits I find in doing iterative development on BI projects.


*What is a short time frame? This is somewhat subjective, and you will find opinions ranging from several days to several months. My personal feeling is that an iteration shouldn’t go past 8 weeks, and I think 4 weeks is a much more manageable size.

x64 Development and Visual Studio Assembly References

This is a Random Bits post – not on my normal BI related topics, but something I found useful and wanted to record somewhere in case it could help someone else.

I recently upgraded to a new laptop running Vista 64-bit. It’s been a bit of an adventure getting everything working, particularly as I work with several people who are still on XP 32-bit, so I have to make sure I can continue to work with others on shared projects. 

One little piece of information I found was related to Visual Studio. I am working on project (BIDSHelper) that requires file references to a number of assemblies that are not in the global assembly cache (GAC). When you add these references in Visual Studio, a relative path from the IDE is added to the project file that points to the assembly. It’s in the <HintPath> tag in the project file.

Unfortunately, these assemblies are not in the same location in a 64-bit install as they are in a 32-bit install. If I just changed the file reference, it would break everyone else’s project, but I couldn’t move forward without resetting them. However, I found a reference online to using reference paths (found in the Project Properties) to refer to assemblies. This allows you to specify a folder (or folders) that will be searched for assemblies or other items used by the project. In my case, simply adding the folder that the assemblies were in to the Reference Paths allowed me to continue developing without impacting my teammates.

This is probably well known to people who spend more time doing .NET development, but it was definitely helpful for me.

Reference Paths In Books Online

Unit Testing In SSIS

I’d put in a request for additional support for unit testing in Integration Services on Connect (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=276292). Unfortunately, it doesn’t look like it will make the cut for SQL Server 2008.

I used automated unit testing extensively when I was more focused on application development, and I really miss having it available in SSIS. In my experience, it made a significant difference in both quality and speed of development. Having a suite of test cases gave developers a safety net and being able to quickly test small units of work made progress much more tangible. (For more on test-driven development and some of the benefits, see http://en.wikipedia.org/wiki/Test-driven_development.)

Given that, maybe it would be worth consider incorporating some unit testing capabilities into the BIDSHelper project (http://www.codeplex.com/bidshelper) or starting another community project to add those capabilities. Any thoughts on what would be good features to include? My list includes:

  • Ability to test tasks individually
  • Ideally, ability to test pipeline components individually
  • Standard xUnit functionality (the ability to set up and tear down a test)
  • Some support for setting the database to a known state prior to the test would be valuable

Any other thoughts would be welcome.

Processing a Flat File with Header and Detail Rows

It’s been tough finding time to write any content recently, but this was a scenario that came up recently both at work and on the forums, so I decided to write it up. It’s another variation on processing flat files (they seem to be endless). This scenario deals with an input file that contains two types of records, headers and details. This processing for this is based off another post (Handling Flat Files with a Varying Number of Columns), so I’m only going to cover what is unique in this scenario.


The input file looks like this:


HL111Header1
DLDetail1
DLDetail2
HL222Header2
DLDetail3
DLDetail4

The rows that begin with HL are header lines, the rows with DL are detail lines. One additional requirement in this example is that the header code (the three digits following the HL in each header line) must be included into each detail row.


The data flow looks like this:


HeaderDetail Data Flow


The Flat File Source is bringing in each row as a single column, delimited by CR/LF. The script component has two outputs defined, both as synchronous, and with an exclusion group greater than zero. (For more info on this, please see this post from Jamie Thompson.) One output will include the Header row, and the other will include the Detail rows. Note that each output must have unique column names defined.


The script appears below. It is simply parsing the Line column (that contains the entire row of data), and putting the values into the appropriate columns on each output. Note that the headerCode variable is stored each time a header row is encountered, and then used to populate the code value on the detail output. 


Public Class ScriptMain
Inherits UserComponent
Private headerCode As Integer

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
If Row.Line.Substring(0, 2) = “HL” Then
headerCode = CType(Row.Line.Substring(2, 3), Integer)
Row.DirectRowToHeader()
Row.HRRecordType = Row.Line.Substring(0, 2)
Row.HRCode = CType(Row.Line.Substring(2, 3), Integer)
Row.HRName = Row.Line.Substring(5)
ElseIf Row.Line.Substring(0, 2) = “DL” Then
Row.DirectRowToDetail()
Row.RecordType = Row.Line.Substring(0, 2)
Row.Code = headerCode
Row.Name = Row.Line.Substring(2)
End If
End Sub

End Class


There it is. Fairly straightforward, but it was quick to put together. The sample files are attached below.

SQL Server 2008 – Using Merge From SSIS

In my last post, I talked about using the new MERGE statement in SQL Server 2008. It allows you to merge the data in a result set with a table, based on a join condition that you specify. This is a common scenario in data warehousing, where you often need to take a working table and compare it to a target table to determine if a row should inserted, updated, or deleted. This is often the case for dimension table. Currently, there is no SSIS component in the June 2007 CTP that leverages the MERGE statement directly. This post will explore one possible pattern for using MERGE from SSIS.

In SSIS 2005, a commonly used pattern for updating dimension tables was to retrieve all of the rows intended for the dimension table in a data flow via a source component. Then the rows would be run through a lookup component to determine if they existed in the target table. Rows to be updated (those found in the lookup) would be run through an OLE DB Command component. Rows to be inserted would be sent to an OLE DB Destination.

image

In cases where performance is a concern, the OLE DB Command could be replaced with another OLE DB Destination that writes all rows to be updated to a temporary table. In the control flow, an Execute SQL task would issue an UPDATE statement, using the temporary table as a source.

image

To implement this using a MERGE statement, we can modify the second pattern. All rows that are read from the source should be written to a temporary table. Then, the Execute SQL task will be used to run a MERGE statement that uses the temporary table as the source for the merge. The matching of the rows will be done through the join condition in the MERGE statement.

image

Why would using the MERGE statement be better than using the lookup pattern described above? For one thing, the data flow is simpler, with no branches in execution. Also, the data flow only has to write to a single table, instead of two tables. I haven’t done any performance testing on the MERGE statement yet, so I can’t say for certain whether it is faster.

I’m still hoping for a destination component that can be used directly in the data flow to perform the MERGE, but the SSIS team has indicated that it probably won’t happen.