Presenting at SQLSaturday #67–Chicago This Weekend

I’ll be presenting on “Do More (ETL) With Less (Effort) – Automating SSIS” this weekend at the Chicago SQLSaturday. I’m looking forward to it – there’s a lot of great speakers on the schedule. Since I have an early time slot at this one, I should get the chance to relax and enjoy some of the other presentations.

The abstract for my presentation is below. If you happen to be attending the event, please stop by.

Do More (ETL) With Less (Effort) – Automating SSIS

SSIS is a great tool for transferring data from one data source to another, and for implementing complex ETL processes. However, for simple, straightforward data transfer tasks, creating SSIS packages by hand can be time-consuming and repetitious. By attending this session, you’ll learn how to automate package creation in SSIS, including the dynamic generation of data flows. We’ll cover some of the free and open source tools available for this, and discuss “roll your own” options.

Posted in SQL Saturday | Tagged , | Comments Off

Handling Advanced Data Warehouse Scenarios in SSIS

For everyone that attended my presentation of “Handling Advanced Data Warehouse Scenarios in SSIS”, thank you. I’ve posted the slides, SSIS packages, and database script from the presentation on my SkyDrive. You are welcome to download them and take a look at the inner workings. A note of caution, though: please test these in your environment and with your data before applying the patterns everywhere. As I pointed out during the presentation, I had no foreign keys on the data warehouse I was using as a sample, and several of the examples used the technique of generating your own keys in the data flow. If you are using foreign keys or identity / sequence keys, you will need to make some adjustments to the patterns.

The presentation covers the loading of Late Arriving Dimensions, Parent-Child dimensions, Type 3 dimensions, and Type 6 dimensions. Over the next couple of weeks, I’ll be producing a series of blog posts detailing the patterns from the presentation in more detail. I’ll also cover a couple of bridge table patterns which I couldn’t squeeze into the presentation.

Posted in Presentations | Tagged , , , , | Comments Off

Presenting at SQLSaturday #70

I’ll be doing two presentations at the SQLSaturday #70 (#SQLSat70) in Columbia, SC this weekend (Saturday, 3/19). Really looking forward to this event – the last Columbia SQLSaturday was a blast, and this one promises more of the same. There are a number of excellent speakers presenting, so there will be plenty of sessions to learn from.

Varigence will have a table at the event. If you are working in Microsoft BI, I encourage you to stop by and take a look at what we are doing – it’s very cool.

Do More (ETL) with Less (Effort) – Automating SSIS

SSIS is a great tool for transferring data from one data source to another, and for implementing complex ETL processes. However, for simple, straightforward data transfer tasks, creating SSIS packages by hand can be time-consuming and repetitious. By attending this session, you’ll learn how to automate package creation in SSIS, including the dynamic generation of data flows. We’ll cover some of the free and open source tools available for this, and discuss “roll your own” options.

***If all goes well, you may even get a sneak preview of some exciting new BIDS Helper functionality at this session ***

Handling Advanced Data Warehouse Scenarios in SSIS

So you’ve used SSIS to populate a simple star schema data mart, and everybody’s happy. But now you have new requirements that require more advanced data warehouse approaches, like late arriving dimensions, bridge tables, parent child dimensions, and Type 3 or Type 6 slowly changing dimensions (SCD). How do you handle those in a scalable, efficient way in SSIS? This session will present some common patterns for handling these scenarios. You’ll learn when to use each advanced approach and the pros and cons associated with each pattern. You will learn how to implement these patterns in SSIS, and how to tune them for high performance.

Posted in SQL Saturday | Tagged , | Comments Off

T-SQL Tuesday #15: Automating XML Updates w/ PowerShell

This month’s T-SQL Tuesday is being hosted by Pat Wright (blog | twitter). He selected the topic of automation with T-SQL or PowerShell. As it happened, I’d just had a scenario come up that seemed appropriate for a little PowerShell, so I went with it for this post.

The Varigence toolset is based on a language called BIML (Business Intelligence Markup Language), which is an XML-based language for creating BI solutions. We needed to make some mass updates to some of our sample files, but there were quite a few individual ones to update. I already had an XSLT transform that would update each file individually, so all I needed was a quick way to run it on each file in a directory.

I’m pretty new to PowerShell, but it seemed like a good fit for this. So, after a little trial and error, I came up with the following script:

$path = "C:tempxml*.xml"
get-childItem $path | rename-item -newname { $_.name -replace '.xml','_backup.xml' }
$files = Get-ChildItem $path
$xslt = New-Object System.Xml.Xsl.XslCompiledTransform
$xslt.Load("C:tempxmlDataType.xslt")
foreach ($file in $files) {
    $oldFileName = $file.FullName
    $newFileName = $file.FullName -replace '_backup.xml', '.xml'
    $xslt.Transform($oldFileName, $newFileName)
}

Basically, it goes through a directory, renames all the existing .XML files to create a backup, and then gets a list of the files. It then loads an XSLT transformation (used to transform XML documents by inserting or removing markup). Finally, it loops through the files in the folder, and uses the XSLT transform to create an updated version of the file with the original filename.

There are probably better ways to do this, but it does exactly what I needed, and it let me learn a little more about PowerShell. It also struck me that this could be very useful to transform XML into a friendlier shape for processing in SSIS.

Tagged , | Comments Off

T-SQL Tuesday: Fix the Glitch or Resolve the Problem

One of the interesting things about T-SQL Tuesday is that it’s a 24-hour event. So, being on the east coast of the US, I get the chance to read some of the blog posts for the event on Monday night, before I actually post mine, though I usually write them ahead of time. Last night I saw that Rob Farley (twitter | blog) had posted his T-SQL Tuesday post. Reading it, I came to a horrifying realization – Rob and I think alike! (If you know Rob, you’ll realize that’s a lot scarier than it sounds). His post and thought process was very similar to mine – though, while I did think about the screen resolution joke, I had the good sense not to mention it.

Seriously, since Rob’s one of the smartest guys I know, I think it’s pretty cool that we had the same approach.

This month’s T-SQL Tuesday is being hosted by Jen McCown (twitter | blog). She chose the topic of resolutions, with the new year in mind. I’ve never been big on the traditional idea of resolutions, and usually focus more on goals that I revisit on a regular basis.

So I started thinking about other ways that I could blog about resolutions, and the idea of resolving problems came to mind pretty quickly. In the roles I’ve filled over the last several years, I was often called in to deal with problems (technical and non-technical). The scope of the problems could vary quite a bit, from simple, 5 minute technical issues (“Why won’t my cube process?”) to business problems requiring full projects (“I need to analyze student performance and determine what factors impact it most heavily.”). One of the trends that I’ve noticed around solving problems is the shift from resolving the problem to simply fixing the glitch.

What’s the difference?

Well, just a second there, professor. We, uh, we fixed the *glitch*. So he won’t be receiving a paycheck anymore, so it’ll just work itself out naturally.
-Office Space

 

When you fix the glitch, you are basically looking for the quickest, cheapest, least effort patch to getting the immediate results you desire. When you resolve a problem, you are making sure the root cause of the issue is fixed, so that the problem doesn’t resurface in a new form a few months later, and so that you address the logical ramifications of the problem.

Of course, fixing the glitch is sometimes appropriate. If you have a server down situation and 50 people sitting around twiddling their thumbs till it’s back up, you need to get that fixed immediately. However, all too often, it stops there. Rather than investigating the root cause and how to prevent it, often the reaction is “It’s working – now back to business as usual.”

The biggest issue with fixing the glitch is that it sets you up for future failure. Most of the time, the patch fails to prevent issues because it didn’t address the problem completely. Even if the patch continues working, it’s creating something that will have to be worked around in the future. We’ve all dealt with those systems that have so many patches and quick fixes that everyone is scared to make any changes to them, for fear of breaking something else.

Resolving Problems

One of my goals for this year is to avoid quick fixes, and focus on resolving the root problems. That’s sometimes painful in the short term, as you don’t get that immediate satisfaction of resolving a problem 5 minutes into the conversation. The benefit, though, is that you won’t have to keep dealing with that same problem on a recurring basis for the next 5 years, so it’s a price I’m willing to pay.

Posted in Professional Development | Tagged | Comments Off

T-SQL Tuesday #13–Data Quality and the Business User

TSQL2sDay150x150This post is part of T-SQL Tuesday #13 – organized by Adam Machanic (Blog | Twitter), and hosted this month by Steve Jones (Blog | Twitter). The topic this month is “What issues have you had in interacting with the business to get your job done?”

Working in business intelligence, I do a lot of work with business users. For the most part, it’s an enjoyable process and a good interchange of information. I learn about the business, and in the process help the users gain new insights about their data and how it can be used to answer their questions. There is one consistent item, though, that I’ve found difficult to communicate to business users, and that’s the state of their data quality.

Data quality has a pretty big impact on the value of a business intelligence solution. Bad data quality = bad answers, and if your business intelligence solution isn’t delivering good answers, it’s not really worth much. But the data quality in a business intelligence solution depends in large part on the data being fed into it. And unfortunately, the data  in a lot of source systems is not in good shape.

It’s Not Me, It’s You

It’s really difficult to communicate this to the business users, though. After all, they’ve been running their business for years on this data, and now I’m telling them it’s bad? Why haven’t people been complaining about it all along? There are a few things that I try mention at this point.

First, the level of data quality required to get good information from a BI system is very different from that required to run an operational system. For operations, there are usually some required, key pieces of information necessary. These are kept to a minimum, though, because operationally, the more data quality checks you implement, the more you impede the business process. So you want to have just enough data quality to for the system to not fall apart. You don’t really care that the customer entered the wrong zip code with their state (after all, the Post Office will work that out when you send them the bill, right?)

For BI work, though, you are flipping that around. To analyze and get meaning form the data, you need classification of the data, and some of those optional, not-so-important from an operational perspective pieces of information start coming into play. Knowing that your states actually align with your zip codes becomes pretty important if you want to display your data on a map.

Also, people probably have been complaining about data quality – they just aren’t complaining to the business users. The DBAs in most companies I’ve worked with are well aware of the issues in the company’s data. But when they complain about it, they complain to their managers or other DBAs. They don’t complain to business users, because, after all, we technology people are just supposed to make it work, right?

Can We Make This Work?

Convincing the business users of these two points can be pretty difficult, though. In the past, I’ve had to extract data, profile it, and then identify problem areas. Even then, going to a business user with a list of the 1000s of customers who have zip codes that aren’t valid for their state of residence often didn’t help. Business users can’t easily correlate the impact of those types of data issues on their reporting and analytics until they see it in the context of the actual analysis.

So, in the past, I often ended up doing a first pass on the data where I let anything in – quality data or not. Inevitably, as soon as users started looking at the numbers, they’d become much more interested in data quality. Today, tools like PowerPivot can help, because I can quickly build a functional prototype on top of the dirty data, and visually show the impact much more easily.

Once the business users understand the impact of bad quality data, they’ve always seen the value in either fixing it or putting in the appropriate safeguards to prevent it from entering the BI system in the first place. It can be challenging to communicate it, but the faster I could get this point across with the business users, the more successful the projects turned out. One of the keys things that I’ve learned from this process over time is that, while it’s difficult to show someone quality data, you can show clearly show the impact that it has on the solutions. For many business users, this communicates far better than all the talking in the world.

Posted in Requirements | Tagged , , | Comments Off

Processing Flat Files with SSIS at SQLSaturday #61

I’m looking forward to speaking at SQLSaturday #61 (#sqlsat61) in Washington, DC this coming weekend (12/4/2010). As usual for a SQLSaturday, there are some great speakers scheduled to present.

I’ll be presenting on “Processing Flat Files with SSIS”, which goes over some tips and tricks for working with flat files. It’s always a fun session to give, as everyone seems to have encountered some troublesome flat files at some point in their career.

If you are in attendance at the event, please feel free to stop by and say hello.

Posted in SQL Saturday | Tagged , | Comments Off

Data Cleaner Component Published on CodePlex

If you saw my talk “Implementing a Data Cleansing Component for SSIS” at the PASS Summit or SQLSaturday #56 (#sqlsat56), you might be wondering where the source for the sample component is located. It’s published on CodePlex, and downloadable from the Source Code page. As time permits, I’ll create an installer for it, and there’s some updates I’d like to make to add some more functionality. But it’s usable right now, and serves as another example component if you are looking to create your own.

If you have suggestions for improvements (and there are many that can be made), please post them to the CodePlex project’s Issue Tracker page.

Posted in Open Source | Tagged | Comments Off

Moving Data with SQL Azure and SSIS

I did a presentation on moving data to and from SQL Azure for SQLSaturday #49 in Orlando. In it, I reviewed a few of the options available for getting data into and out SQL Azure, including SSIS, BCP, and the Sync Framework. I neglected to load the slides up to the SQLSaturday site, though, so I’m making them available here.

One item to note – in the presentation, I said that the SQL Azure Data Sync tool only supported syncing SQL Azure databases – no on-premise SQL Server. That was correct at the time, but Microsoft has now announced that CTP 2 for Data Sync will support on-premise SQL Servers, meaning it’s a much more viable option for moving data.

The presentation is available here.

Posted in SQL Saturday | Tagged | Comments Off

Recap of “Do You Know the Data Flow” at the Atlanta BI User Group

Last week I presented “Do You Know the Data Flow?” to the Atlanta BI user group. The meeting was well attended, with about 40 people present. The group was great, with lots of good questions and comments. I finally got to meet in person several people in the SQL Server BI community who are local to Atlanta – Teo Lachev, who organizes the group,  Jen Underwood, and Julie and Audrey of DataChix fame. Aaron Nelson (@SQLvariant), Denny Cherry (@mrdenny), and Jorge Segarra (@SQLChicken) were also there.

This was my first public presentation of “Do You Know the Data Flow?”, which is always a little interesting. It’s a subject area I’m pretty comfortable with, though, and the presentation went smoothly. I got some great feedback on it (thanks, everyone), with several people telling me that they’d learned something new. For me, as a presenter, that’s great to hear, because I want people to feel like it was well worth their time to attend. Based on the feedback for this one, I’ll be putting it in my regular roster of presentations.

The slides from the presentation are located here.

Posted in User Groups | Tagged | Comments Off