Insert Parent and Child Tables with SSIS Part 2

Yesterday I posted a simple approach for inserting data into tables with a parent child relationship (like Order Header to Order Detail) where you need to know the generated identity key to tie the records together. That approach relied on having a business value that could be used to consistently tie the parent and child data together. I ended that post with a question – what happens if you don’t have that value?

One example of this might be an order feed that comes from external company. Each row contains the order header and order detail information. Each of these rows should be broken up and inserted into two tables, one for the header data, the other for the detail data. Assuming that there is no business value that can tie these together, you need to get the identity key for the parent row as soon as it is inserted.

One approach for this involves retrieving a result set using an Execute SQL task, and shredding it with the For Each Loop. This approach does have some downsides – please review the Pros and Cons section below for more details. Within the For Each loop, two Execute SQL Tasks are used – one to insert the header row and retrieve it’s ID value, the other to insert the child row.

image

Retrieving the result set in the initial Execute SQL task is straightforward – just retrieve all the order rows. You can use a For Each loop to shred the retrieved result set by selecting the For Each ADO enumerator, and mapping the columns in the result set to variables. The For Each will run once per row in the result set, setting the variable values to match the current row.

SNAGHTMLf0da024

The next Execute SQL task is a little more interesting. It inserts the header row and uses the T-SQL OUTPUT clause to return the identity value for that row.

[sourcecode language=”sql” padlinenumbers=”true”]
INSERT INTO parent (order_date, customer_id)
OUTPUT INSERTED.order_id
VALUES (GETDATE(), ?)
[/sourcecode]

If you aren’t using a database engine that supports OUTPUT or similar functionality, you can also execute a second SQL statement in the same task to get the identity value using @@IDENTITY (be aware that there are some constraints when using this), SCOPE_IDENTITY(), or the appropriate function for your database.

The Execute SQL task maps the returned identity value to another package variable (order_id in this example).

SNAGHTMLf117a4c

The next Execute SQL task can then use that variable when inserting the child record into the order details table.

Pros and Cons

This approach has the advantage of guaranteeing that your child records are matched to the correct parent records. This can be a big plus if you don’t have a reliable way to match the records when making two passes.

There are a couple of downsides, though. The biggest one is speed. This approach forces you into RBAR (row by agonizing row) mode, so it’s going to be slow, especially compared to a batch load operation. The second problem is that if you are dealing with a lot of columns, creating all the variables and maintaining the mappings isn’t a fun experience.

Overall, I prefer the approach I mentioned in the first article, and I tend to avoid this one if possible. In some cases, even if you don’t have a reliable key, you may be able to work around it by staging some data to a working table or modifying the data source to provide a reliable key.

The sample package for this post is on my SkyDrive.

Posted in SSIS | Tagged , | Comments Off on Insert Parent and Child Tables with SSIS Part 2

Inserting Parent and Child Tables with SSIS

A relatively frequent occurrence in data integration is the need to insert data into a parent table (like Order Header) and insert related records into a child table (Order Details). If the data is already populated with appropriate keys, and you are just copying it, this isn’t too complex – just copy the parent table first, then the child. What if the new tables use identity keys, though? You need to get the new identity key for each header row before you can insert the child row. This post is going to walk through one pattern for doing this, and I’ll show an alternate approach in my next post.

The first approach assumes that you have some common piece of information to link the data. In this case, often the simplest approach is to use two Data Flow tasks, run in sequence. This is my preferred solution to this problem, because it’s fast and it’s usually straightforward to implement.

The first Data Flow loads the parent records, and the second loads the child records. Note that the second Data Flow can’t run until the first succeeds.

image

The first Data Flow is pretty straight forward, and simply retrieves the parent (order header) data and inserts it into the parent table. The most important item here is that the source component retrieves the right data – that is, one row per order header, and that it includes some information that can be used to uniquely identify the order. In the sample package I’ve linked to below, you’ll see that the source of the order records is a single table, where a given row includes both header information and the detail. The source query for the data flow selects and groups on customer ID, as that uniquely identifies the order in this scenario (one order per customer, per day).

The second data flow retrieves the order detail for the same source table. It then uses a Lookup transform to retrieve the correct order ID (the identity key) from the parent table. The Lookup just needs enough data to make a unique match – in this case, that’s the current date and the customer id.

image

That’s really all there is to the simple pattern. You can find a sample package that illustrates this on my SkyDrive. But there can be more complex scenarios where you still need to handle a Parent / Child insert. For example, what if there is no reliable key to tie the order detail rows to the order header? In this case you can’t use the lookup. Stay tuned for the next post, where I’ll discuss a different pattern that can handle this scenario, but involves some tradeoffs.

Posted in SSIS | Tagged , | Comments Off on Inserting Parent and Child Tables with SSIS

QLSaturday #49 and #56 Recaps–A Little Different

SQLSaturday #49

My recap on SQLSaturday #49 is waaay overdue (sorry, Jack and Andy). It was nice to get a chance to present at the home of SQLSaturday, and I have to say, the amount of experience they have in doing SQLSaturday events really shows. The event ran smoothly, and they had plenty of helpful volunteers to help in the appropriate places. I missed the speaker dinner, but heard it was good. There was also some excitement Friday night, when the hotel fire alarm when off at 3:30 AM. Not a pleasant wakeup experience. You could tell the SQLSaturday people who were staying at the hotel though – they were the ones in pajamas with laptop bags over their shoulders.

The event was well attended with 270 attendees and the sessions I attended were pretty full. My presentation was on SQL Azure and loading data to and from the cloud with SSIS and BCP. Fortunately, I was able to get a good internet connection (some other presenters had warned me that reception was a little weak in some rooms), so the demos ran as expected. The after-party was nice as well. I had the chance to catch up with some friends (thanks for the ride, Eric) and meet some new people from the area.

SQLSaturday #56

SQLSaturday #56 – BI Edition was this past weekend. It was another good event, and had about 215 attendees. I got to spent some time with Greg Galloway (one of the other developers on BIDS Helper), Tim Mitchell, and I met several new people as well. The event itself was held at the Microsoft campus in Dallas, and rooms were very nice. We did have a couple of tornado scares, the first of which came up just as my first presentation was starting. It’s a little disconcerting hear an announcement that everyone should proceed to the storm shelters two sentences into your presentation. But as it turned out, the presentation room we were in was a “tornado safe zone” (which I guess means safer that the surrounding rooms, which had lots of glass windows), so the presentation was able to continue. The event had a small number of volunteers, and it was nice to see speakers and attendees pitching in to help wherever needed, whether cleaning up or helping direct people to the appropriate rooms.

A Little Different

Something struck me about both these SQLSaturday events. They were both felt relatively low key and relaxed. I wasn’t involved in the organization for either of these, but I got the impression that the organizers felt that they wanted to provide a good event without going to extremes. They realized that not everything has to be perfect or prearranged to have a good event. At some other events, I got the distinct impression that the organizers sat down and said “How do we make this SQLSaturday the best one ever?”, really focused on setting the bar higher, and probably experienced a lot of stress in the process. Don’t get me wrong – I think improving the SQLSaturday experience is important, and continuing to try new things to improve that experience is good. I’ve really enjoyed those events that changed my concept of what a SQLSaturday could be.

However, some of the groups that have done the “higher-end” SQLSaturday events have had a lot of sponsors (which means a bigger budget) and a lot of volunteers (which means spreading the workload a lot more). For those of you considering organizing a SQLSaturday in your area, I think it’s important to know that you can organize a very successful SQLSaturday on a smaller budget and with fewer volunteers. If you focus on providing what the attendees are there for (focused, relevant content that helps them learn), the rest of it tends to sort itself out. Based on the speakers and attendees that I’ve talked to at various events, there’s not a significant difference in the satisfaction levels between events. Attendees are very happy to get free, quality training. Speakers are happy to have an audience. The other stuff (meals, giveaways, personal assistants for every attendee 🙂 ) is secondary.

So, if you are considering a new SQLSaturday event, don’t worry about how it will compare with the one the next state over, or the one you gave last year. Most people will be very happy however you choose to organize it. And if someone refuses to attend because there won’t be free doughnuts, well, did you really want them there in the first place?

Posted in SQL Saturday | Tagged | Comments Off on QLSaturday #49 and #56 Recaps–A Little Different

Virtual PC and Windows 7

This morning while trying to spin up one of my Virutal PCs I ran into some peculiar error messages.  In case the same thing happens to you, the workaround is pretty simple. 

When I opened Virtual PC and selected my VMC file I received the following message:

Could not register the virtual machine.  The virtual machine configuration could not be added. User does not have sufficient access rights.

Even if I ran Virtual PC as administrator I still got that error.  The fix is to create a new virtual machine.  When you do so, make sure to point to your existing VHD file.  At this point you’ll probaly get this message:

Virtual PC . . . was unable to write to one of its virtual hard disks.

The fix here is to reference the VHD file by share name, not drive name.  So instead of simply pointing to C:MyVirtualPCsMyFavoriteVirtualPC.vhd, you have to point to \ComputerName:MyVirtualPCsMyFavoriteVirtualPC.vhd.  Now, since I keep my VHD files on an external hard drive, this meant that I had to create a share on its parent directory. 

Thanks to Damir Dobric for posting the solution!

Posted in Windows | Tagged , , | Comments Off on Virtual PC and Windows 7

One Week to SQLSaturday #49

One week left till SQLSaturday #49 – there’s still time to register. I’m pretty excited about going, as it will be my first Orlando SQLSaturday. Also, it looks like a great lineup of speakers.

I’ll be presenting on ways to get data into and out of SQL Azure, with a focus on SSIS (not surprising that I’d put that spin on it, is it?) However, it will cover some of the other options too, and you may be surprised by the results. If you’re going to be around, please drop in, either to say hi, or as Eric plans to do, to throw fruit (looking forward to it, Eric). I’m happy either way.

Posted in SQL Saturday | Tagged | Comments Off on One Week to SQLSaturday #49

SQLSaturday #48 Recap

I presented at SQLSaturday #48 this weekend. It was a great event, and I thoroughly enjoyed it. K. Brian Kelley and the other organizers put on a great event. This was the first SQLSaturday that I’ve been to in a while where I had the time and flexibility to attend several of the sessions. I particularly enjoyed Andy Warren’s presentation on personal development plans. It’s definitely something I need to spend more time on. Andrew Kelly’s presentation on storage and IO was good as well – I’ve seen it before, but I pick up something new every time.

From an attendee perspective, the event was very good. There were plenty of great sessions to choose from. The location (Midlands College) was also nice, with plenty of room, and a nice setup for both attending sessions and networking. The lunch room was a little crowded, but it wasn’t a problem – several of us just grabbed some space on the floor.

Looking at it from a speaker’s standpoint, it was well organized. The directions to various classrooms were clear, with schedules on the doors. The room proctors were very effective and helpful. I missed the speaker dinner, but I heard it was good. The only negative that I noticed is that some sessions were pretty lightly attended. I think this was mostly a factor of the number of simultaneous sessions spreading the attendees a little thin. It wasn’t a big deal, though – most of the speakers are happy to speak to small or large crowds.

Overall, definitely a good event, and I’ll be looking forward to the next one.

Posted in Uncategorized | Tagged | Comments Off on SQLSaturday #48 Recap

Create XML Fragment in SQL Server

The XML datatype, first introduced in SQL Server 2005, can be very  handy when utilized properly. It is an industry standard, and with it, it is easy to import/export data across different servers, as well as different systems entirely. This tutorial will guide you through creating an XML fragment on the fly in SQL Server. Additional resources, as well as reference materials can be found at the bottom of this post. These instructions are valid for both SQL Server 2005 and SQL Server 2008.

First off, we need some sample data:

[sourcecode language=”sql”]
DECLARE @Students TABLE (
FirstName nvarchar(50),
LastName nvarchar(50),
DisplayName nvarchar(100)
)
INSERT INTO @Students
SELECT ‘Jim’ as FirstName, ‘Bob’ as LastName, ‘Jim Bob’ as DisplayName
UNION
SELECT ‘John’, ‘Doe’, ‘John Doe’
UNION
SELECT ‘Jane’, ‘Doe’, ‘Jane Doe’
UNION
SELECT ‘Yuri’, ‘Tao’, ‘Yuri Tao’
[/sourcecode]

Now that we have our test data, we can attempt to create an XML string for each distinct row. Ultimately, we want our table to end up like the sample below:

XML

We could try our first option:

[sourcecode language=”sql”]
select * from @Students FOR XML RAW
[/sourcecode]

However, this option returns an XML fragment for the entire dataset – not what we’re looking for here. We need an individual XML fragment for each row.

After searching around on the net and a few forums, I was finally able to get an answer:
[sourcecode language=”sql”]
SELECT *
FROM @Students s
CROSS APPLY
(
SELECT
(
SELECT *
FROM @Students t
WHERE t.DisplayName = s.DisplayName
FOR XML RAW
) x1
)x
[/sourcecode]

The CROSS APPLY function basically creates a Cartesian product. In this case, it iterates over each row in our table and produces the desired XML result. However, if you were to run this query as-is on our sample data, you would notice that the XML output, while formatted as XML, isn’t of an XML datatype.

To fix this, simply convert the column, like so:
[sourcecode language=”sql”]
SELECT s.*, CONVERT(XML,x.x1) as ErrorData
FROM @Students s
CROSS APPLY
(
SELECT
(
SELECT *
FROM @Students t
WHERE t.DisplayName = s.DisplayName
FOR XML RAW
) x1
)x

[/sourcecode]

That’s all there is to it! We now have an XML fragment for each row of data in our sample. As always, test the example for performance bottlenecks. I’ve utilized this 50,000 records and it returns in about 1 minute. Not too bad, given that the records are very wide (40-50 columns).

Later, I will write a post detailing querying and shredding XML data into a relational dataset – good stuff indeed!!

Additional Resources:

Posted in Uncategorized | Tagged , , | Comments Off on Create XML Fragment in SQL Server

Exception: Microsoft.SharePoint.SPException: User cannot be found – Sharepoint 2010 Installation Failure

Today I found myself trying to install Sharepoint 2010 on my laptop running Windows 7 Enterprise. After reading all of the caveats about doing such a thing, I still needed to get this done. The post that I followed was from Microsoft: http://msdn.microsoft.com/en-us/library/ee554869.aspx. Seemed pretty straight-forward: pre-configure, install prerequisites, blah, blah, blah. And it was all blah, blah, blah, until:

Error #1:

Failed to create the configuration database.
An exception of type System.Security.Cryptography.CryptographicException was thrown.  Additional exception information: The data is invalid.

Well…great. I received this error after I had successfully installed Sharepoint 2010 and was just starting up the Sharepoint 2010 Configuration Wizard.

After searching around for awhile, I came across a post that said this error is related to permissions on the Sharepoint directory. The Network Service account needs Full Control permissions on it.

So, navigate to %commonprogramfiles%Microsoft SharedWeb Server Extensions14 and give the Network Service user Full Control on that folder and all related subfolders and files.

OK. Got that error out of the way. Now to rerun the Sharepoint 2010 Configuration Wizard. And…

BOOM!! Error #2:

error2

What the heck?! This error proved to be more of a pain to diagnose. But again, Google to the rescue. A little background first though.

My computer is part of our domain here at the office. We have two wireless networks which do not talk to one another (for reasons not important). I found that this little error is related to the fact that the Configuration Wizard is trying to use my user credentials from AD to try and reach out to the Domain Controller for verification/authorization.

However, I was not on the network that had the Domain Controller on it. (My computer likes to randomly change wireless networks without telling me about it). So, all that I needed to do was to hop onto the wireless network with the Domain Controller and rerun the setup.

Woot!! Success!

I finished the Sharepoint Configuration Wizard, error free, and now have Sharepoint 2010 installed and working on my Windows 7 Laptop!

site

Posted in Uncategorized | Tagged , | Comments Off on Exception: Microsoft.SharePoint.SPException: User cannot be found – Sharepoint 2010 Installation Failure

Less than 2 Weeks till SQLSaturday #48

SQLSaturday #48 is coming up in a few more days, but there is still time to register. There’s a great lineup of speakers, and you can can’t beat free for training.

I’ll be presenting 2 sessions on SSIS. If you are there, feel free to drop in.

Posted in SQL Saturday | Tagged , | Comments Off on Less than 2 Weeks till SQLSaturday #48

Death by SQL…an Act in Two Parts

CartoonHow common is it to run into performance issues with SQL Server? Daily?  Hourly? Maybe for you, it’s a common existence; and for that, I’m sorry. And how are some ways that you deal with performance degradation in SQL? I’d venture to say that, for most, it would involve juggling indexes, statistics, etc. on the tables in question. But what about going about this all differently?

What if we take a step back and look at the code itself? Maybe the code is the problem, not the server performance. Since running across SQLServerCentral in the early days of my BI experience, there were a few blog posts and articles which have stuck with me throughout. One such article, More RBAR and “Tuning” UPDATEs, has been of great help to me.

This article opened up my eyes to a completely different way of thinking when it comes to performance problems within SQL Server. I highly suggest reading it before continuing with the rest of my post here.

I ran into this “tuning” problem the other day when working with some Fact records that I was trying to tie to a Type 2 Dimension. I have about 37,000 school enrollment records for which I need to find the appropriate Student ID surrogate key among 273,000 different student records. It seemed pretty simple enough:

  • Link the Fact record to the Dimension record using the Student Number
  • Based upon the Fact record’s registration date column, place the record with the correct version of the Student

Act 1

There are two different ways to construct our SQL statement to get this job accomplished: either set-based or row-by-row (see RBAR above) Obviously, one is much more preferred above the other method. Take, for example, the code below (RBAR):

UPDATE [PreStage_FactSchoolEnrollment]
SET [AlternateStudentID] =
(SELECT
    (SELECT TOP 1 DS.[StudentID]
    FROM [DimStudent] DS
    WHERE DS.[EffectiveEndDate] >= [FactSchoolEnrollment].[RegistrationDate]
        AND DS.[StudentSISID] = [FactSchoolEnrollment].[Pupil_Number]
    ORDER BY DS.[EffectiveEndDate],DS.[StudentID])
FROM [FactSchoolEnrollment]
WHERE [PreStage_FactSchoolEnrollment].[ID] = [FactSchoolEnrollment].[ID])

Seems innocent enough, right? However, there is a huge performance issue with this query. Below is a screenshot of one particular piece of the actual execution plan from this query above. Remember our record counts: ~37,000 Fact records and ~273,000 Dimension records.

queryplan_indexspool

That’s right…that number circled above is over 8 BILLION rows that were created in memory!! (8,465,578,262 to be exact). This is the base problem with RBAR queries. In essence, this query, as it is currently structured, queried and stored the ENTIRE dimension (all 273,000 records) for EACH of the incoming Fact records (37,000). That is where the 8.4 Billion records are created. Notice that this update took over 48 minutes run. There isn’t an index in the world that is going to help this type of performance monster.

Act 2

Enter set-based SQL. How about we reconstruct this query as a set-based query instead? Look at the differences in the SQL below:

UPDATE PreStage_FactSchoolEnrollment
SET PreStage_FactSchoolEnrollment.AlternateStudentID = ISNULL(b.StudentID,-1)

FROM (SELECT ROW_NUMBER() OVER (PARTITION BY StudentSISID ORDER
BY EffectiveEndDate ASC) as rownum, Pupil_Number, sub_b.StudentID
FROM FactSchoolEnrollment sub_a
INNER JOIN DimStudent sub_b ON
    sub_a.Pupil_Number = sub_b.StudentSISID
WHERE (sub_b.EffectiveEndDate >= sub_a.RegistrationDate)) b
WHERE PreStage_FactSchoolEnrollment.Pupil_Number = b.Pupil_Number

This end result of this query is EXACTLY the same as the above query; the only difference is that this query took all of 9 seconds to return data. Now that’s a performance gain!

Followup

Now, understandably, it may not be feasible to rewrite your SQL code because of different constraints. But, if you can, at all, (and I’m pleading with you here), PLEASE try to rewrite the code itself. You will be surprised at how much of a difference syntax can make!

Posted in Uncategorized | Tagged , | Comments Off on Death by SQL…an Act in Two Parts