Extracting MySql UTF-8 fields with SSIS

Last week I ran into an interesting task with SSIS.  I’m pulling data from a MySql database (5.0.79-enterprise) using the MySql ADO.Net Connector with SSIS and loading the data into a SQL Server 2008 R2 database.  This has worked pretty well, but I ran into a few issues when dealing with UTF-8, so I thought I’d document some of my steps here.

Before we dive into the code, here’s a little information on how UTF-8 is handled in MySql and SQL Server in case you’re not familar with it.  Here’s some more information on Unicode that’s good reading material as well.

MySql doesn’t have the concept of national data types (such as nvarchar or nchar).  This is handled by setting the CHARSET of each table (which can be latin1, UTF-8, etc.).  You can read more about how MySql handles unicode characters in the MySql documentation.  One thing to note is that MySql support UTF-8 and UCS2, but you need to be aware of a few details about the implementation… The UTF-8 implementation does not use a BOM, and the UCS-2 implementation uses big-endian byte order and does not use a BOM.  The ADO.Net Connector doesn’t allow you to set a Code Page when you source the data.

SQL Server stores unicode data in national fields (nchar, nvarchar) using a little-endian UCS-2 encoding.  See http://msdn.microsoft.com/en-us/library/bb330962(SQL.90).aspx for more details on this (it was written for 2005, but is applicable to 2008 R2 as well). UCS-2 is a predecessor of UTF-16. UCS-2 differs from UTF-16 in that UCS-2 is a fixed-length encoding that represents all characters as a 16-bit value (2 bytes), and therefore does not support supplementary characters. UCS-2 is frequently confused with UTF-16, which is used to internally represent text in the Microsoft Windows operating systems (Windows NT, Windows 2000, Windows XP, and Windows CE), but UCS-2 is more limited.  You may note that the UCS-2 implementations between the two systems are different, so you will have to transform the strings when transferring data between the two systems.

Our source system has been around for quite awhile… it started off only supporting latin character sets, but as the company grew we had to handle international characters as well.  Some tables were created using a UTF-8 character set, but some were never converted from latin1… the front end just started inserting UTF-8 strings into the fields.  This means that in certain cases, we have different encodings in the same field which have to be handled.  This doesn’t materially affect the details of how I implemented this solution, but it does mean that some of the built-in conversion function in MySql won’t necessarily behave as expected, and that you sometimes have to handle a field differently based on when it was stored in the database.

Getting Started

So how do you even know you have a problem like this?  The issue is how the data is represented in each system.   I was trying to get a consistent representation of my data across multiple systems in Linux and Windows, and through a variety of client tools.  Particularly if you don’t control the source of your data, you need to determine if it is correct, if there is an encoding issue, or if there is just a display issue.  One thing that is important is to make sure your tools can actually handle displaying these characters… some can’t.  When in doubt, I’d always fall back to something like Notepad++ with A Hex Editor plug-in.  Here’s an example of a word expressed in hex (which is correct in the source system), the word as it is displayed in the source system (where it was incorrectly encoded), and the word as it should be expressed when encoded correctly.

clip_image001

When I was initially looking at the data, I tended to focus on a few rows/columns where I could easily see the data was incorrect.  I found Japanese and German to be the easiest for this… Japanese tends to display as “all or nothing” being correct, where as characters such as umlauts in German will be displayed differently in each encoding, giving you a good clue when things are right and wrong.  I find I used a lot of functions such as “HEX()” in MySql and “CAST(xxx AS varbinary)” in SQL Server to look at the hex representations, and I will often dump query results to a text file and look at it in Notepad++ to verify what I see.

The Approach

I’m pulling data in from the source system directly, not landing it in a file before loading it in.  That approach can be used to avoid some of these issues… depending on how the data is stored in the source system, you could just create a Flat File source and import that data using a Code Page of 65001 to transform the Unicode characters.  In my particular situation, because of how the data was stored, this wasn’t possible… even if I did want to land the data multiple times, which I didn’t.

To start, I created an ADO.Net source to pull data from the source.  All the fields from the source are typed as WSTR… but they still have to be translated from UTF-8 to Windows Unicode.  The most reliable way I found to do this was to create a query like this:

[sourcecode language="sql" padlinenumbers="true"]
SELECT
     id
    ,CAST(COALESCE(NULLIF(field1, ''), ' ') AS binary) AS field1_binary
    ,CAST(COALESCE(NULLIF(field2, ''), ' ') AS binary) AS field2_binary
FROM 
    myTable
[/sourcecode]

The purpose of this is to convert the string to binary (so SSIS will see it as a BYTESTREAM).  So why the the NULLIF and COALESCE, you’re probably asking?  SSIS doesn’t like null byte streams… they cause an error (more on that later).  Unfortunately, when you try and cast an empty string to a binary, it is transformed into a null.  I haven’t found a way around that, and further haven’t found a way to COALESCE that null back into anything.  It looks like once it becomes a null, it stays a null.  The solution I found was to convert all nulls and empty strings to a single space, then convert that back to a null downstream.  This isn’t optimal, but it works fine in my situation so I’m OK with it.

Once we get each of these strings inside SSIS as byte streams, we need to convert them from UTF-8 byte streams into Unicode byte streams.  This isn’t difficult to do in C#, so we just need to create a Script Transform.  You use the binary fields as inputs, then create WSTR outputs for field1 and field2.  Then we use a method that looks like this:

[sourcecode language="csharp"]
public static string GetUnicodeFromUtf8ByteStream(byte[] input, bool convertEmptyStringToNull)
{
    // Create a UTF-8 string from the UTF-8 byte stream
    string inputAsUtf8 = System.Text.Encoding.UTF8.GetString(input, 0, input.Length); 

    // Opportunity to short-circuit; if the string is empty, and
    // the user wants to return nulls for empty strings, go ahead
    // and return a null.
    if (convertEmptyStringToNull && inputAsUtf8.Trim().Length == 0)
    {
        return null;
    } 

    // Convert the  UTF-8 encoded string into a Unicode byte stream
    byte[] convertedToUnicode = System.Text.Encoding.Unicode.GetBytes(inputAsUtf8);
    // Convert the Unicode byte stream into a unicode string
    string output = System.Text.Encoding.Unicode.GetString(convertedToUnicode); 

    // Return the correctly encoded string
    return output;
}

[/sourcecode]

I also created a separate method

[sourcecode language="csharp"]
public static string GetUnicodeFromUtf8ByteStream(byte[] input)
{
    return GetUnicodeFromUtf8ByteStream(input, true);
}
[/sourcecode]

To provide default behavior on how to handle empty strings.  I used this to work around the issue where empty strings don’t come across from MySql.  If you have to differentiate between nulls and empty strings, you’ll need to come up with a work around.

You could also probably just use the method System.Text.Encoding.Convert(Encoding srcEncoding, Encoding dstEncoding, byte[] bytes), but I wanted more control over the transformation.  I haven’t tested that, but it should work.

I have these methods (along with another few transforms) in a custom assembly, but you can put this directly into the transformation component.  Then, you just need code like this in your ProcessInputRow method:

[sourcecode language="csharp"]
// Convert fields that are stored in UTF-8 format into Unicode
Row.field1 = Converter.GetUnicodeFromUtf8ByteStream(Row.field1binary);
Row.field2 = Converter.GetUnicodeFromUtf8ByteStream(Row.field2binary);
[/sourcecode]

This converts a field containing UTF-8 data into a proper Unicode string inside of SQL Server.

The Performance

Of course, any time you do something like this there is the question of performance.  I initially ran this test on a set with about 3.5 million records, with 6 fields I was performing the conversion on.  Here are some numbers I came up with running each of these cases a few times.  The hit is a few percent, but it isn’t that huge.  I saw roughly the same performance when scaling up to sets of around 100 million rows or so.

Test Time
Select (no casting, original fields, throwing away the data) 1:25
Select (coalesces, nullifs, casting) 1:26
Select (coalesces, nullifs, casting) + transformation 1:34

Some Things that Didn’t Work

One thing that annoyed me about this solution was the COALESCE and NULLIF handling.  Without this, though, a byte stream column will fail in SSIS.  I did try changing the ErrorRowDisposition from RD_FailComponent to RD_IgnoreFailure.  That allows nulls to come through.  Unfortunately, at least in my sample, I found that doing this more than doubled the time it took to import the data.  And even then, you have to use a Derived Column transform to create a flag column (on whether or not each field is null), then you have to handle nulls vs. non-nulls differently in the script transforms.  It was a nice thought – and could work for some applications – but it wasn’t a good fit for my solution.

Wrap-up

The biggest issue I had doing all of this was figuring out what was stored in the source system, and how to transform it.  There were actually a few extra flavors of data in the source system, but the approach above worked for all of them.

Posted in Uncategorized | Comments Off on Extracting MySql UTF-8 fields with SSIS

2 Days to SQLSaturday #46

Only 2 days until SQLSaturday #46 – if you aren’t registered yet, there’s still time. $10 for lunch, lots of great speakers, and some really interesting presentations – what better to do on Saturday?

If you make it out, look me up. I’ve got two presentations, and when I’m not presenting, I’ll be hanging around the Varigence table.

Posted in Uncategorized | Tagged , | Comments Off on 2 Days to SQLSaturday #46

How to Restore a Corrupt Database in SQL Server 2008 (Error 1813)

I ran into this issue the other day, and thought that I’d blog about it. I had copied a database from a co-worker (MDF files only, unbeknown to me) that I needed for my work. However, he forgot to include the Log file for the associated database in the backup that I received(!) Consequently, when I went to restore the database onto my local machine, I was greeted with a very ‘friendly’ message from SQL Server saying that my database could not be restored.

If this had been any other situation, I would have just gone over to my coworker and asked him to give me a copy of the Log file as well (after having chastised him for giving me a bum backup). However, when this situation occurred, we were at two different physical locations, and I didn’t have a way to get over to where he was.

After googling around for a little bit, I ran across a TERRIFIC post that saved my bacon.

Basically what the guy did is created a dummy database, took SQL offline and swapped the MDF files out, brought SQL back online (effectively putting the database into SUSPECT mode), then put the database into EMERGENCY mode, and had SQL re-create the LDF file. Genius!

I can attest to the fact that this works on SQL Server 2008. My database was 27GB in size, so it may take a while to create the LDF file. For me it took about 30 minutes to recreate the file.

Hopefully this post will help someone else out with the issue that I was facing!

Posted in Uncategorized | Tagged , , | Comments Off on How to Restore a Corrupt Database in SQL Server 2008 (Error 1813)

Migration to Word Press

We recently migrated this site from Community Server to Word Press. Most of the posts came over pretty well, but some of the code samples and pictures didn’t make it. I’ll try and get these updated and fixed over the next week. Thanks for your patience…

Posted in Uncategorized | Comments Off on Migration to Word Press

Prevent SSRS 2008 from Overwriting Datasets

Report development is one of my favorite areas of Business Intelligence and I have to say I’ve been fairly happy with the new interface for SSRS 2008.  Except for when it comes to MDX queries. 

Off the top of my head I can think of three major issues with the way SSRS 2008 handles MDX queries.  To be fair, only one of those issues is related to 2008.  The other two issues, which I’m saving for another blog post (no, I’m not even going to hint at them!), are related to SSRS in general, not 2008 specifically.

So what issue am I talking about?  Well, if you’ve ever developed a report in SSRS 2008 that used a parameterized MDX query then I’m pretty sure you’re familiar with it: after switching to MDX mode, making and saving changes to the report query, the MDX Query Designer overwrites the parameter datasets and wipes out all changes made to these parameter datasets.

If you define a parameter in your MDX query, SSRS 2008 will automagically create the parameter, and its dataset, for you.  I have to admit, that’s pretty slick.  Unless the dataset already exists, in which case it gets overwritten.  In fact, anytime you modify an MDX query that references the parameter, the dataset will get overwritten.  If you’re using the generic dataset then this isn’t a problem.  But if you’ve customized the dataset in any way, then it gets annoying fast.  Really fast.  Can you imagine having to update an MDX query that referenced several parameters?  And can you imagine having to make multiple tweaks to said MDX query, just so it would operate just right?

After suffering a mild aneurism I finally said Enough is Enough.  Something must be done.  And, thanks to Teo, something has been done.  Well, kinda-sorta-not really.  There is a bug listed on the Microsoft Connect site but due to introducing backward-compatibility issues, Microsoft is unable to fix this behavior in the current release.  However, they have promised to fix the behavior in a future release.  (Has it been fixed in R2?)

In the meantime, you can perform the workaround described below.  But be warned, it involves manually editing the XML code, so you should probably make a backup of your RDL in case things go horribly awry. 

How to Prevent SSRS 2008 from Overwriting Datasets:

  1. Open the report in XML mode.  There are two ways to do this: One way is to navigate to the RDL file in Windows Explorer, right click on the RDL, and select Open With…  and then choose Notepad.  The second way is to open the report in BIDS, right click on the report in the Solution Explorer, and select View Code. 
  2. Do a search on “<Dataset>”, and add “<rd:SuppressAutoUpdate>true</rd:SuppressAutoUpdate>” to the “<Query>…</Query>” tags, as shown below.  Just to be on the safe side, I added this line to all of my datasets.

image

Posted in SSRS | Tagged , , , , , , | Comments Off on Prevent SSRS 2008 from Overwriting Datasets

Presenting at SQLSaturday #49 in Orlando, FL on October 16, 2010

I have a very busy schedule in October. As already mentioned, I’m presenting at SQLSaturday #48 in Columbia earlier in the month. I also have the privilege of presenting at the Orlando, FL SQLSaturday #49 on October 16th, 2010. Thanks to Andy Warren for squeezing me into a slot that opened up in the schedule.

I’ll be presenting a newly revised presentation on SQL Azure – Moving Data with SQL Azure and SSIS. The abstract is:

SQL Azure allows you to host your data in SQL Server in the cloud. That provides some big benefits in scalability and management. However, it leaves open the question, “How do you get your data into / out of the cloud?” At some point, you are going to need to move data to or from an on-premise store to SQL Azure. In this session, we’ll discuss the available options for this, including SSIS, the Sync Framework, and BCP. We’ll cover the pros and cons for each. We’ll drill into one of the options, SSIS, in detail, and review performance options and potential issues that you may encounter when doing this.

I’ve presented on SQL Azure in the past, focusing on how it could be leveraged for BI purposes. In this session, we’ll focus in on moving data in and out of SQL Azure, as that’s an area that’s changing pretty rapidly right now.

If you read the blog, please introduce yourself. I’ll be around all day, and I’m always happy to chat.

Posted in SQL Saturday | Tagged , , | Comments Off on Presenting at SQLSaturday #49 in Orlando, FL on October 16, 2010

How to Pivot Data in SSRS

Pivoting data can be tricky to say the least, whether the actual pivot is done in SQL Server or Reporting Services. There are cases to be had for both approaches, only to be constrained by your reporting design requirements. I’ve done both and had a very good reason for doing each one its own way.

On one report, I knew that my column headers were always going to be the same, no matter what group of data I was running the report against. In this instance, I let SQL Server do the pivot operation.

In another report, I knew that my column headers were going to change depending on the parameters that were passed into the report. In this case, I had no choice but to let SSRS do the pivot.

If I had a choice, I would do the pivot in SQL Server instead of SSRS, if not for simplicity’s sake. I’m more of a coding type of guy. However, there are instances where you cannot avoid a pivot in SSRS. This tutorial will walk you through the steps of setting up a report in SSRS to pivot data on the fly.

First, we need some data to operate with. The data below is what we are going to start with:

CREATE TABLE tmpTable
(
grp INT
,SortOrder INT
,AssessmentTitle NVARCHAR(50)
,AssessmentLevelLabel NVARCHAR(5)
,LevelScaleLow INT
)

INSERT INTO tmpTable
SELECT 1 AS grp
,1 AS SortOrder
,‘Biology’ AS AssessmentTitle
,‘I’ AS AssessmentLevelLabel
,110 AS LevelScaleLow
UNION
SELECT 2 AS grp
,1 AS SortOrder
,‘Biology’ AS AssessmentTitle
,‘II’ AS AssessmentLevelLabel
,138 AS LevelScaleLow
UNION
SELECT 3 AS grp
,1 AS SortOrder
,‘Biology’ AS AssessmentTitle
,‘III’ AS AssessmentLevelLabel
,147 AS LevelScaleLow
UNION
SELECT 4 AS grp
,1 AS SortOrder
,‘Biology’ AS AssessmentTitle
,‘IV’ AS AssessmentLevelLabel
,159 AS LevelScaleLow
UNION
SELECT 1 AS grp
,2 AS SortOrder
,‘Math’ AS AssessmentTitle
,‘I’ AS AssessmentLevelLabel
,116 AS LevelScaleLow
UNION
SELECT 1 AS grp
,2 AS SortOrder
,‘Math’ AS AssessmentTitle
,‘II’ AS AssessmentLevelLabel
,153 AS LevelScaleLow
UNION
SELECT 3 AS grp
,2 AS SortOrder
,‘Math’ AS AssessmentTitle
,‘III’ AS AssessmentLevelLabel
,174 AS LevelScaleLow
UNION
SELECT 4 AS grp
,2 AS SortOrder
,‘Math’ AS AssessmentTitle
,‘IV’ AS AssessmentLevelLabel
,198 AS LevelScaleLow

This is how your table should appear:

grp SortOrder AssessmentTitle AssessmentLevel LevelScaleLow
1 1 Biology I 110
2 1 Biology II 138
3 1 Biology III 147
4 1 Biology IV 159
1 2 Math I 116
2 2 Math II 153
3 2 Math III 174
4 2 Math IV 198

This is ultimately how we want our data to appear in the Report:

Level Biology Math
I 110 116
II 138 153
III 147 174
IV 159 198

Since this data can change depending on the Assessment that we pass to the stored procedure, a Matrix is going to be needed in the report in order to facilitate dynamic columns. As I have discussed in an earlier post, a Matrix is needed for situations where dynamic columns are required. Else, you can get by with using a static table.

Now that our data is setup for our tutorial, it’s time to start working with pivoting the data in SSRS.

  1. We need to create a new report and add a Matrix dataset to it. Here, I’m assuming that you have already created a data source and populated it with the data from the table above.
  2. Grab the Matrix from the Toolbox and drag-n-drop it onto the Report. The Matrix data region provides helpful clues as to where you need to place your fields, specifying Rows, Columns, and Data.
    1. If we look in our table above we can tell that our Rows are going to be the AssessmentLevelLabels with their associated LevelScaleLow values, and the columns are going to be the AssessmentTitles.
  3. So, for Rows, select AssessmentLevelLabel, for Columns, select AssessmentTitle, and for data, select LevelScaleLow.

Grouping and Sorting is automatically done when you assign the fields to the Row Grouping and the Column Grouping (AssessmentLevelLabel and AssessmentTitle in our case).

Now, run the report, and VIOLA! Notice that your data is now displayed in the pivoted format!

This approach of pivoting the data in SSRS instead of in SQL Server has a couple of advantages:

  • You don’t have to design the report to accommodate every single column. This isn’t such a big deal in our example, since we ultimate only have two columns that end up as output. But imagine if you have the possibility of having 30 columns at any one time. You have to design the report and place out every single one of the 30 columns. What a pain!
  • Portability – If you decide to drive your pivoted data from a stored procedure, and you are using this pivot across multiple reports, then you only need to go to one central location to make your changes, and they will all be propagated to each report.

That’s all there is to it! If you have any questions, feel free to leave a comment.

Posted in Uncategorized | Tagged | Comments Off on How to Pivot Data in SSRS

Thanks to the Raleigh Triangle SQL User Group

This is overdue (the aforementioned blog problems have delayed a lot of posts), but I wanted to say thanks to the Raleigh, NC TriPASS User Group for letting me present on August 17th, 2010. I talked about “Processing Flat Files with SSIS” and had a great audience, with lots of questions and participation. It was a great experience, and I thoroughly enjoyed it.

Posted in User Groups | Tagged | Comments Off on Thanks to the Raleigh Triangle SQL User Group

Presenting at SQLSaturday #48 in Columbia, SC on Oct. 2nd

I’m excited to be presenting at the upcoming SQLSaturday #48 in Columbia, SC on October 2, 2010. I get to give two presentations at this one:

  1. Creating Custom Components for SSIS – if you are interested in building your own custom components for SSIS, this is a good session to attend
  2. Processing Flat Files with SSIS – feeling confused and angry about the way SSIS handles your flat files? Join this group therapy session where you can learn some tips for making SSIS behave.

If you read the blog, please introduce yourself. I’ll be around all day, and I’m always happy to chat.

Posted in SQL Saturday | Tagged , | Comments Off on Presenting at SQLSaturday #48 in Columbia, SC on Oct. 2nd

Presenting at SQLSaturday #46 in Raleigh September 18th, 2010

I’m very happy to be presenting at the upcoming SQLSaturday #46 in Raleigh, NC on September 18th, 2010. I’ll be presenting two sessions, one on Processing Flat Files with SSIS, and the other on Patterns for SSIS Configuration and Deployment.

I’d also like to mention that Varigence is one of the sponsors of this event, along with other a number of other impressive companies. It’s great that so many companies are willing to sponsor these local community events, and I know from helping with the SQLSaturday here in Charlotte that having willing sponsors makes setting up the event much easier. If you’d like to see a demo of Vivid, Mist, or Hadron, feel free to drop by our table at some point during the day.

If you are a reader of this blog, please introduce yourself. I’ll be around all day.

Posted in SQL Saturday | Tagged , | Comments Off on Presenting at SQLSaturday #46 in Raleigh September 18th, 2010