Archive for September 2010

Extracting MySql UTF-8 fields with SSIS

Last week I ran into an interesting task with SSIS.  I’m pulling data from a MySql database (5.0.79-enterprise) using the MySql ADO.Net Connector with SSIS and loading the data into a SQL Server 2008 R2 database.  This has worked pretty well, but I ran into a few issues when dealing with UTF-8, so I thought I’d document some of my steps here.

Before we dive into the code, here’s a little information on how UTF-8 is handled in MySql and SQL Server in case you’re not familar with it.  Here’s some more information on Unicode that’s good reading material as well.

MySql doesn’t have the concept of national data types (such as nvarchar or nchar).  This is handled by setting the CHARSET of each table (which can be latin1, UTF-8, etc.).  You can read more about how MySql handles unicode characters in the MySql documentation.  One thing to note is that MySql support UTF-8 and UCS2, but you need to be aware of a few details about the implementation… The UTF-8 implementation does not use a BOM, and the UCS-2 implementation uses big-endian byte order and does not use a BOM.  The ADO.Net Connector doesn’t allow you to set a Code Page when you source the data.

SQL Server stores unicode data in national fields (nchar, nvarchar) using a little-endian UCS-2 encoding.  See http://msdn.microsoft.com/en-us/library/bb330962(SQL.90).aspx for more details on this (it was written for 2005, but is applicable to 2008 R2 as well). UCS-2 is a predecessor of UTF-16. UCS-2 differs from UTF-16 in that UCS-2 is a fixed-length encoding that represents all characters as a 16-bit value (2 bytes), and therefore does not support supplementary characters. UCS-2 is frequently confused with UTF-16, which is used to internally represent text in the Microsoft Windows operating systems (Windows NT, Windows 2000, Windows XP, and Windows CE), but UCS-2 is more limited.  You may note that the UCS-2 implementations between the two systems are different, so you will have to transform the strings when transferring data between the two systems.

Our source system has been around for quite awhile… it started off only supporting latin character sets, but as the company grew we had to handle international characters as well.  Some tables were created using a UTF-8 character set, but some were never converted from latin1… the front end just started inserting UTF-8 strings into the fields.  This means that in certain cases, we have different encodings in the same field which have to be handled.  This doesn’t materially affect the details of how I implemented this solution, but it does mean that some of the built-in conversion function in MySql won’t necessarily behave as expected, and that you sometimes have to handle a field differently based on when it was stored in the database.

Getting Started

So how do you even know you have a problem like this?  The issue is how the data is represented in each system.   I was trying to get a consistent representation of my data across multiple systems in Linux and Windows, and through a variety of client tools.  Particularly if you don’t control the source of your data, you need to determine if it is correct, if there is an encoding issue, or if there is just a display issue.  One thing that is important is to make sure your tools can actually handle displaying these characters… some can’t.  When in doubt, I’d always fall back to something like Notepad++ with A Hex Editor plug-in.  Here’s an example of a word expressed in hex (which is correct in the source system), the word as it is displayed in the source system (where it was incorrectly encoded), and the word as it should be expressed when encoded correctly.

clip_image001

When I was initially looking at the data, I tended to focus on a few rows/columns where I could easily see the data was incorrect.  I found Japanese and German to be the easiest for this… Japanese tends to display as “all or nothing” being correct, where as characters such as umlauts in German will be displayed differently in each encoding, giving you a good clue when things are right and wrong.  I find I used a lot of functions such as “HEX()” in MySql and “CAST(xxx AS varbinary)” in SQL Server to look at the hex representations, and I will often dump query results to a text file and look at it in Notepad++ to verify what I see.

The Approach

I’m pulling data in from the source system directly, not landing it in a file before loading it in.  That approach can be used to avoid some of these issues… depending on how the data is stored in the source system, you could just create a Flat File source and import that data using a Code Page of 65001 to transform the Unicode characters.  In my particular situation, because of how the data was stored, this wasn’t possible… even if I did want to land the data multiple times, which I didn’t.

To start, I created an ADO.Net source to pull data from the source.  All the fields from the source are typed as WSTR… but they still have to be translated from UTF-8 to Windows Unicode.  The most reliable way I found to do this was to create a query like this:

SELECT
     id
    ,CAST(COALESCE(NULLIF(field1, ''), ' ') AS binary) AS field1_binary
    ,CAST(COALESCE(NULLIF(field2, ''), ' ') AS binary) AS field2_binary
FROM
    myTable

The purpose of this is to convert the string to binary (so SSIS will see it as a BYTESTREAM).  So why the the NULLIF and COALESCE, you’re probably asking?  SSIS doesn’t like null byte streams… they cause an error (more on that later).  Unfortunately, when you try and cast an empty string to a binary, it is transformed into a null.  I haven’t found a way around that, and further haven’t found a way to COALESCE that null back into anything.  It looks like once it becomes a null, it stays a null.  The solution I found was to convert all nulls and empty strings to a single space, then convert that back to a null downstream.  This isn’t optimal, but it works fine in my situation so I’m OK with it.

Once we get each of these strings inside SSIS as byte streams, we need to convert them from UTF-8 byte streams into Unicode byte streams.  This isn’t difficult to do in C#, so we just need to create a Script Transform.  You use the binary fields as inputs, then create WSTR outputs for field1 and field2.  Then we use a method that looks like this:

public static string GetUnicodeFromUtf8ByteStream(byte[] input, bool convertEmptyStringToNull)
{
    // Create a UTF-8 string from the UTF-8 byte stream
    string inputAsUtf8 = System.Text.Encoding.UTF8.GetString(input, 0, input.Length); 

    // Opportunity to short-circuit; if the string is empty, and
    // the user wants to return nulls for empty strings, go ahead
    // and return a null.
    if (convertEmptyStringToNull && inputAsUtf8.Trim().Length == 0)
    {
        return null;
    } 

    // Convert the  UTF-8 encoded string into a Unicode byte stream
    byte[] convertedToUnicode = System.Text.Encoding.Unicode.GetBytes(inputAsUtf8);
    // Convert the Unicode byte stream into a unicode string
    string output = System.Text.Encoding.Unicode.GetString(convertedToUnicode); 

    // Return the correctly encoded string
    return output;
}

I also created a separate method

public static string GetUnicodeFromUtf8ByteStream(byte[] input)
{
    return GetUnicodeFromUtf8ByteStream(input, true);
}

To provide default behavior on how to handle empty strings.  I used this to work around the issue where empty strings don’t come across from MySql.  If you have to differentiate between nulls and empty strings, you’ll need to come up with a work around.

You could also probably just use the method System.Text.Encoding.Convert(Encoding srcEncoding, Encoding dstEncoding, byte[] bytes), but I wanted more control over the transformation.  I haven’t tested that, but it should work.

I have these methods (along with another few transforms) in a custom assembly, but you can put this directly into the transformation component.  Then, you just need code like this in your ProcessInputRow method:

// Convert fields that are stored in UTF-8 format into Unicode
Row.field1 = Converter.GetUnicodeFromUtf8ByteStream(Row.field1binary);
Row.field2 = Converter.GetUnicodeFromUtf8ByteStream(Row.field2binary);

This converts a field containing UTF-8 data into a proper Unicode string inside of SQL Server.

The Performance

Of course, any time you do something like this there is the question of performance.  I initially ran this test on a set with about 3.5 million records, with 6 fields I was performing the conversion on.  Here are some numbers I came up with running each of these cases a few times.  The hit is a few percent, but it isn’t that huge.  I saw roughly the same performance when scaling up to sets of around 100 million rows or so.

Test Time
Select (no casting, original fields, throwing away the data) 1:25
Select (coalesces, nullifs, casting) 1:26
Select (coalesces, nullifs, casting) + transformation 1:34

Some Things that Didn’t Work

One thing that annoyed me about this solution was the COALESCE and NULLIF handling.  Without this, though, a byte stream column will fail in SSIS.  I did try changing the ErrorRowDisposition from RD_FailComponent to RD_IgnoreFailure.  That allows nulls to come through.  Unfortunately, at least in my sample, I found that doing this more than doubled the time it took to import the data.  And even then, you have to use a Derived Column transform to create a flag column (on whether or not each field is null), then you have to handle nulls vs. non-nulls differently in the script transforms.  It was a nice thought – and could work for some applications – but it wasn’t a good fit for my solution.

Wrap-up

The biggest issue I had doing all of this was figuring out what was stored in the source system, and how to transform it.  There were actually a few extra flavors of data in the source system, but the approach above worked for all of them.

Migration to Word Press

We recently migrated this site from Community Server to Word Press. Most of the posts came over pretty well, but some of the code samples and pictures didn’t make it. I’ll try and get these updated and fixed over the next week. Thanks for your patience…