XML as Source in SSIS

XML is becoming increasingly popular in the workplace. Many systems are using it as a way to transmit data efficiently across the network. While XML certainly has its place, consuming it and deconstructing it can be a real headache.

Unfortunately, the native XML Source components in SSIS are lousy for such a purpose as this. Before I get into the details, I’ll provide a high-level overview.

I was provided roughly 1,000 XML files that I had to import into a SQL database. Sounded easy enough. However, the XML files had no in-line schema, and I was not provided a stand-alone schema so I had to derive one, again using the XML Source components. This would seem fine; however, the schema makes a “best-guess” with relation to the XML file that is presented. It guesses on the datatypes; so in one file, you could have an attribute such as filevalue=”1″ and the schema parsing says that that attribute’s datatype is a Bit. However, the next XML file has filevalue=”2432″. In this case, the import fails because 2432 is obviously not a Bit.

Removing Unwanted Data

Any and all users will tell you “My data is clean!”. But reality often proves different. Eventually, as the project morphed, these individual XML files were rolled into one BIG XML file, formatting, metadata and all. As such, I needed a way to strip all of the superfluous junk out of the file on the fly. Enter — C# StringExtension! I just add this as a script task inside my For Each Loop Container to strip out the junk from each file as it’s processed.

public static class StringExtension
    {
        private static Dictionary<string, string> _replacements = new Dictionary<string, string>();
        static StringExtension()
        {
            _replacements["<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>"] = "";
            _replacements["<parmExtRoute>"] = "";
            _replacements["</parmExtRoute>"] = "";
            _replacements["<ReportCacheID>"] = "";
            _replacements["</ReportCacheID>"] = "";
            // etc...

        }
        public static string clean(this string s)
        {
            foreach (string to_replace in _replacements.Keys)
            {
                s = s.Replace(to_replace, _replacements[to_replace]);
            }
            return s;
        }
    }
        public void Main()
        {
            string filename = Dts.Variables["_FileName"].Value.ToString();
            string outfilename = Dts.Variables["_FileName"].Value.ToString();
            try
            {
                StreamReader sr = new StreamReader(filename);

                string content = sr.ReadToEnd();
                content = content.clean();
                sr.Close();

                StreamWriter sw = new StreamWriter(outfilename);
                sw.WriteLine("<Root>");
                sw.WriteLine(content);
                sw.WriteLine("</Root>");

                sw.Close();
                Dts.TaskResult = (int)ScriptResults.Success;
            }
            catch (Exception e)
            {
                Console.WriteLine("{0} Exception caught.", e);
                Dts.TaskResult = (int)ScriptResults.Failure;
            }
        }
    }

All that needs to be modified is the key dictionary pairs, adding additional pairs for whatever needs to be parsed out. Then, simply call the string extension on any string variable that needs to be cleansed, as I’ve done above. I’ve also taken the liberty to surround the entire XML fragment with a singular <root></root> node, so that everything will be happy.

Now, onto importing the data into our table!

Leave a Reply