Handling Flat Files with Varying Numbers of Columns

5/15 Update – I added Part 2 to show how to do the same thing with a Conditional Split and a Derived Column transform, per Jamie’s feedback (see the comments on this post).
A common question on the forums has been how to handle flat files that have a varying number of columns. For example, one row contains 3 columns, and another row may contain on two columns. The example below shows a sample file that uses a comma to delimit the columns, and a cursor return / line feed to delimit the row.
TestValue1,100,12/01/2007
TestValue2,200
TestValue3,300,12/01/2007
TestValue4,400,12/01/2007
TestValue5,500
TestValue6,600,12/01/2007
TestValue7,700,12/01/2007
TestValue8,800
TestValue9,900,12/01/2007
TestValue0,1000,12/01/2007

SSIS does not handle this scenario easily, due to the way it parses flat files. It parses by looking for the next column delimiter. The row delimiter is just the column delimiter for the last defined column. So, on our second line in the sample file, SSIS is looking for a comma instead of a CR/LF. The result of this is that the third row ends up combined with the second row, and we get something that looks like this:

I’m not going to go into a discussion about whether this is good or bad. This article is about how to work around it. If you’d like to see it changed in future versions of SSIS, please go to Connect (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=124478) and vote for it to be changed.
Now, onto the workaround. First, I’ve defined a flat file connection manager that treats each row as one column. I’m using the row delimiter (CR/LF) as the column delimiter.

If you are following along, your flat file should preview like this:
 
Next, in a data flow, I’ve added a flat file source that uses the connection manager. It is connected to a script component that is set as a Transform. The Line column is checked as an input.

In the Inputs and Outputs area, I’ve added three columns, for the three real columns in my flat file, and set the data types appropriately.

Finally, I added the following script to the task:
Public Class ScriptMain
    Inherits UserComponent
    Private columnDelimiter() As Char = CType(“,”, Char())

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As
        Input0Buffer)
        Dim rowValues As String()

        rowValues = Row.Line.Split(columnDelimiter)
        If rowValues.GetUpperBound(0) < 2 Then
            ‘Row is not complete – Handle error
            Row.Name_IsNull = True
            Row.Number_IsNull = True
            Row.Date_IsNull = True
        Else
            Row.Name = rowValues.GetValue(0).ToString()
            Row.Number = Convert.ToInt32(rowValues.GetValue(1))
            Row.Date = Convert.ToDateTime(rowValues.GetValue(2))
        End If
    End Sub

End Class
The columnDelimiter variable holds the value for the column delimiter – a comma in my case. The Split function parses the value contained in Line (the single column defined in the connection manager) and returns an array containing one element for each column in it. Since I’m expecting 3 columns, I’m performing a check to see if the array contains all three columns (.NET uses 0-based array indexes). If columns are missing, I have an error that needs to be handled. In this example, I am simply setting all my column values to NULL. The error handling could be enhanced by redirecting the rows to an error output, but I wanted to keep things simple. With this method, I could use a conditional split to filter out the rows with NULL.
Finally, if the correct number of columns are present, I’m setting the output columns created earlier with the values from the array. Notice that the Convert is necessary to make sure the value is the correct type.
That’s pretty much it. Depending on your needs, you may need to customize the script a bit to better handle error conditions, or reparsing the columns. I’ve attached the sample package and text file below. The sample is using the Trash Destination from Konesans, which you can download from www.SQLIS.com

Sample files here

As always, feedback is appreciated.

32 Comments

  1. jamiet says:

    John,
    I think its important to point out that instead of a script component it is possible to use a Conditional SPlit combined with a Derived Column as well.

    -Jamie

  2. http:// says:

    Why would you need the Conditional Split? Can’t I just use SUBSTRING and FINDSTRING for each possible column in one Derived Column transformation?

    On a related note, how would I represent a TAB in the FINDSTRING?

    BTW, thanks to both of you for such good and prompt answers. It’s very, very much appreciated.

    Thanks,
    Michael

  3. http:// says:

    Jamie,

    You are correct. Jay’s been a bad influence. :)

    I’ll post another version using a Conditional Split and Derived Column. In my typical implementations of this, though, I have to do some form of processing on the “bad” rows, whether it’s defaulting values, sending them to a different output, etc. I usually just combine all of that into the script. You could do it in the data flow, but my experience has been that the conditional logic needed to handle all the cases is easier to maintain as script.

  4. http:// says:

    Michael,

    Thanks! I’m just happy to know someone is finding some value in it.

  5. http:// says:

    I’ve just stumbled across this site – so glad because I was racking my brain trying to think why my text import wasn’t working.

    The solution is great, but in my case it’s going to mean a lot of extra work.

    I’m importing a document with around 50 columns of various types/lengths, etc.

    If I have to do it this way – I’ll have to setup a load of columns manually. Whereas the component can read the header from the text file.

    Is there another way. Perhaps it’s posible to pre-process the file adding in the correct number of commas to match the fields, before import??

  6. jwelch says:

    That is certainly possible. It means processing the file twice, but that may not be a bad thing, if the file isn’t too big.

  7. http:// says:

    V. Useful article, thanks. This certainly is an interesting ‘feature’ Microsoft have included!

    I was hoping that the flat file source would output rejected rows (e.g. where row is truncated) to the error output but it doesn’t seem to do that, it just doesn’t output the row.
    Are you aware of a way to do this?

    Thanks

  8. jwelch says:

    Not without handling it as I show above. You could extend the script component to have an error output and redirect rows to that.

  9. http:// says:

    re: Handling Flat Files with Varying Numbers of Columns

    In the script example above, what is the file is tab delimited in place of comma delimited?

  10. jwelch says:

    I believe you could use an escaped tab character (it’s \t in C#, not sure what it is for VB).

  11. http:// says:

    Hi
    it works good for few rows. I have flate file with 25 columns and 30000 rows. when i tried to implement this method, it gave me an error saying data is too large for column buffer.

    Is there any other way of doing it?

    its urgent.i will really appreciate this

    thank you

  12. jwelch says:

    Not really, but I’m surprised you are getting an error. How wide are your rows?

  13. http:// says:

    After the first step when you convert all the column s into row in flate file connection manager ur row width is equivalent to wiidth of 25 columns.
    So i guess thats is the reason,my all columns are around 30 char so in the row it will be 25 times 30….

    Thank you!

  14. http:// says:

    HI

    I am still working on for the alternating way to deal with it.
    In my last two post I haev mentioned my problem with teh varying no. of columns.

    Currently I first put flate file data into the table and then i put additional comma and other modification through the stored proc(calling it with SSIS). This is not very efficient way of doing it.

    STILL WAITITNG FOR UR REPLY GUYS

    Thank You

  15. jwelch says:

    You might play around with the DefaultBufferSize property on the data flow, and the BufferTempStoragePath.

    Frankly, nothing that you do on a flat file that wide is going to be very efficient. The scripting approach I outline above isn’t going to be blazingly fast on a big file like this. You current solution might be the best alternative. That, or you could write a custom script source that parses the file for you.

  16. http:// says:

    Hi,

    Good to see a solution here. Even i had this issue and i used a similar approach using the .Net code in the script tab.But now we have another mojor problem and im out of ideas as to how to deal with it.

    We have a flat file src, comma seperated file, having 7 columns.
    The problem here is that it is a comma seperated file and there are also commas inside the text fields.It is also having variable number of columns i.e some of them have all the 7 columns and some have only 4-5 columns populated.

    here is a sample data snippet.

    ^UserId,education,degree,school,major,status,graduationDate
    01000038990,education,Masters,”University of Electronic Engineering, Moscow, USSR”,MCS,Complete,
    01003328740,education,Bachelors,Oakland University,”BS, MIS”,Complete,

    The degree column and the school columns have commas even multiple commas at some places.

    It would be a great help if some body can provide a solution to this problem

  17. http:// says:

    Best solution I’ve found is import the file into Visual FoxPro, then export it back out again

    VFP is the King of string manipulation… Even when the 3rd generation of SSIS is STILL deficient in handling something as common as the scenarios listed above (hello, has ANYONE at MS ever worked in the real word, with REAL data), we still have good old VFP to save us.

    * quick-and-dirty program to import CSV file that doesn’t always have same number of columns in each row
    * For example:
    * ProvID,Lname,Fname,MID
    * 10019704,James,James,M
    * 10019706,Aachen,Custer,
    * 10019705,Winans,Scooter
    * 10019707,Williams,Ralph,
    * will error out on 3rd row when using the Flat File Data Task in SSIS 2005, since there isn’t a 4th column defined by the 3rd comma (which IS in the 1st, 2nd and 4th lines)

    * A quick fix is to use VFP to import the file, then re-export it as CSV, which DOES add the proper number of fields.
    * (tried using Excel to do the same, no luck)

    PARAMETERS lInputFilename

    IF PARAMETERS() = 0
    WAIT WINDOW “ERROR! Must include parameter with filename. HALTING…” AT SROWS()/2, SCOLS()/2-25 TIMEOUT 10
    RETURN
    ENDIF

    IF TYPE(“lInputFilename “) <> “C”
    WAIT WINDOW “ERROR! Must include parameter with filename. HALTING…” AT SROWS()/2, SCOLS()/2-25 TIMEOUT 10
    RETURN
    ELSE
    IF !FILE(lInputFilename) THEN
    WAIT WINDOW “ERROR! Input file (” + ALLTRIM(UPPER(lInputFilename)) + “) does not exist. HALTING…” AT SROWS()/2, SCOLS()/2-25 TIMEOUT 10
    RETURN
    ENDIF
    ENDIF

    SET SAFETY OFF

    WAIT WINDOW NOWAIT “WORKING ON: Input file (” + ALLTRIM(UPPER(lInputFilename)) + “) …” AT SROWS()/2, SCOLS()/2-25

    * There COULD be a lot of error-checking here, and etc, however this is a quick-and-dirty. Re-factoring comes in “version 2″
    CREATE CURSOR tempinput ( ;
    provid char(200) , ;
    firmname char(200) , ;
    lname char(200) , ;
    fname char(200) , ;
    blah, blah
    ENDIF

    WAIT WINDOW NOWAIT “Appending Data… ” AT SROWS()/2, SCOLS()/2-25
    APPEND FROM (lInputFilename) TYPE DELIMITED
    COPY TO (lInputFilename ) DELIMITED WITH ” with CHARACTER ,

    CLOSE DATABASES
    RETURN

    Create an .EXE and away you go.
    Bk

  18. http:// says:

    I also struggled with this problem until I found a new Delimited File Source Data Flow Source on Codeplex http://ssisdfs.codeplex.com/releases/view/32846
    It handles varying column numbers perfectly and has a lot of other handy features such as an error output.

  19. Sabarinathan says:

    Hi,
    i’m also facing the same problem.
    i have a Text as a Source File for Eg it contains 4 columns some of the columns may contain Null Values without delimiter . below i given the sample data.

    Source File:

    1,Sabari,27,Chennai
    2,Uma,27
    3,Srivatsan
    4,Sathya
    past 1 day i’m breakinf my head atlast i found this forum how to handle the scenario. almost i brought the result . but in my output i’m getting Default 4 Commas in all the rows . but i need commas wherever the null values comes.below i given the output which i get after implementing the code provided by this Forum

    Output:
    1,Sabari,27,Chennai,,,,
    2,Uma,27,,,,
    3,Srivatsan,,,,
    4,Sathya,,,,

    But the output which i want is

    1,Sabari,27,Chennai
    2,Uma,27,,
    3,Srivatsan,,
    4,Sathya,,

    can anybody tell whats wrong with the Coding i did
    Public Class ScriptMain
    Inherits UserComponent

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
    Dim rowValues As String()

    Dim columnDelimiter() As Char = CType(“,”, Char())
    rowValues = Row.Line.Split(columnDelimiter)

    If rowValues.GetUpperBound(0) < 3 Then
    'Row is not complete – Handle error
    Row.ID = CInt(True)
    Row.Name = CStr(True)
    Row.Age = CInt(True)
    Row.City = CStr(True)

    Else
    Row.ID = CInt(rowValues.GetValue(0))
    Row.Name = CStr(rowValues.GetValue(1))
    Row.Age = CInt(rowValues.GetValue(2))
    Row.City = CStr(rowValues.GetValue(3))
    End If
    End Sub
    End Class

    Regards,
    Sabarinathan

    • jwelch says:

      The first chunk of your output is the original Line column (the one with the entire row in it). Don’t map that in your destination – you want to ignore it after the script.

      The other problem is the CInt(True), etc that you are using when the count of columns is less than 3. You need to actually pull the correct values from the array – as it is, you are generating blank values anytime the row has less than a complete set of columns.

      Try something like:
      If rowValues.GetUpperBound(0) < 1 Then
      'Row is not complete – Handle error
      Row.ID = CInt(rowValues.GetValue(0))
      'Provide defaults for remaining values
      Row.Name = CStr(True)
      Row.Age = CInt(True)
      Row.City = CStr(True)
      Else If rowValues.GetUpperBound(0) < 2 Then
      'Row is not complete – Handle error
      Row.ID = CInt(rowValues.GetValue(0))
      Row.Name = CStr(rowValues.GetValue(1))
      'Provide defaults for remaining values
      Row.Age = CInt(True)
      Row.City = CStr(True)

      ...etc

  20. Faisal Lodhi says:

    This example is perfect for what I’m trying to do. I copied the code exactly and I’m getting a error: namespace does not directly contain members such as fields or methods…..I’m using SSMS 2008. I have no coding experience. Please Help.

  21. tom says:

    Thanks, very helpfull

  22. Mike says:

    I can not believe that this is STILL an SSIS bug since 2005. Amazing Microsoft, amazing.

  23. Daniel Schlueter says:

    Really informative article.Thanks Again. Awesome.

  24. J_Golf says:

    Hi there,

    Great article, but I’m having problems building this solution in SSIS.

    Error 1 ‘Line’ is not a member of ‘SC_41e7d57bd43743eaa9fe27e8dec394d2.vbproj.Input0Buffer

    same error for ‘Number’ and ‘date’

    Thanks,

    J_Golf

  25. chris says:

    Hello! I am get this error as well and the names do match. But I did rework a script componate and change everything in it. Could that be the problem?
    Thanks, Chris

Leave a Reply