Posts tagged ‘Examples’

Dealing with Unicode and ASCII Text Files

A recent post on the MSDN forums brought up an interesting problem. The poster was attempting to process a folder of files, all of which had the same format, but some were encoded as ASCII (ANSI, DT_STR, or varchar, if you like the SQL Server data types) and some were encoded as Unicode (DT_WSTR, nvarchar).


The initial issue was determining which way the file was encoded. After a little research, I located a post entitled “Detecting File Encodings in .NET” by Heath Stewart. This showed how to check the encoding of a file in C#. I converted the code to VB.NET and set it up to work within a Script Task. These are the results:

    Public Sub Main()
”’Code based on sample originally posted here: http://www.devhood.com/tutorials/tutorial_details.aspx?tutorial_id=469
Dim file As System.IO.FileStream = Nothing
Dim isUnicode As Boolean = False

Try
file = New System.IO.FileStream(Dts.Connections(“TestFileAnsi”).AcquireConnection(Nothing).ToString(), _
System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read)
If file.CanSeek Then
Dim bom(4) As Byte ‘ = Byte ‘// Get the byte-order mark, if there is one
file.Read(bom, 0, 4)
If ((bom(0) = &HEF And bom(1) = &HBB And bom(2) = &HBF) Or _
(bom(0) = &HFF And bom(1) = &HFE) Or _
(bom(0) = &HFE And bom(1) = &HFF) Or _
(bom(0) = 0 And bom(1) = 0 And bom(2) = &HFE And bom(3) = &HFF)) Then
isUnicode = True
Else
isUnicode = False
End If
‘// Now reposition the file cursor back to the start of the file
file.Seek(0, System.IO.SeekOrigin.Begin)
Else
‘// The file cannot be randomly accessed, so you need to decide what to set the default to
‘// based on the data provided. If you’re expecting data from a lot of older applications,
‘// default your encoding to Encoding.ASCII. If you’re expecting data from a lot of newer
‘// applications, default your encoding to Encoding.Unicode. Also, since binary files are
‘// single byte-based, so you will want to use Encoding.ASCII, even though you’ll probably
‘// never need to use the encoding then since the Encoding classes are really meant to get
‘// strings from the byte array that is the file.
isUnicode = False
End If

Dts.TaskResult = Dts.Results.Success

Catch e As Exception
Dts.TaskResult = Dts.Results.Failure
Finally
If Not file Is Nothing Then
file.Close()
End If

Try
Dim vars As Variables
Dts.VariableDispenser.LockOneForWrite(“isUnicode”, vars)
vars(“isUnicode”).Value = isUnicode
vars.Unlock()
Catch ex As Exception
Dts.TaskResult = Dts.Results.Failure
End Try
End Try
End Sub


Essentially, the script above is getting the file path from a flat file connection manager.  It’s then checking the byte order mark (BOM), which is what tells us if the file is Unicode or not. The script assumes that the file is ASCII, if it can’t determine it from the BOM. The result of the check is written to a boolean variable defined in the package called isUnicode.


That part was relatively straightforward. Originally, I thought I could just use this variable to set the Unicode property on the flat file connection manager through an expression, and use the same dataflow, as long as I typed all the string columns as DT_WSTR (I really should have known better). Oddly enough, when I first tested this, I found that if I set the Unicode property to false on the connection manager, but set the column types to DT_WSTR, I could process ASCII files without a problem, but Unicode files didn’t send any rows through, even though there was no error. If I set the Unicode property to true, and then set it to false via an expression, the flat file source threw a validation error because the error output was using a DT_NTEXT type, and it needs to use DT_TEXT with ASCII files.


What I ended up with instead isn’t quite as simple, but it does work consistently. I created two connection managers, one configured for ASCII, and the other for Unicode. Both connection managers have an expression defined that sets their ConnectionString property based on the same variable, so they both point at the same file. Then I created two data flows, one for ASCII, one for Unicode. This eliminates any metadata errors. Finally, I set precedence constraints from the script task to each data flow, and made execution conditional based on the isUnicode variable.


image


The constraint to DF ANSI looks like this:image


and the constraint to DF Unicode looks like:image


The sample files for this (included a text file formatted as Unicode and ASCII) are on my Live Drive:



 


To switch the test file from ASCII to Unicode, change the file path in the “testFilePath” variable.

Scanning a Package For Specific Tasks

I recently had a need to scan through an SSIS package to find all instances of data flow tasks within the package. This is to support a future enhancement in BIDSHelper. However, I ended up making the code a little more general, so it can be used to get a list of all instances of any type of tasks within a package.


Before looking at the code, there is a concept that it is helpful to understand. SSIS packages have a nested structure. Everything is held within a container. Some containers, like the Sequence or ForEach Loop containers, can hold groups of other containers. The package itself is a container. All containers that can hold multiple containers implement the IDTSSequence interface. The TaskHost container, however, holds a single task, and does not implement the IDTSSequence interface. The TaskHost is what we have to find to identify individual instances of tasks.


The function below is designed to be called recursively. It uses the IDTSSequence interface to iterate through the Executables collection, which can hold either TaskHosts or containers that implement IDTSSequence themselves, which is where the recursion comes in. It also takes the type name as a string parameter. The code checks the type name string against the name of the type as returned by the GetType function. It’s simply checking to see if the typeName string exists within the full string returned, so it doesn’t have to be an complete match. This means you don’t have to specify the full namespace, like “Microsoft.SqlServer.Dts.Tasks.ExecuteSQLTask”. Instead, you can just specify “ExecuteSQL”.

        List<DtsContainer> FindExecutablesByType(IDTSSequence sequence, string typeName)
{
string typeNameUpper = typeName.ToUpper();
List<DtsContainer> matchingExecutable = new List<DtsContainer>();
foreach (Executable e in sequence.Executables)
{
if (e.GetType().ToString().ToUpper().Contains(typeNameUpper))
{
matchingExecutable.Add((DtsContainer)e);
}
if (e is TaskHost)
{
TaskHost taskHost = (TaskHost)e;

if ((typeNameUpper.Contains(“DATA FLOW”)
|| typeNameUpper.Contains(“DATAFLOW”)
|| typeNameUpper.Contains(“MAINPIPE”)
|| typeNameUpper.Contains(“PIPELINE”))
&& taskHost.InnerObject is IDTSPipeline90)
{
matchingExecutable.Add((DtsContainer)e);
}
else if (taskHost.InnerObject.GetType().ToString().ToUpper().Contains(typeNameUpper))
{
matchingExecutable.Add((DtsContainer)e);
}
}
if (e is IDTSSequence)
{
matchingExecutable.AddRange(FindExecutablesByType((IDTSSequence)e, typeNameUpper));
}

}
return matchingExecutable;
}


The function returns a list of DtsContainer objects, which is a base class for all control flow objects. The DtsContainer gives access to the name and description for each object. There is some special handling for Data Flow objects. These are actually implemented as COM objects, so a call to GetType on them returns “ComObject” rather than the .NET class name. To work around this, I have hard-coded a few values for the data flow objects, and a specific check for the IDTSPipeline object. There may be a better way to accomplish this, but I haven’t located it yet. Another limitation of this code is that it does not check event handlers. Any tasks inside of event handlers will not be found. This could be added fairly easily, but wasn’t necessary for my current purposes, so it didn’t make the cut.


To call this function and iterate the entire package, you can use this code:

Microsoft.SqlServer.Dts.Runtime.Application ssisApp = new Microsoft.SqlServer.Dts.Runtime.Application();
Package pkg = ssisApp.LoadPackage(“C:\\Temp\\MyPackage.dtsx”, null);
List<DtsContainer> containers = FindExecutablesByType((IDTSSequence)pkg, “ExecuteSQL”);
foreach (DtsContainer exec in containers)
{
//Do Something
}

I’ve put together a small GUI around this, posted on my SkyDrive.



If you are interested in learning more about working with SSIS programmatically, please post in the comments. I’m considering doing a few more posts on this, but only if there is interest.

Address Columns Generically In a Script Component

When writing script components in SSIS, it’s often useful to be able to process the columns without knowing exactly what they are. For example, you may want to loop through all the columns to check for a conditional, like NULL values in the columns. Or you may want to take a set of columns and concatenate them into a single string for output, or output each one as an XML element. In any of these cases, you don’t necessarily care about the individual column details.


One option is to use Reflection. This is fairly easy to set up, but is not the fastest performing option. To use Reflection, add this to the top of the Script class:

Imports System.Reflection

In the ProcessInputRow, you can use the following code:

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Dim column As IDTSInputColumn90
Dim rowType As Type = Row.GetType()
Dim columnValue As PropertyInfo

For Each column In Me.ComponentMetaData.InputCollection(0).InputColumnCollection
columnValue = rowType.GetProperty(column.Name)
If columnValue.GetValue(Row, Nothing).ToString() = “January” Then
columnValue.SetValue(Row, String.Empty, Nothing)
End If
Next
End Sub


This example is checking the value of each column to see if it is “January”, and if it is, writing an empty string back to the column. Be aware that this option is not the best performing option. It’s OK for small data sets, but I would not use it for large ones. You can get better performance from the Reflection option by caching the PropertyInfo objects for each column in an instance variable, but if you are really interested in performance, keep reading.


This option is a bit more complex, but performs better. This is based off some code originally written by MSDN Forum member jaegd (original post here). It’s been simplified a good bit, but that introduces some limitations. As written, this isn’t very type safe, so be sure to add code to check the data type if you are not working with strings.

Public Class ScriptMain
Inherits UserComponent
Private inputBuffer As PipelineBuffer

Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer)
inputBuffer = Buffer
MyBase.ProcessInput(InputID, Buffer)
End Sub

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Dim counter As Integer = 0

For counter = 0 To inputBuffer.ColumnCount – 1
If inputBuffer.Item(counter).ToString() = “January” Then
inputBuffer.Item(counter) = String.Empty
End If
Next
End Sub
End Class


In the ProcessInput method, the Buffer object is cached, as we’ll need it to reference the columns by their index. In the ProcessInputRow method, we can use the Item property and the column index to read from or write to the column.


That’s two options for working with columns generically in the Script component. As mentioned above, use the second option for performance.



As usual, I’ve posted a sample with both methods on my SkyDrive.

Preparing A List of IDs for an IN Condition

There’s been a few questions on the MSDN Forums recently about using a list of keys in an IN condition in a WHERE clause. There are a few scenarios were this can be a bit tricky in SSIS, particularly if you need to dynamically create the list of keys. As an example, let’s look at the AdventureWorksDW database. Suppose that you need to extract rows from the FactInternetSales table based on the customer key.  You may need to use a SELECT like this, where the ? should be replaced with a list of keys:

SELECT
ProductKey, OrderDateKey, CustomerKey, SalesOrderNumber, SalesOrderLineNumber,
OrderQuantity, UnitPrice, TotalProductCost, SalesAmount, TaxAmt, Freight
FROM
FactInternetSales
WHERE
CustomerKey IN (?)

There are a number of ways to do this. First off, if the list of keys are coming from a table (the key table), and both tables are in the same database, you may be able to add a JOIN condition to the SELECT. This works if the list of keys can be determined through a WHERE clause.

SELECT
ProductKey, OrderDateKey, FactInternetSales.CustomerKey, SalesOrderNumber, SalesOrderLineNumber,
OrderQuantity, UnitPrice, TotalProductCost, SalesAmount, TaxAmt, Freight
FROM
FactInternetSales
JOIN DimCustomer ON FactInternetSales.CustomerKey = DimCustomer.CustomerKey
WHERE
DimCustomer.LastName LIKE ‘Z%’

This is the simplest approach, but lets assume that we need to get the list of keys from a separate database, and that it is not practical to join between the tables. This could be because they are two different database engines (SQL Server and Oracle, for example) or because of the performance impact of a cross-server join. The next approach to consider is using a temporary table to hold the list of keys. You would use one Data Flow to move the list of IDs from the key table to a temporary table in the same database as the Sales table. Then, in a second Data Flow, the technique discussed above can be applied.


Another approach can be used in situations where creating a temporary table isn’t desirable. An Execute SQL task can be used to retrieve the list of keys into a recordset object. There are two options once the keys are in a recordset. A For Each loop can be used to iterate through the keys, and run the Data Flow for each key in the recordset. A drawback with this approach is that you are processing each key individually, which can slow down the overall processing. The other other option is to use a Script task to create a concatenated list of the keys. This can be used in the Data Flow to process all the keys at once. Both approaches require that the Data Flow source component uses a variable or an expression to get the SQLCommand to execute.


image


The example I am including shows using the Script task to create the list of IDs. The script is:

    Public Sub Main()
Dim vars As Variables
Dim ds As DataSet
Dim dr As DataRow
Dim delimitedKeyList As String = String.Empty

Dts.VariableDispenser.LockOneForRead(“KeyList”, vars)
ds = CType(vars(“KeyList”).Value, DataSet)
vars.Unlock()

For Each dr In ds.Tables(0).Rows
delimitedKeyList += (dr(0).ToString() + “, “)
Next

‘remove last delimiter
delimitedKeyList = delimitedKeyList.Remove(delimitedKeyList.Length – 2, 2)

Dts.VariableDispenser.LockOneForWrite(“INList”, vars)
vars(“INList”).Value = delimitedKeyList
vars.Unlock()

Dts.TaskResult = Dts.Results.Success
End Sub


It takes the KeyList variable (populated with a DataSet object by the Execute SQL task, and writes a list of delimited key values to the INList variable. The INList variable is used in an expression set on the Data Flow:

“SELECT
ProductKey, OrderDateKey, FactInternetSales.CustomerKey, SalesOrderNumber, SalesOrderLineNumber,
OrderQuantity, UnitPrice, TotalProductCost, SalesAmount, TaxAmt, Freight
FROM FactInternetSales
WHERE CustomerKey IN (“
+ @[User::INList] +“)”

Be aware that if you are using an expression, the SELECT string can’t be over 4000 characters in length. For a long list of keys being appended to a long SELECT, that can be a problem. You can work around it by creating the entire SELECT statement in the Script task, and assigning it to a variable there. Another item to note is that the script expects an ADO.NET DataSet object. The Execute SQL task is using an ADO.NET connection, so that is what is passed in. If you are using an OLE DB Connection, the code will need to be updated to convert the ADO recordset (what you get with OLE DB) to a DataTable. Also, you may want to set DelayValidation to TRUE on the Data Flow task, to avoid validation errors when the package initially loads.


A sample package showing the Script approach is available here:



Hopefully this provides information on a number of the options for accomplishing this in SSIS. There are other ways that I didn’t cover in this, but these options cover most scenarios.

Dynamically Pivoting Rows to Columns

Previously, I posted about dynamically un-pivoting columns to rows. In this post, we’ll look at the reverse scenario. Suppose that you have a list of sales orders, in this format:
























Sales Order Number Product
SO-001 Hammer
SO-001 Table Saw
SO-001 Cordless Screwdriver
SO-002 Table Saw
SO-002 Wrench
SO-003 Flashlight

If there is a (reasonable) maximum number of products, you can create the maximum number of columns needed, and use the Pivot transform. However, if the number of products per order vary significantly, you might want to use something a little more flexible. This can be accomplished by storing the values in a single column as XML, or as a delimited set of values. 


For this example, we’ll take the input described above and transform it to this:















Sales Order Number Product
SO-001 Hammer|Table Saw|Cordless Screwdriver
SO-002 Table Saw|Wrench
SO-003 Flashlight

The data flow for this is straightforward – a Flat File Source, a Script component, and a Multicast (so that a data viewer can be added to see the results of the script). The Script component is where the bulk of the work is done.


image 


The script transform is configured with a single asynchronous output, by setting Output 0′s SynchronousInputID property to “None”. Two columns are added to Output 0, one for the Sales Order Number, and one for the list of products.


image


The script itself is listed below. Essentially, the script checks each row to see if the sales order number has changed from the previous row. If it is the same, it appends the product to a list. If the sales order number is different, the currently stored values for sales order number and the product list are sent to the asynchronous output.


There is some special handling for the first and last rows. On the first row, we don’t want to output anything, as we need to check the next row first. When the input buffer is finished, we need send the last value for sales order number and product list, and then call the SetEndOfRowSet on the output buffer. To make sure we’ve processed all the rows, we need to check EndOfRowSet on the input buffer. This should be checked in the Input0_ProcessInput routine, as it will never be equal to true in the Input0_ProcessInputRow method.


 



   1: Public Class ScriptMain
   2:     Inherits UserComponent
   3:  
   4:     Private orderNum As String
   5:     Private productList As String
   6:     Private blnFirstRow As Boolean = True
   7:  
   8:     Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
   9:  
  10:         If Row.SalesOrderNumber = orderNum Then
  11:             ‘append this item to the list
  12:             productList = productList + “|” + Row.Product
  13:         Else
  14:             If blnFirstRow Then
  15:                 blnFirstRow = False
  16:             Else
  17:                 ‘output the current values
  18:                 Output0Buffer.AddRow()
  19:                 Output0Buffer.SalesOrderNumber = orderNum
  20:                 Output0Buffer.ProductList = productList
  21:             End If
  22:  
  23:             ‘reset the control variables
  24:             orderNum = Row.SalesOrderNumber
  25:             productList = Row.Product
  26:         End If
  27:     End Sub
  28:  
  29:     Public Overrides Sub Input0_ProcessInput(ByVal Buffer As Input0Buffer)
  30:         MyBase.Input0_ProcessInput(Buffer)
  31:         If Buffer.EndOfRowset Then
  32:             Output0Buffer.AddRow()
  33:             Output0Buffer.SalesOrderNumber = orderNum
  34:             Output0Buffer.ProductList = productList
  35:             Output0Buffer.SetEndOfRowset()
  36:         End If
  37:     End Sub
  38: End Class

This can be useful if you need to pivot a varied number of columns. The same techniques can also be used anywhere you want to process a set of rows and change the output based on previous row values. If you’d like to take a more in-depth look, the sample files are on my SkyDrive:



 


 

Using the WMI Data Reader in SSIS

A recent post on the MSDN forums was asking about using the WMI Data Reader Task to retrieve information and send it to a SQL Server destination. I wasn’t very familiar with the task, so I decided to try it out. I wasn’t expecting much, given some of my experiences with the WMI Event Watcher (which I have found fairly difficult to get working consistently). However, it was very easier to configure and use.


WMI (Windows Management Instrumentation) allows you to get information about your hardware, software, and  network. I’m not going to cover it in depth (I’m not anywhere close to being knowledgeable about it), but if you want more information, there is an article from Microsoft on using the WMI Query Language here (http://www.microsoft.com/technet/scriptcenter/resources/begin/ss1206.mspx).


As a sample, I am querying the WIn32_Service class to get a list of services running on my computer, and the start mode for each of them. The query for this is:

SELECT NAME, STARTMODE FROM Win32_Service

The control flow contains two tasks, the WMI Data Reader Task and a data flow to write the results out. For sample purposes, I am only sending it to a Multicast, but it could be sent to any destination.

image           image

The WMI Data Reader Task is configured with the query in the WqlQuerySource, and the OutputType is set to “Data table”. The results are being put into a variable (User::WMIVar) of type Object.


image


The task stores the result of the WMI Query in the specified variable as an ADO.NET DataTable object. This variable is used in the data flow, within a Script Source component. The Script component is configured with two output columns, one for the service name and one for the StartupType. Both columns are using the DT_WSTR(50) data type. The WMIVar is passed in as a Read Only Variable. 


image        image


The Script Source shreds the DataTable into the data flow.

    Public Overrides Sub CreateNewOutputRows()
Dim dataTable As System.Data.DataTable
Dim dataRow As System.Data.DataRow

dataTable = CType(Me.Variables.WMIVar, Data.DataTable)

For Each dataRow In dataTable.Rows
Output0Buffer.AddRow()

Output0Buffer.Name = dataRow.Item(“NAME”).ToString()
Output0Buffer.StartupType = dataRow.Item(“STARTMODE”).ToString()
Next

Output0Buffer.SetEndOfRowset()
End Sub


That’s all that was necessary to use the results from the WMI Data Reader task. It was pretty straightforward to configure, and it runs quickly as well.

Processing a Flat File with Header and Detail Rows

It’s been tough finding time to write any content recently, but this was a scenario that came up recently both at work and on the forums, so I decided to write it up. It’s another variation on processing flat files (they seem to be endless). This scenario deals with an input file that contains two types of records, headers and details. This processing for this is based off another post (Handling Flat Files with a Varying Number of Columns), so I’m only going to cover what is unique in this scenario.


The input file looks like this:


HL111Header1
DLDetail1
DLDetail2
HL222Header2
DLDetail3
DLDetail4

The rows that begin with HL are header lines, the rows with DL are detail lines. One additional requirement in this example is that the header code (the three digits following the HL in each header line) must be included into each detail row.


The data flow looks like this:


HeaderDetail Data Flow


The Flat File Source is bringing in each row as a single column, delimited by CR/LF. The script component has two outputs defined, both as synchronous, and with an exclusion group greater than zero. (For more info on this, please see this post from Jamie Thompson.) One output will include the Header row, and the other will include the Detail rows. Note that each output must have unique column names defined.


The script appears below. It is simply parsing the Line column (that contains the entire row of data), and putting the values into the appropriate columns on each output. Note that the headerCode variable is stored each time a header row is encountered, and then used to populate the code value on the detail output. 


Public Class ScriptMain
Inherits UserComponent
Private headerCode As Integer

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
If Row.Line.Substring(0, 2) = “HL” Then
headerCode = CType(Row.Line.Substring(2, 3), Integer)
Row.DirectRowToHeader()
Row.HRRecordType = Row.Line.Substring(0, 2)
Row.HRCode = CType(Row.Line.Substring(2, 3), Integer)
Row.HRName = Row.Line.Substring(5)
ElseIf Row.Line.Substring(0, 2) = “DL” Then
Row.DirectRowToDetail()
Row.RecordType = Row.Line.Substring(0, 2)
Row.Code = headerCode
Row.Name = Row.Line.Substring(2)
End If
End Sub

End Class


There it is. Fairly straightforward, but it was quick to put together. The sample files are attached below.

XML Transformations Part 2

Previously, I posted about using a script destination component to output an XML file. A post on the MSDN forums got me thinking about a slight tweak to the previous post that would result in a little more flexibility. The poster wanted to create an XML string containing his row values to save into a column of a database table. So instead of script destination, what was needed here was a script transform to create a new column.

The data flow is straightforward – a source, script transform, and a OLEDB Destination.

DataFlow

In the script component, each column from the input file is selected. I added a single output column (XML) to hold the results of the script task.

ScriptInput

The script used is below:

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports System.Reflection

Public Class ScriptMain
    Inherits UserComponent

    Private Const rowElement As String = "Row"

    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        Dim column As IDTSInputColumn90
        Dim rowType As Type = Row.GetType()
        Dim columnValue As PropertyInfo

        Dim newXML As String

        newXML = (FormatElement(rowElement))
        For Each column In Me.ComponentMetaData.InputCollection(0).InputColumnCollection
            columnValue = rowType.GetProperty(column.Name)
            newXML += ((FormatElement(column.Name) + columnValue.GetValue(Row, Nothing).ToString() + FormatElement(column.Name, True)))
        Next
        newXML += (FormatElement(rowElement, True))

        Row.XML = newXML
    End Sub

    Private Function FormatElement(ByVal elementName As String) As String
        Return FormatElement(elementName, False)
    End Function

    Private Function FormatElement(ByVal elementName As String, ByVal closingTag As Boolean) As String
        Dim returnValue As String
        If closingTag Then
            returnValue = "</"
        Else
            returnValue = "<"
        End If
        returnValue += elementName + ">"

        Return returnValue
    End Function

End Class

The script is very similar to the one from the previous article, though it is a bit simpler. The XML values are prepared by iterating through the Input Collection metadata, and using the Reflection classes to read the Row object dynamically. The resulting value is put into the XML column on the output.

The OLEDB Destination simply maps the XML column to an XML column on a database table.

This is a good alternate approach to the previous one, particularly if you need to send the XML to a database instead of a flat file.

Multi-File Outputs Part 2

In my previous post, I discussed how to use a script destination to output multiple flat files from a single source. In it, I mentioned that I would post an alternate approach, that didn’t require any scripting. So, here it is.


This is using the same example data as the previous post, a source file that looks like:


NAME;VALUE;DATE
A;1;1/1/2000
A;2;1/2/2000
A;3;1/3/2000
A;4;1/4/2000
A;5;1/5/2000
B;1;1/1/2000
B;2;1/2/2000
B;3;1/3/2000
B;4;1/4/2000
B;5;1/5/2000
C;1;1/1/2000
C;2;1/2/2000
C;3;1/3/2000
C;4;1/4/2000
C;5;1/5/2000

The goal is to output one file containing all “A” rows, one with “B” rows, and one with “C” rows.


 


The control flow consists of a Data Flow, ForEach container, and another Data Flow.


MultiFile 2


The first data flow is used to get the number of “slices” in the original file, and the data to identify each slice. The results are stored in a recordset object that is used by the ForEach loop. In the recordset, I expect to see three records, with the values A, B, and C. To get that from the source, I can either use a conditional split, or an aggregate transform. Important: you do not need both; I am only including both approaches as examples. If you implement this as a real solution, please only use the approach that best suits your needs.


GetSliceDF


The aggregate transform is set to group by and pass through the Name column from the source file.


Aggregate


The condition split works by looking for rows with a value of 1. Those are the only rows passed to the recordset destination.


CondSplit


The choice of which one to use really depends on how complex the rules for determining how many files you need are. If it is based on a single column, or set of columns, the aggregate transform works well. For more complex logic, the conditional split is probably more appropriate.


Once the recordset is populated, it is stored in a variable. There are two defined, one to hold the results of the conditional split (splitRowList), and one to hold the results of the aggregate transform (aggregatedRowList). Again, you don’t need both. The aggregatedRowList is the one used for the rest of this example.


Variables


The Foreach loop is set to use the aggregatedRowList variable.


ForEach1


The first column of the recordset (the Name column) is mapped to the currentSlice variable. This is what is used in the data flow to determine whether the row should be included in the output. It is also used in an expression for the currentFile variable, which is also used to set the ConnectionString property of the Destination File connection manager:

“C:\\OutputDir\\” + “Output” +  @[User::currentSlice] + “.txt”

This updates the destination filename for each iteration of the loop.


ForEach


In the data flow, the source and destination are fairly straightforward.


DataFlow1


In the conditional split, the currentSlice variable is compared to the Name column to determine if this row should be included in the output. More complex logic could be included here fairly easily.


CondSplit1


That’s pretty much all there is to it.


Why would you want to use this approach over using a script? For one, this would be easy to adapt to multiple table destinations, instead of flat files. Two, this approach does not care about whether the rows are sorted or not. However, it is reliant on the conditional split, so the logic for which output a row belongs in needs to be something that can be implemented in a conditional split. And it does require multiple passes through the source rows. The script approach seems to be better suited to sequential, single pass file processing.

Multi-File Output Destination Script Component

Yesterday I posted about using a destination script component to write XML files from a data flow. Today, I’m continuing in the same vein with a post about writing multiple destination files from a destination script component. The inspiration for this was a post on the MSDN forums where a developer described a scenario where he had a single file and needed to split it into multiple output files based on some conditional logic. Normally, this could be handled with a conditional split and multiple destinations, but the catch in his scenario was that he didn’t know how many destination files he would need.


Based on that, I worked up a quick script destination to accomplish his goal. Later, I’m also going to post an alternative approach that does not use any script.


Multifile Destination


The data flow is fairly straightforward. The flat file source is retrieving a flat file with the following data:


NAME;VALUE;DATE
A;1;1/1/2000
A;2;1/2/2000
A;3;1/3/2000
A;4;1/4/2000
A;5;1/5/2000
B;1;1/1/2000
B;2;1/2/2000
B;3;1/3/2000
B;4;1/4/2000
B;5;1/5/2000
C;1;1/1/2000
C;2;1/2/2000
C;3;1/3/2000
C;4;1/4/2000
C;5;1/5/2000

The goal is to output three files, one each for the A, B, and C rows. I used a folder connection rather than a specific flat file connection. The folder connection specifies the output directory to place the files in.

DestinationFolder

On the script task (created as a destination), all the available input columns were selected.

Script Inputs

The Destination Folder is specified on the Connection Managers page.

ConnMgr

The script is as follows:

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports System.IO

Public Class ScriptMain
Inherits UserComponent

Private Const fileName As String = “Output”
Private Const fileExt As String = “.txt”
Private Const delimiter As String = “;”
Private targetFolder As String
Private fileWriter As StreamWriter
Private fileCount As Int32 = 1
Private previousValue As String = String.Empty

Public Overrides Sub AcquireConnections(ByVal Transaction As Object)
targetFolder = CType(Me.Connections.Destination.AcquireConnection(Nothing), String) + “\”
End Sub

Public Overrides Sub PostExecute()
fileWriter.Close()
End Sub

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Dim column As IDTSInputColumn90

If Not Row.NAME = previousValue Then
If Not fileWriter Is Nothing Then
fileWriter.Close()
End If
fileWriter = New StreamWriter(targetFolder + fileName + fileCount.ToString() + fileExt, False)
fileCount += 1
previousValue = Row.NAME
End If

With fileWriter
.Write(Row.NAME + delimiter)
.Write(Row.VALUE.ToString() + delimiter)
.WriteLine(Row.DATE.ToString() + delimiter)
End With
End Sub

End Class


The most interesting portion of this script is in the ProcessInputRow method. The initial section of the method checks to see if the current row Name value is the same as the previous row. If it is different, it closes the current file, opens a new one, and increments a counter for the file name. Then the current row values are written out to the file.

This script works well for outputting multiple files. In the script, I’m using a simple previous value comparison to see if  I need to start a new file, but the logic could be much more complex if necessary.

An advantage of this approach over the non-script approach is that it only requires one pass through the source file. Another advantage is that the comparison logic can be fairly complex and can easily handle triggering new file creation based on number of rows or a trigger row. The pure SSIS approach can’t handle that as easily. However, this approach does require that your rows are sorted properly before being fed into the destination, whereas the alternate approach can handle the rows in any order.