Anthony Martin's Blog

SQL Server and Microsoft BI

Piling on the Work Pile

with one comment

My Pragmatic Work’s colleague Chris Schmidt recently recommended using the SSIS work pile design pattern to help me out with a challenging customer implementation.  Chris did an excellent job outlining the design pattern in his blog post here: http://bit.ly/1iHzou5.  I’ve had such amazing success with this solution that I wanted to touch on a few of the finer points of the implementation.  Specifically, in this blog post I’ll discuss how to automate the population of the work pile.

image

Big Papi on Top of the Work Pile!

At the center of the work pile design pattern is a task queue that all of the SSIS package threads read from to determine what they should be working on.  Once a SSIS package thread completes its work it picks another task from the task queue and does more work.  This process continues until there is no more work to done.  You can achieve massive scaling and parallelism by simply increasing the amount of SSIS package threads that run at the same time.

For simplicity and brevity, Chris’s blog post had a known set of files to populate into the work pile table.  However, what do you do when the set of files isn’t known AND new files are always being pushed to different file system directories?  I’ll show you how I handled this scenario in a recent implementation…

Getting Started

The first step is to create a table that will store the metadata of the file system directories that we are interested in searching for files to add to the work pile.  In my scenario, I need to search several directories and each of the directories represent files that are pushed from different source systems.  Also, each of the source systems name there files in different ways.

In the two T-SQL scripts below I create a table named dbo.TaskQueueDirectories to store the directory metadata and insert four sample directories, for four different source systems, that all have different file naming specifications.

   1: CREATE TABLE [dbo].[TaskQueueDirectories]

   2: (

   3:     [DirectoryID] [int] IDENTITY(1,1) NOT NULL,

   4:     [Directory] [varchar](200) NOT NULL,

   5:     [SourceSystem] [varchar](50) NOT NULL,

   6:     [FileSpec] [varchar](50) NOT NULL,

   7:     [Priority] [int] NULL,

   8:  CONSTRAINT [PK_TaskQueueDirectories] PRIMARY KEY CLUSTERED 

   9:     (

  10:         [DirectoryID] ASC

  11:     )

  12: )

 

   1: INSERT INTO dbo.TaskQueueDirectories

   2:     (

   3:         Directory

   4:         , SourceSystem

   5:         , FileSpec

   6:         , [Priority]

   7:     )

   8: VALUES

   9:     ('\\MyServer1\ETL\SourceSystem1', 'SourceSystem1', 'abc*.csv', 1)

  10:     , ('\\MyServer2\ETL\SourceSystem2', 'SourceSystem2', 'def*.csv', 2)

  11:     , ('\\MyServer3\ETL\SourceSystem3', 'SourceSystem3', 'ghi*.csv', 3)

  12:     , ('\\MyServer4\ETL\SourceSystem4', 'SourceSystem4', 'jkl*.csv', 4)

Now that the metadata is in place, the next step is to create an SSIS package that will loop through the directories, loop through the files in each directory that match the source system’s file specification, and compare that to our work pile table to see if the file already exists there.  If it doesn’t exist in the work pile we’ll insert it, if it does exist we’ll skip it.  Simple as that!

Finished Product

Let’s start at the final solution and work our way through the process.  Below is a screenshot of the finished package that piles on the work pile!

image

SSIS Package Step 1- Find the Directories

The first step is to read in our dbo.TaskQueueDirectories and store it in an SSIS object type variable.  I created a stored procedure named dbo.TaskQueueDirectories_Get which is executed in an Execute SQL Task.  The results are stored in a variable named TaskQueueDirectories that has a data type of Object.  I’m going to assume that you know how to configure the Execute SQL Task to based on this description.  I will however give you DDL for the stored procedure:

   1: CREATE PROCEDURE [dbo].[TaskQueueDirectories_Get]

   2: AS

   3:  

   4: SELECT

   5:     SourceSystem

   6:     , Directory

   7:     , FileSpec

   8: FROM

   9:     TaskQueueDirectories

  10: ORDER BY

  11:     Priority ASC

  12: GO

SSIS Package Step 2- The Outer Loop

Next we simply use a For Each Loop task to loop through rows in the TaskQueueDirectories variable.  The columns within each row are mapped to three new SSIS variables, SourceFileDirectory, FileSpec, and SourceSystem.

imageimage

 

 

 

 

 

 

 

 

 

SSIS Package Step 3- The Inner Loop

Where are we now?  We have the logic to loop through the directories, now we need the logic to loop through the files within the directory.  To do this we’ll need one more For Each Loop task that utilizes the new variables we created in the previous step.  First, make sure the enumerator is set to ForEach File Enumerator.  Then create two new expressions on the Collection page, one for the Directory property which is set to the value in the SourceFileDirectory variable and another for the FileSpec property which is set to the value of the FileSpec variable.  Finally, switch to the Variable Mappings page to store the name of the file in another new SSIS variable named SourceFileFullyQualified

image

image

SSIS Package Step 4- Wrapping Up!

The final step!  Here we need to determine if the file name that was found in the inner loop already exists in our work pile and insert it into the work pile if it doesn’t exist.  Below is the DDL for my work pile table and a stored procedure to insert a row into the table if the file doesn’t already exist.

   1: CREATE TABLE [dbo].[TaskQueue](

   2:     [TaskID] [int] IDENTITY(1,1) NOT NULL,

   3:     [SourceSystem] [varchar](50) NOT NULL,

   4:     [FilePath] [varchar](255) NOT NULL,

   5:     [Status] [tinyint] NOT NULL,

   6:     [Priority] [int] NULL,

   7:     [DateStarted] [datetime] NULL,

   8:     [DateCompleted] [datetime] NULL,

   9:  CONSTRAINT [PK_TaskQueue] PRIMARY KEY CLUSTERED 

  10:     (

  11:         [TaskID] ASC

  12:     )

  13: )

 

   1: CREATE PROCEDURE [dbo].[TaskQueue_Insert]

   2: (

   3:     @SourceSystem VARCHAR(50)

   4:     , @SourceFileFullyQualified VARCHAR(200)    

   5: )

   6: AS

   7:  

   8: IF NOT EXISTS    

   9:     (

  10:     SELECT

  11:         1

  12:     FROM

  13:         dbo.TaskQueue

  14:     WHERE

  15:         FilePath = @SourceFileFullyQualified

  16:     )

  17: BEGIN

  18:     INSERT INTO dbo.TaskQueue 

  19:         (

  20:             SourceSystem

  21:             , FilePath

  22:             , Status

  23:         ) 

  24:     VALUES

  25:         (

  26:             @SourceSystem

  27:             , @SourceFileFullyQualified

  28:             , 0

  29:         )

  30: END

Finally, I’ll show you how the last Execute SQL Task is configured to execute the stored procedure and pass in the appropriate parameters:

image

image

What we’ve done here is pass the SSIS variable SourceFileFullyQualified, this is used to determine if the file already exists in the work pile.  The second parameter is the variable SourceSystem, this is used to populate the dbo.TaskQueue.SourceSystem column if the file didn’t exist and a new row needs to be inserted.

Finishing Up

The last step is to schedule this bad boy to run on a given schedule, be it hourly, daily, or whatever meets your requirements.  Enjoy!

Written by Anthony Martin

November 5, 2013 at 2:42 pm

Posted in Uncategorized

One Response

Subscribe to comments with RSS.

  1. […] So how did we get around this problem? Well we tried two different design patterns. The first and the one I will be blogging about here is the “MultiFlatfile” connection manager. The second, more complicated to set up, but extremely efficient can be found here: Anthony Martin.  […]


Leave a comment