Piling on the Work Pile
My Pragmatic Work’s colleague Chris Schmidt recently recommended using the SSIS work pile design pattern to help me out with a challenging customer implementation. Chris did an excellent job outlining the design pattern in his blog post here: http://bit.ly/1iHzou5. I’ve had such amazing success with this solution that I wanted to touch on a few of the finer points of the implementation. Specifically, in this blog post I’ll discuss how to automate the population of the work pile.
Big Papi on Top of the Work Pile!
At the center of the work pile design pattern is a task queue that all of the SSIS package threads read from to determine what they should be working on. Once a SSIS package thread completes its work it picks another task from the task queue and does more work. This process continues until there is no more work to done. You can achieve massive scaling and parallelism by simply increasing the amount of SSIS package threads that run at the same time.
For simplicity and brevity, Chris’s blog post had a known set of files to populate into the work pile table. However, what do you do when the set of files isn’t known AND new files are always being pushed to different file system directories? I’ll show you how I handled this scenario in a recent implementation…
Getting Started
The first step is to create a table that will store the metadata of the file system directories that we are interested in searching for files to add to the work pile. In my scenario, I need to search several directories and each of the directories represent files that are pushed from different source systems. Also, each of the source systems name there files in different ways.
In the two T-SQL scripts below I create a table named dbo.TaskQueueDirectories to store the directory metadata and insert four sample directories, for four different source systems, that all have different file naming specifications.
1: CREATE TABLE [dbo].[TaskQueueDirectories]
2: (
3: [DirectoryID] [int] IDENTITY(1,1) NOT NULL,
4: [Directory] [varchar](200) NOT NULL,
5: [SourceSystem] [varchar](50) NOT NULL,
6: [FileSpec] [varchar](50) NOT NULL,
7: [Priority] [int] NULL,
8: CONSTRAINT [PK_TaskQueueDirectories] PRIMARY KEY CLUSTERED
9: (
10: [DirectoryID] ASC
11: )
12: )
1: INSERT INTO dbo.TaskQueueDirectories
2: (
3: Directory
4: , SourceSystem
5: , FileSpec
6: , [Priority]
7: )
8: VALUES
9: ('\\MyServer1\ETL\SourceSystem1', 'SourceSystem1', 'abc*.csv', 1)
10: , ('\\MyServer2\ETL\SourceSystem2', 'SourceSystem2', 'def*.csv', 2)
11: , ('\\MyServer3\ETL\SourceSystem3', 'SourceSystem3', 'ghi*.csv', 3)
12: , ('\\MyServer4\ETL\SourceSystem4', 'SourceSystem4', 'jkl*.csv', 4)
Now that the metadata is in place, the next step is to create an SSIS package that will loop through the directories, loop through the files in each directory that match the source system’s file specification, and compare that to our work pile table to see if the file already exists there. If it doesn’t exist in the work pile we’ll insert it, if it does exist we’ll skip it. Simple as that!
Finished Product
Let’s start at the final solution and work our way through the process. Below is a screenshot of the finished package that piles on the work pile!
SSIS Package Step 1- Find the Directories
The first step is to read in our dbo.TaskQueueDirectories and store it in an SSIS object type variable. I created a stored procedure named dbo.TaskQueueDirectories_Get which is executed in an Execute SQL Task. The results are stored in a variable named TaskQueueDirectories that has a data type of Object. I’m going to assume that you know how to configure the Execute SQL Task to based on this description. I will however give you DDL for the stored procedure:
1: CREATE PROCEDURE [dbo].[TaskQueueDirectories_Get]
2: AS
3:
4: SELECT
5: SourceSystem
6: , Directory
7: , FileSpec
8: FROM
9: TaskQueueDirectories
10: ORDER BY
11: Priority ASC
12: GO
SSIS Package Step 2- The Outer Loop
Next we simply use a For Each Loop task to loop through rows in the TaskQueueDirectories variable. The columns within each row are mapped to three new SSIS variables, SourceFileDirectory, FileSpec, and SourceSystem.
SSIS Package Step 3- The Inner Loop
Where are we now? We have the logic to loop through the directories, now we need the logic to loop through the files within the directory. To do this we’ll need one more For Each Loop task that utilizes the new variables we created in the previous step. First, make sure the enumerator is set to ForEach File Enumerator. Then create two new expressions on the Collection page, one for the Directory property which is set to the value in the SourceFileDirectory variable and another for the FileSpec property which is set to the value of the FileSpec variable. Finally, switch to the Variable Mappings page to store the name of the file in another new SSIS variable named SourceFileFullyQualified.
SSIS Package Step 4- Wrapping Up!
The final step! Here we need to determine if the file name that was found in the inner loop already exists in our work pile and insert it into the work pile if it doesn’t exist. Below is the DDL for my work pile table and a stored procedure to insert a row into the table if the file doesn’t already exist.
1: CREATE TABLE [dbo].[TaskQueue](
2: [TaskID] [int] IDENTITY(1,1) NOT NULL,
3: [SourceSystem] [varchar](50) NOT NULL,
4: [FilePath] [varchar](255) NOT NULL,
5: [Status] [tinyint] NOT NULL,
6: [Priority] [int] NULL,
7: [DateStarted] [datetime] NULL,
8: [DateCompleted] [datetime] NULL,
9: CONSTRAINT [PK_TaskQueue] PRIMARY KEY CLUSTERED
10: (
11: [TaskID] ASC
12: )
13: )
1: CREATE PROCEDURE [dbo].[TaskQueue_Insert]
2: (
3: @SourceSystem VARCHAR(50)
4: , @SourceFileFullyQualified VARCHAR(200)
5: )
6: AS
7:
8: IF NOT EXISTS
9: (
10: SELECT
11: 1
12: FROM
13: dbo.TaskQueue
14: WHERE
15: FilePath = @SourceFileFullyQualified
16: )
17: BEGIN
18: INSERT INTO dbo.TaskQueue
19: (
20: SourceSystem
21: , FilePath
22: , Status
23: )
24: VALUES
25: (
26: @SourceSystem
27: , @SourceFileFullyQualified
28: , 0
29: )
30: END
Finally, I’ll show you how the last Execute SQL Task is configured to execute the stored procedure and pass in the appropriate parameters:
What we’ve done here is pass the SSIS variable SourceFileFullyQualified, this is used to determine if the file already exists in the work pile. The second parameter is the variable SourceSystem, this is used to populate the dbo.TaskQueue.SourceSystem column if the file didn’t exist and a new row needs to be inserted.
Finishing Up
The last step is to schedule this bad boy to run on a given schedule, be it hourly, daily, or whatever meets your requirements. Enjoy!
[…] So how did we get around this problem? Well we tried two different design patterns. The first and the one I will be blogging about here is the “MultiFlatfile” connection manager. The second, more complicated to set up, but extremely efficient can be found here: Anthony Martin. […]
Load thousands of files in SSIS lighting fast! MultiFlatFile Connection Manager. | MitchellSQL
June 26, 2014 at 9:02 pm