Strategies for Staging Data to SQL

Introduction

Staging data is an important part of the ETL (or ELT) process. The general concept is this: before you bring your data into your database, data store, data mart or data warehouse, you need to pre-load it. This is the process of extracting and then loading before moving it on, otherwise known as “staging”.

In the examples below I’m going to load in the contents of a csv file to a SQL table using BULK INSERT.

Truncate and Load

This is the simplest option and normally good if you have a data-set in the thousands. With this strategy you simply delete the copy of the data you have from your last load and replace it with a fresh copy.

TRUNCATE TABLE dbo.Products

BULK INSERT dbo.Products
FROM 'C:\Temp\Products_20200408_235959.csv'
WITH (
FIRSTROW = 2,
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
TABLOCK
);

Append

The append strategy works if your data doesn’t change. A good example of data like this might be event logs or till transactions. You’ll want to put some checks in place though to ensure you don’t insert the same data more than once.

-- Truncate and load file data to staging table.
TRUNCATE TABLE dbo.Staging_Trans;

BULK INSERT dbo.Staging_Trans
FROM 'C:\Temp\Transactions_20200408_235959.csv'
WITH (
    FIRSTROW = 2,
    FIELDTERMINATOR = ',',  
    ROWTERMINATOR = '\n',   
    TABLOCK
);

-- Append staged data to loading table.
INSERT INTO dbo.Transactions (tranId, tranDate, productId, tillId, amount, discount)
SELECT tranId, tranDate, productId, tillId, amount, discount
FROM dbo.Staging_Trans AS S -- Source
WHERE NOT EXISTS
	(-- don't insert duplicate rows.
	 SELECT * FROM dbo.Transactions AS T -- Target
	 WHERE S.tranId = T.tranId
	)

Merge Deltas

This process is needed when your data-set changes often. For example, let’s say we have a company that receives orders and then sends out the goods. We may therefore have an Order Date, Shipped Date, maybe even the odd change of address.

With this strategy you want to merge your data set on an identifiable subset.

Let us assume that generally none of my orders take longer than a month to fulfil, and I’ve tested this by reviewing the history of my data.

This being the case I can extract a few months of data at a time (a delta) and merge the results into my target table.

So I would have a primary key and a delta key. The delta key is important because it helps me work out what rows to delete without deleting historical data.

-- ======================
-- Truncate and load file data to staging table.
TRUNCATE TABLE dbo.Staging_Stock;

BULK INSERT dbo.Staging_Stock
FROM 'C:\Temp\Stock_20200408_235959.csv'
WITH (
	FIRSTROW = 2,
    FIELDTERMINATOR = ',',  
    ROWTERMINATOR = '\n',   
    TABLOCK
);

-- ======================
-- Get the delta key value.
DECLARE @YearMonth int 

-- Get the last month in the file.
SELECT TOP(1) 
	@YearMonth = YEAR(inDate) * 100 + MONTH(inDate) 
FROM dbo.Staging_Stock
ORDER BY inDate DESC

-- ======================
-- Merge differences from starting to target
MERGE dbo.Stock AS T -- Target
USING dbo.Staging_Stock AS S -- Staging
	ON T.stockItemId = S.stockItemId
WHEN MATCHED AND
	EXISTS
	( -- Check if column values are different.
	SELECT T.stockItemId, T.inDate, T.outDate, T.productId, T.quantity, T.listPrice			
	EXCEPT
	SELECT S.stockItemId, S.inDate, S.outDate, S.productId, S.quantity, S.listPrice
	)
THEN UPDATE SET -- Update rows that have changed.
	 T.stockItemId = S.stockItemId
	,T.inDate	   = S.inDate
	,T.outDate	   = S.outDate
	,T.productId   = S.productId
	,T.quantity	   = S.quantity
	,T.listPrice   = S.listPrice
	,T.updateDate  = GETDATE() -- update stamp
WHEN NOT MATCHED BY TARGET
	THEN INSERT -- Insert new rows.
	(
	 stockItemId
	,inDate
	,outDate
	,productId
	,quantity
	,listPrice
	,insertDate
	,yearMonthKey
	)
	VALUES
	(
	 S.stockItemId
	,S.inDate
	,S.outDate
	,S.productId
	,S.quantity
	,S.listPrice
	,GETDATE() -- insert stamp
	,YEAR(inDate) * 100 + MONTH(inDate) -- Insert delta key.
	)
	WHEN NOT MATCHED BY SOURCE
		AND yearMonthKey >= @YearMonth -- Only delete rows relative to the delta.
	THEN DELETE;

Conclusion

Now we’ve covered three main strategies.

  • Truncate and load for small and simple data sets.
  • Append for datasets that never change.
  • Merge deltas for large and changing data sets.