Here’s a comprehensive solution addressing all three aspects of your duplicate record problem:
ETL Merge Logic Correction:
Your MERGE statement needs defensive programming to handle source duplicates. First, add a unique constraint on your target table to prevent duplicates at the database level:
ALTER TABLE InventoryFact
ADD CONSTRAINT UK_Inventory UNIQUE (SKU, LocationID);
Then modify your merge to use a deduplicated source:
WITH CleanSource AS (
SELECT SKU, LocationID, Quantity, UpdateTimestamp,
ROW_NUMBER() OVER (PARTITION BY SKU, LocationID
ORDER BY UpdateTimestamp DESC) AS rn
FROM StagingInventory
)
MERGE INTO InventoryFact AS target
USING (SELECT SKU, LocationID, Quantity
FROM CleanSource WHERE rn = 1) AS source
ON target.SKU = source.SKU AND target.LocationID = source.LocationID
WHEN MATCHED THEN
UPDATE SET Quantity = source.Quantity, LastUpdated = GETDATE()
WHEN NOT MATCHED THEN
INSERT (SKU, LocationID, Quantity, LastUpdated)
VALUES (source.SKU, source.LocationID, source.Quantity, GETDATE());
Deduplication with SQL:
For existing duplicates in InventoryFact, implement a cleanup procedure that runs after your ETL:
-- Identify and log duplicates for investigation
INSERT INTO ETL_DuplicateLog (SKU, LocationID, DuplicateCount, DetectedDate)
SELECT SKU, LocationID, COUNT(*) as DupCount, GETDATE()
FROM InventoryFact
GROUP BY SKU, LocationID
HAVING COUNT(*) > 1;
-- Remove duplicates keeping most recent
WITH DuplicateRecords AS (
SELECT InventoryID,
ROW_NUMBER() OVER (PARTITION BY SKU, LocationID
ORDER BY LastUpdated DESC, InventoryID DESC) AS rn
FROM InventoryFact
)
DELETE FROM InventoryFact
WHERE InventoryID IN (
SELECT InventoryID FROM DuplicateRecords WHERE rn > 1
);
Inventory Data Validation:
Implement a three-tier validation framework:
- Pre-Merge Validation - Check staging data quality before processing:
-- Detect source duplicates
SELECT SKU, LocationID, COUNT(*) as RecordCount
FROM StagingInventory
GROUP BY SKU, LocationID
HAVING COUNT(*) > 1;
-- Validate quantity ranges
SELECT * FROM StagingInventory
WHERE Quantity < 0 OR Quantity > 999999;
- Post-Merge Validation - Verify data integrity after ETL:
-- Check for duplicates in target
SELECT 'Duplicates Found' as ValidationStatus, COUNT(*) as IssueCount
FROM (SELECT SKU, LocationID FROM InventoryFact
GROUP BY SKU, LocationID HAVING COUNT(*) > 1) d
UNION ALL
-- Verify record counts match expected
SELECT 'Record Count Mismatch' as ValidationStatus,
ABS((SELECT COUNT(DISTINCT SKU, LocationID) FROM StagingInventory) -
(SELECT COUNT(*) FROM InventoryFact)) as IssueCount;
- SSRS Report Validation - Add data quality indicators to your inventory reports:
Create a validation dataset in SSRS that checks for anomalies:
SELECT
CASE WHEN DuplicateCount > 0 THEN 'WARNING: Duplicates detected'
WHEN NegativeQty > 0 THEN 'ERROR: Negative quantities found'
ELSE 'OK' END as DataQualityStatus,
(SELECT COUNT(*) FROM InventoryFact
GROUP BY SKU, LocationID HAVING COUNT(*) > 1) as DuplicateCount,
(SELECT COUNT(*) FROM InventoryFact WHERE Quantity < 0) as NegativeQty
Display this as a warning banner at the top of your inventory reports so warehouse managers know when data quality issues exist.
Root Cause Prevention:
Implement these safeguards in your ETL process:
- Add transaction logging to track which source system contributed each record
- Implement a reconciliation process that compares source system totals against warehouse totals
- Create alerts when duplicate counts exceed threshold (e.g., more than 5 duplicates per ETL run)
- Schedule the deduplication cleanup procedure to run immediately after ETL completes but before reports are generated
This multi-layered approach prevents duplicates at the source, catches them during ETL, cleans up any that slip through, and alerts users when data quality issues exist.