Duplicate records after ETL merge in inventory warehouse causing incorrect stock levels

We’re experiencing a critical issue with our inventory reporting system. After running ETL merge operations in our data warehouse, we’re seeing duplicate records for the same SKUs, which is causing our SSRS inventory reports to show inflated stock levels.

The ETL process uses a MERGE statement to upsert inventory data from multiple distribution centers. The merge logic is supposed to update existing records and insert new ones based on SKU and location. However, we’re ending up with 2-3 duplicate rows for some products.


MERGE INTO InventoryFact AS target
USING StagingInventory AS source
ON target.SKU = source.SKU AND target.LocationID = source.LocationID
WHEN MATCHED THEN UPDATE SET Quantity = source.Quantity
WHEN NOT MATCHED THEN INSERT (SKU, LocationID, Quantity) VALUES (source.SKU, source.LocationID, source.Quantity);

Our inventory data validation checks are failing, and warehouse managers are making decisions based on incorrect stock numbers. Has anyone dealt with deduplication in SQL after ETL merges?

ROW_NUMBER() works but can be slow on large datasets. Consider using a CTE with ROW_NUMBER() partitioned by SKU and LocationID, ordered by timestamp descending to keep the most recent record. Alternatively, create a cleaned staging table first using GROUP BY with MAX(timestamp) to identify the latest record for each SKU/Location combination. This is usually faster than window functions on big inventory tables.

Your MERGE statement looks correct at first glance, but I suspect the issue is in your source data. If StagingInventory has duplicates before the merge, you’ll get multiple inserts. Run a quick check to see if your staging table has duplicate SKU/LocationID combinations before the merge executes.

Here’s a comprehensive solution addressing all three aspects of your duplicate record problem:

ETL Merge Logic Correction: Your MERGE statement needs defensive programming to handle source duplicates. First, add a unique constraint on your target table to prevent duplicates at the database level:

ALTER TABLE InventoryFact
ADD CONSTRAINT UK_Inventory UNIQUE (SKU, LocationID);

Then modify your merge to use a deduplicated source:

WITH CleanSource AS (
  SELECT SKU, LocationID, Quantity, UpdateTimestamp,
    ROW_NUMBER() OVER (PARTITION BY SKU, LocationID
                       ORDER BY UpdateTimestamp DESC) AS rn
  FROM StagingInventory
)
MERGE INTO InventoryFact AS target
USING (SELECT SKU, LocationID, Quantity
       FROM CleanSource WHERE rn = 1) AS source
ON target.SKU = source.SKU AND target.LocationID = source.LocationID
WHEN MATCHED THEN
  UPDATE SET Quantity = source.Quantity, LastUpdated = GETDATE()
WHEN NOT MATCHED THEN
  INSERT (SKU, LocationID, Quantity, LastUpdated)
  VALUES (source.SKU, source.LocationID, source.Quantity, GETDATE());

Deduplication with SQL: For existing duplicates in InventoryFact, implement a cleanup procedure that runs after your ETL:

-- Identify and log duplicates for investigation
INSERT INTO ETL_DuplicateLog (SKU, LocationID, DuplicateCount, DetectedDate)
SELECT SKU, LocationID, COUNT(*) as DupCount, GETDATE()
FROM InventoryFact
GROUP BY SKU, LocationID
HAVING COUNT(*) > 1;

-- Remove duplicates keeping most recent
WITH DuplicateRecords AS (
  SELECT InventoryID,
    ROW_NUMBER() OVER (PARTITION BY SKU, LocationID
                       ORDER BY LastUpdated DESC, InventoryID DESC) AS rn
  FROM InventoryFact
)
DELETE FROM InventoryFact
WHERE InventoryID IN (
  SELECT InventoryID FROM DuplicateRecords WHERE rn > 1
);

Inventory Data Validation: Implement a three-tier validation framework:

  1. Pre-Merge Validation - Check staging data quality before processing:
-- Detect source duplicates
SELECT SKU, LocationID, COUNT(*) as RecordCount
FROM StagingInventory
GROUP BY SKU, LocationID
HAVING COUNT(*) > 1;

-- Validate quantity ranges
SELECT * FROM StagingInventory
WHERE Quantity < 0 OR Quantity > 999999;
  1. Post-Merge Validation - Verify data integrity after ETL:
-- Check for duplicates in target
SELECT 'Duplicates Found' as ValidationStatus, COUNT(*) as IssueCount
FROM (SELECT SKU, LocationID FROM InventoryFact
      GROUP BY SKU, LocationID HAVING COUNT(*) > 1) d

UNION ALL

-- Verify record counts match expected
SELECT 'Record Count Mismatch' as ValidationStatus,
       ABS((SELECT COUNT(DISTINCT SKU, LocationID) FROM StagingInventory) -
           (SELECT COUNT(*) FROM InventoryFact)) as IssueCount;
  1. SSRS Report Validation - Add data quality indicators to your inventory reports:

Create a validation dataset in SSRS that checks for anomalies:

SELECT
  CASE WHEN DuplicateCount > 0 THEN 'WARNING: Duplicates detected'
       WHEN NegativeQty > 0 THEN 'ERROR: Negative quantities found'
       ELSE 'OK' END as DataQualityStatus,
  (SELECT COUNT(*) FROM InventoryFact
   GROUP BY SKU, LocationID HAVING COUNT(*) > 1) as DuplicateCount,
  (SELECT COUNT(*) FROM InventoryFact WHERE Quantity < 0) as NegativeQty

Display this as a warning banner at the top of your inventory reports so warehouse managers know when data quality issues exist.

Root Cause Prevention: Implement these safeguards in your ETL process:

  • Add transaction logging to track which source system contributed each record
  • Implement a reconciliation process that compares source system totals against warehouse totals
  • Create alerts when duplicate counts exceed threshold (e.g., more than 5 duplicates per ETL run)
  • Schedule the deduplication cleanup procedure to run immediately after ETL completes but before reports are generated

This multi-layered approach prevents duplicates at the source, catches them during ETL, cleans up any that slip through, and alerts users when data quality issues exist.

We had this exact problem last year. The issue was that our ETL was running concurrently from different distribution centers, and the MERGE wasn’t properly handling race conditions. Two processes would both see ‘NOT MATCHED’ and insert duplicate rows. We fixed it by adding a unique constraint on SKU+LocationID and wrapping the MERGE in a transaction with proper isolation level. Also check if you have triggers on the table that might be causing unexpected inserts.

Before deduplicating, you need to understand why duplicates exist. Are they legitimate updates from the same location within the ETL window, or are they data quality issues from source systems? Sometimes duplicates indicate that different systems are tracking the same inventory (like a WMS and ERP both reporting the same SKU). You might need business logic to determine which source is authoritative rather than just picking the latest timestamp.