Here’s the complete solution for handling duplicate records in merged data sources:
Merge Logic Configuration:
In Crystal Reports Data Preparation, configure the merge operation with proper deduplication:
- Define a unique composite key:
CREATE VIEW merged_sales_clean AS
SELECT DISTINCT
customer_id,
transaction_date,
transaction_id, -- or order_id
amount,
product_id,
region
FROM (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY customer_id, transaction_date, transaction_id
ORDER BY file_load_timestamp DESC
) as rn
FROM (
SELECT *, 'North' as region FROM north_sales
UNION ALL
SELECT *, 'South' as region FROM south_sales
UNION ALL
SELECT *, 'West' as region FROM west_sales
) combined
) ranked
WHERE rn = 1
This keeps the most recently loaded version when duplicates exist.
Unique Key Constraints:
Establish what makes a record unique:
-
Scenario 1: True Duplicates (same transaction reported by multiple regions)
- Unique Key: customer_id + transaction_date + transaction_id
- Action: Keep one record, discard duplicates
- Use ROW_NUMBER() with PARTITION BY unique key
-
Scenario 2: Different Transactions (same customer, multiple legitimate transactions)
- Unique Key: transaction_id (should be globally unique)
- Action: Keep all records
- Fix source data to ensure unique transaction_ids
-
Scenario 3: Overlapping Customers (customer exists in multiple regions)
- Not duplicates if transactions are different
- Preserve all records
- Add region as part of reporting dimension
Duplicate Removal Strategies:
- Hash-based deduplication (most reliable):
WITH hashed_records AS (
SELECT *,
MD5(CONCAT(customer_id, transaction_date, amount, product_id)) as record_hash
FROM merged_sales
)
SELECT DISTINCT ON (record_hash)
customer_id, transaction_date, amount, product_id, region
FROM hashed_records
ORDER BY record_hash, file_load_timestamp DESC
-
Rule-based deduplication:
- If transaction_id exists: use it as unique key
- If no transaction_id: create composite key from all business fields
- Apply deduplication in Data Preparation module before data model
-
Source-level resolution:
- Establish data ownership: each transaction reported by only one region
- Add transaction_source field to track origin
- Implement validation at CSV generation to prevent overlaps
Implementation in Crystal Reports 2016:
-
In Data Preparation module:
- Add Transformation: “Remove Duplicates”
- Configure unique key fields: customer_id, transaction_date, transaction_id
- Choose conflict resolution: “Keep First” or “Keep Last”
-
Create a validation report:
SELECT
customer_id,
transaction_date,
COUNT(*) as duplicate_count,
GROUP_CONCAT(region) as regions
FROM merged_sales
GROUP BY customer_id, transaction_date, transaction_id
HAVING COUNT(*) > 1
Run this after each merge to detect remaining duplicates.
- Add data quality rules:
- Flag records without transaction_id for review
- Alert when duplicate rate exceeds 5%
- Log all deduplication actions for audit trail
Best Practices:
- Add unique transaction_id at source if not present
- Include file_name and load_timestamp in merged data for traceability
- Create a staging table to inspect data before final merge
- Document deduplication rules in data model metadata
- Schedule regular data quality audits to catch new duplicate patterns
Prevention:
- Coordinate with regional offices to ensure non-overlapping data
- Implement transaction_id generation at source
- Use API-based data collection instead of CSV files if possible
- Add validation at CSV generation: reject files with duplicate transaction_ids
Implementing these merge logic improvements and unique key constraints should eliminate the 450 duplicate records you’re currently seeing.