Duplicate records appear in data model after merging multiple CSV sources

ashleysql · December 14, 2024, 6:19am

Merging multiple CSV data sources in data preparation creates duplicate records in the final data model. I’m combining sales data from three regional offices (North, South, West), each providing a monthly CSV file. The merge logic seems to be creating duplicates where customer records exist in multiple files.

For example, customer ID 12345 appears in both North and West region files (customer has offices in both regions), and after the merge, I get two records for the same customer with identical transaction data duplicated.

SELECT customer_id, COUNT(*) as record_count
FROM merged_sales
GROUP BY customer_id
HAVING COUNT(*) > 1
-- Returns 450 duplicate customer_id values

How do I configure the merge to handle overlapping records and apply proper unique key constraints?

dorothy_data · August 13, 2025, 2:10pm

Here’s the complete solution for handling duplicate records in merged data sources:

Merge Logic Configuration: In Crystal Reports Data Preparation, configure the merge operation with proper deduplication:

Define a unique composite key:

CREATE VIEW merged_sales_clean AS
SELECT DISTINCT
  customer_id,
  transaction_date,
  transaction_id,  -- or order_id
  amount,
  product_id,
  region
FROM (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY customer_id, transaction_date, transaction_id
    ORDER BY file_load_timestamp DESC
  ) as rn
  FROM (
    SELECT *, 'North' as region FROM north_sales
    UNION ALL
    SELECT *, 'South' as region FROM south_sales
    UNION ALL
    SELECT *, 'West' as region FROM west_sales
  ) combined
) ranked
WHERE rn = 1

This keeps the most recently loaded version when duplicates exist.

Unique Key Constraints: Establish what makes a record unique:

Scenario 1: True Duplicates (same transaction reported by multiple regions)
- Unique Key: customer_id + transaction_date + transaction_id
- Action: Keep one record, discard duplicates
- Use ROW_NUMBER() with PARTITION BY unique key
Scenario 2: Different Transactions (same customer, multiple legitimate transactions)
- Unique Key: transaction_id (should be globally unique)
- Action: Keep all records
- Fix source data to ensure unique transaction_ids
Scenario 3: Overlapping Customers (customer exists in multiple regions)
- Not duplicates if transactions are different
- Preserve all records
- Add region as part of reporting dimension

Duplicate Removal Strategies:

Hash-based deduplication (most reliable):

WITH hashed_records AS (
  SELECT *,
    MD5(CONCAT(customer_id, transaction_date, amount, product_id)) as record_hash
  FROM merged_sales
)
SELECT DISTINCT ON (record_hash)
  customer_id, transaction_date, amount, product_id, region
FROM hashed_records
ORDER BY record_hash, file_load_timestamp DESC

Rule-based deduplication:
- If transaction_id exists: use it as unique key
- If no transaction_id: create composite key from all business fields
- Apply deduplication in Data Preparation module before data model
Source-level resolution:
- Establish data ownership: each transaction reported by only one region
- Add transaction_source field to track origin
- Implement validation at CSV generation to prevent overlaps

Implementation in Crystal Reports 2016:

In Data Preparation module:
- Add Transformation: “Remove Duplicates”
- Configure unique key fields: customer_id, transaction_date, transaction_id
- Choose conflict resolution: “Keep First” or “Keep Last”
Create a validation report:

SELECT
  customer_id,
  transaction_date,
  COUNT(*) as duplicate_count,
  GROUP_CONCAT(region) as regions
FROM merged_sales
GROUP BY customer_id, transaction_date, transaction_id
HAVING COUNT(*) > 1

Run this after each merge to detect remaining duplicates.

Add data quality rules:
- Flag records without transaction_id for review
- Alert when duplicate rate exceeds 5%
- Log all deduplication actions for audit trail

Best Practices:

Add unique transaction_id at source if not present
Include file_name and load_timestamp in merged data for traceability
Create a staging table to inspect data before final merge
Document deduplication rules in data model metadata
Schedule regular data quality audits to catch new duplicate patterns

Prevention:

Coordinate with regional offices to ensure non-overlapping data
Implement transaction_id generation at source
Use API-based data collection instead of CSV files if possible
Add validation at CSV generation: reject files with duplicate transaction_ids

Implementing these merge logic improvements and unique key constraints should eliminate the 450 duplicate records you’re currently seeing.

andrew_sql · August 5, 2025, 6:35pm

This is a classic data integration problem. When you merge CSV files, Crystal Reports doesn’t automatically deduplicate unless you explicitly define a unique key constraint. The merge operation is doing a UNION ALL (keep all rows) instead of UNION (remove duplicates). You need to specify which fields constitute a unique record.

paul_ops · August 11, 2025, 9:50am

Use the DISTINCT clause in your final query or create a view with duplicate removal logic. But fix the root cause in the merge process rather than working around it in every report.

thomaspro · August 9, 2025, 3:20pm

Also check your source CSV files. Are the regions sending overlapping data? Maybe North region is including transactions from shared customers that West region also reports. You might need to establish data ownership rules at the source - each transaction should be reported by only one region.

lisa_tech · August 6, 2025, 10:15am

What fields should I use for the unique key? Just customer_id, or do I need to include the transaction date and region as well? Some customers legitimately have multiple transactions on the same day across different regions.

Topic		Replies	Views
Duplicate records appear in data model after merging multiple data sources SAP Crystal Reports question , data-quality , data-modeling , sql , scr-2016 , data-preparation , duplicate-records , inaccurate-report , crystal-reports	4	0	November 19, 2025
Duplicate records after ETL merge in inventory warehouse causing incorrect stock levels SQL Server Reporting Services (SSRS) question , data-quality , sql , ssrs-2017 , data-preparation , etl-integration , duplicate-records , inventory-reports , merge	5	1	November 2, 2025
Duplicate records after ETL merge in inventory warehouse causing inaccurate reports SQL Server Reporting Services (SSRS) question , data-quality , sql , ssrs-2017 , data-preparation , etl-integration , duplicate-records , inventory-reports , merge-logic	4	0	May 22, 2025
Loyalty points import fails with duplicate customer IDs during bulk upload SAP Customer Experience (SAP CX) question , csv , scx-2105 , data-validation , csv-import , loyalty-programs , data-import-migration , duplicate-key-error , etl-preprocessing	3	0	March 15, 2025
Embedded analytics data model fails to join tables due to mismatched data types SAP Crystal Reports question , data-modeling , sql , scr-2016 , embedded-analytics , custom-views , join-mismatch , incomplete-report , data-types	4	0	June 23, 2025
Sales order migration fails due to duplicate customer records detected in staging table Infor CloudSuite question , data-migration , crm-integration , sales-mgmt , ics-2022 , data-migration-tool , duplicate-customer , migration-failure , customer-master	3	0	July 19, 2025
Sales order import fails in order management due to duplicate customer references Microsoft Dynamics 365 question , data-migration , data-quality , order-mgmt , customer-master , d365-10-0-40 , data-management , dup-cust-ref , sales-order-import	7	0	November 1, 2025
Contact merge rules not respecting unique identifier in data import causing duplicate contacts SAP Customer Experience (SAP CX) question , data-quality , data-modeling , contact-mgmt , data-integration , scx-2205 , odata-api , merge-logic , duplicate-prevention	5	0	February 4, 2025
Part master data duplication causes merge conflicts when consolidating multi-site PLM instances SAP PLM question , data-migration , configuration , sql , part-mgmt , master-data , sap-2022 , deduplication , fuzzy-matching	3	0	September 11, 2025

Duplicate records appear in data model after merging multiple CSV sources

Related topics