Duplicate records appear in data model after merging multiple CSV sources

Merging multiple CSV data sources in data preparation creates duplicate records in the final data model. I’m combining sales data from three regional offices (North, South, West), each providing a monthly CSV file. The merge logic seems to be creating duplicates where customer records exist in multiple files.

For example, customer ID 12345 appears in both North and West region files (customer has offices in both regions), and after the merge, I get two records for the same customer with identical transaction data duplicated.

SELECT customer_id, COUNT(*) as record_count
FROM merged_sales
GROUP BY customer_id
HAVING COUNT(*) > 1
-- Returns 450 duplicate customer_id values

How do I configure the merge to handle overlapping records and apply proper unique key constraints?

Here’s the complete solution for handling duplicate records in merged data sources:

Merge Logic Configuration: In Crystal Reports Data Preparation, configure the merge operation with proper deduplication:

  1. Define a unique composite key:
CREATE VIEW merged_sales_clean AS
SELECT DISTINCT
  customer_id,
  transaction_date,
  transaction_id,  -- or order_id
  amount,
  product_id,
  region
FROM (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY customer_id, transaction_date, transaction_id
    ORDER BY file_load_timestamp DESC
  ) as rn
  FROM (
    SELECT *, 'North' as region FROM north_sales
    UNION ALL
    SELECT *, 'South' as region FROM south_sales
    UNION ALL
    SELECT *, 'West' as region FROM west_sales
  ) combined
) ranked
WHERE rn = 1

This keeps the most recently loaded version when duplicates exist.

Unique Key Constraints: Establish what makes a record unique:

  • Scenario 1: True Duplicates (same transaction reported by multiple regions)

    • Unique Key: customer_id + transaction_date + transaction_id
    • Action: Keep one record, discard duplicates
    • Use ROW_NUMBER() with PARTITION BY unique key
  • Scenario 2: Different Transactions (same customer, multiple legitimate transactions)

    • Unique Key: transaction_id (should be globally unique)
    • Action: Keep all records
    • Fix source data to ensure unique transaction_ids
  • Scenario 3: Overlapping Customers (customer exists in multiple regions)

    • Not duplicates if transactions are different
    • Preserve all records
    • Add region as part of reporting dimension

Duplicate Removal Strategies:

  1. Hash-based deduplication (most reliable):
WITH hashed_records AS (
  SELECT *,
    MD5(CONCAT(customer_id, transaction_date, amount, product_id)) as record_hash
  FROM merged_sales
)
SELECT DISTINCT ON (record_hash)
  customer_id, transaction_date, amount, product_id, region
FROM hashed_records
ORDER BY record_hash, file_load_timestamp DESC
  1. Rule-based deduplication:

    • If transaction_id exists: use it as unique key
    • If no transaction_id: create composite key from all business fields
    • Apply deduplication in Data Preparation module before data model
  2. Source-level resolution:

    • Establish data ownership: each transaction reported by only one region
    • Add transaction_source field to track origin
    • Implement validation at CSV generation to prevent overlaps

Implementation in Crystal Reports 2016:

  1. In Data Preparation module:

    • Add Transformation: “Remove Duplicates”
    • Configure unique key fields: customer_id, transaction_date, transaction_id
    • Choose conflict resolution: “Keep First” or “Keep Last”
  2. Create a validation report:

SELECT
  customer_id,
  transaction_date,
  COUNT(*) as duplicate_count,
  GROUP_CONCAT(region) as regions
FROM merged_sales
GROUP BY customer_id, transaction_date, transaction_id
HAVING COUNT(*) > 1

Run this after each merge to detect remaining duplicates.

  1. Add data quality rules:
    • Flag records without transaction_id for review
    • Alert when duplicate rate exceeds 5%
    • Log all deduplication actions for audit trail

Best Practices:

  1. Add unique transaction_id at source if not present
  2. Include file_name and load_timestamp in merged data for traceability
  3. Create a staging table to inspect data before final merge
  4. Document deduplication rules in data model metadata
  5. Schedule regular data quality audits to catch new duplicate patterns

Prevention:

  • Coordinate with regional offices to ensure non-overlapping data
  • Implement transaction_id generation at source
  • Use API-based data collection instead of CSV files if possible
  • Add validation at CSV generation: reject files with duplicate transaction_ids

Implementing these merge logic improvements and unique key constraints should eliminate the 450 duplicate records you’re currently seeing.

This is a classic data integration problem. When you merge CSV files, Crystal Reports doesn’t automatically deduplicate unless you explicitly define a unique key constraint. The merge operation is doing a UNION ALL (keep all rows) instead of UNION (remove duplicates). You need to specify which fields constitute a unique record.

Use the DISTINCT clause in your final query or create a view with duplicate removal logic. But fix the root cause in the merge process rather than working around it in every report.

Also check your source CSV files. Are the regions sending overlapping data? Maybe North region is including transactions from shared customers that West region also reports. You might need to establish data ownership rules at the source - each transaction should be reported by only one region.

What fields should I use for the unique key? Just customer_id, or do I need to include the transaction date and region as well? Some customers legitimately have multiple transactions on the same day across different regions.