Merging delta records in Redshift using UPSERT

Handling delta records—also known as incremental data updates—is a common challenge in data engineering and analytics pipelines. In Amazon Redshift, which is designed for high-performance data warehousing, efficiently merging these records requires a strategy that can handle both inserts and updates seamlessly. This process is often referred to as UPSERT (update + insert).

In this blog, we’ll explore how to perform UPSERT operations in Redshift to merge delta records effectively. We'll cover use cases, key SQL strategies, and performance considerations.


Why UPSERT in Redshift?

Delta records typically represent changes (updates, additions) in your source data since the last load. Without handling these properly, you risk:

Duplicated rows if only inserts are done

Outdated data if updates are missed

Inconsistent reporting if the data isn’t synchronized

UPSERT addresses this by updating existing rows based on a key (e.g., ID or timestamp) and inserting new ones if they don’t exist.

Redshift and the Lack of Native UPSERT

Unlike PostgreSQL or MySQL, Redshift does not support a direct UPSERT or MERGE statement. However, you can achieve the same result using a combination of:


Staging tables

DELETE + INSERT pattern

CTE-based MERGE workaround (in limited scenarios)

Let’s walk through the common and efficient approach using staging and DELETE + INSERT.


Step-by-Step UPSERT Strategy

1. Create a Staging Table

First, load your delta records into a staging table. This can be done via a data pipeline (ETL/ELT), AWS Glue, or an S3 COPY command.


sql

Copy

Edit

CREATE TEMP TABLE stage_users (LIKE users);

Load the delta data:


sql

Copy

Edit

COPY stage_users

FROM 's3://your-bucket/delta/users/'

IAM_ROLE 'arn:aws:iam::1234567890:role/RedshiftCopyRole'

FORMAT AS JSON 'auto';


2. Delete Matching Records from the Target Table

Use a key (like user_id) to identify which rows should be updated.


sql

Copy

Edit

DELETE FROM users

USING stage_users

WHERE users.user_id = stage_users.user_id;

This ensures that existing records in the target table are removed before the updated ones are inserted.


3. Insert New and Updated Records

Now insert all records from the staging table into the main table.


sql

Copy

Edit

INSERT INTO users

SELECT * FROM stage_users;

This effectively merges new and updated rows into the target table.


Considerations for Efficient UPSERT

Primary Key Indexing: Redshift doesn’t enforce primary keys, but you should still design tables with meaningful keys and sort keys for performance.


Batch Size: Load data in optimal batch sizes. Very large DELETE operations can be slow. Use filters or incremental loading when possible.


Vacuum & Analyze: After large UPSERT operations, run VACUUM to reorganize storage and ANALYZE to update table statistics:


sql

Copy

Edit

VACUUM users;

ANALYZE users;

Concurrency: Avoid running UPSERTs on the same table simultaneously. This can cause contention and unexpected behavior.


Optional: Use MERGE via Materialized Views (Advanced)

As of recent Redshift updates, the MERGE command is in preview in some regions. If available in your cluster version, you can use:


sql

Copy

Edit

MERGE INTO users USING stage_users

ON users.user_id = stage_users.user_id

WHEN MATCHED THEN UPDATE SET name = stage_users.name

WHEN NOT MATCHED THEN INSERT VALUES (...);

Always check your Redshift version and region to confirm availability.


Conclusion

While Amazon Redshift doesn’t offer a native UPSERT command, merging delta records is entirely feasible using the DELETE + INSERT approach with staging tables. This pattern is efficient, scalable, and widely used in production data pipelines.

By handling updates and inserts correctly, you maintain data integrity, reduce duplication, and ensure your analytics reflect the latest changes. With Redshift’s performance capabilities and a thoughtful UPSERT strategy, your data warehouse can remain both accurate and fast.

Learn AWS Data Engineer with Data Analytics
Read More: Deploying container-based ETL jobs using AWS Batch


Visit Quality Thought Training Institute in Hyderabad
Get Direction

Comments

Popular posts from this blog

Tosca vs Selenium: Which One to Choose?

Flask API Optimization: Using Content Delivery Networks (CDNs)

Using ID and Name Locators in Selenium Python