How to backfill data in Kinesis and Glue jobs
Backfilling data is a common requirement in data engineering pipelines. It refers to the process of loading historical data into a data pipeline or system after it has already been running in production. When working with Amazon Kinesis and AWS Glue, backfilling can be complex due to the streaming nature of Kinesis and the batch or ETL nature of Glue. However, with the right approach, you can design your pipelines to support smooth, efficient, and safe backfills.
In this blog, we’ll walk through strategies and best practices for backfilling data using Kinesis and Glue jobs.
Why Backfill Data?
Backfilling is needed when:
There’s a new transformation logic that should be applied to historical data.
Historical data needs to be reprocessed for analytics or compliance.
A bug in the original pipeline caused data loss or corruption.
You onboard a new destination (e.g., S3, Redshift) and need past data.
Kinesis is designed for real-time streaming, not historical processing.
Data in Kinesis is retained for a limited time (default: 24 hours, max: 7 days).
Glue streaming jobs consume real-time data and are not built for processing old datasets.
There's risk of duplicate processing if backfill and live streams aren’t managed properly.
To backfill correctly, you must separate historical batch data from live streaming data, and manage data consistency during the transition.
Strategy 1: Backfilling with Glue Batch Jobs
If your historical data is stored in S3 or another system (like an RDS snapshot or Redshift export), the simplest approach is to create a Glue batch job to process it.
Steps:
Extract historical data from the source system and load it into S3.
Create a new AWS Glue job (batch) that reads from this historical data source.
Apply the same transformations used in your live pipeline.
Write the results to the same or a different destination (e.g., S3 or Redshift).
This approach avoids touching Kinesis entirely and is best when data already exists outside the stream.
Strategy 2: Backfilling into a Kinesis Stream
If you must send historical data through a Kinesis stream (e.g., to mimic live processing or support downstream consumers), follow this method:
Steps:
Extract and prepare historical data from your original source or archive.
Use the AWS Kinesis Data Streams PutRecord/PutRecords API or the Kinesis Producer Library (KPL) to push historical data into the stream.
python
import boto3
import json
kinesis = boto3.client('kinesis')
def send_to_kinesis(data, stream_name):
for record in data:
kinesis.put_record(
StreamName=stream_name,
Data=json.dumps(record),
PartitionKey="backfill"
)
Make sure your Glue streaming job is idempotent, so repeated processing doesn’t lead to duplicates or errors.
Optionally, tag or flag backfill records so they can be identified downstream.
Strategy 3: Dual Path with Merge
Some teams maintain two separate paths:
One for real-time streaming (Kinesis + Glue Streaming Job)
One for backfill (Glue Batch Job)
Data is eventually merged in a common storage layer (e.g., partitioned S3 bucket or a data warehouse). This ensures minimal impact on live streams and greater control.
Best Practices
Use partitions wisely: When writing to S3, use date-based partitions (year/month/day) to distinguish backfill vs live data.
Avoid duplicate ingestion: Include deduplication logic (e.g., hash keys or UUIDs) in your Glue jobs.
Log and monitor: Use CloudWatch logs and metrics to monitor the backfill process.
Time-based filtering: In Glue scripts, apply filters to process only specific timeframes.
Conclusion
Backfilling data in Kinesis and Glue requires a careful balance between real-time and historical processing. Depending on your architecture, you can choose to backfill through batch Glue jobs, re-ingest into Kinesis, or design a hybrid model. Whichever method you choose, maintaining data integrity, consistency, and traceability is essential to ensuring a successful and accurate backfill process.
Learn AWS Data Engineer with Data Analytics
Read More: AWS Glue with JDBC connections: Best practices
Read More: Creating dynamic dashboards in QuickSight from Athena
Read More: Using AWS CloudShell for quick data engineering tasks
Get Direction
Comments
Post a Comment