Creating partitioned tables in Glue Catalog

May 21, 2025

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and catalog data for analytics. One of Glue’s powerful features is its Data Catalog, which acts as a central metadata repository. To efficiently manage and query large datasets, it's often beneficial to use partitioned tables in the Glue Catalog. Partitioning improves query performance and reduces the amount of data scanned by tools like Amazon Athena, Amazon Redshift Spectrum, and AWS Glue ETL jobs.

In this blog, we’ll walk through what partitioned tables are, why they matter, and how to create and manage them in the AWS Glue Data Catalog.

What Are Partitioned Tables?

Partitioned tables are a way of organizing data into discrete sections based on values in one or more columns (called partition keys). Common partition keys include date, region, or device type. For example, a sales table might be partitioned by year, month, and day:

bash

Copy

Edit

s3://your-bucket/sales/year=2024/month=12/day=25/

Each partition is stored in a separate folder in Amazon S3, allowing AWS services to only read the relevant partitions instead of scanning the entire dataset.

Benefits of Partitioned Tables

Improved Performance: Tools like Athena and Redshift Spectrum only scan relevant partitions.
Reduced Cost: Less data scanned = lower costs.
Simplified Data Management: Easier to manage time-based or hierarchical data.

Creating Partitioned Tables in Glue Catalog

There are several ways to create partitioned tables in AWS Glue Data Catalog:

1. Using AWS Glue Crawler

The simplest method is to use a Glue crawler to automatically create a partitioned table.

Steps:

Store your data in S3 with partitioned folders:

bash

s3://your-bucket/logs/year=2025/month=05/day=21/

Go to AWS Glue Console > Crawlers and create a new crawler.
Set the S3 location as the data source.
Choose or create a Glue database.

Run the crawler.

The crawler detects partitions based on the folder structure and creates a table with corresponding partition columns: year, month, day.

2. Manually Creating a Partitioned Table Using AWS CLI or Boto3

You can define a partitioned table manually using the AWS CLI or the Boto3 SDK.

Example CLI command:

bash

aws glue create-table --database-name my_database --table-input '{

"Name": "sales_data",

"StorageDescriptor": {

"Columns": [

{"Name": "product_id", "Type": "string"},

{"Name": "amount", "Type": "double"}

"Location": "s3://your-bucket/sales/",

"InputFormat": "org.apache.hadoop.mapred.TextInputFormat",

"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",

"SerdeInfo": {

"SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",

"Parameters": {

"field.delim": ","

}

"PartitionKeys": [

{"Name": "year", "Type": "int"},

{"Name": "month", "Type": "int"}

"TableType": "EXTERNAL_TABLE"

Once the table is created, you must add partitions using batch-create-partition or MSCK REPAIR TABLE in Athena.

Managing Partitions

MSCK REPAIR TABLE: In Athena, this SQL command loads partitions from S3 into the Glue Catalog.

sql

MSCK REPAIR TABLE sales_data;

BatchCreatePartition (AWS CLI/Boto3): Manually add partitions if you're not using a crawler.

Conclusion

Creating partitioned tables in AWS Glue Data Catalog is essential for optimizing performance and cost in large-scale data processing. Whether you automate the process using crawlers or manage it manually with CLI/SDK, partitioning allows your analytics queries to scale efficiently. By organizing data thoughtfully and keeping partitions up to date, you’ll gain faster insights and better control over your AWS data lake architecture.

Learn AWS Data Engineer with Data Analytics

Read More: Leveraging AWS Step Functions for data orchestration

Visit Quality Thought Training Institute in Hyderabad
Get Direction

Search This Blog

Quality Thought Training Institute

Creating partitioned tables in Glue Catalog

What Are Partitioned Tables?

Benefits of Partitioned Tables

Creating Partitioned Tables in Glue Catalog

1. Using AWS Glue Crawler

2. Manually Creating a Partitioned Table Using AWS CLI or Boto3

Conclusion

Comments

Post a Comment

Popular posts from this blog

Tosca vs Selenium: Which One to Choose?

Flask API Optimization: Using Content Delivery Networks (CDNs)

Using ID and Name Locators in Selenium Python