AWS Glue development using notebooks in SageMaker

Modern data engineering often involves building pipelines that are efficient, scalable, and easy to manage. AWS Glue is a serverless data integration service designed to make it simple to discover, prepare, and transform data for analytics and machine learning. While AWS Glue provides its own development interface, one powerful and flexible way to work with Glue jobs is through SageMaker Studio Notebooks.

In this blog, we’ll explore how to develop AWS Glue scripts using SageMaker notebooks, including setup, code samples, and best practices.


Why Use SageMaker Notebooks for AWS Glue Development?

While AWS Glue Studio offers a visual interface and script editor, SageMaker notebooks provide a more interactive, code-first development experience. This setup is perfect for data engineers and scientists who:

  • Want to explore datasets interactively
  • Prefer to use Jupyter-based notebooks
  • Need a flexible Python environment
  • Integrate Glue with machine learning pipelines

Prerequisites

  1. Before you begin, ensure you have the following:
  2. An AWS account with access to Glue, SageMaker, and S3
  3. A SageMaker Studio environment configured
  4. Proper IAM roles with permissions for Glue, S3, and SageMaker
  5. Also, you’ll need to enable Glue interactive sessions in your AWS account.


Step 1: Set Up Glue Interactive Sessions

AWS Glue interactive sessions allow you to run Glue jobs in real time, making it ideal for notebook development.

To enable:

Open AWS Glue Console.

Navigate to Notebooks > Interactive sessions.

Choose a role with permissions (or create one with AmazonS3FullAccess, AWSGlueServiceRole, and AmazonAthenaFullAccess).


Step 2: Launch SageMaker Notebook

Open Amazon SageMaker Studio.

Create a new Python 3 notebook.

Install the AWS Glue libraries using the following code:


python


!pip install --upgrade aws-glue-sessions boto3

Import necessary libraries:


python

Copy

Edit

import sys

import boto3

from awsglue.utils import getResolvedOptions

from awsglue.context import GlueContext

from pyspark.context import SparkContext

Note: Some setups may require you to launch a kernel with PySpark pre-installed.


Step 3: Connect to a Glue Session

You can start an interactive session directly from the notebook using:


python

Copy

Edit

from awsglue.context import GlueContext

from pyspark.context import SparkContext


sc = SparkContext.getOrCreate()

glueContext = GlueContext(sc)

spark = glueContext.spark_session

This allows you to write PySpark or Glue DynamicFrame code inside your notebook.


Step 4: Develop and Test ETL Code

Now you can write ETL logic interactively. Example: reading from an S3 bucket and applying transformations.


python

Copy

Edit

datasource = glueContext.create_dynamic_frame.from_options(

    connection_type="s3",

    connection_options={"paths": ["s3://your-bucket/input-data"]},

    format="csv",

    format_options={"withHeader": True}

)


# Transformation

transformed = datasource.drop_fields(["unnecessary_column"])


# Write back to S3

glueContext.write_dynamic_frame.from_options(

    frame=transformed,

    connection_type="s3",

    connection_options={"path": "s3://your-bucket/output-data"},

    format="parquet"

)

Step 5: Convert to Glue Job (Optional)

Once your notebook code is ready, you can convert it into a Glue job script:

Export the notebook as a .py file.

Upload the script to S3.

Create a Glue job in the console pointing to the script.


Best Practices

Use Glue bookmarks to track processed data.

Store configurations in parameterized cells for reusability.

Use logging to monitor job performance.

Leverage Git integration in SageMaker for version control.


Conclusion

Using SageMaker notebooks for AWS Glue development combines the best of both worlds: the power of Glue’s managed ETL service and the interactivity of Jupyter notebooks. This approach is ideal for data engineers and scientists building sophisticated, scalable data pipelines with full control and flexibility.

Learn AWS Data Engineer with Data Analytics
Read More: Merging delta records in Redshift using UPSERT


Visit Quality Thought Training Institute in Hyderabad
Get Direction

Comments

Popular posts from this blog

Tosca vs Selenium: Which One to Choose?

Flask API Optimization: Using Content Delivery Networks (CDNs)

Using ID and Name Locators in Selenium Python