In big data and advanced analytics, PySpark has emerged as a powerful tool for processing large datasets and analyzing distributed data. Deploying PySpark on AWS applications on the cloud can be a game-changer, offering scalability and flexibility for data-intensive tasks. Amazon Web Services (AWS) provides an ideal platform for such deployments, and when combined with Docker containers, it becomes a seamless and efficient solution.
However, deploying PySpark on a cloud infrastructure can be complex and daunting. The intricacies of setting up a distributed computing environment, configuring Spark clusters, and managing resources often deter many from harnessing their full potential.
This article was published as a part of the Data Science Blogathon.
Before embarking on the journey to deploy PySpark on AWS using Docker, ensure that you have the following prerequisites in place:
🚀 Local PySpark Installation: To develop and test PySpark applications, it’s essential to have PySpark installed on your local machine. You can install PySpark by following the official documentation for your operating system. This local installation will serve as your development environment, allowing you to write and test PySpark code before deploying it on AWS.
🌐 AWS Account: You’ll need an active AWS (Amazon Web Services) account to access the cloud infrastructure and services required for PySpark deployment. You can sign up on the AWS website if you don’t have an AWS account. Be prepared to provide your payment information, although AWS offers a free tier with limited resources for new users.
🐳 Docker Installation: Docker is a pivotal component in this deployment process. Install Docker on your local machine by following the installation instructions for the Ubuntu operating system. Docker containers will allow you to encapsulate and deploy your PySpark applications consistently.
1. Open your terminal and update your package manager:
sudo apt-get update
2. Install necessary dependencies:
sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common
3. Add Docker’s official GPG key:
curl -fsSL https://download.docker.com/linux/ubuntu/gpg |
sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
4. Set up the Docker repository:
echo "deb [signed-by=/usr/share/keyrings/docker-archive-keyring.gpg]
https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" |
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
5. Update your package index again:
sudo apt-get update
6. Install Docker:
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
7. Start and enable the Docker service:
sudo systemctl start docker
sudo systemctl enable docker
8. Verify the installation:
sudo docker --version
**** Add split lines in one line
Watch a video tutorial on Docker installation
Amazon Web Services (AWS) is the backbone of our PySpark deployment, and we’ll use two essential services, Elastic Container Registry (ECR) and Elastic Compute Cloud (EC2), to create a dynamic cloud environment.
If you haven’t already, head to the AWS sign-up page to create an account. Please follow the registration process, provide the necessary information, and be ready with your payment details if you’d like to explore beyond the AWS Free Tier.
For those new to AWS, take advantage of the AWS Free Tier, which offers limited resources and services at no cost for 12 months. This is an excellent way to explore AWS without incurring charges.
You’ll need an Access Key ID and Secret Access Key to interact with AWS programmatically. Follow these steps to generate them:
ECR is a managed Docker container registry service provided by AWS. It will be our repository for storing Docker images. You can set up your ECR by following these steps:
EC2 provides scalable computing capacity in the cloud and will host your PySpark applications. To set up an EC2 instance:
“””” HERE IMPORTANT AFTER THAT ATTACH THE SECURITY GROUPS “”””
AWS_ACCESS_KEY_ID: AKIAYOURSAMPLEACCESSKEY
AWS_ECR_LOGIN_URI: 123456789012.dkr.ecr.region.amazonaws.com
AWS_REGION: us-east-1
AWS_SECRET_ACCESS_KEY: YOURSAMPLESECRETACCESSKEY12345
ECR_REPOSITORY_NAME: your-ecr-repository-name
Now that you have your AWS setup values ready, it’s time to securely configure them in your GitHub repository using GitHub secrets and variables. This adds an extra layer of security and convenience to your PySpark deployment process.
Follow these steps to set up your AWS values:
With your AWS secrets securely stored in GitHub, you can easily reference them in your GitHub Actions workflows and securely access AWS services during deployment.
Your AWS setup values are now safely configured in your GitHub repository, making them readily available for your PySpark deployment workflow.
To effectively deploy PySpark on AWS using Docker, it’s essential to grasp the structure of your project’s code. Let’s break down the components that make up the codebase:
├── .github
│ ├── workflows
│ │ ├── build.yml
├── airflow
├── configs
├── consumerComplaint
│ ├── cloud_storage
│ ├── components
│ ├── config
│ │ ├── py_sparkmanager.py
│ ├── constants
│ ├── data_access
│ ├── entity
│ ├── exceptions
│ ├── logger
│ ├── ml
│ ├── pipeline
│ ├── utils
├── output
│ ├── .png
├── prediction_data
├── research
│ ├── jupyter_notebooks
├── saved_models
│ ├── model.pkl
├── tests
├── venv
├── Dockerfile
├── app.py
├── requirements.txt
├── .gitignore
├── .dockerignore
import os
from dotenv import load_dotenv
from pyspark.sql import SparkSession
# Load environment variables from .env
load_dotenv()
access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
# Initialize SparkSession
spark_session = SparkSession.builder.master('local[*]').appName('consumer_complaint') \
.config("spark.executor.instances", "1") \
.config("spark.executor.memory", "6g") \
.config("spark.driver.memory", "6g") \
.config("spark.executor.memoryOverhead", "8g") \
.config('spark.jars.packages', "com.amazonaws:aws-java-sdk:1.7.4,
org.apache.hadoop:hadoop-aws:2.7.3") \
.getOrCreate()
# Configure SparkSession for AWS S3 access
spark_session._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", access_key_id)
spark_session._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", secret_access_key)
spark_session._jsc.hadoopConfiguration().set("fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
spark_session._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark_session._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark_session._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "ap-south-1.amazonaws.com")
spark_session._jsc.hadoopConfiguration().set("fs.s3.buffer.dir", "tmp")
This code sets up your SparkSession, configures it for AWS S3 access, and loads AWS credentials from environment variables, allowing you to work with AWS services seamlessly in your PySpark application
This section will explore how to create Docker images that encapsulate your PySpark application, making it portable, scalable, and ready for deployment on AWS. Docker containers provide a consistent environment for your PySpark applications, ensuring seamless execution in various settings.
The key to building Docker images for PySpark is a well-defined Dockerfile. This file specifies the instructions for setting up the container environment, including Python and PySpark dependencies.
FROM python:3.8.5-slim-buster
# Use an Ubuntu base image
FROM ubuntu:20.04
# Set JAVA_HOME and install OpenJDK 8
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
RUN apt-get update -y \
&& apt-get install -y openjdk-8-jdk \
&& apt-get install python3-pip -y \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Set environment variables for your application
ENV AIRFLOW_HOME="/app/airflow"
ENV PYSPARK_PYTHON=/usr/bin/python3
ENV PYSPARK_DRIVER_PYTHON=/usr/bin/python3
# Create a directory for your application and set it as the working directory
WORKDIR /app
# Copy the contents of the current directory to the working directory in the container
COPY . /app
# Install Python dependencies from requirements.txt
RUN pip3 install -r requirements.txt
# Set the entry point to run your app.py script
CMD ["python3", "app.py"]
Once you have your Dockerfile ready, you can build the Docker image using the following command:
docker build -t your-image-name
Replace your-image-name with the desired name and version for your Docker image.
After building the image, you can list your local Docker images using the following command:
docker images
docker ps -a
docker system df
With your Docker image prepared, you can go ahead and run your PySpark application in a Docker container. Use the following command:
docker run -your-image-name
“”” SOMETIME docker run COMMAND NOT WORK FOLLOW BELOW COMMAND. “””
docker run 80:8080 your-image-name
docker run 8080:8080 your-image-name
This section will walk through deploying your PySpark application on AWS using Docker containers. This deployment will involve launching Amazon Elastic Compute Cloud (EC2) instances for creating a PySpark cluster.
This is all I mentioned above.
Download the Docker installation script
curl -fsSL https://get.docker.com -o get-docker.sh
Run the Docker installation script with root privileges
sudo sh get-docker.sh
Add the current user to the docker group (replace ‘ubuntu’ with your username)
sudo usermod -aG docker ubuntu
Activate the changes by running a new shell session or using ‘newgrp’
newgrp docker
We’ll set up a self-hosted runner for GitHub Actions, responsible for executing your CI/CD workflows. A self-hosted runner runs on your infrastructure and is a good choice for running workflows that require specific configurations or access to local resources.
$ mkdir actions-runner && cd actions-runner
$ curl -o actions-runner-linux-x64-2.309.0.tar.gz -L
https://github.com/actions/runner/releases/download/v2.309.0/actions-runner-linux-x64-2.309.0.tar.gz
$ echo "2974243bab2a282349ac833475d241d5273605d3628f0685bd07fb5530f9bb1a
actions-runner-linux-x64-2.309.0.tar.gz" | shasum -a 256 -c
$ tar xzf ./actions-runner-linux-x64-2.309.0.tar.gz
$ ./run.sh
In a CI/CD pipeline, the build.yaml file is crucial in defining the steps required to build and deploy your application. This configuration file specifies the workflow for your CI/CD process, including how code is built, tested, and deployed. Let’s dive into the critical aspects of the build.yaml configuration and its importance:
The build.yaml file outlines the tasks executed during the CI/CD pipeline. It defines the steps for continuous integration, which involves building and testing your application and continuous delivery, where the application is deployed to various environments.
This phase typically includes tasks like code compilation, unit testing, and code quality checks. The build.yaml file specifies the tools, scripts, and commands required to perform these tasks. For example, it might trigger the execution of unit tests to ensure code quality.
After successful CI, the CD phase involves deploying the application to different environments, such as staging or production. The build.yaml file specifies how the deployment should happen, including where and when to deploy and which configurations to use.
The build.yaml file often includes details about project dependencies. It defines where to fetch external libraries or dependencies from, which can be crucial for the successful build and deployment of the application.
CI/CD workflows often require environment-specific configurations, such as API keys or connection strings. The build.yaml file may define how these environment variables are set for each pipeline stage.
In case of failures or issues during the CI/CD process, notifications and alerts are essential. The build.yaml file can configure how and to whom these alerts are sent, ensuring that problems are addressed promptly.
Depending on the CI/CD workflow, the build.yaml file may specify what artifacts or build outputs should be generated and where they should be stored. These artifacts can be used for deployments or further testing.
By understanding the build.yaml file and its components, you can effectively manage and customize your CI/CD workflow to meet the needs of your project. It is the blueprint for the entire automation process, from code changes to production deployments.
You can customize the content further based on the specific details of your build.yaml configuration and how it fits into your CI/CD pipeline.
name: workflow
on:
push:
branches:
- main
paths-ignore:
- 'README.md'
permissions:
id-token: write
contents: read
jobs:
integration:
name: Continuous Integration
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v3
- name: Lint code
run: echo "Linting repository"
- name: Run unit tests
run: echo "Running unit tests"
build-and-push-ecr-image:
name: Continuous Delivery
needs: integration
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v3
- name: Install Utilities
run: |
sudo apt-get update
sudo apt-get install -y jq unzip
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1
- name: Build, tag, and push image to Amazon ECR
id: build-image
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
ECR_REPOSITORY: ${{ secrets.ECR_REPOSITORY_NAME }}
IMAGE_TAG: latest
run: |
# Build a docker container and
# push it to ECR so that it can
# be deployed to ECS.
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
echo "::set-output name=image::$ECR_REGISTRY/$ECR_REPOSITORY
:$IMAGE_TAG"
Continuous-Deployment:
needs: build-and-push-ecr-image
runs-on: self-hosted
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1
- name: Pull latest images
run: |
docker pull ${{secrets.AWS_ECR_LOGIN_URI}}/${{ secrets.
ECR_REPOSITORY_NAME }}:latest
- name: Stop and remove sensor container if running
run: |
docker ps -q --filter "name=sensor" | grep -q . && docker stop sensor
&& docker rm -fv sensor
- name: Run Docker Image to serve users
run: |
docker run -d -p 80:8080 --name=sensor -e 'AWS_ACCESS_KEY_ID=
${{ secrets.AWS_ACCESS_KEY_ID }}
' -e 'AWS_SECRET_ACCESS_KEY=${{ secrets.AWS_SECRET_ACCESS_KEY }}'
-e 'AWS_REGION=${{ secrets.AWS_REGION }}' ${{secrets.AWS_ECR_LOGIN_URI}}/
${{ secrets.ECR_REPOSITORY_NAME }}:latest
- name: Clean previous images and containers
run: |
docker system prune -f
Note: All Split line join as one
If any issue occurs then follow the GitHub repo I mentioned last.
Continuous-Deployment Job:
To make the entire CI/CD process seamless and responsive to code changes, you can configure your repository to trigger the workflow upon code commits automatically or pushes. Every time you save and push changes to your repository, the CI/CD pipeline will start working its magic.
By automating the workflow execution, you ensure that your application remains up-to-date with the latest changes without manual intervention. This automation can significantly improve development efficiency and provide rapid feedback on code changes, making it easier to catch and resolve issues early in the development cycle.
To set up automated workflow execution on code changes, follow these steps:
git add .
git commit -m "message"
git push origin main
In this comprehensive guide, we’ve walked you through the intricate process of deploying PySpark on AWS using EC2 and ECR. Utilizing containerization and continuous integration and delivery, this approach provides a robust and adaptable solution for managing large-scale data analytics and processing tasks. By following the steps outlined in this blog, you can harness the full power of PySpark in a cloud environment, taking advantage of the scalability and flexibility AWS offers.
It’s important to note that AWS presents many deployment options, from EC2 and ECR to specialized services like EMR. The choice of method ultimately depends on the unique requirements of your project. Whether you prefer the containerization approach demonstrated here or opt for a different AWS service, the key is to leverage the capabilities of PySpark effectively in your data-driven applications. With AWS as your platform, you’re well-equipped to unlock the full potential of PySpark, ushering in a new era of data analytics and processing. Explore services like EMR if they align better with your specific use cases and preferences, as AWS provides a diverse toolkit for deploying PySpark to meet the unique needs of your projects.
A. PySpark is the Python library for Apache Spark, a robust extensive data processing framework. Deploying PySpark on AWS offers scalable and flexible solutions for data-intensive tasks, making it an ideal choice for distributed data analysis.
A. While you can run PySpark locally, cloud deployment is recommended for handling large datasets efficiently. AWS provides the infrastructure and tools needed for scaling PySpark applications.
A. Use GitHub Secrets to store AWS credentials and securely access them in your workflow. This ensures your credentials remain protected and are not exposed in your code.
A. Docker containers offer a consistent environment across different platforms, ensuring your PySpark application runs the same way in development, testing, and production. They also simplify the process of building and deploying PySpark applications.
The cost of running PySpark on AWS depends on various factors, including the type and number of EC2 instances used, data storage, data transfer, and more. Monitoring your AWS usage and optimizing resources to manage costs efficiently is essential.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.