PySpark Basics

PySpark is the Python API for Apache Spark. It allows you to process large-scale data using Python while leveraging Spark’s distributed computing power.

PySpark is widely used in Data Engineering, Big Data processing, and Machine Learning workflows.

Why Use PySpark?

PySpark is useful when:

  • Data is too large for Pandas or Excel
  • You need distributed processing
  • You are working with Big Data systems
  • You want to integrate with Hadoop or cloud platforms

Installing PySpark

You can install PySpark using pip:

pip install pyspark

Or use it in environments like:

  • Jupyter Notebook
  • Google Colab
  • Databricks

Creating a Spark Session

The first step in PySpark is creating a Spark session.

from pyspark.sql import SparkSessionspark = SparkSession.builder \
.appName("PySpark Basics") \
.getOrCreate()

SparkSession is the entry point for working with data in Spark.

Reading Data in PySpark

Read CSV File

df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show()

Read JSON File

df = spark.read.json("data.json")
df.show()

Understanding DataFrames

In PySpark, the main data structure is a DataFrame.

It is similar to a Pandas DataFrame but distributed across multiple machines.

Display Schema

df.printSchema()

Show Data

df.show()

Basic DataFrame Operations

Select Columns

df.select("name", "salary").show()

Filter Data

df.filter(df.salary > 50000).show()

Group By

df.groupBy("department").count().show()

Add New Column

from pyspark.sql.functions import coldf = df.withColumn("bonus", col("salary") * 0.10)
df.show()

Transformations vs Actions

Transformations

Operations that create a new DataFrame (lazy execution).

Examples:

  • select()
  • filter()
  • groupBy()

Actions

Operations that trigger execution.

Examples:

  • show()
  • count()
  • collect()

Spark follows lazy evaluation, meaning it waits until an action is called before executing transformations.

Writing Data

Save as CSV

df.write.csv("output_folder", header=True)

Save as Parquet

df.write.parquet("output_folder")

Parquet is optimized for Big Data processing.

PySpark vs Pandas

Pandas:

  • Works on a single machine
  • Best for small to medium datasets

PySpark:

  • Distributed processing
  • Handles massive datasets
  • Scalable

Real-World Use Case

Example workflow:

Raw Sales Data → Clean with PySpark → Aggregate Revenue → Store in Data Warehouse → Visualize in Power BI

Final Takeaway

PySpark allows Python developers to work with Big Data efficiently using distributed computing.

Mastering PySpark is essential for building scalable data pipelines and becoming a Data Engineer.

Home » PYTHON FOR DATA ENGINEERING (PYDE) > Working with Big Data > PySpark Basics