PySpark is the Python API for Apache Spark. It allows you to process large-scale data using Python while leveraging Spark’s distributed computing power.
PySpark is widely used in Data Engineering, Big Data processing, and Machine Learning workflows.
Why Use PySpark?
PySpark is useful when:
- Data is too large for Pandas or Excel
- You need distributed processing
- You are working with Big Data systems
- You want to integrate with Hadoop or cloud platforms
Installing PySpark
You can install PySpark using pip:
pip install pyspark
Or use it in environments like:
- Jupyter Notebook
- Google Colab
- Databricks
Creating a Spark Session
The first step in PySpark is creating a Spark session.
from pyspark.sql import SparkSessionspark = SparkSession.builder \
.appName("PySpark Basics") \
.getOrCreate()
SparkSession is the entry point for working with data in Spark.
Reading Data in PySpark
Read CSV File
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show()
Read JSON File
df = spark.read.json("data.json")
df.show()
Understanding DataFrames
In PySpark, the main data structure is a DataFrame.
It is similar to a Pandas DataFrame but distributed across multiple machines.
Display Schema
df.printSchema()
Show Data
df.show()
Basic DataFrame Operations
Select Columns
df.select("name", "salary").show()
Filter Data
df.filter(df.salary > 50000).show()
Group By
df.groupBy("department").count().show()
Add New Column
from pyspark.sql.functions import coldf = df.withColumn("bonus", col("salary") * 0.10)
df.show()
Transformations vs Actions
Transformations
Operations that create a new DataFrame (lazy execution).
Examples:
- select()
- filter()
- groupBy()
Actions
Operations that trigger execution.
Examples:
- show()
- count()
- collect()
Spark follows lazy evaluation, meaning it waits until an action is called before executing transformations.
Writing Data
Save as CSV
df.write.csv("output_folder", header=True)
Save as Parquet
df.write.parquet("output_folder")
Parquet is optimized for Big Data processing.
PySpark vs Pandas
Pandas:
- Works on a single machine
- Best for small to medium datasets
PySpark:
- Distributed processing
- Handles massive datasets
- Scalable
Real-World Use Case
Example workflow:
Raw Sales Data → Clean with PySpark → Aggregate Revenue → Store in Data Warehouse → Visualize in Power BI
Final Takeaway
PySpark allows Python developers to work with Big Data efficiently using distributed computing.
Mastering PySpark is essential for building scalable data pipelines and becoming a Data Engineer.