In PySpark, understanding Transformations and Actions is very important because Spark works on a concept called lazy evaluation.

PySpark is built on Apache Spark, which processes data only when necessary.

What are Transformations?

Transformations are operations that:

Create a new DataFrame or RDD
Do NOT execute immediately
Are lazily evaluated

Spark simply records the transformation steps but does not run them until an action is called.

Common Transformations

select()
filter()
groupBy()
withColumn()
drop()
join()
orderBy()

Example of Transformation

df_filtered = df.filter(df.salary > 50000)

At this stage, Spark does NOT process the data.
It only remembers the instruction.

What are Actions?

Actions are operations that:

Trigger execution
Return results
Perform computation

When an action is called, Spark executes all previous transformations.

Common Actions

show()
count()
collect()
first()
take()
write()

Example of Action

df_filtered.show()

Now Spark processes the data and displays results.

Lazy Evaluation (Important Concept)

Spark follows lazy evaluation:

Transformations → Stored in DAG → Action Called → Execution Starts

Spark builds a logical execution plan (DAG – Directed Acyclic Graph) and optimizes it before running.

This makes Spark:

Faster
More efficient
Optimized automatically

Types of Transformations

Narrow Transformations

Data remains in the same partition
No shuffling required
Faster

Examples:

select()
filter()

Wide Transformations

Data moves across partitions
Shuffling occurs
Slower than narrow transformations

Examples:

groupBy()
join()
distinct()

Practical Example

from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("Example").getOrCreate()df = spark.read.csv("sales.csv", header=True, inferSchema=True)# Transformations
df_filtered = df.filter(df.amount > 1000)
df_grouped = df_filtered.groupBy("region").sum("amount")# Action
df_grouped.show()

Execution starts only when show() is called.

Real-World Flow

Load Data → Apply Transformations → Store Plan → Call Action → Spark Executes

Example:

Raw Sales Data → Filter High Sales → Group by Region → Show Results

Interview Tip

Common interview question:

“Why is Spark faster than traditional systems?”

Answer:
Because Spark uses in-memory processing and lazy evaluation with optimized execution plans.

Final Takeaway

Transformations define what to do
Actions define when to execute
Spark executes only when required

Understanding this concept is essential for building optimized Big Data pipelines with PySpark.

Home » PYTHON FOR DATA ENGINEERING (PYDE) > Working with Big Data > Transformations and Actions

Free Video Tutorial

Want Mentorship on this Training?

Book a 1-on-1 Consultancy Session

Transformations and Actions