In PySpark, understanding Transformations and Actions is very important because Spark works on a concept called lazy evaluation.
PySpark is built on Apache Spark, which processes data only when necessary.
What are Transformations?
Transformations are operations that:
- Create a new DataFrame or RDD
- Do NOT execute immediately
- Are lazily evaluated
Spark simply records the transformation steps but does not run them until an action is called.
Common Transformations
select()filter()groupBy()withColumn()drop()join()orderBy()
Example of Transformation
df_filtered = df.filter(df.salary > 50000)
At this stage, Spark does NOT process the data.
It only remembers the instruction.
What are Actions?
Actions are operations that:
- Trigger execution
- Return results
- Perform computation
When an action is called, Spark executes all previous transformations.
Common Actions
show()count()collect()first()take()write()
Example of Action
df_filtered.show()
Now Spark processes the data and displays results.
Lazy Evaluation (Important Concept)
Spark follows lazy evaluation:
Transformations → Stored in DAG → Action Called → Execution Starts
Spark builds a logical execution plan (DAG – Directed Acyclic Graph) and optimizes it before running.
This makes Spark:
- Faster
- More efficient
- Optimized automatically
Types of Transformations
Narrow Transformations
- Data remains in the same partition
- No shuffling required
- Faster
Examples:
select()filter()
Wide Transformations
- Data moves across partitions
- Shuffling occurs
- Slower than narrow transformations
Examples:
groupBy()join()distinct()
Practical Example
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("Example").getOrCreate()df = spark.read.csv("sales.csv", header=True, inferSchema=True)# Transformations
df_filtered = df.filter(df.amount > 1000)
df_grouped = df_filtered.groupBy("region").sum("amount")# Action
df_grouped.show()
Execution starts only when show() is called.
Real-World Flow
Load Data → Apply Transformations → Store Plan → Call Action → Spark Executes
Example:
Raw Sales Data → Filter High Sales → Group by Region → Show Results
Interview Tip
Common interview question:
“Why is Spark faster than traditional systems?”
Answer:
Because Spark uses in-memory processing and lazy evaluation with optimized execution plans.
Final Takeaway
- Transformations define what to do
- Actions define when to execute
- Spark executes only when required
Understanding this concept is essential for building optimized Big Data pipelines with PySpark.