Transformations and Actions

In PySpark, understanding Transformations and Actions is very important because Spark works on a concept called lazy evaluation.

PySpark is built on Apache Spark, which processes data only when necessary.

What are Transformations?

Transformations are operations that:

  • Create a new DataFrame or RDD
  • Do NOT execute immediately
  • Are lazily evaluated

Spark simply records the transformation steps but does not run them until an action is called.

Common Transformations

  • select()
  • filter()
  • groupBy()
  • withColumn()
  • drop()
  • join()
  • orderBy()

Example of Transformation

df_filtered = df.filter(df.salary > 50000)

At this stage, Spark does NOT process the data.
It only remembers the instruction.

What are Actions?

Actions are operations that:

  • Trigger execution
  • Return results
  • Perform computation

When an action is called, Spark executes all previous transformations.

Common Actions

  • show()
  • count()
  • collect()
  • first()
  • take()
  • write()

Example of Action

df_filtered.show()

Now Spark processes the data and displays results.

Lazy Evaluation (Important Concept)

Spark follows lazy evaluation:

Transformations → Stored in DAG → Action Called → Execution Starts

Spark builds a logical execution plan (DAG – Directed Acyclic Graph) and optimizes it before running.

This makes Spark:

  • Faster
  • More efficient
  • Optimized automatically

Types of Transformations

Narrow Transformations

  • Data remains in the same partition
  • No shuffling required
  • Faster

Examples:

  • select()
  • filter()

Wide Transformations

  • Data moves across partitions
  • Shuffling occurs
  • Slower than narrow transformations

Examples:

  • groupBy()
  • join()
  • distinct()

Practical Example

from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("Example").getOrCreate()df = spark.read.csv("sales.csv", header=True, inferSchema=True)# Transformations
df_filtered = df.filter(df.amount > 1000)
df_grouped = df_filtered.groupBy("region").sum("amount")# Action
df_grouped.show()

Execution starts only when show() is called.

Real-World Flow

Load Data → Apply Transformations → Store Plan → Call Action → Spark Executes

Example:

Raw Sales Data → Filter High Sales → Group by Region → Show Results

Interview Tip

Common interview question:

“Why is Spark faster than traditional systems?”

Answer:
Because Spark uses in-memory processing and lazy evaluation with optimized execution plans.

Final Takeaway

  • Transformations define what to do
  • Actions define when to execute
  • Spark executes only when required

Understanding this concept is essential for building optimized Big Data pipelines with PySpark.

Home » PYTHON FOR DATA ENGINEERING (PYDE) > Working with Big Data > Transformations and Actions