Scheduling pipelines is one of the most important features of Apache Airflow. It allows you to automatically run workflows at specific times or intervals without manual intervention.

With proper scheduling, you can automate daily reports, hourly data refreshes, weekly backups, and more.

What is Pipeline Scheduling?

Pipeline scheduling means defining:

When a workflow should start
How often it should run
Whether missed runs should be executed

In Airflow, scheduling is controlled inside the DAG definition.

Key Scheduling Parameters

1. start_date

Defines when the scheduler should begin triggering the DAG.

Example:

start_date=datetime(2024, 1, 1)

Airflow will not schedule runs before this date.

2. schedule_interval

Defines how often the DAG runs.

Common options:

‘@once’ → Run only once
‘@hourly’ → Every hour
‘@daily’ → Every day
‘@weekly’ → Every week
‘@monthly’ → Every month

Example:

schedule_interval='@daily'

3. Cron Expressions

For custom schedules, use cron format:

schedule_interval='0 6 * * *'

This means:
Run every day at 6:00 AM.

Cron Format Structure:

Minute Hour Day Month Weekday
0 6 * * *

Examples:

‘0 0 * * *’ → Midnight daily
‘0 */2 * * *’ → Every 2 hours
‘0 9 * * 1’ → Every Monday at 9 AM

4. catchup

Determines whether Airflow should run missed schedules.

catchup=False

If set to True:
Airflow will execute all missed intervals since start_date.

If False:
Airflow runs only the latest schedule.

In most modern pipelines, catchup is set to False.

Example: Daily ETL Pipeline

dag = DAG(
    dag_id='daily_sales_pipeline',
    start_date=datetime(2024, 1, 1),
    schedule_interval='@daily',
    catchup=False
)

This pipeline will:

Run once per day
Not execute past missed runs
Start scheduling from Jan 1, 2024

Types of Scheduling in Real Projects

Time-Based Scheduling
Example: Daily sales refresh at 2 AM
Event-Based Scheduling
Triggered after another DAG finishes
Manual Trigger
Run from Web UI
Dataset/Dependency-Based Scheduling
Run when upstream data becomes available

Backfilling

Backfilling allows running a DAG for past dates manually.

Used when:

Historical data needs processing
A bug was fixed and past data must be reloaded

Time Zones in Scheduling

Airflow works with time zones. Always:

Set proper timezone
Be consistent with server time

In production systems, UTC is commonly used.

Best Practices for Scheduling

Avoid very frequent schedules unless necessary
Set catchup carefully
Use meaningful start_date
Test DAG manually before production scheduling
Monitor execution time and failures

Interview Answer (Short Version)

Scheduling pipelines in Apache Airflow involves setting the start_date, schedule_interval, and catchup parameters inside a DAG to control when and how often workflows run.

Final Summary

Scheduling in Apache Airflow allows you to:

Automate workflows
Control execution frequency
Handle missed runs
Run pipelines reliably

Proper scheduling ensures smooth and automated data engineering operations.

Home » PYTHON FOR DATA ENGINEERING (PYDE) > Orchestration and Automation > Scheduling Pipelines

Free Video Tutorial

Want Mentorship on this Training?

Book a 1-on-1 Consultancy Session

Scheduling Pipelines