Scheduling Pipelines

Scheduling pipelines is one of the most important features of Apache Airflow. It allows you to automatically run workflows at specific times or intervals without manual intervention.

With proper scheduling, you can automate daily reports, hourly data refreshes, weekly backups, and more.

What is Pipeline Scheduling?

Pipeline scheduling means defining:

  • When a workflow should start
  • How often it should run
  • Whether missed runs should be executed

In Airflow, scheduling is controlled inside the DAG definition.

Key Scheduling Parameters

1. start_date

Defines when the scheduler should begin triggering the DAG.

Example:

start_date=datetime(2024, 1, 1)

Airflow will not schedule runs before this date.

2. schedule_interval

Defines how often the DAG runs.

Common options:

  • ‘@once’ → Run only once
  • ‘@hourly’ → Every hour
  • ‘@daily’ → Every day
  • ‘@weekly’ → Every week
  • ‘@monthly’ → Every month

Example:

schedule_interval='@daily'

3. Cron Expressions

For custom schedules, use cron format:

schedule_interval='0 6 * * *'

This means:
Run every day at 6:00 AM.

Cron Format Structure:

Minute Hour Day Month Weekday
0 6 * * *

Examples:

  • ‘0 0 * * *’ → Midnight daily
  • ‘0 */2 * * *’ → Every 2 hours
  • ‘0 9 * * 1’ → Every Monday at 9 AM

4. catchup

Determines whether Airflow should run missed schedules.

catchup=False

If set to True:
Airflow will execute all missed intervals since start_date.

If False:
Airflow runs only the latest schedule.

In most modern pipelines, catchup is set to False.

Example: Daily ETL Pipeline

dag = DAG(
dag_id='daily_sales_pipeline',
start_date=datetime(2024, 1, 1),
schedule_interval='@daily',
catchup=False
)

This pipeline will:

  • Run once per day
  • Not execute past missed runs
  • Start scheduling from Jan 1, 2024

Types of Scheduling in Real Projects

  1. Time-Based Scheduling
    Example: Daily sales refresh at 2 AM
  2. Event-Based Scheduling
    Triggered after another DAG finishes
  3. Manual Trigger
    Run from Web UI
  4. Dataset/Dependency-Based Scheduling
    Run when upstream data becomes available

Backfilling

Backfilling allows running a DAG for past dates manually.

Used when:

  • Historical data needs processing
  • A bug was fixed and past data must be reloaded

Time Zones in Scheduling

Airflow works with time zones. Always:

  • Set proper timezone
  • Be consistent with server time

In production systems, UTC is commonly used.

Best Practices for Scheduling

  • Avoid very frequent schedules unless necessary
  • Set catchup carefully
  • Use meaningful start_date
  • Test DAG manually before production scheduling
  • Monitor execution time and failures

Interview Answer (Short Version)

Scheduling pipelines in Apache Airflow involves setting the start_date, schedule_interval, and catchup parameters inside a DAG to control when and how often workflows run.

Final Summary

Scheduling in Apache Airflow allows you to:

  • Automate workflows
  • Control execution frequency
  • Handle missed runs
  • Run pipelines reliably

Proper scheduling ensures smooth and automated data engineering operations.

Home » PYTHON FOR DATA ENGINEERING (PYDE) > Orchestration and Automation > Scheduling Pipelines