Scheduling pipelines is one of the most important features of Apache Airflow. It allows you to automatically run workflows at specific times or intervals without manual intervention.
With proper scheduling, you can automate daily reports, hourly data refreshes, weekly backups, and more.
What is Pipeline Scheduling?
Pipeline scheduling means defining:
- When a workflow should start
- How often it should run
- Whether missed runs should be executed
In Airflow, scheduling is controlled inside the DAG definition.
Key Scheduling Parameters
1. start_date
Defines when the scheduler should begin triggering the DAG.
Example:
start_date=datetime(2024, 1, 1)
Airflow will not schedule runs before this date.
2. schedule_interval
Defines how often the DAG runs.
Common options:
- ‘@once’ → Run only once
- ‘@hourly’ → Every hour
- ‘@daily’ → Every day
- ‘@weekly’ → Every week
- ‘@monthly’ → Every month
Example:
schedule_interval='@daily'
3. Cron Expressions
For custom schedules, use cron format:
schedule_interval='0 6 * * *'
This means:
Run every day at 6:00 AM.
Cron Format Structure:
Minute Hour Day Month Weekday
0 6 * * *
Examples:
- ‘0 0 * * *’ → Midnight daily
- ‘0 */2 * * *’ → Every 2 hours
- ‘0 9 * * 1’ → Every Monday at 9 AM
4. catchup
Determines whether Airflow should run missed schedules.
catchup=False
If set to True:
Airflow will execute all missed intervals since start_date.
If False:
Airflow runs only the latest schedule.
In most modern pipelines, catchup is set to False.
Example: Daily ETL Pipeline
dag = DAG(
dag_id='daily_sales_pipeline',
start_date=datetime(2024, 1, 1),
schedule_interval='@daily',
catchup=False
)
This pipeline will:
- Run once per day
- Not execute past missed runs
- Start scheduling from Jan 1, 2024
Types of Scheduling in Real Projects
- Time-Based Scheduling
Example: Daily sales refresh at 2 AM - Event-Based Scheduling
Triggered after another DAG finishes - Manual Trigger
Run from Web UI - Dataset/Dependency-Based Scheduling
Run when upstream data becomes available
Backfilling
Backfilling allows running a DAG for past dates manually.
Used when:
- Historical data needs processing
- A bug was fixed and past data must be reloaded
Time Zones in Scheduling
Airflow works with time zones. Always:
- Set proper timezone
- Be consistent with server time
In production systems, UTC is commonly used.
Best Practices for Scheduling
- Avoid very frequent schedules unless necessary
- Set catchup carefully
- Use meaningful start_date
- Test DAG manually before production scheduling
- Monitor execution time and failures
Interview Answer (Short Version)
Scheduling pipelines in Apache Airflow involves setting the start_date, schedule_interval, and catchup parameters inside a DAG to control when and how often workflows run.
Final Summary
Scheduling in Apache Airflow allows you to:
- Automate workflows
- Control execution frequency
- Handle missed runs
- Run pipelines reliably
Proper scheduling ensures smooth and automated data engineering operations.