Creating a DAG (Directed Acyclic Graph) is the first step in building workflows in Apache Airflow.
A DAG defines:
- What tasks to run
- In what order to run them
- When to run them
DAGs are written in Python and saved as .py files inside the Airflow dags folder.
Basic Structure of a DAG
A simple DAG includes:
- DAG definition
- Default arguments
- Tasks (Operators)
- Task dependencies
Step 1: Import Required Libraries
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
Step 2: Define Default Arguments
Default arguments are settings applied to all tasks.
default_args = {
'owner': 'airflow',
'retries': 1
}
Step 3: Create the DAG Object
dag = DAG(
dag_id='simple_pipeline',
default_args=default_args,
start_date=datetime(2024, 1, 1),
schedule_interval='@daily',
catchup=False
)
Key Parameters:
- dag_id → Unique name of the DAG
- start_date → When scheduling begins
- schedule_interval → How often it runs
- catchup → Whether to run past scheduled jobs
Step 4: Define Tasks
Example Python function:
def say_hello():
print("Hello from Airflow")
Create a task using PythonOperator:
task1 = PythonOperator(
task_id='hello_task',
python_callable=say_hello,
dag=dag
)
Step 5: Set Task Dependencies
Dependencies define execution order.
task1
For multiple tasks:
task1 >> task2 >> task3
This means:
task1 runs first, then task2, then task3.
Example Complete Simple DAG
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetimedefault_args = {
'owner': 'airflow',
'retries': 1
}def extract():
print("Extracting data")def transform():
print("Transforming data")def load():
print("Loading data")dag = DAG(
dag_id='etl_pipeline',
default_args=default_args,
start_date=datetime(2024, 1, 1),
schedule_interval='@daily',
catchup=False
)extract_task = PythonOperator(
task_id='extract',
python_callable=extract,
dag=dag
)transform_task = PythonOperator(
task_id='transform',
python_callable=transform,
dag=dag
)load_task = PythonOperator(
task_id='load',
python_callable=load,
dag=dag
)extract_task >> transform_task >> load_task
Common DAG Scheduling Options
- ‘@once’ → Run once
- ‘@hourly’ → Every hour
- ‘@daily’ → Daily
- ‘@weekly’ → Weekly
- Cron expression → Custom schedule
Example:
schedule_interval='0 6 * * *'
This runs daily at 6 AM.
Best Practices for Creating DAGs
- Keep tasks small and focused
- Avoid heavy processing inside DAG files
- Use modular code
- Use clear task names
- Disable catchup unless needed
- Monitor logs regularly
How to Deploy a DAG
- Save the Python file
- Place it in the Airflow DAGs folder
- Restart scheduler (if needed)
- Enable DAG in Web UI
- Trigger manually or wait for schedule
Interview Answer (Short Version)
Creating a DAG in Apache Airflow involves defining a DAG object in Python, adding tasks using operators, and setting dependencies between tasks to control execution order.
Final Summary
Creating DAGs in Apache Airflow involves:
- Defining workflow structure
- Adding tasks
- Setting dependencies
- Scheduling execution
DAGs are the backbone of workflow orchestration in modern data engineering pipelines.