Creating DAGs

Creating a DAG (Directed Acyclic Graph) is the first step in building workflows in Apache Airflow.

A DAG defines:

  • What tasks to run
  • In what order to run them
  • When to run them

DAGs are written in Python and saved as .py files inside the Airflow dags folder.

Basic Structure of a DAG

A simple DAG includes:

  • DAG definition
  • Default arguments
  • Tasks (Operators)
  • Task dependencies

Step 1: Import Required Libraries

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

Step 2: Define Default Arguments

Default arguments are settings applied to all tasks.

default_args = {
'owner': 'airflow',
'retries': 1
}

Step 3: Create the DAG Object

dag = DAG(
dag_id='simple_pipeline',
default_args=default_args,
start_date=datetime(2024, 1, 1),
schedule_interval='@daily',
catchup=False
)

Key Parameters:

  • dag_id → Unique name of the DAG
  • start_date → When scheduling begins
  • schedule_interval → How often it runs
  • catchup → Whether to run past scheduled jobs

Step 4: Define Tasks

Example Python function:

def say_hello():
print("Hello from Airflow")

Create a task using PythonOperator:

task1 = PythonOperator(
task_id='hello_task',
python_callable=say_hello,
dag=dag
)

Step 5: Set Task Dependencies

Dependencies define execution order.

task1

For multiple tasks:

task1 >> task2 >> task3

This means:
task1 runs first, then task2, then task3.

Example Complete Simple DAG

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetimedefault_args = {
'owner': 'airflow',
'retries': 1
}def extract():
print("Extracting data")def transform():
print("Transforming data")def load():
print("Loading data")dag = DAG(
dag_id='etl_pipeline',
default_args=default_args,
start_date=datetime(2024, 1, 1),
schedule_interval='@daily',
catchup=False
)extract_task = PythonOperator(
task_id='extract',
python_callable=extract,
dag=dag
)transform_task = PythonOperator(
task_id='transform',
python_callable=transform,
dag=dag
)load_task = PythonOperator(
task_id='load',
python_callable=load,
dag=dag
)extract_task >> transform_task >> load_task

Common DAG Scheduling Options

  • ‘@once’ → Run once
  • ‘@hourly’ → Every hour
  • ‘@daily’ → Daily
  • ‘@weekly’ → Weekly
  • Cron expression → Custom schedule

Example:

schedule_interval='0 6 * * *'

This runs daily at 6 AM.

Best Practices for Creating DAGs

  • Keep tasks small and focused
  • Avoid heavy processing inside DAG files
  • Use modular code
  • Use clear task names
  • Disable catchup unless needed
  • Monitor logs regularly

How to Deploy a DAG

  1. Save the Python file
  2. Place it in the Airflow DAGs folder
  3. Restart scheduler (if needed)
  4. Enable DAG in Web UI
  5. Trigger manually or wait for schedule

Interview Answer (Short Version)

Creating a DAG in Apache Airflow involves defining a DAG object in Python, adding tasks using operators, and setting dependencies between tasks to control execution order.

Final Summary

Creating DAGs in Apache Airflow involves:

  • Defining workflow structure
  • Adding tasks
  • Setting dependencies
  • Scheduling execution

DAGs are the backbone of workflow orchestration in modern data engineering pipelines.

Home » PYTHON FOR DATA ENGINEERING (PYDE) > Orchestration and Automation > Creating DAGs