













































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This exam covers the foundational aspects of Apache Airflow, including DAG concepts, scheduling, task execution, operators, sensors, hooks, and Airflow architecture. Students practice identifying pipeline failures, configuring Airflow environments, securing deployments, and optimizing DAG performance. The exam includes hands-on DAG interpretation, dependency mapping, and error-handling strategies essential for managing reliable workflows.
Typology: Exams
1 / 85
This page cannot be seen from the preview
Don't miss anything!














































































Question 1. Which component is primarily responsible for parsing DAG files and creating task instances? A) Webserver B) Scheduler C) Worker D) Metadata Database Answer: B Explanation: The Scheduler continuously scans the DAG folder, parses DAG definitions, and determines when tasks should be scheduled. Question 2. In Airflow terminology, what does the "D" in DAG stand for? A) Distributed B) Directed C) Dynamic D) Data‑driven Answer: B Explanation: DAG stands for Directed Acyclic Graph; "Directed" indicates that edges have a direction from upstream to downstream tasks. Question 3. Which executor runs tasks sequentially in the same process as the scheduler? A) LocalExecutor B) CeleryExecutor C) SequentialExecutor D) KubernetesExecutor Answer: C Explanation: SequentialExecutor is the simplest executor; it executes tasks one at a time in the scheduler’s process, useful for debugging.
Question 4. What is stored in Airflow’s metadata database? A) Task code files B) DAG definitions only C) Execution history, variables, connections, and DAG metadata D) Log files Answer: C Explanation: The metadata DB tracks DAG runs, task instances, variables, connections, and other state information. Question 5. Which of the following best describes an Airflow “Operator”? A) A reusable piece of code that defines a type of work B) An instance of a task in a DAG C) A UI component for monitoring D) A configuration file for the scheduler Answer: A Explanation: Operators are Python classes that encapsulate the logic for a specific type of work (e.g., BashOperator runs a bash command). Question 6. How does a “Task” differ from an “Operator”? A) A Task is the class definition, an Operator is the instantiated node B) A Task is the instantiated node, an Operator is the class definition C) They are interchangeable terms D) A Task runs on a worker, an Operator runs on the scheduler Answer: B Explanation: An Operator defines the template; a Task is the concrete instance of that Operator placed in a DAG.
Explanation: “{{ ds }}” is a Jinja macro that expands to the logical date (execution date) of the task run. Question 10. Which operator would you use to wait for a file to appear in a filesystem? A) BashOperator B) PythonOperator C) FileSensor D) S3ToRedshiftOperator Answer: C Explanation: FileSensor continuously checks for the existence of a file and only proceeds when it appears. Question 11. In the context of XComs, what does “ti.xcom_pull(key='value', task_ids='task_a')” do? A) Pushes a value to the XCom store B) Retrieves a value from task_a’s XCom with key ‘value’ C) Deletes the XCom entry D) Lists all XCom entries for the DAG run Answer: B Explanation: xcom_pull fetches a value previously pushed to XCom by the specified task and key. Question 12. What is the maximum size recommended for an XCom payload? A) 1 KB B) 10 KB C) 100 KB D) 1 MB Answer: B
Explanation: XComs are stored in the metadata DB; keeping payloads ≤10 KB avoids performance issues. Question 13. Which configuration file is the primary source for Airflow settings? A) airflow.cfg B) airflow.yaml C) settings.ini D) config.json Answer: A Explanation: airflow.cfg contains sections for scheduler, webserver, executor, and other core settings. Question 14. How can you override a setting defined in airflow.cfg without editing the file? A) Using a .env file in the DAG folder B) Setting an environment variable with the same name prefixed by “AIRFLOW__” C) Adding a comment line in airflow.cfg D) Modifying the DAG’s default_args Answer: B Explanation: Airflow reads environment variables with the pattern AIRFLOW__{SECTION}__{KEY} to override config values. Question 15. Which UI component allows you to view task logs for a specific run? A) Tree View B) Graph View C) Gantt Chart D) Log Tab in Task Instance Details Answer: D
Explanation: The callable must return a task_id (or list) that determines the active downstream branch. Question 19. Which of the following statements about “catchup=False” is true? A) It disables scheduled runs entirely B) It prevents backfilling but still runs the most recent schedule interval C) It forces Airflow to run all past intervals immediately D) It only affects the webserver UI display Answer: B Explanation: catchup=False tells Airflow to ignore past intervals and only schedule the latest interval when the DAG is turned on. Question 20. What does the “schedule_interval” value “@hourly” represent? A) Every hour at minute 0 B) Every minute C) Every day at midnight D) Every hour at minute 30 Answer: A Explanation: “@hourly” is a preset cron expression equivalent to “0 * * * *”. Question 21. Which task state indicates that a task has completed successfully? A) queued B) running C) success D) failed Answer: C Explanation: The “success” state marks a task instance that finished without errors.
Question 22. What happens when a task exceeds its defined “retry” count? A) It is marked as “success” B) It is marked as “failed” and downstream tasks may be skipped C) It is automatically deleted D) Airflow raises an alert but keeps the task in “running” state Answer: B Explanation: After exhausting retries, the task moves to “failed”, which may trigger downstream failure propagation. Question 23. Which command lists all DAGs available in the Airflow environment? A) airflow dags list B) airflow list_dags C) airflow dags show D) airflow dag list Answer: A Explanation: “airflow dags list” (Airflow 2.x) prints the IDs of all DAGs discovered in the dags_folder. Question 24. How can you trigger a DAG run manually from the CLI? A) airflow trigger_dag <dag_id> B) airflow run_dag <dag_id> C) airflow start <dag_id> D) airflow exec_dag <dag_id> Answer: A Explanation: The “trigger_dag” command creates a new DAG run immediately, optionally with a specific execution date.
Question 28. In a DAG file, where should you place the import statements for operators? A) Inside the default_args dictionary B) At the top of the Python file, before DAG definition C) Inside each task’s callable function D) In the Airflow UI settings Answer: B Explanation: Standard Python practice; imports are placed at the beginning of the file so they are available when the DAG is parsed. Question 29. What is the effect of setting “max_active_runs=1” on a DAG? A) Only one task can run at a time across the entire Airflow instance B) Only one DAG run can be active simultaneously for that DAG C) The DAG will be paused after the first run D) It limits the number of workers to one Answer: B Explanation: max_active_runs limits concurrent DAG runs; with a value of 1, a new run will not start until the previous one finishes. Question 30. Which of the following best describes a “SubDAG”? A) A separate Airflow instance used for testing B) A DAG defined inside another DAG’s task, used for modularization C) A UI feature for grouping tasks D) A special kind of pool Answer: B Explanation: A SubDAG is a DAG object used as a task in a parent DAG, allowing hierarchical workflow composition (though Task Groups are preferred today).
Question 31. How do you define a default argument that applies to all tasks in a DAG? A) Pass it to each operator’s constructor individually B) Include it in the DAG’s default_args parameter when creating the DAG object C) Set it in airflow.cfg under [core] D) Use an environment variable named DEFAULT_ARGS Answer: B Explanation: The default_args dictionary supplied to DAG() is merged with each task’s arguments unless overridden. Question 32. Which of the following is a built‑in Airflow operator for moving data from S to Redshift? A) S3ToRedshiftOperator B) S3TransferOperator C) RedshiftCopyOperator D) S3CopyOperator Answer: A Explanation: S3ToRedshiftOperator handles copying data from an S3 bucket into a Redshift table. Question 33. What does the “depends_on_past” task parameter control? A) Whether a task waits for the previous DAG run’s same task to succeed B) Whether a task depends on all upstream tasks in the same run C) Whether a task can be retried D) Whether a task runs in parallel with its downstream tasks Answer: A Explanation: When True, the task instance will not run unless the previous execution’s instance succeeded.
Explanation: Task Groups are a UI‑only grouping mechanism that avoid the overhead and complexity of SubDAGs, leading to better performance. Question 37. Which connection URI format is correct for a PostgreSQL database? A) postgres://user:pass@host:5432/dbname B) pgsql://user@host/dbname C) mysql://user:pass@host/dbname D) sqlite:///:memory: Answer: A Explanation: The standard PostgreSQL URI uses the scheme “postgres://” followed by credentials, host, port, and database name. Question 38. How can you securely store a secret (e.g., API key) for use in a DAG without hard‑coding it? A) Place it in the DAG file as a global variable B) Store it as an Airflow Variable with “is_secret=True” (or use a secret backend) C) Write it to a local text file in the dags_folder D) Encode it in base64 and embed in the DAG code Answer: B Explanation: Airflow Variables (or external secret backends) keep secrets out of code, and the UI can mask them. Question 39. What does the “poke_interval” parameter of a Sensor control? A) How long the sensor waits before timing out B) How frequently the sensor checks the condition C) The maximum number of retries D) The priority weight of the sensor task
Answer: B Explanation: poke_interval (seconds) defines the sleep time between successive condition checks. Question 40. Which of the following is NOT a valid schedule expression in Airflow? A) @daily B) 0 12 * * MON‑FRI C) every_hour D) */15 * * * * Answer: C Explanation: “every_hour” is not a recognized preset; valid presets include @hourly, @daily, etc., or cron strings. Question 41. If a task has “retries=3” and “retry_delay=timedelta(minutes=5)”, how long will Airflow wait in total before marking the task as failed after the first attempt fails? A) 5 minutes B) 10 minutes C) 15 minutes D) 20 minutes Answer: C Explanation: Three retries × 5 minutes each = 15 minutes of waiting after the initial failure. Question 42. Which CLI command clears the state of a specific task instance? A) airflow tasks clear <dag_id> <task_id> --execution_date
Answer: C Explanation: dag_id must be unique; it is the primary key for DAG metadata in the database. Question 46. Which of the following best describes the purpose of the “Webserver” component? A) Executes tasks on remote workers B) Parses DAG files and schedules runs C) Provides a UI for monitoring DAGs, tasks, and logs D) Stores connection credentials Answer: C Explanation: The Webserver hosts the Flask‑based UI used for visualizing and managing Airflow resources. Question 47. What happens if a DAG’s “start_date” is set to a future date? A) The DAG will run immediately B) The DAG will never be scheduled until the start_date is reached C) Airflow will ignore the start_date and use the current date D) The scheduler will raise an error and stop Answer: B Explanation: Airflow only schedules runs after the start_date; a future start_date delays the first run. Question 48. Which of the following is a recommended practice for writing idempotent tasks? A) Always delete output data before writing new data B) Use random filenames for each run C) Ensure that re‑executing the task yields the same result without side effects
D) Disable retries to avoid duplicate execution Answer: C Explanation: Idempotent tasks can be safely retried or backfilled because repeated runs do not alter the final state. Question 49. How can you make a DAG visible only to certain users in a RBAC‑enabled Airflow deployment? A) Place the DAG file in a private folder B) Set “is_paused=True” in the DAG definition C) Use “access_control” parameter to map roles to the DAG D) Rename the DAG with a leading underscore Answer: C Explanation: The “access_control” dict assigns specific roles permissions to view or edit a DAG. Question 50. Which operator would you use to execute a simple no‑op placeholder in a DAG? A) BashOperator B) DummyOperator (Airflow 1.x) / EmptyOperator (Airflow 2.x) C) PythonOperator with a pass statement D) PauseOperator Answer: B Explanation: EmptyOperator (formerly DummyOperator) creates a task that does nothing, useful for structuring DAGs. Question 51. What is the purpose of the “trigger_rule” parameter on a task? A) Determines the schedule interval for the task
Question 54. Which of the following is a valid way to pass runtime parameters to a PythonOperator? A) Using the “op_kwargs” argument to supply a dictionary of keyword arguments B) Modifying the DAG’s default_args after creation C) Editing the Python function’s global variables inside the DAG file D) Setting environment variables inside the task code Answer: A Explanation: op_kwargs maps directly to the callable’s keyword parameters at execution time. Question 55. What does the “sla_miss_callback” function allow you to do? A) Automatically retry a failed task B) Execute custom logic (e.g., send alerts) when a task exceeds its SLA C) Skip the task on the next run D) Change the task’s priority weight dynamically Answer: B Explanation: sla_miss_callback is invoked when a task’s SLA is missed, enabling custom alerting or remediation. Question 56. Which of the following best describes the “Data Interval” concept? A) The time between two consecutive task executions within a DAG run B) The logical time span that a DAG run is processing (e.g., a day for @daily) C) The duration of the scheduler’s polling loop D) The time a worker spends on a task Answer: B Explanation: Data Interval is the period of data that a particular DAG run is intended to handle, derived from the schedule.
Question 57. How can you view the lineage of a task’s upstream dependencies in the UI? A) Click the “Tree View” and hover over the task B) Use the “Graph View” which displays upstream and downstream edges C) Open the “Log” tab for the task D) Check the “Code” view of the DAG file Answer: B Explanation: Graph View visualizes the DAG structure, showing arrows from upstream to downstream tasks. Question 58. Which of the following is true about the “KubernetesExecutor”? A) It runs each task in a separate Docker container on a Kubernetes cluster B) It executes tasks on the same machine as the scheduler C) It does not support task retries D) It requires a local SQLite metadata database Answer: A Explanation: KubernetesExecutor launches a pod per task, providing isolation and scalability. Question 59. What does the “priority_weight” attribute affect? A) The order in which tasks are displayed in the UI B) The order in which tasks are scheduled when resources are limited (higher weight = higher priority) C) The number of retries a task receives D) The size of the task’s log file Answer: B Explanation: priority_weight influences the scheduler’s decision when multiple tasks compete for limited slots.