Apache Airflow and Apache NiFi: A Comprehensive Q&A Guide for Data Engineers, Exams of Technology

A comprehensive set of questions and answers covering key concepts in apache airflow and apache nifi, two crucial big data processing frameworks. it delves into the functionalities of each component, including scheduling, data transfer, and error handling. The questions test understanding of dags, operators, processors, and data flow management within these platforms, making it an excellent resource for students and professionals in data engineering. The detailed explanations accompanying each answer enhance learning and comprehension of complex concepts.

Typology: Exams

2024/2025

Available from 05/27/2025

nicky-jone
nicky-jone 🇮🇳

2.9

(44)

28K documents

1 / 289

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Cloudera CDP Data Developer Exam
1
1. Which Apache Airflow component is responsible for scheduling and triggering
DAGs?
A) Executor
B) Scheduler
C) Worker
D) Webserver
Answer: B) Scheduler
Explanation: The Scheduler monitors DAG definitions and schedules tasks to be executed
by workers.
2. In Apache Airflow, which operator would you use to make an HTTP request to a
REST API?
A) BashOperator
B) PythonOperator
C) HttpOperator
D) DummyOperator
Answer: C) HttpOperator
Explanation: The HttpOperator is designed to make HTTP requests to REST APIs,
facilitating data connections and downloads.
3. Which Airflow feature allows secure storage of sensitive information like API keys
used in REST API connections?
A) Variables
B) Connections
C) XComs
D) Macros
Answer: B) Connections
Explanation: Airflow Connections securely store credentials and connection details needed
to interact with external systems.
4. What is the purpose of a DAG in Apache Airflow?
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Apache Airflow and Apache NiFi: A Comprehensive Q&A Guide for Data Engineers and more Exams Technology in PDF only on Docsity!

  1. Which Apache Airflow component is responsible for scheduling and triggering DAGs? A) Executor B) Scheduler C) Worker D) Webserver Answer: B) Scheduler Explanation: The Scheduler monitors DAG definitions and schedules tasks to be executed by workers.
  2. In Apache Airflow, which operator would you use to make an HTTP request to a REST API? A) BashOperator B) PythonOperator C) HttpOperator D) DummyOperator Answer: C) HttpOperator Explanation: The HttpOperator is designed to make HTTP requests to REST APIs, facilitating data connections and downloads.
  3. Which Airflow feature allows secure storage of sensitive information like API keys used in REST API connections? A) Variables B) Connections C) XComs D) Macros Answer: B) Connections Explanation: Airflow Connections securely store credentials and connection details needed to interact with external systems.
  4. What is the purpose of a DAG in Apache Airflow?

A) To define the structure of a data pipeline B) To store logs C) To manage Airflow users D) To configure Airflow's executor Answer: A) To define the structure of a data pipeline Explanation: A DAG (Directed Acyclic Graph) outlines the tasks and their dependencies, representing the workflow of a data pipeline.

  1. Which Airflow component allows you to execute code in response to events or schedules? A) DAGs B) Operators C) Sensors D) Hooks Answer: B) Operators Explanation: Operators are the building blocks of tasks in Airflow, executing specific actions like running scripts or making API calls.
  2. To download data from a REST API and store it in S3 using Airflow, which operators might you use in combination? A) HttpOperator and S3Hook B) BashOperator and PythonOperator C) PythonOperator and S3Operator D) HttpOperator and S3UploadOperator Answer: C) PythonOperator and S3Operator Explanation: The PythonOperator can handle the logic for downloading data, while the S3Operator manages uploading data to S3.
  3. Which Airflow component is responsible for executing tasks defined in a DAG? A) Scheduler B) Executor C) Worker D) Webserver
  1. In Apache NiFi, what is the primary component used to move data between processors? A) FlowFiles B) Processors C) Connections D) Controller Services Answer: C) Connections Explanation: Connections link processors, allowing FlowFiles to flow from one processor to another.
  2. Which NiFi processor is used to ingest data from a REST API? A) GetHTTP B) InvokeHTTP C) ListenHTTP D) FetchHTTP Answer: A) GetHTTP Explanation: GetHTTP periodically makes HTTP requests to retrieve data from REST APIs.
  3. What is a FlowFile in Apache NiFi? A) A configuration file for NiFi B) The data structure that moves through the NiFi flow C) A log file generated by NiFi D) A template for NiFi processors Answer: B) The data structure that moves through the NiFi flow Explanation: FlowFiles represent the data and its metadata as it traverses through the NiFi data flow.
  4. Which processor would you use to parse JSON data in NiFi? A) ConvertJSONToXML B) SplitJSON C) EvaluateJsonPath D) MergeJSON

Answer: C) EvaluateJsonPath Explanation: EvaluateJsonPath extracts specific data from JSON structures, enabling further processing based on the extracted values.

  1. How can you ensure data provenance in Apache NiFi? A) By using SSL encryption B) By configuring data lineage tracking C) By enabling data provenance in the NiFi settings D) By using FlowFile attributes Answer: C) By enabling data provenance in the NiFi settings Explanation: Data provenance is enabled in NiFi’s configuration, allowing tracking of data flow through the system.
  2. Which of the following is NOT a built-in processor in Apache NiFi? A) PutHDFS B) GetSFTP C) QueryDatabaseTable D) SparkSubmit Answer: D) SparkSubmit Explanation: SparkSubmit is not a standard NiFi processor. NiFi includes processors like PutHDFS and GetSFTP for specific tasks.
  3. What feature of NiFi allows you to manage and monitor data flows in real-time? A) NiFi Registry B) NiFi UI C) NiFi API D) NiFi CLI Answer: B) NiFi UI Explanation: The NiFi user interface provides real-time monitoring and management capabilities for data flows.

Answer: B) GetHDFS Explanation: GetHDFS fetches files from the Hadoop Distributed File System (HDFS).

  1. To transfer data from HDFS to S3 using NiFi, which sequence of processors could be used? A) GetHDFS → PutS3Object B) FetchHDFS → PutS3Object C) GetHDFS → FetchS3Object D) ReadHDFS → WriteS Answer: A) GetHDFS → PutS3Object Explanation: GetHDFS retrieves data from HDFS, and PutS3Object uploads it to S3.
  2. Which property must be configured in PutS3Object processor to authenticate with AWS? A) AWS Access Key and Secret Key B) HDFS URL C) S3 Bucket Name D) Both A and C Answer: D) Both A and C Explanation: To authenticate and specify the target, both AWS credentials and the S3 bucket name are required.
  3. What is the purpose of the S3Transfer protocol in NiFi? A) To move data between S3 and HDFS B) To enable secure transfer to S C) It is not a standard protocol in NiFi D) To transfer data to S3 using a specific protocol Answer: C) It is not a standard protocol in NiFi Explanation: S3Transfer is not a recognized standard protocol within NiFi. Standard processors handle S3 interactions.
  1. When transferring data from HDFS to S3 in NiFi, which processor is responsible for handling retries in case of failures? A) GetHDFS B) PutS3Object C) RetryHandler D) ErrorHandlingProcessor Answer: B) PutS3Object Explanation: PutS3Object can be configured with retry settings to manage upload failures.
  2. In NiFi, how can you ensure that files are deleted from HDFS after successful transfer to S3? A) Use the DeleteHDFS processor after PutS3Object B) Configure GetHDFS to delete after fetching C) Use a success relationship to trigger deletion D) All of the above Answer: D) All of the above Explanation: Multiple approaches can ensure files are deleted post-transfer, including using DeleteHDFS or configuring GetHDFS accordingly.
  3. Which NiFi processor is used to write data to HDFS? A) PutHDFS B) WriteHDFS C) UploadHDFS D) SendHDFS Answer: A) PutHDFS Explanation: PutHDFS handles writing data to the Hadoop Distributed File System.
  4. How does NiFi handle large files during transfer between HDFS and S3? A) It splits the files into smaller chunks B) It streams the data to avoid memory overload

Explanation: Kudu is designed for real-time data operations, enabling fast ingestion and updates suitable for analytics.

  1. Which NiFi processor can be used to interact directly with Apache Kudu? A) PutKudu B) KuduPut C) There is no direct Kudu processor in NiFi; use a custom processor or intermediary like Spark D) WriteKudu Answer: C) There is no direct Kudu processor in NiFi; use a custom processor or intermediary like Spark Explanation: NiFi does not have a built-in Kudu processor. Integration typically involves using Spark as an intermediary.
  2. How can Apache Spark be used in conjunction with NiFi to load data into Kudu? A) Spark processes data ingested by NiFi and writes it to Kudu B) NiFi directly writes to Kudu via Spark C) Spark and NiFi write to Kudu independently D) NiFi cannot integrate with Spark for Kudu loading Answer: A) Spark processes data ingested by NiFi and writes it to Kudu Explanation: NiFi handles data ingestion and routing, while Spark processes the data and interfaces with Kudu for storage.
  3. Which Spark component is responsible for executing the data loading into Kudu? A) Spark Driver B) Spark Executors C) Kudu Connector D) All of the above Answer: D) All of the above Explanation: The Spark Driver coordinates the job, Executors perform the tasks, and the Kudu Connector facilitates interaction with Kudu.
  1. What is the role of the Kudu Connector in Spark? A) It provides a user interface for Kudu B) It allows Spark to read and write data to Kudu C) It manages Kudu nodes D) It optimizes Spark queries Answer: B) It allows Spark to read and write data to Kudu Explanation: The Kudu Connector enables Spark applications to perform read and write operations on Kudu tables.
  2. Which data format is recommended for efficient data loading into Kudu via Spark? A) CSV B) JSON C) Parquet D) Avro Answer: C) Parquet Explanation: Parquet is a columnar storage format that is efficient for both Spark processing and Kudu ingestion.
  3. In Spark, which API would you typically use to write data to Kudu? A) RDD API B) DataFrame API C) Dataset API D) SQL API Answer: B) DataFrame API Explanation: The DataFrame API is commonly used with the Kudu Connector for structured data operations.
  4. Which NiFi processor can be used to trigger a Spark job for loading data into Kudu? A) ExecuteSpark B) ExecuteStreamCommand

Explanation: Combining NiFi for data ingestion and routing with Spark for processing streamlines the overall data pipeline to Kudu.

  1. In a Spark-Kudu integration, what is the purpose of defining a schema? A) To validate data types before loading B) To define table structure in Kudu C) To optimize query performance D) All of the above Answer: D) All of the above Explanation: Defining a schema ensures data consistency, defines Kudu table structures, and optimizes query performance.
  2. Which NiFi processor can be used to buffer data before it's processed by Spark for loading into Kudu? A) QueueProcessor B) PutKafka C) PublishKafka D) All of the above, depending on the architecture Answer: D) All of the above, depending on the architecture Explanation: Depending on the design, processors like PublishKafka can buffer data before Spark consumes it.
  3. How does Apache Spark handle data consistency when loading into Kudu? A) Spark does not handle consistency; Kudu manages it B) Spark uses transactions to ensure data consistency in Kudu C) Spark writes data in batch mode, ensuring consistency D) Spark relies on NiFi for consistency Answer: B) Spark uses transactions to ensure data consistency in Kudu Explanation: The Kudu Connector allows Spark to perform atomic transactions, ensuring data consistency during writes.
  1. What is the typical deployment model for Apache Kudu in a production environment? A) Single-node cluster B) Distributed cluster with multiple nodes for scalability and redundancy C) Cloud-only deployment D) On-premises, single-server deployment Answer: B) Distributed cluster with multiple nodes for scalability and redundancy Explanation: Kudu is deployed as a distributed system to provide scalability, fault tolerance, and high availability.
  2. Which NiFi processor would you use to convert data into a format suitable for Spark processing before loading into Kudu? A) ConvertRecord B) SerializeRecord C) ConvertJSONToCSV D) All of the above, depending on the desired format Answer: D) All of the above, depending on the desired format Explanation: Various processors like ConvertRecord or specific format converters can be used based on the required data format.
  3. What role does Apache Zookeeper play in the integration between NiFi, Spark, and Kudu? A) Manages configurations and coordination among the cluster nodes B) Provides storage for Spark C) Acts as a message broker D) It is not involved in this integration Answer: A) Manages configurations and coordination among the cluster nodes Explanation: Zookeeper manages configuration and coordination, essential for distributed systems like Spark and Kudu.
  4. In the context of NiFi and Spark integration, what is backpressure?

Answer: B) Airflow's HttpOperator can make GET, POST, PUT, DELETE requests to REST APIs Explanation: HttpOperator supports multiple HTTP methods, enabling versatile interactions with REST APIs.

  1. How does Apache Airflow ensure that a task using an API is idempotent? A) By ignoring task retries B) By designing the task to handle multiple executions without adverse effects C) By storing the task state in a database D) By using unique task identifiers Answer: B) By designing the task to handle multiple executions without adverse effects Explanation: Idempotent tasks can safely run multiple times without changing the outcome beyond the initial execution.
  2. What is the purpose of the Airflow Variables? A) To store task outputs B) To store global configuration parameters C) To store user credentials D) To define DAG dependencies Answer: B) To store global configuration parameters Explanation: Variables hold configuration values that can be accessed by multiple tasks within DAGs.
  3. Which Airflow feature allows dynamic generation of tasks based on external data or conditions? A) Macros B) Dynamic DAGs C) XComs D) Templates Answer: B) Dynamic DAGs Explanation: Dynamic DAGs enable the creation of tasks and dependencies at runtime based on external inputs or conditions.
  1. In Apache Airflow, what is the purpose of 'retries' and 'retry_delay' parameters in a task? A) To define the maximum runtime of a task B) To specify how many times a task should be retried upon failure and the delay between retries C) To set the priority of a task D) To log retries of a task Answer: B) To specify how many times a task should be retried upon failure and the delay between retries Explanation: These parameters control the retry behavior for tasks that fail, enhancing reliability.
  2. What is a common use case for the 'List' processor family in NiFi? A) To list files or data from a directory or service without fetching the content B) To list all available processors C) To list user permissions D) To list system logs Answer: A) To list files or data from a directory or service without fetching the content Explanation: 'List' processors enumerate available data sources without retrieving the actual data.
  3. Which NiFi processor is best suited for splitting large files into smaller chunks? A) SplitText B) MergeContent C) SplitContent D) DivideFlow Answer: C) SplitContent Explanation: SplitContent can divide a FlowFile’s content based on size or other criteria.
  4. What is the function of the 'UpdateAttribute' processor in NiFi?

Answer: C) There is no specific processor; use generic format conversion processors Explanation: Generic processors like ConvertRecord can handle various data format conversions as needed.

  1. What is the purpose of 'FlowFile Attributes' in the context of transferring data between HDFS and S3? A) To store the content of the data B) To store metadata such as filenames, paths, and other properties C) To define security credentials D) To manage processor configurations Answer: B) To store metadata such as filenames, paths, and other properties Explanation: FlowFile Attributes hold metadata that can be used to manage and route data flows effectively.
  2. Which NiFi processor can send data to a Spark Streaming job? A) PublishKafka B) PutSpark C) ExecuteSparkStreaming D) StreamToSpark Answer: A) PublishKafka Explanation: By publishing data to Kafka, Spark Streaming can consume and process the data in real-time.
  3. When using Spark to load data into Kudu, what is the function of the 'kudu.table' option? A) It specifies the target Kudu table B) It defines the schema of the data C) It sets the partitioning for Kudu D) It is used for logging purposes Answer: A) It specifies the target Kudu table Explanation: The 'kudu.table' option directs Spark to the specific Kudu table where data should be written.
  1. What is a common challenge when integrating NiFi and Spark for data loading into Kudu? A) Data format incompatibility B) Managing data flow synchronization and backpressure C) Lack of security features D) NiFi cannot handle large data volumes Answer: B) Managing data flow synchronization and backpressure Explanation: Coordinating data flow rates between NiFi and Spark to prevent bottlenecks is a key challenge.
  2. Which Spark configuration property is important for optimizing memory usage when loading large datasets into Kudu? A) spark.executor.memory B) spark.driver.memory C) spark.kudu.memory D) Both A and B Answer: D) Both A and B Explanation: Allocating sufficient memory to both executors and the driver ensures efficient processing of large datasets.
  3. Which NiFi processor can be used to convert CSV data into a format Spark can easily consume? A) ConvertCSV B) ConvertRecord with CSVReader and appropriate writer C) CSVParser D) SplitCSV Answer: B) ConvertRecord with CSVReader and appropriate writer Explanation: ConvertRecord allows flexible data format transformations using specified readers and writers.