Cloudera CDP Data Engineer Practice Exam, Exams of Technology

A deep engineering exam testing distributed data processing skills using Spark, Kafka, NiFi, Airflow, and CDP data services. It emphasizes building resilient pipelines, streaming analytics, job optimization, cluster tuning, schema evolution, and orchestrating enterprise data flows.

Typology: Exams

2025/2026

Available from 01/06/2026

shilpi-jain-1
shilpi-jain-1 🇮🇳

4.2

(5)

29K documents

1 / 88

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Cloudera CDP Data Engineer Practice Exam
**Question 1.** Which component in a SparkonKubernetes deployment is responsible for launching
driver pods?
A) Spark Submit
B) kubescheduler
C) Sparkexecutorcontroller
D) Sparkmaster
Answer: A
Explanation: `spark-submit` translates the Spark application into a driver pod specification and asks the
Kubernetes API to create the driver pod.
**Question 2.** In Spark DataFrames, which method returns a new DataFrame containing only distinct
rows?
A) distinct()
B) unique()
C) dropDuplicates()
D) filterDistinct()
Answer: A
Explanation: `distinct()` removes duplicate rows and produces a DataFrame with unique records.
**Question 3.** Which Spark storage level persists data only in memory and discards it when the JVM
runs out of memory?
A) MEMORY_ONLY_SER
B) MEMORY_AND_DISK
C) OFF_HEAP
D) MEMORY_ONLY
Answer: D
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58

Partial preview of the text

Download Cloudera CDP Data Engineer Practice Exam and more Exams Technology in PDF only on Docsity!

Question 1. Which component in a Spark‑on‑Kubernetes deployment is responsible for launching driver pods? A) Spark Submit B) kube‑scheduler C) Spark‑executor‑controller D) Spark‑master Answer: A Explanation: spark-submit translates the Spark application into a driver pod specification and asks the Kubernetes API to create the driver pod. Question 2. In Spark DataFrames, which method returns a new DataFrame containing only distinct rows? A) distinct() B) unique() C) dropDuplicates() D) filterDistinct() Answer: A Explanation: distinct() removes duplicate rows and produces a DataFrame with unique records. Question 3. Which Spark storage level persists data only in memory and discards it when the JVM runs out of memory? A) MEMORY_ONLY_SER B) MEMORY_AND_DISK C) OFF_HEAP D) MEMORY_ONLY Answer: D

Explanation: MEMORY_ONLY stores partitions as deserialized Java objects in RAM; if memory is insufficient, some partitions are not cached. Question 4. When reading a Hive table with Spark, which option enables schema inference from the Hive Metastore? A) .option("inferSchema", "true") B) .format("hive") C) .enableHiveSupport() D) .option("hive.metastore.uris", "...") Answer: C Explanation: enableHiveSupport() tells Spark to use the Hive Metastore for schema information when accessing Hive tables. Question 5. Which file format provides columnar storage, built‑in compression, and is optimal for analytical queries? A) JSON B) CSV C) Parquet D) Avro Answer: C Explanation: Parquet stores data column‑wise, enabling predicate push‑down and efficient compression for analytics. Question 6. In Airflow, which object defines the execution order and dependencies of tasks? A) Operator B) DAG C) XCom

C) Executors D) SQL Answer: B Explanation: The Stages tab shows task duration distribution; unusually long tasks in a stage often indicate skew. Question 10. When using groupBy followed by agg, which Spark concept determines how many shuffle partitions are created? A) spark.sql.shuffle.partitions B) spark.default.parallelism C) spark.sql.autoBroadcastJoinThreshold D) spark.executor.cores Answer: A Explanation: spark.sql.shuffle.partitions defines the default number of partitions for shuffle operations like groupBy. Question 11. Which Spark transformation is lazy and does not trigger execution until an action is called? A) collect() B) count() C) map() D) show() Answer: C Explanation: map() creates a new RDD/DataFrame but does not compute results until an action (e.g., collect) is invoked. Question 12. Which of the following is a best practice to improve join performance on large tables?

A) Use cross join without conditions B) Broadcast the larger table C) Broadcast the smaller table D) Disable partition pruning Answer: C Explanation: Broadcasting the smaller table avoids a shuffle, making the join more efficient. Question 13. In Iceberg, what feature enables “time‑travel” queries? A) Hidden partitioning B) Snapshot IDs C) Partition evolution D) Schema pruning Answer: B Explanation: Iceberg stores each write as a snapshot; querying a prior snapshot ID allows reading data as of that point in time. Question 14. Which command creates a Kudu table with a primary key on column id? A) CREATE TABLE t (id INT PRIMARY KEY) STORED AS KUDU; B) CREATE TABLE t (id INT, PRIMARY KEY (id)) USING KUDU; C) CREATE TABLE t (id INT PRIMARY KEY) USING KUDU; D) CREATE TABLE t (id INT, PRIMARY KEY (id)) STORED AS KUDU; Answer: D Explanation: In Hive/Impala syntax, PRIMARY KEY (id) defines the Kudu primary key, and STORED AS KUDU selects the storage engine. Question 15. Which Spark SQL function can be used to explode an array column into multiple rows?

A) start_date B) schedule_interval C) catchup D) default_args Answer: B Explanation: schedule_interval specifies how often the DAG should be triggered (e.g., cron expression). Question 19. Which Spark configuration enables automatic broadcast of a table when its size is less than the threshold? A) spark.sql.autoBroadcastJoinThreshold B) spark.broadcast.compress C) spark.sql.broadcastTimeout D) spark.sql.broadcastJoinSize Answer: A Explanation: spark.sql.autoBroadcastJoinThreshold (default 10 MB) determines when Spark will automatically broadcast a smaller table. Question 20. Which DataFrame method would you use to write data in ORC format partitioned by column year? A) write.format("orc").partitionBy("year") B) write.orc().partitionBy("year") C) write.saveAsTable("tbl", format="orc", partitionBy="year") D) write.mode("overwrite").orc("path", partitionBy="year") Answer: A Explanation: write.format("orc").partitionBy("year") sets the file format and partition column before calling save.

Question 21. What is the primary advantage of using bucketed tables over plain partitioned tables in Hive? A) Faster writes B) Better compression C) Optimized joins on bucket columns D) Automatic schema evolution Answer: C Explanation: Bucketing distributes rows into fixed buckets, enabling map‑side joins when both tables are bucketed on the join key. Question 22. Which Spark UI metric indicates the amount of data read from external storage during a stage? A) Input Metrics → Bytes Read B) Shuffle Read Metrics → Bytes Read C) Executor Metrics → Disk Spilled D) Task Metrics → Duration Answer: A Explanation: Input Metrics track data read directly from sources like HDFS, S3, or JDBC. Question 23. In Airflow, which trigger rule executes a downstream task only if all upstream tasks succeeded? A) all_success B) one_success C) none_failed D) all_done

B) memory C) foreachBatch D) file Answer: C Explanation: foreachBatch allows you to apply idempotent batch writes (e.g., to Delta Lake) inside a transaction, achieving exactly‑once guarantees. Question 27. Which Spark SQL function calculates the rank of rows within a partition ordered by sales descending? A) rank() over (order by sales desc) B) dense_rank() over (partition by sales order by desc) C) row_number() over (partition by sales order by desc) D) rank() over (partition by category order by sales desc) Answer: D Explanation: To rank rows per category, you need a partition by clause; option D correctly defines the window. Question 28. In Cloudera Data Platform, which service provides lineage visualization for datasets? A) Ranger B) Atlas C) NiFi D) Oozie Answer: B Explanation: Apache Atlas tracks data lineage and metadata across CDP services. Question 29. Which Airflow operator is best suited for executing a HiveQL script stored in HDFS?

A) HiveOperator B) PrestoOperator C) SqoopOperator D) SparkSubmitOperator Answer: A Explanation: HiveOperator runs Hive queries or scripts, optionally pointing to a file in HDFS. Question 30. Which Spark configuration controls the maximum size of a serialized task result that can be fetched by the driver? A) spark.driver.maxResultSize B) spark.task.maxResultSize C) spark.executor.memoryOverhead D) spark.sql.shuffle.partitions Answer: A Explanation: spark.driver.maxResultSize limits the total size of results returned to the driver to avoid OOM errors. Question 31. What is the effect of setting spark.sql.shuffle.partitions to a very high number on a small dataset? A) Faster execution B) Increased memory usage and overhead C) Automatic schema inference D) Reduced number of executor cores Answer: B Explanation: Excessive shuffle partitions create many small tasks, leading to higher scheduling overhead and memory consumption.

Question 35. When converting a DataFrame from JSON to Parquet, which Spark option improves write performance by reducing the number of files? A) .option("compression","snappy") B) .repartition(1) before write C) .option("mergeSchema","true") D) .coalesce(1) before write Answer: D Explanation: coalesce(1) reduces the number of output partitions, resulting in fewer Parquet files; repartition would increase them. Question 36. Which Spark method removes duplicate rows based on a subset of columns ["id","date"]? A) dropDuplicates(["id","date"]) B) distinct(["id","date"]) C) dropDuplicates().select("id","date") D) filterDuplicates(["id","date"]) Answer: A Explanation: dropDuplicates accepts a column list to consider when identifying duplicates. Question 37. In CDP, which service provides a managed, serverless Spark environment for data engineers? A) DataFlow B) Data Engineering Service (CDE) C) Data Warehouse D) Data Hub

Answer: B Explanation: CDE (Cloudera Data Engineering) offers managed Spark clusters with autoscaling and UI/CLI controls. Question 38. Which Spark SQL clause is used to limit the number of rows returned by a query? A) TOP B) LIMIT C) FETCH FIRST D) SAMPLE Answer: B Explanation: LIMIT n restricts the result set to n rows. Question 39. Which of the following best describes “data skew” in Spark? A) Uneven distribution of tasks across executors causing some tasks to run much longer B) Duplicate records in a dataset C) Missing values in a column D) Incorrect data types in a schema Answer: A Explanation: Skew occurs when a small number of partitions contain disproportionately many records, leading to long-running tasks. Question 40. Which Airflow operator would you use to move files between HDFS directories? A) HdfsToS3Operator B) BashOperator C) HdfsCopyOperator D) HdfsFileSensor

D) spark.executor.instances Answer: A Explanation: Setting spark.dynamicAllocation.enabled to true allows Spark to request and release executors based on workload. Question 44. What does the spark.sql.sources.partitionOverwriteMode setting control? A) How partitions are overwritten when writing a DataFrame B) The number of shuffle partitions C) The compression codec for Parquet files D) The default timestamp format Answer: A Explanation: This property determines whether Spark overwrites whole partitions (dynamic) or uses static overwriting. Question 45. In Airflow, which sensor blocks DAG execution until a condition is met? A) TimeSensor B) ExternalTaskSensor C) FileSensor D) All of the above Answer: D Explanation: Sensors (TimeSensor, ExternalTaskSensor, FileSensor, etc.) wait for a condition before allowing downstream tasks to proceed. Question 46. Which Spark function can be used to replace null values in a column price with 0? A) na.fill(0, Seq("price")) B) fillna(0, ["price"])

C) replaceNull(0, "price") D) coalesce(col("price"), lit(0)) Answer: A Explanation: na.fill (or DataFrameNaFunctions.fill) replaces nulls in specified columns with a given value. Question 47. Which of the following is a benefit of using column pruning in Parquet files? A) Reducing network I/O by reading only needed columns B) Encrypting data at rest C) Enforcing row‑level security D) Improving write latency Answer: A Explanation: Column pruning allows Spark to read only the columns required for a query, decreasing I/O. Question 48. Which Airflow parameter determines whether past DAG runs are created when the scheduler catches up after being down? A) catchup B) depends_on_past C) max_active_runs D) start_date Answer: A Explanation: catchup=False disables backfilling of missed DAG runs; True enables it. Question 49. In Spark, which method returns the schema of a DataFrame as a StructType object? A) printSchema() B) schema()

C) To serialize a DataFrame to disk D) To limit the number of shuffle partitions Answer: A Explanation: broadcast(df) forces Spark to send the entire DataFrame to each executor, enabling a broadcast join. Question 53. Which Iceberg command can be used to roll back a table to a previous snapshot ID? A) ALTER TABLE ... ROLLBACK TO SNAPSHOT … B) CALL iceberg.rollback(table, snapshot_id) C) RESTORE SNAPSHOT snapshot_id ON table D) MERGE INTO table USING snapshot_id Answer: A Explanation: Iceberg supports ALTER TABLE table ROLLBACK TO SNAPSHOT <id> to revert to an earlier state. Question 54. In Airflow, which attribute of a task defines its retry delay? A) retry_exponential_backoff B) retry_delay C) execution_timeout D) max_retry_attempts Answer: B Explanation: retry_delay sets the time interval between consecutive retries of a failed task. Question 55. Which Spark configuration controls the size of the broadcast variable threshold for joins? A) spark.sql.autoBroadcastJoinThreshold

B) spark.broadcast.blockSize C) spark.sql.broadcastJoinThreshold D) spark.executor.broadcastLimit Answer: A Explanation: This setting (default 10 MB) determines when Spark automatically broadcasts a small table for a join. Question 56. Which of the following is a valid way to add a new column age INT to an existing Iceberg table without rewriting data files? A) ALTER TABLE events ADD COLUMNS (age INT); B) ALTER TABLE events ADD COLUMN age INT; C) ALTER TABLE events ADD (age INT); D) ALTER TABLE events MODIFY COLUMNS (age INT); Answer: B Explanation: Iceberg uses the standard ADD COLUMN syntax; the table metadata is updated while existing files remain unchanged. Question 57. Which Spark DataFrame method is used to write data in “append” mode? A) write.mode("append") B) write.append() C) write.saveAsTable("tbl", mode="append") D) write.option("mode","append") Answer: A Explanation: write.mode("append") tells Spark to add new rows to the target without overwriting existing data.