Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Cloudera CDP Data Engineer Practice Exam, Exams of Technology

Technology

A deep engineering exam testing distributed data processing skills using Spark, Kafka, NiFi, Airflow, and CDP data services. It emphasizes building resilient pipelines, streaming analytics, job optimization, cluster tuning, schema evolution, and orchestrating enterprise data flows.

Typology: Exams

2025/2026

Available from 01/06/2026

shilpi-jain-1 🇮🇳

4.2

(5)

29K documents

1 / 88

This page cannot be seen from the preview

Don't miss anything!

Cloudera CDP Data Engineer Practice Exam

**Question 1.** Which component in a Spark‑on‑Kubernetes deployment is responsible for launching

driver pods?

A) Spark Submit

B) kube‑scheduler

C) Spark‑executor‑controller

D) Spark‑master

Answer: A

Explanation: `spark-submit` translates the Spark application into a driver pod specification and asks the

Kubernetes API to create the driver pod.

**Question 2.** In Spark DataFrames, which method returns a new DataFrame containing only distinct

rows?

A) distinct()

B) unique()

C) dropDuplicates()

D) filterDistinct()

Answer: A

Explanation: `distinct()` removes duplicate rows and produces a DataFrame with unique records.

**Question 3.** Which Spark storage level persists data only in memory and discards it when the JVM

runs out of memory?

A) MEMORY_ONLY_SER

B) MEMORY_AND_DISK

C) OFF_HEAP

D) MEMORY_ONLY

Answer: D

Partial preview of the text

Download Cloudera CDP Data Engineer Practice Exam and more Exams Technology in PDF only on Docsity!

Question 1. Which component in a Spark‑on‑Kubernetes deployment is responsible for launching driver pods? A) Spark Submit B) kube‑scheduler C) Spark‑executor‑controller D) Spark‑master Answer: A Explanation: spark-submit translates the Spark application into a driver pod specification and asks the Kubernetes API to create the driver pod. Question 2. In Spark DataFrames, which method returns a new DataFrame containing only distinct rows? A) distinct() B) unique() C) dropDuplicates() D) filterDistinct() Answer: A Explanation: distinct() removes duplicate rows and produces a DataFrame with unique records. Question 3. Which Spark storage level persists data only in memory and discards it when the JVM runs out of memory? A) MEMORY_ONLY_SER B) MEMORY_AND_DISK C) OFF_HEAP D) MEMORY_ONLY Answer: D

Explanation: MEMORY_ONLY stores partitions as deserialized Java objects in RAM; if memory is insufficient, some partitions are not cached. Question 4. When reading a Hive table with Spark, which option enables schema inference from the Hive Metastore? A) .option("inferSchema", "true") B) .format("hive") C) .enableHiveSupport() D) .option("hive.metastore.uris", "...") Answer: C Explanation: enableHiveSupport() tells Spark to use the Hive Metastore for schema information when accessing Hive tables. Question 5. Which file format provides columnar storage, built‑in compression, and is optimal for analytical queries? A) JSON B) CSV C) Parquet D) Avro Answer: C Explanation: Parquet stores data column‑wise, enabling predicate push‑down and efficient compression for analytics. Question 6. In Airflow, which object defines the execution order and dependencies of tasks? A) Operator B) DAG C) XCom

C) Executors D) SQL Answer: B Explanation: The Stages tab shows task duration distribution; unusually long tasks in a stage often indicate skew. Question 10. When using groupBy followed by agg, which Spark concept determines how many shuffle partitions are created? A) spark.sql.shuffle.partitions B) spark.default.parallelism C) spark.sql.autoBroadcastJoinThreshold D) spark.executor.cores Answer: A Explanation: spark.sql.shuffle.partitions defines the default number of partitions for shuffle operations like groupBy. Question 11. Which Spark transformation is lazy and does not trigger execution until an action is called? A) collect() B) count() C) map() D) show() Answer: C Explanation: map() creates a new RDD/DataFrame but does not compute results until an action (e.g., collect) is invoked. Question 12. Which of the following is a best practice to improve join performance on large tables?

A) Use cross join without conditions B) Broadcast the larger table C) Broadcast the smaller table D) Disable partition pruning Answer: C Explanation: Broadcasting the smaller table avoids a shuffle, making the join more efficient. Question 13. In Iceberg, what feature enables “time‑travel” queries? A) Hidden partitioning B) Snapshot IDs C) Partition evolution D) Schema pruning Answer: B Explanation: Iceberg stores each write as a snapshot; querying a prior snapshot ID allows reading data as of that point in time. Question 14. Which command creates a Kudu table with a primary key on column id? A) CREATE TABLE t (id INT PRIMARY KEY) STORED AS KUDU; B) CREATE TABLE t (id INT, PRIMARY KEY (id)) USING KUDU; C) CREATE TABLE t (id INT PRIMARY KEY) USING KUDU; D) CREATE TABLE t (id INT, PRIMARY KEY (id)) STORED AS KUDU; Answer: D Explanation: In Hive/Impala syntax, PRIMARY KEY (id) defines the Kudu primary key, and STORED AS KUDU selects the storage engine. Question 15. Which Spark SQL function can be used to explode an array column into multiple rows?

A) start_date B) schedule_interval C) catchup D) default_args Answer: B Explanation: schedule_interval specifies how often the DAG should be triggered (e.g., cron expression). Question 19. Which Spark configuration enables automatic broadcast of a table when its size is less than the threshold? A) spark.sql.autoBroadcastJoinThreshold B) spark.broadcast.compress C) spark.sql.broadcastTimeout D) spark.sql.broadcastJoinSize Answer: A Explanation: spark.sql.autoBroadcastJoinThreshold (default 10 MB) determines when Spark will automatically broadcast a smaller table. Question 20. Which DataFrame method would you use to write data in ORC format partitioned by column year? A) write.format("orc").partitionBy("year") B) write.orc().partitionBy("year") C) write.saveAsTable("tbl", format="orc", partitionBy="year") D) write.mode("overwrite").orc("path", partitionBy="year") Answer: A Explanation: write.format("orc").partitionBy("year") sets the file format and partition column before calling save.

Question 21. What is the primary advantage of using bucketed tables over plain partitioned tables in Hive? A) Faster writes B) Better compression C) Optimized joins on bucket columns D) Automatic schema evolution Answer: C Explanation: Bucketing distributes rows into fixed buckets, enabling map‑side joins when both tables are bucketed on the join key. Question 22. Which Spark UI metric indicates the amount of data read from external storage during a stage? A) Input Metrics → Bytes Read B) Shuffle Read Metrics → Bytes Read C) Executor Metrics → Disk Spilled D) Task Metrics → Duration Answer: A Explanation: Input Metrics track data read directly from sources like HDFS, S3, or JDBC. Question 23. In Airflow, which trigger rule executes a downstream task only if all upstream tasks succeeded? A) all_success B) one_success C) none_failed D) all_done

B) memory C) foreachBatch D) file Answer: C Explanation: foreachBatch allows you to apply idempotent batch writes (e.g., to Delta Lake) inside a transaction, achieving exactly‑once guarantees. Question 27. Which Spark SQL function calculates the rank of rows within a partition ordered by sales descending? A) rank() over (order by sales desc) B) dense_rank() over (partition by sales order by desc) C) row_number() over (partition by sales order by desc) D) rank() over (partition by category order by sales desc) Answer: D Explanation: To rank rows per category, you need a partition by clause; option D correctly defines the window. Question 28. In Cloudera Data Platform, which service provides lineage visualization for datasets? A) Ranger B) Atlas C) NiFi D) Oozie Answer: B Explanation: Apache Atlas tracks data lineage and metadata across CDP services. Question 29. Which Airflow operator is best suited for executing a HiveQL script stored in HDFS?

A) HiveOperator B) PrestoOperator C) SqoopOperator D) SparkSubmitOperator Answer: A Explanation: HiveOperator runs Hive queries or scripts, optionally pointing to a file in HDFS. Question 30. Which Spark configuration controls the maximum size of a serialized task result that can be fetched by the driver? A) spark.driver.maxResultSize B) spark.task.maxResultSize C) spark.executor.memoryOverhead D) spark.sql.shuffle.partitions Answer: A Explanation: spark.driver.maxResultSize limits the total size of results returned to the driver to avoid OOM errors. Question 31. What is the effect of setting spark.sql.shuffle.partitions to a very high number on a small dataset? A) Faster execution B) Increased memory usage and overhead C) Automatic schema inference D) Reduced number of executor cores Answer: B Explanation: Excessive shuffle partitions create many small tasks, leading to higher scheduling overhead and memory consumption.

Question 35. When converting a DataFrame from JSON to Parquet, which Spark option improves write performance by reducing the number of files? A) .option("compression","snappy") B) .repartition(1) before write C) .option("mergeSchema","true") D) .coalesce(1) before write Answer: D Explanation: coalesce(1) reduces the number of output partitions, resulting in fewer Parquet files; repartition would increase them. Question 36. Which Spark method removes duplicate rows based on a subset of columns ["id","date"]? A) dropDuplicates(["id","date"]) B) distinct(["id","date"]) C) dropDuplicates().select("id","date") D) filterDuplicates(["id","date"]) Answer: A Explanation: dropDuplicates accepts a column list to consider when identifying duplicates. Question 37. In CDP, which service provides a managed, serverless Spark environment for data engineers? A) DataFlow B) Data Engineering Service (CDE) C) Data Warehouse D) Data Hub

Answer: B Explanation: CDE (Cloudera Data Engineering) offers managed Spark clusters with autoscaling and UI/CLI controls. Question 38. Which Spark SQL clause is used to limit the number of rows returned by a query? A) TOP B) LIMIT C) FETCH FIRST D) SAMPLE Answer: B Explanation: LIMIT n restricts the result set to n rows. Question 39. Which of the following best describes “data skew” in Spark? A) Uneven distribution of tasks across executors causing some tasks to run much longer B) Duplicate records in a dataset C) Missing values in a column D) Incorrect data types in a schema Answer: A Explanation: Skew occurs when a small number of partitions contain disproportionately many records, leading to long-running tasks. Question 40. Which Airflow operator would you use to move files between HDFS directories? A) HdfsToS3Operator B) BashOperator C) HdfsCopyOperator D) HdfsFileSensor

D) spark.executor.instances Answer: A Explanation: Setting spark.dynamicAllocation.enabled to true allows Spark to request and release executors based on workload. Question 44. What does the spark.sql.sources.partitionOverwriteMode setting control? A) How partitions are overwritten when writing a DataFrame B) The number of shuffle partitions C) The compression codec for Parquet files D) The default timestamp format Answer: A Explanation: This property determines whether Spark overwrites whole partitions (dynamic) or uses static overwriting. Question 45. In Airflow, which sensor blocks DAG execution until a condition is met? A) TimeSensor B) ExternalTaskSensor C) FileSensor D) All of the above Answer: D Explanation: Sensors (TimeSensor, ExternalTaskSensor, FileSensor, etc.) wait for a condition before allowing downstream tasks to proceed. Question 46. Which Spark function can be used to replace null values in a column price with 0? A) na.fill(0, Seq("price")) B) fillna(0, ["price"])

C) replaceNull(0, "price") D) coalesce(col("price"), lit(0)) Answer: A Explanation: na.fill (or DataFrameNaFunctions.fill) replaces nulls in specified columns with a given value. Question 47. Which of the following is a benefit of using column pruning in Parquet files? A) Reducing network I/O by reading only needed columns B) Encrypting data at rest C) Enforcing row‑level security D) Improving write latency Answer: A Explanation: Column pruning allows Spark to read only the columns required for a query, decreasing I/O. Question 48. Which Airflow parameter determines whether past DAG runs are created when the scheduler catches up after being down? A) catchup B) depends_on_past C) max_active_runs D) start_date Answer: A Explanation: catchup=False disables backfilling of missed DAG runs; True enables it. Question 49. In Spark, which method returns the schema of a DataFrame as a StructType object? A) printSchema() B) schema()

C) To serialize a DataFrame to disk D) To limit the number of shuffle partitions Answer: A Explanation: broadcast(df) forces Spark to send the entire DataFrame to each executor, enabling a broadcast join. Question 53. Which Iceberg command can be used to roll back a table to a previous snapshot ID? A) ALTER TABLE ... ROLLBACK TO SNAPSHOT … B) CALL iceberg.rollback(table, snapshot_id) C) RESTORE SNAPSHOT snapshot_id ON table D) MERGE INTO table USING snapshot_id Answer: A Explanation: Iceberg supports ALTER TABLE table ROLLBACK TO SNAPSHOT <id> to revert to an earlier state. Question 54. In Airflow, which attribute of a task defines its retry delay? A) retry_exponential_backoff B) retry_delay C) execution_timeout D) max_retry_attempts Answer: B Explanation: retry_delay sets the time interval between consecutive retries of a failed task. Question 55. Which Spark configuration controls the size of the broadcast variable threshold for joins? A) spark.sql.autoBroadcastJoinThreshold

B) spark.broadcast.blockSize C) spark.sql.broadcastJoinThreshold D) spark.executor.broadcastLimit Answer: A Explanation: This setting (default 10 MB) determines when Spark automatically broadcasts a small table for a join. Question 56. Which of the following is a valid way to add a new column age INT to an existing Iceberg table without rewriting data files? A) ALTER TABLE events ADD COLUMNS (age INT); B) ALTER TABLE events ADD COLUMN age INT; C) ALTER TABLE events ADD (age INT); D) ALTER TABLE events MODIFY COLUMNS (age INT); Answer: B Explanation: Iceberg uses the standard ADD COLUMN syntax; the table metadata is updated while existing files remain unchanged. Question 57. Which Spark DataFrame method is used to write data in “append” mode? A) write.mode("append") B) write.append() C) write.saveAsTable("tbl", mode="append") D) write.option("mode","append") Answer: A Explanation: write.mode("append") tells Spark to add new rows to the target without overwriting existing data.

Cloudera CDP Data Engineer Practice Exam, Exams of Technology

Related documents

Partial preview of the text

Download Cloudera CDP Data Engineer Practice Exam and more Exams Technology in PDF only on Docsity!