[CDPDD] Cloudera CDP Data Developer Certification Exam Preparation, Exams of Technology

The Cloudera CDP Data Developer (CDPDD) Exam Preparation delivers comprehensive coverage of big data architecture, Cloudera platform tools, data ingestion, transformation, and analytics practices. Candidates will strengthen hands-on skills for building, deploying, and managing scalable data pipelines. Structured labs, practice scenarios, and exam-style questions ensure readiness for certification.

Typology: Exams

2025/2026

Available from 02/02/2026

shilpi-jain-3
shilpi-jain-3 🇮🇳

2.5

(11)

80K documents

1 / 87

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
[CDPDD] Cloudera CDP Data Developer
Certification Exam Preparation
Question 1. In Spark’s architecture, which component is responsible for converting a user
program into a logical execution plan?
A) Driver
B) Executor
C) Cluster Manager
D) Task Scheduler
Answer: A
Explanation: The Driver receives the user code, creates SparkContext, and builds the logical plan
before it is optimized and sent to executors.
Question 2. Which Spark component actually runs the tasks that perform the computation?
A) Driver
B) Executor
C) Cluster Manager
D) Scheduler Backend
Answer: B
Explanation: Executors are launched on worker nodes and execute the tasks assigned by the
driver.
Question 3. When using Spark on YARN, what does the “client” deployment mode mean?
A) Driver runs inside the ApplicationMaster
B) Driver runs on the client machine that submitted the job
C) Driver runs on the first available worker node
D) Driver runs on a dedicated master node
Answer: B
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57

Partial preview of the text

Download [CDPDD] Cloudera CDP Data Developer Certification Exam Preparation and more Exams Technology in PDF only on Docsity!

Certification Exam Preparation

Question 1. In Spark’s architecture, which component is responsible for converting a user program into a logical execution plan? A) Driver B) Executor C) Cluster Manager D) Task Scheduler Answer: A Explanation: The Driver receives the user code, creates SparkContext, and builds the logical plan before it is optimized and sent to executors. Question 2. Which Spark component actually runs the tasks that perform the computation? A) Driver B) Executor C) Cluster Manager D) Scheduler Backend Answer: B Explanation: Executors are launched on worker nodes and execute the tasks assigned by the driver. Question 3. When using Spark on YARN, what does the “client” deployment mode mean? A) Driver runs inside the ApplicationMaster B) Driver runs on the client machine that submitted the job C) Driver runs on the first available worker node D) Driver runs on a dedicated master node Answer: B

Certification Exam Preparation

Explanation: In client mode, the driver stays on the submitting machine, while the ApplicationMaster only coordinates resources. Question 4. Which of the following statements about Spark stages is correct? A) A stage contains tasks that can be executed in parallel without shuffling data. B) A stage is equivalent to a Spark job. C) Stages are defined by user‑defined functions. D) Each stage corresponds to a single partition. Answer: A Explanation: Stages are separated by shuffle boundaries; tasks within a stage operate on the same partition locally. Question 5. In Spark’s DAG, a “job” is created when which action is called? A) map() B) filter() C) collect() D) select() Answer: C Explanation: Actions like collect(), count(), save() trigger job execution; transformations only build the DAG. Question 6. Which of the following best describes the difference between cache() and persist(StorageLevel.MEMORY_ONLY)? A) cache() always stores data on disk, persist() never does. B) cache() uses the default storage level MEMORY_AND_DISK, persist() allows explicit level. C) cache() is deprecated, persist() is the new API.

Certification Exam Preparation

B) lead() C) lag() D) rank() Answer: A Explanation: The window function sum() with an ORDER BY clause creates a cumulative total. Question 10. What is the primary performance drawback of using a User‑Defined Function (UDF) in Spark SQL? A) UDFs cannot be used with DataFrames. B) UDFs are not compiled to native code and prevent Catalyst optimizations. C) UDFs automatically cache data, increasing memory usage. D) UDFs cause Spark to run in single‑threaded mode. Answer: B Explanation: UDFs are black boxes to Catalyst, so Spark cannot optimize them, leading to slower execution. Question 11. Which of the following is a built‑in Spark function for handling null values? A) coalesce() B) explode() C) regexp_replace() D) json_tuple() Answer: A Explanation: coalesce() returns the first non‑null value among its arguments. Question 12. In Spark Structured Streaming, which output mode is NOT supported? A) Append

Certification Exam Preparation

B) Update C) Complete D) Overwrite Answer: D Explanation: Overwrite is not a streaming output mode; Append, Update, and Complete are supported. Question 13. Which file format provides schema evolution and hidden partitioning without requiring Hive Metastore? A) CSV B) Parquet C) Avro D) Apache Iceberg Answer: D Explanation: Iceberg stores schema and partition metadata in the table itself, supporting evolution and hidden partitioning. Question 14. What ACID property does Iceberg guarantee when multiple writers concurrently add data files? A) Atomicity only B) Consistency only C) Isolation only D) All four (Atomicity, Consistency, Isolation, Durability) Answer: D Explanation: Iceberg implements full ACID semantics using snapshot isolation and atomic commits.

Certification Exam Preparation

Explanation: Managed tables have both metadata and data owned by Hive; dropping the table deletes the data files. Question 18. In a partitioned Iceberg table, which approach reduces the number of files read during a query? A) Increasing the number of partitions B) Using hidden partitioning C) Storing data in CSV format D) Disabling predicate pushdown Answer: B Explanation: Hidden partitioning lets Iceberg prune files based on data values without explicit directory structures. Question 19. Which cloud storage option is NOT natively supported by CDP Data Hub for table locations? A) Amazon S B) Azure Data Lake Storage (ADLS) Gen C) Google Cloud Storage (GCS) D) Alibaba OSS Answer: D Explanation: CDP supports S3, ADLS, and GCS; Alibaba OSS is not a built‑in option. Question 20. What is the primary benefit of bucketing a table in Hive/Impala? A) It eliminates the need for partition pruning. B) It guarantees equal size files. C) It enables efficient joins on the bucket column.

Certification Exam Preparation

D) It stores data in JSON format. Answer: C Explanation: Bucketing groups rows with the same bucket column value together, allowing map‑side joins without shuffle. Question 21. In Airflow, which parameter defines the schedule interval for a DAG? A) start_date B) schedule_interval C) catchup D) default_args Answer: B Explanation: schedule_interval determines how often the DAG runs (cron expression, timedelta, etc.). Question 22. Which operator would you use in an Airflow DAG to submit a Spark job to a YARN cluster? A) BashOperator B) PythonOperator C) SparkSubmitOperator D) KubernetesPodOperator Answer: C Explanation: SparkSubmitOperator wraps the spark-submit command, allowing submission to YARN, K8s, etc. Question 23. How does an Airflow Sensor differ from a regular Operator? A) Sensors run in parallel by default.

Certification Exam Preparation

A) Increase the number of executors. B) Use DataFrames instead of RDDs. C) Broadcast the smaller dataset. D) Disable Kryo serialization. Answer: C Explanation: Broadcasting avoids the shuffle that would otherwise be required to bring matching rows together. Question 27. When examining a Spark SQL explain plan, which operator indicates a shuffle read? (Select the most specific) A) Project B) Filter C) Exchange D) Scan Answer: C Explanation: Exchange operators represent data movement between stages, i.e., shuffle read/write. Question 28. What does the “spark.sql.autoBroadcastJoinThreshold” configuration control? A) Maximum size of a broadcasted table in bytes. B) Minimum number of partitions for a shuffle join. C) Timeout for broadcast joins. D) Number of broadcast replicas. Answer: A

Certification Exam Preparation

Explanation: This setting defines the size threshold under which Spark will automatically broadcast a table. Question 29. Which technique can mitigate data skew for a join on a high‑cardinality key? A) Increase executor memory. B) Use salting to add a random prefix to the key. C) Disable adaptive query execution. D) Increase the number of shuffle partitions. Answer: B Explanation: Salting distributes skewed keys across multiple reducers, balancing load. Question 30. In Spark, what is the effect of setting “spark.sql.adaptive.enabled” to true? A) Disables dynamic allocation. B) Enables runtime optimization such as coalescing shuffle partitions. C) Forces broadcast joins for all joins. D) Turns off catalyst optimizer. Answer: B Explanation: Adaptive Query Execution (AQE) adjusts query plans during execution, e.g., reducing shuffle partitions. Question 31. Which of the following is NOT a valid Spark executor memory configuration? A) spark.executor.memory B) spark.executor.memoryOverhead C) spark.executor.memoryFraction D) spark.executor.memoryLimit Answer: D

Certification Exam Preparation

Explanation: The Ranger‑Hive plugin enforces column‑ and row‑level policies defined in Ranger. Question 35. What is the primary function of Apache Atlas in a CDP environment? A) Data encryption at rest. B) Service discovery for micro‑services. C) Metadata catalog and lineage tracking. D) Real‑time streaming ingestion. Answer: C Explanation: Atlas stores metadata, data models, and lineage information for governance. Question 36. Which command lists all CDP services in a given environment via the CLI? A) cdp services list B) cdp env describe C) cdp cluster list‑services D) cdp service describe-all Answer: A Explanation: “cdp services list” returns the services deployed in the current environment. Question 37. In CDP, what does SDX stand for? A) Secure Data Exchange B) Shared Data Experience C) Scalable Data eXecution D) Service Development eXtension Answer: B

Certification Exam Preparation

Explanation: SDX provides a unified interface for security, governance, and data catalog across CDP services. Question 38. Which of the following is a valid Spark SQL built‑in function for extracting a substring? A) substr() B) slice() C) split() D) extract() Answer: A Explanation: substr(string, pos, len) returns a substring starting at position pos. Question 39. When reading JSON data with Spark, which option should you set to handle corrupt records gracefully? A) mode("PERMISSIVE") B) mode("FAILFAST") C) mode("DROP") D) mode("IGNORE") Answer: A Explanation: PERMISSIVE mode places malformed records into a column named _corrupt_record instead of failing. Question 40. Which Iceberg table property controls the number of files written per commit? A) write.target-file-size-bytes B) commit.max-files C) iceberg.max-file-size

Certification Exam Preparation

B) .trigger(ProcessingTime("5 minutes")) C) .option("mergeSchema", true) D) .option("checkpointLocation", "/path") Answer: D Explanation: Providing a checkpoint location allows Spark to store offsets and achieve exactly‑once semantics. Question 44. Which of the following is a correct way to define a window specification that orders rows by “event_time” and partitions by “user_id”? A) Window.partitionBy("user_id").orderBy("event_time") B) Window.orderBy("user_id").partitionBy("event_time") C) Window.partitionBy("event_time").orderBy("user_id") D) Window.orderBy("event_time", "user_id") Answer: A Explanation: partitionBy defines the grouping column; orderBy defines the ordering within each partition. Question 45. In Iceberg, what does the “metadata.log” file contain? A) List of all data files in the table. B) History of snapshots and their locations. C) Partition column definitions. D) User access logs. Answer: B Explanation: metadata.log tracks snapshot IDs, timestamps, and the locations of metadata files. Question 46. Which Ranger policy type is used to grant column‑level masking?

Certification Exam Preparation

A) Row‑level filter policy B) Column‑level masking policy C) Data‑level encryption policy D) Tag‑based policy Answer: B Explanation: Ranger supports column masking policies that replace sensitive column values with masked data. Question 47. What is the default storage level when calling DataFrame.persist() without arguments? A) MEMORY_ONLY B) MEMORY_AND_DISK C) DISK_ONLY D) OFF_HEAP Answer: B Explanation: persist() defaults to MEMORY_AND_DISK, caching in memory and spilling to disk if needed. Question 48. Which Spark SQL function can be used to convert a timestamp column to a date string in “yyyy‑MM‑dd” format? A) date_format(timestamp, 'yyyy-MM-dd') B) to_date(timestamp, 'yyyy-MM-dd') C) format_timestamp(timestamp, 'yyyy-MM-dd') D) cast(timestamp as date) Answer: A Explanation: date_format formats a timestamp according to the provided pattern.

Certification Exam Preparation

Explanation: Setting spark.serializer to the KryoSerializer class activates Kryo serialization. Question 52. When using Spark on Kubernetes, which resource defines the pod template for executors? A) spark.kubernetes.executor.podTemplateFile B) spark.kubernetes.executor.podSpec C) spark.kubernetes.executor.podName D) spark.kubernetes.executor.resourceFile Answer: A Explanation: spark.kubernetes.executor.podTemplateFile points to a YAML file that defines executor pod specifications. Question 53. Which function would you use to flatten a nested array column in a DataFrame? A) explode() B) flatten() C) unnest() D) split() Answer: A Explanation: explode() creates a new row for each element in an array or map column. Question 54. In a Hive external table pointing to an S3 location, what happens when you DROP the table? A) Data files are deleted from S3. B) Only the metadata is removed; data remains in S3. C) The table is converted to a managed table. D) S3 bucket is also deleted.

Certification Exam Preparation

Answer: B Explanation: External tables only store metadata in the Metastore; dropping them does not delete underlying data. Question 55. Which of the following statements about Spark’s “mapPartitions” transformation is true? A) It processes each row individually. B) It provides access to the entire partition as an iterator. C) It cannot be used with DataFrames. D) It always results in a shuffle. Answer: B Explanation: mapPartitions receives an iterator over all rows in a partition, allowing batch processing. Question 56. What does the “spark.sql.shuffle.partitions” default value of 200 imply? A) All shuffle operations will use exactly 200 reducers regardless of data size. B) Spark will dynamically adjust the number of partitions up to 200. C) By default, each shuffle will create 200 output partitions. D) The setting is ignored unless adaptive query execution is enabled. Answer: C Explanation: The default config creates 200 partitions for each shuffle stage unless overridden. Question 57. Which Airflow operator would you use to wait for a file to appear in HDFS before proceeding? A) HdfsSensor B) FileSensor