















































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The Cloudera CDP Data Developer (CDPDD) Exam Preparation delivers comprehensive coverage of big data architecture, Cloudera platform tools, data ingestion, transformation, and analytics practices. Candidates will strengthen hands-on skills for building, deploying, and managing scalable data pipelines. Structured labs, practice scenarios, and exam-style questions ensure readiness for certification.
Typology: Exams
1 / 87
This page cannot be seen from the preview
Don't miss anything!
















































































Question 1. In Spark’s architecture, which component is responsible for converting a user program into a logical execution plan? A) Driver B) Executor C) Cluster Manager D) Task Scheduler Answer: A Explanation: The Driver receives the user code, creates SparkContext, and builds the logical plan before it is optimized and sent to executors. Question 2. Which Spark component actually runs the tasks that perform the computation? A) Driver B) Executor C) Cluster Manager D) Scheduler Backend Answer: B Explanation: Executors are launched on worker nodes and execute the tasks assigned by the driver. Question 3. When using Spark on YARN, what does the “client” deployment mode mean? A) Driver runs inside the ApplicationMaster B) Driver runs on the client machine that submitted the job C) Driver runs on the first available worker node D) Driver runs on a dedicated master node Answer: B
Explanation: In client mode, the driver stays on the submitting machine, while the ApplicationMaster only coordinates resources. Question 4. Which of the following statements about Spark stages is correct? A) A stage contains tasks that can be executed in parallel without shuffling data. B) A stage is equivalent to a Spark job. C) Stages are defined by user‑defined functions. D) Each stage corresponds to a single partition. Answer: A Explanation: Stages are separated by shuffle boundaries; tasks within a stage operate on the same partition locally. Question 5. In Spark’s DAG, a “job” is created when which action is called? A) map() B) filter() C) collect() D) select() Answer: C Explanation: Actions like collect(), count(), save() trigger job execution; transformations only build the DAG. Question 6. Which of the following best describes the difference between cache() and persist(StorageLevel.MEMORY_ONLY)? A) cache() always stores data on disk, persist() never does. B) cache() uses the default storage level MEMORY_AND_DISK, persist() allows explicit level. C) cache() is deprecated, persist() is the new API.
B) lead() C) lag() D) rank() Answer: A Explanation: The window function sum() with an ORDER BY clause creates a cumulative total. Question 10. What is the primary performance drawback of using a User‑Defined Function (UDF) in Spark SQL? A) UDFs cannot be used with DataFrames. B) UDFs are not compiled to native code and prevent Catalyst optimizations. C) UDFs automatically cache data, increasing memory usage. D) UDFs cause Spark to run in single‑threaded mode. Answer: B Explanation: UDFs are black boxes to Catalyst, so Spark cannot optimize them, leading to slower execution. Question 11. Which of the following is a built‑in Spark function for handling null values? A) coalesce() B) explode() C) regexp_replace() D) json_tuple() Answer: A Explanation: coalesce() returns the first non‑null value among its arguments. Question 12. In Spark Structured Streaming, which output mode is NOT supported? A) Append
B) Update C) Complete D) Overwrite Answer: D Explanation: Overwrite is not a streaming output mode; Append, Update, and Complete are supported. Question 13. Which file format provides schema evolution and hidden partitioning without requiring Hive Metastore? A) CSV B) Parquet C) Avro D) Apache Iceberg Answer: D Explanation: Iceberg stores schema and partition metadata in the table itself, supporting evolution and hidden partitioning. Question 14. What ACID property does Iceberg guarantee when multiple writers concurrently add data files? A) Atomicity only B) Consistency only C) Isolation only D) All four (Atomicity, Consistency, Isolation, Durability) Answer: D Explanation: Iceberg implements full ACID semantics using snapshot isolation and atomic commits.
Explanation: Managed tables have both metadata and data owned by Hive; dropping the table deletes the data files. Question 18. In a partitioned Iceberg table, which approach reduces the number of files read during a query? A) Increasing the number of partitions B) Using hidden partitioning C) Storing data in CSV format D) Disabling predicate pushdown Answer: B Explanation: Hidden partitioning lets Iceberg prune files based on data values without explicit directory structures. Question 19. Which cloud storage option is NOT natively supported by CDP Data Hub for table locations? A) Amazon S B) Azure Data Lake Storage (ADLS) Gen C) Google Cloud Storage (GCS) D) Alibaba OSS Answer: D Explanation: CDP supports S3, ADLS, and GCS; Alibaba OSS is not a built‑in option. Question 20. What is the primary benefit of bucketing a table in Hive/Impala? A) It eliminates the need for partition pruning. B) It guarantees equal size files. C) It enables efficient joins on the bucket column.
D) It stores data in JSON format. Answer: C Explanation: Bucketing groups rows with the same bucket column value together, allowing map‑side joins without shuffle. Question 21. In Airflow, which parameter defines the schedule interval for a DAG? A) start_date B) schedule_interval C) catchup D) default_args Answer: B Explanation: schedule_interval determines how often the DAG runs (cron expression, timedelta, etc.). Question 22. Which operator would you use in an Airflow DAG to submit a Spark job to a YARN cluster? A) BashOperator B) PythonOperator C) SparkSubmitOperator D) KubernetesPodOperator Answer: C Explanation: SparkSubmitOperator wraps the spark-submit command, allowing submission to YARN, K8s, etc. Question 23. How does an Airflow Sensor differ from a regular Operator? A) Sensors run in parallel by default.
A) Increase the number of executors. B) Use DataFrames instead of RDDs. C) Broadcast the smaller dataset. D) Disable Kryo serialization. Answer: C Explanation: Broadcasting avoids the shuffle that would otherwise be required to bring matching rows together. Question 27. When examining a Spark SQL explain plan, which operator indicates a shuffle read? (Select the most specific) A) Project B) Filter C) Exchange D) Scan Answer: C Explanation: Exchange operators represent data movement between stages, i.e., shuffle read/write. Question 28. What does the “spark.sql.autoBroadcastJoinThreshold” configuration control? A) Maximum size of a broadcasted table in bytes. B) Minimum number of partitions for a shuffle join. C) Timeout for broadcast joins. D) Number of broadcast replicas. Answer: A
Explanation: This setting defines the size threshold under which Spark will automatically broadcast a table. Question 29. Which technique can mitigate data skew for a join on a high‑cardinality key? A) Increase executor memory. B) Use salting to add a random prefix to the key. C) Disable adaptive query execution. D) Increase the number of shuffle partitions. Answer: B Explanation: Salting distributes skewed keys across multiple reducers, balancing load. Question 30. In Spark, what is the effect of setting “spark.sql.adaptive.enabled” to true? A) Disables dynamic allocation. B) Enables runtime optimization such as coalescing shuffle partitions. C) Forces broadcast joins for all joins. D) Turns off catalyst optimizer. Answer: B Explanation: Adaptive Query Execution (AQE) adjusts query plans during execution, e.g., reducing shuffle partitions. Question 31. Which of the following is NOT a valid Spark executor memory configuration? A) spark.executor.memory B) spark.executor.memoryOverhead C) spark.executor.memoryFraction D) spark.executor.memoryLimit Answer: D
Explanation: The Ranger‑Hive plugin enforces column‑ and row‑level policies defined in Ranger. Question 35. What is the primary function of Apache Atlas in a CDP environment? A) Data encryption at rest. B) Service discovery for micro‑services. C) Metadata catalog and lineage tracking. D) Real‑time streaming ingestion. Answer: C Explanation: Atlas stores metadata, data models, and lineage information for governance. Question 36. Which command lists all CDP services in a given environment via the CLI? A) cdp services list B) cdp env describe C) cdp cluster list‑services D) cdp service describe-all Answer: A Explanation: “cdp services list” returns the services deployed in the current environment. Question 37. In CDP, what does SDX stand for? A) Secure Data Exchange B) Shared Data Experience C) Scalable Data eXecution D) Service Development eXtension Answer: B
Explanation: SDX provides a unified interface for security, governance, and data catalog across CDP services. Question 38. Which of the following is a valid Spark SQL built‑in function for extracting a substring? A) substr() B) slice() C) split() D) extract() Answer: A Explanation: substr(string, pos, len) returns a substring starting at position pos. Question 39. When reading JSON data with Spark, which option should you set to handle corrupt records gracefully? A) mode("PERMISSIVE") B) mode("FAILFAST") C) mode("DROP") D) mode("IGNORE") Answer: A Explanation: PERMISSIVE mode places malformed records into a column named _corrupt_record instead of failing. Question 40. Which Iceberg table property controls the number of files written per commit? A) write.target-file-size-bytes B) commit.max-files C) iceberg.max-file-size
B) .trigger(ProcessingTime("5 minutes")) C) .option("mergeSchema", true) D) .option("checkpointLocation", "/path") Answer: D Explanation: Providing a checkpoint location allows Spark to store offsets and achieve exactly‑once semantics. Question 44. Which of the following is a correct way to define a window specification that orders rows by “event_time” and partitions by “user_id”? A) Window.partitionBy("user_id").orderBy("event_time") B) Window.orderBy("user_id").partitionBy("event_time") C) Window.partitionBy("event_time").orderBy("user_id") D) Window.orderBy("event_time", "user_id") Answer: A Explanation: partitionBy defines the grouping column; orderBy defines the ordering within each partition. Question 45. In Iceberg, what does the “metadata.log” file contain? A) List of all data files in the table. B) History of snapshots and their locations. C) Partition column definitions. D) User access logs. Answer: B Explanation: metadata.log tracks snapshot IDs, timestamps, and the locations of metadata files. Question 46. Which Ranger policy type is used to grant column‑level masking?
A) Row‑level filter policy B) Column‑level masking policy C) Data‑level encryption policy D) Tag‑based policy Answer: B Explanation: Ranger supports column masking policies that replace sensitive column values with masked data. Question 47. What is the default storage level when calling DataFrame.persist() without arguments? A) MEMORY_ONLY B) MEMORY_AND_DISK C) DISK_ONLY D) OFF_HEAP Answer: B Explanation: persist() defaults to MEMORY_AND_DISK, caching in memory and spilling to disk if needed. Question 48. Which Spark SQL function can be used to convert a timestamp column to a date string in “yyyy‑MM‑dd” format? A) date_format(timestamp, 'yyyy-MM-dd') B) to_date(timestamp, 'yyyy-MM-dd') C) format_timestamp(timestamp, 'yyyy-MM-dd') D) cast(timestamp as date) Answer: A Explanation: date_format formats a timestamp according to the provided pattern.
Explanation: Setting spark.serializer to the KryoSerializer class activates Kryo serialization. Question 52. When using Spark on Kubernetes, which resource defines the pod template for executors? A) spark.kubernetes.executor.podTemplateFile B) spark.kubernetes.executor.podSpec C) spark.kubernetes.executor.podName D) spark.kubernetes.executor.resourceFile Answer: A Explanation: spark.kubernetes.executor.podTemplateFile points to a YAML file that defines executor pod specifications. Question 53. Which function would you use to flatten a nested array column in a DataFrame? A) explode() B) flatten() C) unnest() D) split() Answer: A Explanation: explode() creates a new row for each element in an array or map column. Question 54. In a Hive external table pointing to an S3 location, what happens when you DROP the table? A) Data files are deleted from S3. B) Only the metadata is removed; data remains in S3. C) The table is converted to a managed table. D) S3 bucket is also deleted.
Answer: B Explanation: External tables only store metadata in the Metastore; dropping them does not delete underlying data. Question 55. Which of the following statements about Spark’s “mapPartitions” transformation is true? A) It processes each row individually. B) It provides access to the entire partition as an iterator. C) It cannot be used with DataFrames. D) It always results in a shuffle. Answer: B Explanation: mapPartitions receives an iterator over all rows in a partition, allowing batch processing. Question 56. What does the “spark.sql.shuffle.partitions” default value of 200 imply? A) All shuffle operations will use exactly 200 reducers regardless of data size. B) Spark will dynamically adjust the number of partitions up to 200. C) By default, each shuffle will create 200 output partitions. D) The setting is ignored unless adaptive query execution is enabled. Answer: C Explanation: The default config creates 200 partitions for each shuffle stage unless overridden. Question 57. Which Airflow operator would you use to wait for a file to appear in HDFS before proceeding? A) HdfsSensor B) FileSensor