Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

[CDPDD] Cloudera CDP Data Developer Certification Exam Preparation, Exams of Technology

Technology

The Cloudera CDP Data Developer (CDPDD) Exam Preparation delivers comprehensive coverage of big data architecture, Cloudera platform tools, data ingestion, transformation, and analytics practices. Candidates will strengthen hands-on skills for building, deploying, and managing scalable data pipelines. Structured labs, practice scenarios, and exam-style questions ensure readiness for certification.

Typology: Exams

2025/2026

Available from 02/02/2026

shilpi-jain-3 🇮🇳

2.5

(11)

80K documents

1 / 87

This page cannot be seen from the preview

Don't miss anything!

[CDPDD] Cloudera CDP Data Developer

Certification Exam Preparation

Question 1. In Spark’s architecture, which component is responsible for converting a user

program into a logical execution plan?

A) Driver

B) Executor

C) Cluster Manager

D) Task Scheduler

Answer: A

Explanation: The Driver receives the user code, creates SparkContext, and builds the logical plan

before it is optimized and sent to executors.

Question 2. Which Spark component actually runs the tasks that perform the computation?

A) Driver

B) Executor

C) Cluster Manager

D) Scheduler Backend

Answer: B

Explanation: Executors are launched on worker nodes and execute the tasks assigned by the

driver.

Question 3. When using Spark on YARN, what does the “client” deployment mode mean?

A) Driver runs inside the ApplicationMaster

B) Driver runs on the client machine that submitted the job

C) Driver runs on the first available worker node

D) Driver runs on a dedicated master node

Answer: B

Partial preview of the text

Download [CDPDD] Cloudera CDP Data Developer Certification Exam Preparation and more Exams Technology in PDF only on Docsity!

Certification Exam Preparation

Question 1. In Spark’s architecture, which component is responsible for converting a user program into a logical execution plan? A) Driver B) Executor C) Cluster Manager D) Task Scheduler Answer: A Explanation: The Driver receives the user code, creates SparkContext, and builds the logical plan before it is optimized and sent to executors. Question 2. Which Spark component actually runs the tasks that perform the computation? A) Driver B) Executor C) Cluster Manager D) Scheduler Backend Answer: B Explanation: Executors are launched on worker nodes and execute the tasks assigned by the driver. Question 3. When using Spark on YARN, what does the “client” deployment mode mean? A) Driver runs inside the ApplicationMaster B) Driver runs on the client machine that submitted the job C) Driver runs on the first available worker node D) Driver runs on a dedicated master node Answer: B

Certification Exam Preparation

Explanation: In client mode, the driver stays on the submitting machine, while the ApplicationMaster only coordinates resources. Question 4. Which of the following statements about Spark stages is correct? A) A stage contains tasks that can be executed in parallel without shuffling data. B) A stage is equivalent to a Spark job. C) Stages are defined by user‑defined functions. D) Each stage corresponds to a single partition. Answer: A Explanation: Stages are separated by shuffle boundaries; tasks within a stage operate on the same partition locally. Question 5. In Spark’s DAG, a “job” is created when which action is called? A) map() B) filter() C) collect() D) select() Answer: C Explanation: Actions like collect(), count(), save() trigger job execution; transformations only build the DAG. Question 6. Which of the following best describes the difference between cache() and persist(StorageLevel.MEMORY_ONLY)? A) cache() always stores data on disk, persist() never does. B) cache() uses the default storage level MEMORY_AND_DISK, persist() allows explicit level. C) cache() is deprecated, persist() is the new API.

Certification Exam Preparation

B) lead() C) lag() D) rank() Answer: A Explanation: The window function sum() with an ORDER BY clause creates a cumulative total. Question 10. What is the primary performance drawback of using a User‑Defined Function (UDF) in Spark SQL? A) UDFs cannot be used with DataFrames. B) UDFs are not compiled to native code and prevent Catalyst optimizations. C) UDFs automatically cache data, increasing memory usage. D) UDFs cause Spark to run in single‑threaded mode. Answer: B Explanation: UDFs are black boxes to Catalyst, so Spark cannot optimize them, leading to slower execution. Question 11. Which of the following is a built‑in Spark function for handling null values? A) coalesce() B) explode() C) regexp_replace() D) json_tuple() Answer: A Explanation: coalesce() returns the first non‑null value among its arguments. Question 12. In Spark Structured Streaming, which output mode is NOT supported? A) Append

Certification Exam Preparation

B) Update C) Complete D) Overwrite Answer: D Explanation: Overwrite is not a streaming output mode; Append, Update, and Complete are supported. Question 13. Which file format provides schema evolution and hidden partitioning without requiring Hive Metastore? A) CSV B) Parquet C) Avro D) Apache Iceberg Answer: D Explanation: Iceberg stores schema and partition metadata in the table itself, supporting evolution and hidden partitioning. Question 14. What ACID property does Iceberg guarantee when multiple writers concurrently add data files? A) Atomicity only B) Consistency only C) Isolation only D) All four (Atomicity, Consistency, Isolation, Durability) Answer: D Explanation: Iceberg implements full ACID semantics using snapshot isolation and atomic commits.

Certification Exam Preparation

Explanation: Managed tables have both metadata and data owned by Hive; dropping the table deletes the data files. Question 18. In a partitioned Iceberg table, which approach reduces the number of files read during a query? A) Increasing the number of partitions B) Using hidden partitioning C) Storing data in CSV format D) Disabling predicate pushdown Answer: B Explanation: Hidden partitioning lets Iceberg prune files based on data values without explicit directory structures. Question 19. Which cloud storage option is NOT natively supported by CDP Data Hub for table locations? A) Amazon S B) Azure Data Lake Storage (ADLS) Gen C) Google Cloud Storage (GCS) D) Alibaba OSS Answer: D Explanation: CDP supports S3, ADLS, and GCS; Alibaba OSS is not a built‑in option. Question 20. What is the primary benefit of bucketing a table in Hive/Impala? A) It eliminates the need for partition pruning. B) It guarantees equal size files. C) It enables efficient joins on the bucket column.

Certification Exam Preparation

D) It stores data in JSON format. Answer: C Explanation: Bucketing groups rows with the same bucket column value together, allowing map‑side joins without shuffle. Question 21. In Airflow, which parameter defines the schedule interval for a DAG? A) start_date B) schedule_interval C) catchup D) default_args Answer: B Explanation: schedule_interval determines how often the DAG runs (cron expression, timedelta, etc.). Question 22. Which operator would you use in an Airflow DAG to submit a Spark job to a YARN cluster? A) BashOperator B) PythonOperator C) SparkSubmitOperator D) KubernetesPodOperator Answer: C Explanation: SparkSubmitOperator wraps the spark-submit command, allowing submission to YARN, K8s, etc. Question 23. How does an Airflow Sensor differ from a regular Operator? A) Sensors run in parallel by default.

Certification Exam Preparation

A) Increase the number of executors. B) Use DataFrames instead of RDDs. C) Broadcast the smaller dataset. D) Disable Kryo serialization. Answer: C Explanation: Broadcasting avoids the shuffle that would otherwise be required to bring matching rows together. Question 27. When examining a Spark SQL explain plan, which operator indicates a shuffle read? (Select the most specific) A) Project B) Filter C) Exchange D) Scan Answer: C Explanation: Exchange operators represent data movement between stages, i.e., shuffle read/write. Question 28. What does the “spark.sql.autoBroadcastJoinThreshold” configuration control? A) Maximum size of a broadcasted table in bytes. B) Minimum number of partitions for a shuffle join. C) Timeout for broadcast joins. D) Number of broadcast replicas. Answer: A

Certification Exam Preparation

Explanation: This setting defines the size threshold under which Spark will automatically broadcast a table. Question 29. Which technique can mitigate data skew for a join on a high‑cardinality key? A) Increase executor memory. B) Use salting to add a random prefix to the key. C) Disable adaptive query execution. D) Increase the number of shuffle partitions. Answer: B Explanation: Salting distributes skewed keys across multiple reducers, balancing load. Question 30. In Spark, what is the effect of setting “spark.sql.adaptive.enabled” to true? A) Disables dynamic allocation. B) Enables runtime optimization such as coalescing shuffle partitions. C) Forces broadcast joins for all joins. D) Turns off catalyst optimizer. Answer: B Explanation: Adaptive Query Execution (AQE) adjusts query plans during execution, e.g., reducing shuffle partitions. Question 31. Which of the following is NOT a valid Spark executor memory configuration? A) spark.executor.memory B) spark.executor.memoryOverhead C) spark.executor.memoryFraction D) spark.executor.memoryLimit Answer: D

Certification Exam Preparation

Explanation: The Ranger‑Hive plugin enforces column‑ and row‑level policies defined in Ranger. Question 35. What is the primary function of Apache Atlas in a CDP environment? A) Data encryption at rest. B) Service discovery for micro‑services. C) Metadata catalog and lineage tracking. D) Real‑time streaming ingestion. Answer: C Explanation: Atlas stores metadata, data models, and lineage information for governance. Question 36. Which command lists all CDP services in a given environment via the CLI? A) cdp services list B) cdp env describe C) cdp cluster list‑services D) cdp service describe-all Answer: A Explanation: “cdp services list” returns the services deployed in the current environment. Question 37. In CDP, what does SDX stand for? A) Secure Data Exchange B) Shared Data Experience C) Scalable Data eXecution D) Service Development eXtension Answer: B

Certification Exam Preparation

Explanation: SDX provides a unified interface for security, governance, and data catalog across CDP services. Question 38. Which of the following is a valid Spark SQL built‑in function for extracting a substring? A) substr() B) slice() C) split() D) extract() Answer: A Explanation: substr(string, pos, len) returns a substring starting at position pos. Question 39. When reading JSON data with Spark, which option should you set to handle corrupt records gracefully? A) mode("PERMISSIVE") B) mode("FAILFAST") C) mode("DROP") D) mode("IGNORE") Answer: A Explanation: PERMISSIVE mode places malformed records into a column named _corrupt_record instead of failing. Question 40. Which Iceberg table property controls the number of files written per commit? A) write.target-file-size-bytes B) commit.max-files C) iceberg.max-file-size

Certification Exam Preparation

B) .trigger(ProcessingTime("5 minutes")) C) .option("mergeSchema", true) D) .option("checkpointLocation", "/path") Answer: D Explanation: Providing a checkpoint location allows Spark to store offsets and achieve exactly‑once semantics. Question 44. Which of the following is a correct way to define a window specification that orders rows by “event_time” and partitions by “user_id”? A) Window.partitionBy("user_id").orderBy("event_time") B) Window.orderBy("user_id").partitionBy("event_time") C) Window.partitionBy("event_time").orderBy("user_id") D) Window.orderBy("event_time", "user_id") Answer: A Explanation: partitionBy defines the grouping column; orderBy defines the ordering within each partition. Question 45. In Iceberg, what does the “metadata.log” file contain? A) List of all data files in the table. B) History of snapshots and their locations. C) Partition column definitions. D) User access logs. Answer: B Explanation: metadata.log tracks snapshot IDs, timestamps, and the locations of metadata files. Question 46. Which Ranger policy type is used to grant column‑level masking?

Certification Exam Preparation

A) Row‑level filter policy B) Column‑level masking policy C) Data‑level encryption policy D) Tag‑based policy Answer: B Explanation: Ranger supports column masking policies that replace sensitive column values with masked data. Question 47. What is the default storage level when calling DataFrame.persist() without arguments? A) MEMORY_ONLY B) MEMORY_AND_DISK C) DISK_ONLY D) OFF_HEAP Answer: B Explanation: persist() defaults to MEMORY_AND_DISK, caching in memory and spilling to disk if needed. Question 48. Which Spark SQL function can be used to convert a timestamp column to a date string in “yyyy‑MM‑dd” format? A) date_format(timestamp, 'yyyy-MM-dd') B) to_date(timestamp, 'yyyy-MM-dd') C) format_timestamp(timestamp, 'yyyy-MM-dd') D) cast(timestamp as date) Answer: A Explanation: date_format formats a timestamp according to the provided pattern.

Certification Exam Preparation

Explanation: Setting spark.serializer to the KryoSerializer class activates Kryo serialization. Question 52. When using Spark on Kubernetes, which resource defines the pod template for executors? A) spark.kubernetes.executor.podTemplateFile B) spark.kubernetes.executor.podSpec C) spark.kubernetes.executor.podName D) spark.kubernetes.executor.resourceFile Answer: A Explanation: spark.kubernetes.executor.podTemplateFile points to a YAML file that defines executor pod specifications. Question 53. Which function would you use to flatten a nested array column in a DataFrame? A) explode() B) flatten() C) unnest() D) split() Answer: A Explanation: explode() creates a new row for each element in an array or map column. Question 54. In a Hive external table pointing to an S3 location, what happens when you DROP the table? A) Data files are deleted from S3. B) Only the metadata is removed; data remains in S3. C) The table is converted to a managed table. D) S3 bucket is also deleted.

Certification Exam Preparation

Answer: B Explanation: External tables only store metadata in the Metastore; dropping them does not delete underlying data. Question 55. Which of the following statements about Spark’s “mapPartitions” transformation is true? A) It processes each row individually. B) It provides access to the entire partition as an iterator. C) It cannot be used with DataFrames. D) It always results in a shuffle. Answer: B Explanation: mapPartitions receives an iterator over all rows in a partition, allowing batch processing. Question 56. What does the “spark.sql.shuffle.partitions” default value of 200 imply? A) All shuffle operations will use exactly 200 reducers regardless of data size. B) Spark will dynamically adjust the number of partitions up to 200. C) By default, each shuffle will create 200 output partitions. D) The setting is ignored unless adaptive query execution is enabled. Answer: C Explanation: The default config creates 200 partitions for each shuffle stage unless overridden. Question 57. Which Airflow operator would you use to wait for a file to appear in HDFS before proceeding? A) HdfsSensor B) FileSensor

[CDPDD] Cloudera CDP Data Developer Certification Exam Preparation, Exams of Technology

Related documents

Partial preview of the text

Download [CDPDD] Cloudera CDP Data Developer Certification Exam Preparation and more Exams Technology in PDF only on Docsity!

Certification Exam Preparation

Certification Exam Preparation

Certification Exam Preparation

Certification Exam Preparation

Certification Exam Preparation

Certification Exam Preparation

Certification Exam Preparation

Certification Exam Preparation

Certification Exam Preparation

Certification Exam Preparation

Certification Exam Preparation

Certification Exam Preparation

Certification Exam Preparation

Certification Exam Preparation