[CDP3002] CDP 3002 CDP Data Engineer Certification Exam Guide, Exams of Technology

This exam guide prepares candidates for data engineering roles within CDP ecosystems. Coverage includes data architecture, ETL processes, performance optimization, scalability, and governance.

Typology: Exams

2025/2026

Available from 02/10/2026

shilpi-jain-3
shilpi-jain-3 🇮🇳

2.5

(11)

80K documents

1 / 90

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
[CDP3002] CDP 3002 CDP Data Engineer
Certification Exam Guide
**Question 1. Which component in a SparkonKubernetes deployment is responsible for
coordinating the execution of tasks across executor pods?**
A) Spark Driver pod
B) KubeScheduler
C) Spark Executor pod
D) Kubernetes API server
Answer: A
Explanation: The Spark driver pod runs the SparkContext, creates tasks, and schedules them to
executor pods; it is the central coordinator in a SparkonK8s cluster.
**Question 2. In Kubernetes, which resource limit setting controls the maximum amount of
memory a Spark executor pod can use?**
A) cpuLimit
B) memoryRequest
C) memoryLimit
D) podQuota
Answer: C
Explanation: `memoryLimit` defines the upper bound of memory a pod can consume; exceeding
it causes the pod to be OOMkilled.
**Question 3. When configuring Spark on Kubernetes, what is the effect of setting
`spark.kubernetes.driver.podTemplateFile`?**
A) It defines the Docker image for the driver.
B) It provides a podlevel YAML template for the driver pod.
C) It configures the number of driver cores.
D) It enables driverside logging to a remote store.
Answer: B
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a

Partial preview of the text

Download [CDP3002] CDP 3002 CDP Data Engineer Certification Exam Guide and more Exams Technology in PDF only on Docsity!

Certification Exam Guide

Question 1. Which component in a Spark‑on‑Kubernetes deployment is responsible for coordinating the execution of tasks across executor pods? A) Spark Driver pod B) Kube‑Scheduler C) Spark Executor pod D) Kubernetes API server Answer: A Explanation: The Spark driver pod runs the SparkContext, creates tasks, and schedules them to executor pods; it is the central coordinator in a Spark‑on‑K8s cluster. Question 2. In Kubernetes, which resource limit setting controls the maximum amount of memory a Spark executor pod can use? A) cpuLimit B) memoryRequest C) memoryLimit D) podQuota Answer: C Explanation: memoryLimit defines the upper bound of memory a pod can consume; exceeding it causes the pod to be OOM‑killed. Question 3. When configuring Spark on Kubernetes, what is the effect of setting spark.kubernetes.driver.podTemplateFile? A) It defines the Docker image for the driver. B) It provides a pod‑level YAML template for the driver pod. C) It configures the number of driver cores. D) It enables driver‑side logging to a remote store. Answer: B

Certification Exam Guide

Explanation: The podTemplateFile option allows users to supply a full pod specification (labels, volumes, etc.) for the driver pod. Question 4. Which Spark storage level persists data in memory in a serialized format and also spills to disk when necessary? A) MEMORY_ONLY B) DISK_ONLY C) MEMORY_AND_DISK_SER D) OFF_HEAP Answer: C Explanation: MEMORY_AND_DISK_SER stores data serialized in memory and writes overflow to disk, reducing memory pressure. Question 5. In Spark SQL, what does the EXPLAIN command display? A) Physical execution plan only. B) Logical plan, optimized logical plan, and physical plan. C) Data lineage graph. D) Only the cost‑based optimizer decisions. Answer: B Explanation: EXPLAIN shows the logical plan, the optimized logical plan, and the final physical plan used for execution. Question 6. Which of the following best describes Spark’s shuffle operation? A) Data is repartitioned without network I/O. B) Data is moved across executors to satisfy grouping or join requirements. C) Data is cached locally on each executor.

Certification Exam Guide

A) Operator B) DAG C) TaskInstance D) XCom Answer: B Explanation: A DAG (Directed Acyclic Graph) specifies tasks and their dependencies. Question 10. Which Airflow operator would you use to execute a Spark job on a Kubernetes cluster? A) BashOperator B) SparkSubmitOperator C) KubernetesPodOperator D) PythonOperator Answer: C Explanation: KubernetesPodOperator can launch a pod that runs spark-submit, enabling Spark jobs on K8s. Question 11. In an Airflow DAG, what does the schedule_interval='@daily' argument represent? A) Run once a day at midnight UTC. B) Run every 24 hours after the previous run finishes. C) Run at the start of each day in the local timezone. D) Run every hour. Answer: A Explanation: @daily is a preset cron expression equivalent to 0 0 * * *, triggering at midnight UTC.

Certification Exam Guide

Question 12. How does the Airflow catchup parameter affect DAG execution when schedule_interval is missed? A) It disables all future runs. B) It runs only the latest missed schedule. C) It backfills all missed runs from the start date. D) It skips missed intervals entirely. Answer: C Explanation: When catchup=True, Airflow will create TaskInstances for every interval that was missed since the DAG’s start_date. Question 13. Which Airflow feature enables tasks to exchange small pieces of data without using external storage? A) Variables B) Connections C) XComs D) Pools Answer: C Explanation: XComs (cross‑communication) allow tasks to push and pull small messages or values. Question 14. In Airflow, what is the purpose of a Pool? A) To group tasks for parallel execution. B) To limit the number of concurrent task instances for a given resource. C) To store secret credentials. D) To schedule DAGs.

Certification Exam Guide

B) Enables automatic schema evolution. C) Improves query compilation time. D) Allows storage of nested JSON. Answer: A Explanation: Bucketing aligns rows with the same bucket key on the same executor, often avoiding a full shuffle during joins. Question 18. Which Spark join strategy is chosen automatically when one side of the join fits in executor memory? A) Sort‑merge join B) Broadcast hash join C) Shuffle hash join D) Cartesian join Answer: B Explanation: Spark’s optimizer picks a broadcast hash join when the smaller side can be broadcast to all executors. Question 19. In Spark, what does the “small file problem” refer to? A) Excessive number of tiny files causing overhead in the driver. B) Inability to read files smaller than 128 KB. C) Memory overflow due to large files. D) Loss of data during checkpointing. Answer: A Explanation: Many small files increase the overhead of file system metadata operations and can degrade read performance.

Certification Exam Guide

Question 20. Which Spark storage level stores data off‑heap using Tungsten memory management? A) MEMORY_ONLY B) OFF_HEAP C) DISK_ONLY D) MEMORY_AND_DISK_SER_ Answer: B Explanation: OFF_HEAP stores serialized data in off‑heap memory managed by Tungsten, reducing JVM GC pressure. Question 21. Which Iceberg feature allows you to query the state of a table as of a previous snapshot? A) Hidden partitioning B) Time travel C) Schema evolution D) Partition pruning Answer: B Explanation: Iceberg’s time‑travel capability lets you specify a snapshot ID or timestamp to read historical data. Question 22. In Iceberg, what is stored in the “metadata.json” file? A) Actual data rows. B) List of data files and partition information for a snapshot. C) Hive metastore connection details. D) Spark executor logs. Answer: B

Certification Exam Guide

D) Dedicated Answer: B Explanation: “All‑Purpose” VCs provide a balanced mix of resources suitable for interactive notebooks and ad‑hoc jobs. Question 26. In the CDE CLI, which command lists all active virtual clusters? A) cde vc list B) cde cluster show C) cde virtual-clusters ls D) cde vc describe Answer: A Explanation: cde vc list returns a table of currently running virtual clusters. Question 27. How does the CDE service enforce isolation between jobs from different tenants? A) By using separate physical machines. B) By assigning each tenant a distinct Kubernetes namespace. C) By encrypting all network traffic. D) By requiring SSL certificates per job. Answer: B Explanation: CDE maps each tenant’s virtual cluster to a dedicated K8s namespace, ensuring resource and security isolation. Question 28. Which REST API endpoint is used to submit a new Spark job to CDE? A) POST /api/v1/jobs/submit B) GET /api/v1/jobs/status

Certification Exam Guide

C) PUT /api/v1/jobs/update D DELETE /api/v1/jobs/remove Answer: A Explanation: The POST /api/v1/jobs/submit endpoint accepts job definition payloads for submission. Question 29. When configuring a CDE job, which artifact type allows you to include third‑party Python libraries? A) JAR file B) Wheel file C) ZIP archive of virtualenv D) Docker image Answer: C Explanation: CDE accepts a zipped Python virtual environment, which is unpacked on the executor nodes. Question 30. In CDE, what is the purpose of workload credentials? A) To authenticate the user to the Cloudera UI. B) To provide temporary tokens for accessing external services from a job. C) To encrypt Spark shuffle data. D) To configure driver memory. Answer: B Explanation: Workload credentials are short‑lived secrets (e.g., IAM tokens) injected into job containers for secure external access. Question 31. Which Spark configuration disables the default “dynamic allocation” of executors?

Certification Exam Guide

Question 34. Which Airflow sensor is most appropriate for waiting until a file appears in HDFS? A) HttpSensor B) HdfsSensor C) S3KeySensor D) TimeSensor Answer: B Explanation: HdfsSensor polls HDFS for the existence of a specified path. Question 35. When using Spark Structured Streaming with a Hive table as sink, which mode guarantees exactly‑once semantics? A) Append B) Update C) Complete D) None; streaming to Hive cannot be exactly‑once. Answer: A Explanation: In Append mode, each micro‑batch writes new rows atomically to the Hive table, preserving exactly‑once guarantees when the source is also exactly‑once. Question 36. Which Iceberg table property controls the maximum number of data files per manifest? A) write.target-file-size-bytes B) manifest.max-rows-per-file C) manifest.max-file-size-bytes D) commit.target-file-size-bytes

Certification Exam Guide

Answer: C Explanation: manifest.max-file-size-bytes limits the size of each manifest file, influencing how many data files are listed per manifest. Question 37. In Spark, what does the coalesce(numPartitions) transformation do? A) Increases the number of partitions without shuffle. B) Decreases the number of partitions without shuffle. C) Repartitions using a full shuffle. D) Persists data to disk. Answer: B Explanation: coalesce reduces partitions by collapsing them, avoiding a full shuffle (unless shuffle=true is specified). Question 38. Which Spark UI tab provides information about stage‑level task durations and shuffle read/write metrics? A) Jobs B) Stages C) Executors D) SQL Answer: B Explanation: The Stages tab breaks down each stage, showing task times, shuffle read/write, and spill information. Question 39. In Airflow, what does setting depends_on_past=True for a task enforce? A) The task must wait for all upstream tasks. B) The task will only run if the previous run of the same task succeeded.

Certification Exam Guide

B) It replicates a small DataFrame to all executors to avoid a shuffle join. C) It caches the DataFrame in off‑heap memory. D) It converts a DataFrame to an RDD. Answer: B Explanation: broadcast marks a DataFrame as small enough to be sent to all executors, enabling a broadcast hash join. Question 43. Which Spark configuration can be tuned to reduce the size of shuffle files on disk? A) spark.shuffle.file.buffer B) spark.shuffle.compress C) spark.shuffle.spill.compress D) spark.shuffle.io.maxRetries Answer: C Explanation: spark.shuffle.spill.compress enables compression of spilled shuffle data, decreasing disk usage. Question 44. In Airflow, which parameter controls the time interval between successive retries of a failed task? A) retry_delay B) retry_exponential_backoff C) retry_timeout D) retry_interval Answer: A Explanation: retry_delay is a datetime.timedelta specifying how long to wait before each retry.

Certification Exam Guide

Question 45. Which Iceberg feature allows you to change the data type of an existing column without rewriting data files? A) Column renaming B) Type promotion (e.g., INT to BIGINT) C) Partition evolution D) Snapshot expiration Answer: B Explanation: Iceberg supports safe type promotion (e.g., widening numeric types) by updating the schema metadata only. Question 46. When deploying Spark on Kubernetes, which volume type is recommended for storing driver logs that need to survive pod restarts? A) emptyDir B) hostPath C) PersistentVolumeClaim (PVC) D) configMap Answer: C Explanation: A PVC provides durable storage that persists beyond the lifecycle of a pod, making it suitable for log retention. Question 47. Which Spark SQL function can be used to retrieve the current Iceberg snapshot ID of a table? A) iceberg_snapshot_id() B) current_snapshot() C) snapshot_id() D) iceberg_current_snapshot() Answer: B

Certification Exam Guide

D) @workflow Answer: A Explanation: The @dag decorator (available in Airflow 2.x) turns a Python function into a DAG definition. Question 51. Which of the following Spark actions triggers a job execution? A) map() B) filter() C) count() D) select() Answer: C Explanation: count() is an action that forces Spark to compute the DataFrame and return a result, launching a job. Question 52. In Iceberg, what does the expire-snapshots command accomplish? A) Deletes data files older than a retention period. B) Removes metadata for snapshots older than a specified timestamp. C) Compacts small files into larger ones. D) Renames partitions. Answer: B Explanation: expire-snapshots cleans up old snapshot metadata, optionally deleting unreferenced data files. Question 53. Which Spark setting controls the maximum size of a single task’s result that can be sent back to the driver? A) spark.driver.maxResultSize

Certification Exam Guide

B) spark.executor.resultSize C) spark.task.resultSize D) spark.sql.maxResultSize Answer: A Explanation: spark.driver.maxResultSize limits the total size of results returned to the driver to avoid OOM. Question 54. In Airflow, which hook is used to interact with a Hive Metastore? A) HiveHook B) PrestoHook C) SparkHook D) MetastoreHook Answer: A Explanation: HiveHook provides methods to run HiveQL and interact with the metastore. Question 55. When using Spark Structured Streaming with a checkpoint directory on HDFS, what is the purpose of the checkpoint? A) To store intermediate shuffle files. B) To persist the streaming query’s state for fault tolerance. C) To cache the source data. D) To log driver events. Answer: B Explanation: Checkpoints capture offsets and state so that a streaming job can recover after a failure.