Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

[CDP3002] CDP 3002 CDP Data Engineer Certification Exam Guide, Exams of Technology

Technology

This exam guide prepares candidates for data engineering roles within CDP ecosystems. Coverage includes data architecture, ETL processes, performance optimization, scalability, and governance.

Typology: Exams

2025/2026

Available from 02/10/2026

shilpi-jain-3 🇮🇳

2.5

(11)

80K documents

1 / 90

This page cannot be seen from the preview

Don't miss anything!

[CDP3002] CDP 3002 CDP Data Engineer

Certification Exam Guide

**Question 1. Which component in a Spark‑on‑Kubernetes deployment is responsible for

coordinating the execution of tasks across executor pods?**

A) Spark Driver pod

B) Kube‑Scheduler

C) Spark Executor pod

D) Kubernetes API server

Answer: A

Explanation: The Spark driver pod runs the SparkContext, creates tasks, and schedules them to

executor pods; it is the central coordinator in a Spark‑on‑K8s cluster.

**Question 2. In Kubernetes, which resource limit setting controls the maximum amount of

memory a Spark executor pod can use?**

A) cpuLimit

B) memoryRequest

C) memoryLimit

D) podQuota

Answer: C

Explanation: `memoryLimit` defines the upper bound of memory a pod can consume; exceeding

it causes the pod to be OOM‑killed.

**Question 3. When configuring Spark on Kubernetes, what is the effect of setting

`spark.kubernetes.driver.podTemplateFile`?**

A) It defines the Docker image for the driver.

B) It provides a pod‑level YAML template for the driver pod.

C) It configures the number of driver cores.

D) It enables driver‑side logging to a remote store.

Answer: B

Partial preview of the text

Download [CDP3002] CDP 3002 CDP Data Engineer Certification Exam Guide and more Exams Technology in PDF only on Docsity!

Certification Exam Guide

Question 1. Which component in a Spark‑on‑Kubernetes deployment is responsible for coordinating the execution of tasks across executor pods? A) Spark Driver pod B) Kube‑Scheduler C) Spark Executor pod D) Kubernetes API server Answer: A Explanation: The Spark driver pod runs the SparkContext, creates tasks, and schedules them to executor pods; it is the central coordinator in a Spark‑on‑K8s cluster. Question 2. In Kubernetes, which resource limit setting controls the maximum amount of memory a Spark executor pod can use? A) cpuLimit B) memoryRequest C) memoryLimit D) podQuota Answer: C Explanation: memoryLimit defines the upper bound of memory a pod can consume; exceeding it causes the pod to be OOM‑killed. Question 3. When configuring Spark on Kubernetes, what is the effect of setting spark.kubernetes.driver.podTemplateFile? A) It defines the Docker image for the driver. B) It provides a pod‑level YAML template for the driver pod. C) It configures the number of driver cores. D) It enables driver‑side logging to a remote store. Answer: B

Certification Exam Guide

Explanation: The podTemplateFile option allows users to supply a full pod specification (labels, volumes, etc.) for the driver pod. Question 4. Which Spark storage level persists data in memory in a serialized format and also spills to disk when necessary? A) MEMORY_ONLY B) DISK_ONLY C) MEMORY_AND_DISK_SER D) OFF_HEAP Answer: C Explanation: MEMORY_AND_DISK_SER stores data serialized in memory and writes overflow to disk, reducing memory pressure. Question 5. In Spark SQL, what does the EXPLAIN command display? A) Physical execution plan only. B) Logical plan, optimized logical plan, and physical plan. C) Data lineage graph. D) Only the cost‑based optimizer decisions. Answer: B Explanation: EXPLAIN shows the logical plan, the optimized logical plan, and the final physical plan used for execution. Question 6. Which of the following best describes Spark’s shuffle operation? A) Data is repartitioned without network I/O. B) Data is moved across executors to satisfy grouping or join requirements. C) Data is cached locally on each executor.

Certification Exam Guide

A) Operator B) DAG C) TaskInstance D) XCom Answer: B Explanation: A DAG (Directed Acyclic Graph) specifies tasks and their dependencies. Question 10. Which Airflow operator would you use to execute a Spark job on a Kubernetes cluster? A) BashOperator B) SparkSubmitOperator C) KubernetesPodOperator D) PythonOperator Answer: C Explanation: KubernetesPodOperator can launch a pod that runs spark-submit, enabling Spark jobs on K8s. Question 11. In an Airflow DAG, what does the schedule_interval='@daily' argument represent? A) Run once a day at midnight UTC. B) Run every 24 hours after the previous run finishes. C) Run at the start of each day in the local timezone. D) Run every hour. Answer: A Explanation: @daily is a preset cron expression equivalent to 0 0 * * *, triggering at midnight UTC.

Certification Exam Guide

Question 12. How does the Airflow catchup parameter affect DAG execution when schedule_interval is missed? A) It disables all future runs. B) It runs only the latest missed schedule. C) It backfills all missed runs from the start date. D) It skips missed intervals entirely. Answer: C Explanation: When catchup=True, Airflow will create TaskInstances for every interval that was missed since the DAG’s start_date. Question 13. Which Airflow feature enables tasks to exchange small pieces of data without using external storage? A) Variables B) Connections C) XComs D) Pools Answer: C Explanation: XComs (cross‑communication) allow tasks to push and pull small messages or values. Question 14. In Airflow, what is the purpose of a Pool? A) To group tasks for parallel execution. B) To limit the number of concurrent task instances for a given resource. C) To store secret credentials. D) To schedule DAGs.

Certification Exam Guide

B) Enables automatic schema evolution. C) Improves query compilation time. D) Allows storage of nested JSON. Answer: A Explanation: Bucketing aligns rows with the same bucket key on the same executor, often avoiding a full shuffle during joins. Question 18. Which Spark join strategy is chosen automatically when one side of the join fits in executor memory? A) Sort‑merge join B) Broadcast hash join C) Shuffle hash join D) Cartesian join Answer: B Explanation: Spark’s optimizer picks a broadcast hash join when the smaller side can be broadcast to all executors. Question 19. In Spark, what does the “small file problem” refer to? A) Excessive number of tiny files causing overhead in the driver. B) Inability to read files smaller than 128 KB. C) Memory overflow due to large files. D) Loss of data during checkpointing. Answer: A Explanation: Many small files increase the overhead of file system metadata operations and can degrade read performance.

Certification Exam Guide

Question 20. Which Spark storage level stores data off‑heap using Tungsten memory management? A) MEMORY_ONLY B) OFF_HEAP C) DISK_ONLY D) MEMORY_AND_DISK_SER_ Answer: B Explanation: OFF_HEAP stores serialized data in off‑heap memory managed by Tungsten, reducing JVM GC pressure. Question 21. Which Iceberg feature allows you to query the state of a table as of a previous snapshot? A) Hidden partitioning B) Time travel C) Schema evolution D) Partition pruning Answer: B Explanation: Iceberg’s time‑travel capability lets you specify a snapshot ID or timestamp to read historical data. Question 22. In Iceberg, what is stored in the “metadata.json” file? A) Actual data rows. B) List of data files and partition information for a snapshot. C) Hive metastore connection details. D) Spark executor logs. Answer: B

Certification Exam Guide

D) Dedicated Answer: B Explanation: “All‑Purpose” VCs provide a balanced mix of resources suitable for interactive notebooks and ad‑hoc jobs. Question 26. In the CDE CLI, which command lists all active virtual clusters? A) cde vc list B) cde cluster show C) cde virtual-clusters ls D) cde vc describe Answer: A Explanation: cde vc list returns a table of currently running virtual clusters. Question 27. How does the CDE service enforce isolation between jobs from different tenants? A) By using separate physical machines. B) By assigning each tenant a distinct Kubernetes namespace. C) By encrypting all network traffic. D) By requiring SSL certificates per job. Answer: B Explanation: CDE maps each tenant’s virtual cluster to a dedicated K8s namespace, ensuring resource and security isolation. Question 28. Which REST API endpoint is used to submit a new Spark job to CDE? A) POST /api/v1/jobs/submit B) GET /api/v1/jobs/status

Certification Exam Guide

C) PUT /api/v1/jobs/update D DELETE /api/v1/jobs/remove Answer: A Explanation: The POST /api/v1/jobs/submit endpoint accepts job definition payloads for submission. Question 29. When configuring a CDE job, which artifact type allows you to include third‑party Python libraries? A) JAR file B) Wheel file C) ZIP archive of virtualenv D) Docker image Answer: C Explanation: CDE accepts a zipped Python virtual environment, which is unpacked on the executor nodes. Question 30. In CDE, what is the purpose of workload credentials? A) To authenticate the user to the Cloudera UI. B) To provide temporary tokens for accessing external services from a job. C) To encrypt Spark shuffle data. D) To configure driver memory. Answer: B Explanation: Workload credentials are short‑lived secrets (e.g., IAM tokens) injected into job containers for secure external access. Question 31. Which Spark configuration disables the default “dynamic allocation” of executors?

Certification Exam Guide

Question 34. Which Airflow sensor is most appropriate for waiting until a file appears in HDFS? A) HttpSensor B) HdfsSensor C) S3KeySensor D) TimeSensor Answer: B Explanation: HdfsSensor polls HDFS for the existence of a specified path. Question 35. When using Spark Structured Streaming with a Hive table as sink, which mode guarantees exactly‑once semantics? A) Append B) Update C) Complete D) None; streaming to Hive cannot be exactly‑once. Answer: A Explanation: In Append mode, each micro‑batch writes new rows atomically to the Hive table, preserving exactly‑once guarantees when the source is also exactly‑once. Question 36. Which Iceberg table property controls the maximum number of data files per manifest? A) write.target-file-size-bytes B) manifest.max-rows-per-file C) manifest.max-file-size-bytes D) commit.target-file-size-bytes

Certification Exam Guide

Answer: C Explanation: manifest.max-file-size-bytes limits the size of each manifest file, influencing how many data files are listed per manifest. Question 37. In Spark, what does the coalesce(numPartitions) transformation do? A) Increases the number of partitions without shuffle. B) Decreases the number of partitions without shuffle. C) Repartitions using a full shuffle. D) Persists data to disk. Answer: B Explanation: coalesce reduces partitions by collapsing them, avoiding a full shuffle (unless shuffle=true is specified). Question 38. Which Spark UI tab provides information about stage‑level task durations and shuffle read/write metrics? A) Jobs B) Stages C) Executors D) SQL Answer: B Explanation: The Stages tab breaks down each stage, showing task times, shuffle read/write, and spill information. Question 39. In Airflow, what does setting depends_on_past=True for a task enforce? A) The task must wait for all upstream tasks. B) The task will only run if the previous run of the same task succeeded.

Certification Exam Guide

B) It replicates a small DataFrame to all executors to avoid a shuffle join. C) It caches the DataFrame in off‑heap memory. D) It converts a DataFrame to an RDD. Answer: B Explanation: broadcast marks a DataFrame as small enough to be sent to all executors, enabling a broadcast hash join. Question 43. Which Spark configuration can be tuned to reduce the size of shuffle files on disk? A) spark.shuffle.file.buffer B) spark.shuffle.compress C) spark.shuffle.spill.compress D) spark.shuffle.io.maxRetries Answer: C Explanation: spark.shuffle.spill.compress enables compression of spilled shuffle data, decreasing disk usage. Question 44. In Airflow, which parameter controls the time interval between successive retries of a failed task? A) retry_delay B) retry_exponential_backoff C) retry_timeout D) retry_interval Answer: A Explanation: retry_delay is a datetime.timedelta specifying how long to wait before each retry.

Certification Exam Guide

Question 45. Which Iceberg feature allows you to change the data type of an existing column without rewriting data files? A) Column renaming B) Type promotion (e.g., INT to BIGINT) C) Partition evolution D) Snapshot expiration Answer: B Explanation: Iceberg supports safe type promotion (e.g., widening numeric types) by updating the schema metadata only. Question 46. When deploying Spark on Kubernetes, which volume type is recommended for storing driver logs that need to survive pod restarts? A) emptyDir B) hostPath C) PersistentVolumeClaim (PVC) D) configMap Answer: C Explanation: A PVC provides durable storage that persists beyond the lifecycle of a pod, making it suitable for log retention. Question 47. Which Spark SQL function can be used to retrieve the current Iceberg snapshot ID of a table? A) iceberg_snapshot_id() B) current_snapshot() C) snapshot_id() D) iceberg_current_snapshot() Answer: B

Certification Exam Guide

D) @workflow Answer: A Explanation: The @dag decorator (available in Airflow 2.x) turns a Python function into a DAG definition. Question 51. Which of the following Spark actions triggers a job execution? A) map() B) filter() C) count() D) select() Answer: C Explanation: count() is an action that forces Spark to compute the DataFrame and return a result, launching a job. Question 52. In Iceberg, what does the expire-snapshots command accomplish? A) Deletes data files older than a retention period. B) Removes metadata for snapshots older than a specified timestamp. C) Compacts small files into larger ones. D) Renames partitions. Answer: B Explanation: expire-snapshots cleans up old snapshot metadata, optionally deleting unreferenced data files. Question 53. Which Spark setting controls the maximum size of a single task’s result that can be sent back to the driver? A) spark.driver.maxResultSize

Certification Exam Guide

B) spark.executor.resultSize C) spark.task.resultSize D) spark.sql.maxResultSize Answer: A Explanation: spark.driver.maxResultSize limits the total size of results returned to the driver to avoid OOM. Question 54. In Airflow, which hook is used to interact with a Hive Metastore? A) HiveHook B) PrestoHook C) SparkHook D) MetastoreHook Answer: A Explanation: HiveHook provides methods to run HiveQL and interact with the metastore. Question 55. When using Spark Structured Streaming with a checkpoint directory on HDFS, what is the purpose of the checkpoint? A) To store intermediate shuffle files. B) To persist the streaming query’s state for fault tolerance. C) To cache the source data. D) To log driver events. Answer: B Explanation: Checkpoints capture offsets and state so that a streaming job can recover after a failure.

[CDP3002] CDP 3002 CDP Data Engineer Certification Exam Guide, Exams of Technology

Related documents

Partial preview of the text

Download [CDP3002] CDP 3002 CDP Data Engineer Certification Exam Guide and more Exams Technology in PDF only on Docsity!

Certification Exam Guide

Certification Exam Guide

Certification Exam Guide

Certification Exam Guide

Certification Exam Guide

Certification Exam Guide

Certification Exam Guide

Certification Exam Guide

Certification Exam Guide

Certification Exam Guide

Certification Exam Guide

Certification Exam Guide

Certification Exam Guide

Certification Exam Guide