Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

PrepIQ CDP3002 CDP Data Engineer Ultimate Exam, Exams of Technology

Technology

Focused review of data engineering, ETL pipelines, distributed processing, data quality, governance, performance tuning, and cloud data workflows.

Typology: Exams

2025/2026

Available from 06/12/2026

shilpi-jain-2 🇮🇳

1

(1)

25K documents

1 / 49

This page cannot be seen from the preview

Don't miss anything!

PrepIQ CDP3002 CDP Data

Engineer Ultimate Exam

**Question 1.** Which component of the Spark execution engine is primarily

responsible for converting a logical plan into an optimized physical plan?

A) Catalyst Optimizer

B) Tungsten Engine

C) DAG Scheduler

D) Task Scheduler

Answer: A

Explanation: The Catalyst Optimizer performs rule-based and cost-based

optimizations on the logical query plan before generating the physical plan that

Tungsten will execute.

**Question 2.** In a CDE virtual cluster, Spark applications are run inside containers

orchestrated by which platform?

A) Docker Swarm

B) Kubernetes

C) Mesos

D) YARN

Answer: B

Explanation: CDE leverages Kubernetes to launch Spark driver and executor pods,

providing isolation and resource management.

**Question 3.** Which Spark DataFrame API call reads a JSON file into a DataFrame

while automatically inferring the schema?

A) spark.read.text()

B) spark.read.format("json").load()

C) spark.read.parquet()

D) spark.read.csv()

Answer: B

Explanation: The `format("json")` reader parses JSON and infers column types

unless a schema is explicitly supplied.

Partial preview of the text

Download PrepIQ CDP3002 CDP Data Engineer Ultimate Exam and more Exams Technology in PDF only on Docsity!

Engineer Ultimate Exam

Question 1. Which component of the Spark execution engine is primarily responsible for converting a logical plan into an optimized physical plan? A) Catalyst Optimizer B) Tungsten Engine C) DAG Scheduler D) Task Scheduler Answer: A Explanation: The Catalyst Optimizer performs rule-based and cost-based optimizations on the logical query plan before generating the physical plan that Tungsten will execute. Question 2. In a CDE virtual cluster, Spark applications are run inside containers orchestrated by which platform? A) Docker Swarm B) Kubernetes C) Mesos D) YARN Answer: B Explanation: CDE leverages Kubernetes to launch Spark driver and executor pods, providing isolation and resource management. Question 3. Which Spark DataFrame API call reads a JSON file into a DataFrame while automatically inferring the schema? A) spark.read.text() B) spark.read.format("json").load() C) spark.read.parquet() D) spark.read.csv() Answer: B Explanation: The format("json") reader parses JSON and infers column types unless a schema is explicitly supplied.

Engineer Ultimate Exam

Question 4. When working with nested columns, which Spark function extracts a field from a struct column? A) explode() B) getItem() C) col() D) withField() Answer: B Explanation: getItem("fieldName") or the dot notation (col("struct.field")) extracts a nested field from a struct. Question 5. In Spark Structured Streaming, which output mode is required when performing aggregations that produce new rows for each micro-batch? A) Append B) Update C) Complete D) Incremental Answer: C Explanation: The complete mode writes the full result of an aggregation on every trigger, necessary for grouped aggregations. Question 6. Which Hive Warehouse Connector (HWC) method writes a Spark DataFrame to an ACID-managed Hive table? A) df.write.format("hive").saveAsTable() B) df.writeTo("hive_table").append() C) df.write.format("iceberg").save() D) df.writeToTable("hive_table") Answer: B Explanation: writeTo with HWC enables ACID semantics by leveraging Hive’s transaction manager.

Engineer Ultimate Exam

Question 10. Which Airflow operator is specifically designed to submit a Spark job to a CDE virtual cluster? A) BashOperator B) PythonOperator C) SparkSubmitOperator D) CdeSparkSubmitOperator Answer: D Explanation: The CdeSparkSubmitOperator is a custom operator provided by Cloudera to interact with CDE. Question 11. When implementing Change Data Capture (CDC) in an Airflow DAG, which pattern is most common for incremental extraction? A) Full table reload each run B) Timestamp-based WHERE clause C) Random sampling of rows D) Using a fixed row limit Answer: B Explanation: CDC typically uses a high-watermark timestamp column to fetch only rows changed since the last run. Question 12. In Airflow, what mechanism allows tasks to share small pieces of data without using external storage? A) Variables B) Connections C) XComs D) Pools Answer: C Explanation: XComs (cross-communication) let tasks push and pull small payloads between each other.

Engineer Ultimate Exam

Question 13. Which Spark join strategy is automatically chosen when one side of the join is smaller than the broadcast threshold? A) Sort-Merge Join B) Broadcast Hash Join C) Shuffle Hash Join D) Cartesian Join Answer: B Explanation: Spark will broadcast the smaller dataset to all executors, enabling a fast hash join. Question 14. How can data skew be mitigated when joining on a high-cardinality key? A) Increase shuffle partitions only B) Use broadcast join on both sides C) Apply salting to the skewed key D) Disable catalyst optimizer Answer: C Explanation: Adding a random “salt” to the skewed key distributes records more evenly across partitions. Question 15. Which file system issue is addressed by compaction in a data lake? A) Too many small files causing NameNode overload B) Corrupted parquet schema C) Insufficient replication factor D) Missing ACLs Answer: A Explanation: Compaction merges many tiny files into larger ones, reducing metadata overhead and improving read performance.

Engineer Ultimate Exam

Question 19. In Iceberg, what is the purpose of a “metadata file”? A) Store the actual data rows B) Track the list of data files and their manifests for each snapshot C) Hold user-defined functions D) Encrypt the data files Answer: B Explanation: Metadata files contain manifests that point to data files, enabling fast table scans and snapshot management. Question 20. Which Iceberg property controls how many files are written per partition during a write operation? A) write.target-file-size-bytes B) iceberg.max-file-size C) spark.sql.files.maxPartitionBytes D) iceberg.target-file-size-bytes Answer: D Explanation: iceberg.target-file-size-bytes defines the desired file size for writes, influencing the number of output files per partition. Question 21. Which CDE CLI command lists all virtual clusters in a CDP environment? A) cde clusters list B) cde virtual-clusters show C) cde cluster list D) cde vc describe Answer: A Explanation: cde clusters list returns a table of existing virtual clusters. Question 22. To limit the memory used by a Spark executor, which configuration should be set?

Engineer Ultimate Exam

A) spark.executor.cores B) spark.executor.memoryOverhead C) spark.driver.memory D) spark.sql.shuffle.partitions Answer: B Explanation: spark.executor.memoryOverhead reserves off-heap memory for JVM overhead, preventing OOM errors. Question 23. Which Airflow feature allows you to define a maximum number of concurrent runs for a DAG? A) max_active_runs B) concurrency C) pool_slots D) dagrun_timeout Answer: A Explanation: max_active_runs limits how many DAG runs can be active simultaneously. Question 24. In Spark, which transformation is lazy and does not trigger execution until an action is called? A) map() B) filter() C) reduceByKey() D) all of the above Answer: D Explanation: All listed transformations are lazy; actions such as collect() or write trigger the computation. Question 25. Which Spark UI tab provides details about the amount of data persisted in memory and on disk?

Engineer Ultimate Exam

A) Partition pruning B) Snapshot IDs C) Data compaction D) Bucketing Answer: B Explanation: Snapshots capture the table state; querying a specific snapshot ID provides a historical view. Question 29. In Airflow, which component stores connection credentials securely? A) Variables B) Secrets Backend C) XComs D) DAG Parameters Answer: B Explanation: The Secrets Backend integrates with external vaults (e.g., HashiCorp Vault) to retrieve connection info. Question 30. Which Spark configuration reduces the size of shuffle files by using columnar compression? A) spark.sql.parquet.compression.codec B) spark.shuffle.compress C) spark.sql.inMemoryColumnarStorage.compressed D) spark.sql.execution.arrow.enabled Answer: B Explanation: spark.shuffle.compress enables compression of shuffle data, decreasing network I/O. Question 31. Which of the following best describes a “bucket” in Iceberg? A) A physical file on disk

Engineer Ultimate Exam

B) A logical grouping of rows based on hash of a column C) A partition directory D) An Iceberg metadata snapshot Answer: B Explanation: Bucketing hashes a column’s value to distribute rows evenly across a fixed number of buckets. Question 32. In CDE, which resource is used to enforce CPU limits for a Spark executor pod? A) executorMemory B) executorCores C) cpuRequest D) cpuLimit Answer: D Explanation: cpuLimit defines the maximum CPU cores a pod can consume, enforced by Kubernetes. Question 33. Which Spark function converts a DataFrame column containing JSON strings into a struct column? A) from_json() B) to_json() C) json_tuple() D) explode_json() Answer: A Explanation: from_json(col, schema) parses JSON strings into a struct according to the provided schema. Question 34. In Airflow, what does setting retries=3 on a task accomplish? A) The task will run three times in parallel B) The task will be attempted up to three additional times after failure

Engineer Ultimate Exam

C) snapshot.max-count D) iceberg.snapshot.retention.max Answer: D Explanation: iceberg.snapshot.retention.max limits the number of snapshots kept before automatic expiration. Question 38. In Spark Structured Streaming, which trigger processes data as soon as it arrives? A) Trigger.Once() B) Trigger.ProcessingTime("5 minutes") C) Trigger.Continuous("1 second") D) Trigger.AvailableNow() Answer: C Explanation: Trigger.Continuous runs the query continuously with the specified checkpoint interval. Question 39. Which Airflow hook would you use to interact with a Hive Metastore for metadata queries? A) MySqlHook B) HiveCliHook C) HiveMetastoreHook D) PrestoHook Answer: C Explanation: HiveMetastoreHook provides a client to call Hive Metastore APIs. Question 40. Which Spark configuration property enables whole-stage code generation for Java and Scala? A) spark.sql.codegen.wholeStage B) spark.sql.codegen.factoryMode C) spark.sql.codegen.maxFields

Engineer Ultimate Exam

D) spark.sql.codegen.enabled Answer: D Explanation: spark.sql.codegen.enabled toggles the whole-stage code generation optimization. Question 41. In Iceberg, what does the term “manifest list” refer to? A) List of all data files in the table B) List of manifest files that together describe a snapshot C) List of columns in the schema D) List of partition values Answer: B Explanation: A manifest list aggregates multiple manifest files, each of which points to data files for a snapshot. Question 42. Which Spark DataFrame API call removes duplicate rows based on all columns? A) distinct() B) dropDuplicates() C) unique() D) both A and B Answer: D Explanation: Both distinct() and dropDuplicates() without arguments achieve the same result. Question 43. In Airflow, what is the purpose of a “pool”? A) To store temporary files for tasks B) To limit the number of concurrent tasks for a resource C) To group DAGs by owner D) To manage secret variables Answer: B

Engineer Ultimate Exam

Question 47. In CDE, what is the effect of setting spark.sql.shuffle.partitions to a value higher than the number of executors? A) Improves performance by increasing parallelism B) Causes unnecessary small tasks and overhead C) Triggers a runtime error D) No effect; Spark ignores the setting Answer: B Explanation: Excessive partitions create many tiny tasks, increasing scheduling overhead without performance gains. Question 48. Which Airflow operator would you use to execute a Python callable that returns a Pandas DataFrame? A) PythonOperator B) PandasOperator C) DataFrameOperator D) SparkSubmitOperator Answer: A Explanation: PythonOperator can run any Python function, including those returning Pandas DataFrames. Question 49. Which Spark feature enables columnar in-memory representation for faster CPU utilization? A) Tungsten B) Catalyst C) GraphX D) MLlib Answer: A Explanation: Tungsten provides off-heap memory management and bytecode generation for columnar processing.

Engineer Ultimate Exam

Question 50. When using Hive Warehouse Connector, which Spark configuration must be set to enable ACID reads? A) spark.sql.hive.convertMetastoreParquet B) spark.sql.hive.hwc.enabled C) spark.sql.hive.metastore.version D) spark.sql.hive.metastore.jars Answer: B Explanation: spark.sql.hive.hwc.enabled activates the HWC and its ACID capabilities. Question 51. Which of the following is NOT a valid Spark shuffle write compression codec? A) lz B) snappy C) gzip D) bzip Answer: D Explanation: Spark supports lz4, snappy, and gzip; bzip2 is not available for shuffle compression. Question 52. In Airflow, which parameter of a DAG defines the timezone used for schedule intervals? A) default_args B) catchup C) timezone D) schedule_interval Answer: C Explanation: The timezone argument sets the DAG’s execution timezone.

Engineer Ultimate Exam

Question 56. Which Airflow feature allows you to pause a DAG without deleting its schedule? A) set_active(False) B) pause_dag() C) is_paused flag in UI D) disable_schedule() Answer: C Explanation: The UI includes an “is_paused” toggle that stops scheduling while preserving DAG definition. Question 57. Which Spark configuration controls the amount of memory allocated for the driver process? A) spark.driver.memory B) spark.executor.memory C) spark.memory.fraction D) spark.sql.autoBroadcastJoinThreshold Answer: A Explanation: spark.driver.memory defines the JVM heap size for the driver. Question 58. When reading from an Iceberg table via Spark, which format must be specified for optimal performance? A) iceberg B) parquet C) orc D) delta Answer: A Explanation: Using format("iceberg") lets Spark leverage Iceberg’s metadata for pruning and projection. Question 59. Which Spark DataFrame API is used to flatten an array column into multiple rows?

Engineer Ultimate Exam

A) explode() B) flatten() C) split() D) unnest() Answer: A Explanation: explode(col) transforms each element of an array into a separate row. Question 60. In Airflow, which attribute of a task determines the maximum execution time before it is marked as failed? A) execution_timeout B) timeout_seconds c) sla d) retry_delay Answer: A Explanation: execution_timeout is a datetime.timedelta after which the task is terminated. Question 61. Which Spark UI tab shows the DAG of stages and tasks for a particular job? A) Stages B) DAG Visualization (under Stages) C) Executors D) SQL Answer: B Explanation: The DAG Visualization within the Stages tab displays stage dependencies. Question 62. Which Iceberg command is used to rewrite data files to achieve better file sizes? A) rewrite_data_files()

PrepIQ CDP3002 CDP Data Engineer Ultimate Exam, Exams of Technology

Related documents

Partial preview of the text

Download PrepIQ CDP3002 CDP Data Engineer Ultimate Exam and more Exams Technology in PDF only on Docsity!

Engineer Ultimate Exam

Engineer Ultimate Exam

Engineer Ultimate Exam

Engineer Ultimate Exam

Engineer Ultimate Exam

Engineer Ultimate Exam

Engineer Ultimate Exam

Engineer Ultimate Exam

Engineer Ultimate Exam

Engineer Ultimate Exam

Engineer Ultimate Exam

Engineer Ultimate Exam

Engineer Ultimate Exam

Engineer Ultimate Exam