PrepIQ Cloudera CDP Data Developer Ultimate Exam, Exams of Technology

The PrepIQ Cloudera CDP Data Developer Ultimate Exam prepares learners to develop and manage big data solutions using Cloudera Data Platform technologies, data pipelines, analytics tools, and distributed processing systems.

Typology: Exams

2025/2026

Available from 06/07/2026

shilpi-jain-2
shilpi-jain-2 🇮🇳

1

(1)

25K documents

1 / 64

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
PrepIQ Cloudera CDP Data Developer
Ultimate Exam
Question 1. **Which CDP deployment model allows you to run services on a public
cloud provider while keeping the control plane on-premises?**
A) CDP Public Cloud
B) CDP Private Cloud Base
C) CDP Hybrid
D) CDP Edge
Answer: C
Explanation: CDP Hybrid combines on-premises control with public-cloud data
services, enabling workloads to span both environments.
---
Question 2. **In CDP, which component provides fine-grained access control for
Hive, HBase, and Spark through policies?**
A) Apache Atlas
B) Apache Ranger
C) Apache Knox
D) Apache NiFi
Answer: B
Explanation: Apache Ranger centralizes security policies, enforcing row-level and
column-level permissions across Hadoop services.
---
Question 3. **What is the primary purpose of Apache Atlas in the Shared Data
Experience (SDX) layer?**
A) Data encryption at rest
B) Job scheduling
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40

Partial preview of the text

Download PrepIQ Cloudera CDP Data Developer Ultimate Exam and more Exams Technology in PDF only on Docsity!

Ultimate Exam

Question 1. Which CDP deployment model allows you to run services on a public cloud provider while keeping the control plane on-premises? A) CDP Public Cloud B) CDP Private Cloud Base C) CDP Hybrid D) CDP Edge Answer: C Explanation: CDP Hybrid combines on-premises control with public-cloud data services, enabling workloads to span both environments.

Question 2. In CDP, which component provides fine-grained access control for Hive, HBase, and Spark through policies? A) Apache Atlas B) Apache Ranger C) Apache Knox D) Apache NiFi Answer: B Explanation: Apache Ranger centralizes security policies, enforcing row-level and column-level permissions across Hadoop services.

Question 3. What is the primary purpose of Apache Atlas in the Shared Data Experience (SDX) layer? A) Data encryption at rest B) Job scheduling

Ultimate Exam

C) Metadata cataloging and lineage tracking D) Resource provisioning Answer: C Explanation: Atlas captures metadata, relationships, and lineage, enabling governance and impact analysis.

Question 4. Which CDP service is optimized for ad-hoc SQL analytics on large datasets? A) Cloudera Data Engineering (CDE) B) Cloudera Data Warehouse (CDW) C) Cloudera Machine Learning (CML) D) Cloudera Data Flow (CDF) Answer: B Explanation: CDW provides a high-performance, MPP-style SQL engine (Impala) for interactive analytics.

Question 5. Virtual Private Clusters (VPC) in CDP are used to: A) Encrypt data in transit B) Isolate compute resources per team or project C) Store metadata for tables D) Manage Kerberos tickets Answer: B

Ultimate Exam

Question 8. Which Spark SQL feature enables predicate pushdown to Parquet files, reducing I/O? A) Catalyst optimizer B) Tungsten execution engine C) DataSource V2 API D) Partition pruning Answer: D Explanation: Partition pruning pushes filter predicates down to the file level, allowing Spark to skip irrelevant Parquet row groups.

Question 9. Which compression codec offers the best trade-off between speed and compression ratio for columnar storage in CDP? A) Gzip B) Snappy C) Bzip D) LZO Answer: B Explanation: Snappy compresses quickly with modest size reduction, making it ideal for Parquet and ORC files.

Question 10. When writing a DataFrame to a Hive table using saveAsTable, which mode overwrites the existing table data but preserves the table definition? A) append B) overwrite C) ignore

Ultimate Exam

D) errorIfExists Answer: B Explanation: overwrite replaces the table’s underlying data files while keeping the Hive metadata intact.

Question 11. Which of the following cloud storage options is native to Azure and supported by CDP for data ingestion? A) Amazon S B) Google Cloud Storage C) Azure Data Lake Storage Gen2 (ADLS Gen2) D) IBM Cloud Object Storage Answer: C Explanation: ADLS Gen2 is Azure’s object store, fully integrated with CDP’s storage connectors.

Question 12. In Apache Iceberg, what is the purpose of a manifest file? A) Stores the full table schema B) Lists data files that belong to a particular snapshot C) Contains statistics for query optimization D) Holds the Iceberg configuration settings Answer: B Explanation: Each manifest enumerates the data files (and their partition values) that are part of a snapshot, enabling fast metadata scans.

Ultimate Exam

B) Automatic partition pruning based on file statistics C) Storing partition values inside the data files instead of the directory structure D) Encrypting partition values for security Answer: C Explanation: Iceberg embeds partition values in the file metadata, allowing flexible partitioning without directory nesting.

Question 16. Which ACID property is most directly enforced by Iceberg’s write-conflict detection? A) Atomicity B) Consistency C) Isolation D) Durability Answer: C Explanation: Iceberg detects concurrent writes to the same snapshot and aborts conflicting transactions, ensuring isolation.

Question 17. In an Airflow DAG, which operator would you use to trigger a Spark job running in CDE? A) BashOperator B) SparkSubmitOperator C) CDEOperator D) PythonOperator

Ultimate Exam

Answer: C Explanation: The CDEOperator is a CDP-specific Airflow operator that submits Spark jobs to the Cloudera Data Engineering service.

Question 18. What is the purpose of an Airflow Sensor? A) Execute a Python function on a schedule B) Pause DAG execution until a condition is met C) Send email alerts on failure D) Perform data transformations Answer: B Explanation: Sensors are special operators that repeatedly check for a condition (e.g., file arrival) before allowing downstream tasks to run.

Question 19. Which scheduling option allows a DAG to run only when new data appears in an S3 bucket? A) cron expression B) timedelta schedule_interval C) S3KeySensor D) FixedDateSchedule Answer: C Explanation: S3KeySensor monitors an S3 path and triggers downstream tasks once the specified key is detected.

Ultimate Exam

C) Exchange D) Scan Answer: C Explanation: The Exchange operator represents a shuffle stage where data is repartitioned across executors.

Question 23. Which join strategy is most efficient when one side of the join is less than 10 MB and the other side is large? A) Shuffle Hash Join B) Broadcast Hash Join C) Sort-Merge Join D) Cartesian Join Answer: B Explanation: Broadcasting the small dataset avoids a shuffle, allowing each executor to join locally.

Question 24. What Spark configuration controls the amount of memory allocated to the executor’s JVM heap? A) spark.driver.memory B) spark.executor.memory C) spark.memory.fraction D) spark.sql.shuffle.partitions Answer: B

Ultimate Exam

Explanation: spark.executor.memory defines the JVM heap size for each executor process.

Question 25. Dynamic allocation in Spark primarily adjusts which resource? A) Number of Spark SQL tables B) Executor count based on workload demand C) Size of the driver memory D) Number of Hive Metastore connections Answer: B Explanation: Dynamic allocation adds or removes executors automatically according to the current stage’s needs.

Question 26. Which data organization technique can dramatically reduce the amount of data read during a query that filters on a column with high cardinality? A) Bucketing by that column B) Sorting by that column only within partitions C) Using a single large file D) Disabling compression Answer: A Explanation: Bucketing groups rows with the same column value into the same file, allowing Spark to skip irrelevant buckets during filters.

Ultimate Exam

Answer: B Explanation: Masking policies transform column values (e.g., show only last 4 digits) when a user queries the table.

Question 30. Which Apache Atlas feature helps developers understand the upstream sources of a dataset? A) Glossary terms B) Lineage graphs C) Classification tags D) Policy enforcement Answer: B Explanation: Atlas lineage graphs visualize data flow from source to destination, aiding impact analysis.

Question 31. Data-at-rest encryption in CDP is typically enforced at which layer? A) Application code level B) HDFS block storage level (transparent encryption) C) Network firewall D) Airflow DAG definition Answer: B Explanation: CDP can enable Transparent Data Encryption (TDE) on HDFS, encrypting blocks on disk automatically.

Ultimate Exam

Question 32. Which protocol does Kerberos use to obtain a ticket-granting ticket (TGT) for a user? A) LDAP B) HTTP C) TCP/UDP on port 88 (AS) D) SSH Answer: C Explanation: Kerberos Authentication Service (AS) runs on port 88, issuing TGTs after verifying credentials.

Question 33. In CDP, what is the purpose of workload passwords? A) To encrypt HDFS data B) To authenticate service-to-service calls without exposing Kerberos keys C) To store user passwords in plain text D) To configure network firewalls Answer: B Explanation: Workload passwords allow applications (e.g., Spark) to authenticate to services like Hive without needing Kerberos tickets.

Question 34. Which CDP service provides a managed JupyterLab environment for data scientists? A) CDE

Ultimate Exam

Answer: B Explanation: Excessive partitions create many small files, leading to overhead in task scheduling and network I/O.

Question 37. Which Iceberg table property controls the minimum number of files that must be rewritten during a major compaction? A) write.target-file-size-bytes B) snapshot.interval-ms C) min-snapshots-to-keep D) delete.target-file-size-bytes Answer: A Explanation: write.target-file-size-bytes influences how many small files are merged during compaction.

Question 38. In Airflow, what does the depends_on_past=True parameter achieve? A) Forces the DAG to wait for external triggers B) Makes each task instance wait for the previous run’s same task to succeed C) Disables retries for the task D) Enables task-level parallelism Answer: B Explanation: When true, a task will not run for the current schedule until the same task from the previous schedule has completed successfully.

Ultimate Exam

Question 39. Which Spark SQL function can be used to explode an array column into multiple rows? A) flatten() B) posexplode() C) explode() D) split() Answer: C Explanation: explode() creates a new row for each element in the array, preserving other columns.

Question 40. When storing semi-structured JSON data in a Spark DataFrame, which data type best represents nested objects? A) String B) MapType C) StructType D) BinaryType Answer: C Explanation: StructType models a fixed schema with nested fields, matching JSON objects.

Question 41. What is the primary advantage of using the Iceberg MERGE INTO statement over a classic Hive INSERT OVERWRITE for incremental loads? A) It automatically creates indexes B) It can update, insert, or delete rows in place without rewriting the whole table

Ultimate Exam

Explanation: CDF (based on NiFi, Kafka, Flink) is designed for building and operating data pipelines and streaming applications.

Question 44. What does the spark.sql.autoBroadcastJoinThreshold configuration control? A) Maximum size of a table that can be broadcast for a join B) Minimum number of partitions for a shuffle join C) Timeout for broadcast join execution D) Number of broadcast join retries Answer: A Explanation: This threshold (default 10 MB) determines whether Spark will automatically broadcast a small table during a join.

Question 45. Which Spark storage level persists data both in memory and on disk, and also replicates it across two executors? A) MEMORY_ONLY_2 B) MEMORY_AND_DISK_2 C) DISK_ONLY_2 D) OFF_HEAP Answer: B Explanation: MEMORY_AND_DISK_2 keeps data in memory when possible, spills to disk otherwise, and stores two replicas.

Ultimate Exam

Question 46. In Iceberg, what is a “manifest list”? A) A list of all column names in the table B) A file that aggregates pointers to individual manifest files for a snapshot C) A log of all schema changes D) A configuration file for storage locations Answer: B Explanation: The manifest list references the set of manifest files that constitute a particular snapshot, enabling fast metadata reads.

Question 47. Which Ranger policy type can enforce data masking on a column based on user groups? A) Row filter policy B) Column masking policy C) Access type policy D) Tag-based policy Answer: B Explanation: Column masking policies allow different mask expressions per user or group.

Question 48. When configuring Kerberos for a Spark application on CDP, which principal is typically used for the driver? A) hdfs/_HOST@REALM B) spark/_HOST@REALM C) yarn/_HOST@REALM