
























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The PrepIQ Cloudera CDP Data Developer Ultimate Exam prepares learners to develop and manage big data solutions using Cloudera Data Platform technologies, data pipelines, analytics tools, and distributed processing systems.
Typology: Exams
1 / 64
This page cannot be seen from the preview
Don't miss anything!

























































Question 1. Which CDP deployment model allows you to run services on a public cloud provider while keeping the control plane on-premises? A) CDP Public Cloud B) CDP Private Cloud Base C) CDP Hybrid D) CDP Edge Answer: C Explanation: CDP Hybrid combines on-premises control with public-cloud data services, enabling workloads to span both environments.
Question 2. In CDP, which component provides fine-grained access control for Hive, HBase, and Spark through policies? A) Apache Atlas B) Apache Ranger C) Apache Knox D) Apache NiFi Answer: B Explanation: Apache Ranger centralizes security policies, enforcing row-level and column-level permissions across Hadoop services.
Question 3. What is the primary purpose of Apache Atlas in the Shared Data Experience (SDX) layer? A) Data encryption at rest B) Job scheduling
C) Metadata cataloging and lineage tracking D) Resource provisioning Answer: C Explanation: Atlas captures metadata, relationships, and lineage, enabling governance and impact analysis.
Question 4. Which CDP service is optimized for ad-hoc SQL analytics on large datasets? A) Cloudera Data Engineering (CDE) B) Cloudera Data Warehouse (CDW) C) Cloudera Machine Learning (CML) D) Cloudera Data Flow (CDF) Answer: B Explanation: CDW provides a high-performance, MPP-style SQL engine (Impala) for interactive analytics.
Question 5. Virtual Private Clusters (VPC) in CDP are used to: A) Encrypt data in transit B) Isolate compute resources per team or project C) Store metadata for tables D) Manage Kerberos tickets Answer: B
Question 8. Which Spark SQL feature enables predicate pushdown to Parquet files, reducing I/O? A) Catalyst optimizer B) Tungsten execution engine C) DataSource V2 API D) Partition pruning Answer: D Explanation: Partition pruning pushes filter predicates down to the file level, allowing Spark to skip irrelevant Parquet row groups.
Question 9. Which compression codec offers the best trade-off between speed and compression ratio for columnar storage in CDP? A) Gzip B) Snappy C) Bzip D) LZO Answer: B Explanation: Snappy compresses quickly with modest size reduction, making it ideal for Parquet and ORC files.
Question 10. When writing a DataFrame to a Hive table using saveAsTable, which mode overwrites the existing table data but preserves the table definition? A) append B) overwrite C) ignore
D) errorIfExists Answer: B Explanation: overwrite replaces the table’s underlying data files while keeping the Hive metadata intact.
Question 11. Which of the following cloud storage options is native to Azure and supported by CDP for data ingestion? A) Amazon S B) Google Cloud Storage C) Azure Data Lake Storage Gen2 (ADLS Gen2) D) IBM Cloud Object Storage Answer: C Explanation: ADLS Gen2 is Azure’s object store, fully integrated with CDP’s storage connectors.
Question 12. In Apache Iceberg, what is the purpose of a manifest file? A) Stores the full table schema B) Lists data files that belong to a particular snapshot C) Contains statistics for query optimization D) Holds the Iceberg configuration settings Answer: B Explanation: Each manifest enumerates the data files (and their partition values) that are part of a snapshot, enabling fast metadata scans.
B) Automatic partition pruning based on file statistics C) Storing partition values inside the data files instead of the directory structure D) Encrypting partition values for security Answer: C Explanation: Iceberg embeds partition values in the file metadata, allowing flexible partitioning without directory nesting.
Question 16. Which ACID property is most directly enforced by Iceberg’s write-conflict detection? A) Atomicity B) Consistency C) Isolation D) Durability Answer: C Explanation: Iceberg detects concurrent writes to the same snapshot and aborts conflicting transactions, ensuring isolation.
Question 17. In an Airflow DAG, which operator would you use to trigger a Spark job running in CDE? A) BashOperator B) SparkSubmitOperator C) CDEOperator D) PythonOperator
Answer: C Explanation: The CDEOperator is a CDP-specific Airflow operator that submits Spark jobs to the Cloudera Data Engineering service.
Question 18. What is the purpose of an Airflow Sensor? A) Execute a Python function on a schedule B) Pause DAG execution until a condition is met C) Send email alerts on failure D) Perform data transformations Answer: B Explanation: Sensors are special operators that repeatedly check for a condition (e.g., file arrival) before allowing downstream tasks to run.
Question 19. Which scheduling option allows a DAG to run only when new data appears in an S3 bucket? A) cron expression B) timedelta schedule_interval C) S3KeySensor D) FixedDateSchedule Answer: C Explanation: S3KeySensor monitors an S3 path and triggers downstream tasks once the specified key is detected.
C) Exchange D) Scan Answer: C Explanation: The Exchange operator represents a shuffle stage where data is repartitioned across executors.
Question 23. Which join strategy is most efficient when one side of the join is less than 10 MB and the other side is large? A) Shuffle Hash Join B) Broadcast Hash Join C) Sort-Merge Join D) Cartesian Join Answer: B Explanation: Broadcasting the small dataset avoids a shuffle, allowing each executor to join locally.
Question 24. What Spark configuration controls the amount of memory allocated to the executor’s JVM heap? A) spark.driver.memory B) spark.executor.memory C) spark.memory.fraction D) spark.sql.shuffle.partitions Answer: B
Explanation: spark.executor.memory defines the JVM heap size for each executor process.
Question 25. Dynamic allocation in Spark primarily adjusts which resource? A) Number of Spark SQL tables B) Executor count based on workload demand C) Size of the driver memory D) Number of Hive Metastore connections Answer: B Explanation: Dynamic allocation adds or removes executors automatically according to the current stage’s needs.
Question 26. Which data organization technique can dramatically reduce the amount of data read during a query that filters on a column with high cardinality? A) Bucketing by that column B) Sorting by that column only within partitions C) Using a single large file D) Disabling compression Answer: A Explanation: Bucketing groups rows with the same column value into the same file, allowing Spark to skip irrelevant buckets during filters.
Answer: B Explanation: Masking policies transform column values (e.g., show only last 4 digits) when a user queries the table.
Question 30. Which Apache Atlas feature helps developers understand the upstream sources of a dataset? A) Glossary terms B) Lineage graphs C) Classification tags D) Policy enforcement Answer: B Explanation: Atlas lineage graphs visualize data flow from source to destination, aiding impact analysis.
Question 31. Data-at-rest encryption in CDP is typically enforced at which layer? A) Application code level B) HDFS block storage level (transparent encryption) C) Network firewall D) Airflow DAG definition Answer: B Explanation: CDP can enable Transparent Data Encryption (TDE) on HDFS, encrypting blocks on disk automatically.
Question 32. Which protocol does Kerberos use to obtain a ticket-granting ticket (TGT) for a user? A) LDAP B) HTTP C) TCP/UDP on port 88 (AS) D) SSH Answer: C Explanation: Kerberos Authentication Service (AS) runs on port 88, issuing TGTs after verifying credentials.
Question 33. In CDP, what is the purpose of workload passwords? A) To encrypt HDFS data B) To authenticate service-to-service calls without exposing Kerberos keys C) To store user passwords in plain text D) To configure network firewalls Answer: B Explanation: Workload passwords allow applications (e.g., Spark) to authenticate to services like Hive without needing Kerberos tickets.
Question 34. Which CDP service provides a managed JupyterLab environment for data scientists? A) CDE
Answer: B Explanation: Excessive partitions create many small files, leading to overhead in task scheduling and network I/O.
Question 37. Which Iceberg table property controls the minimum number of files that must be rewritten during a major compaction? A) write.target-file-size-bytes B) snapshot.interval-ms C) min-snapshots-to-keep D) delete.target-file-size-bytes Answer: A Explanation: write.target-file-size-bytes influences how many small files are merged during compaction.
Question 38. In Airflow, what does the depends_on_past=True parameter achieve? A) Forces the DAG to wait for external triggers B) Makes each task instance wait for the previous run’s same task to succeed C) Disables retries for the task D) Enables task-level parallelism Answer: B Explanation: When true, a task will not run for the current schedule until the same task from the previous schedule has completed successfully.
Question 39. Which Spark SQL function can be used to explode an array column into multiple rows? A) flatten() B) posexplode() C) explode() D) split() Answer: C Explanation: explode() creates a new row for each element in the array, preserving other columns.
Question 40. When storing semi-structured JSON data in a Spark DataFrame, which data type best represents nested objects? A) String B) MapType C) StructType D) BinaryType Answer: C Explanation: StructType models a fixed schema with nested fields, matching JSON objects.
Question 41. What is the primary advantage of using the Iceberg MERGE INTO statement over a classic Hive INSERT OVERWRITE for incremental loads? A) It automatically creates indexes B) It can update, insert, or delete rows in place without rewriting the whole table
Explanation: CDF (based on NiFi, Kafka, Flink) is designed for building and operating data pipelines and streaming applications.
Question 44. What does the spark.sql.autoBroadcastJoinThreshold configuration control? A) Maximum size of a table that can be broadcast for a join B) Minimum number of partitions for a shuffle join C) Timeout for broadcast join execution D) Number of broadcast join retries Answer: A Explanation: This threshold (default 10 MB) determines whether Spark will automatically broadcast a small table during a join.
Question 45. Which Spark storage level persists data both in memory and on disk, and also replicates it across two executors? A) MEMORY_ONLY_2 B) MEMORY_AND_DISK_2 C) DISK_ONLY_2 D) OFF_HEAP Answer: B Explanation: MEMORY_AND_DISK_2 keeps data in memory when possible, spills to disk otherwise, and stores two replicas.
Question 46. In Iceberg, what is a “manifest list”? A) A list of all column names in the table B) A file that aggregates pointers to individual manifest files for a snapshot C) A log of all schema changes D) A configuration file for storage locations Answer: B Explanation: The manifest list references the set of manifest files that constitute a particular snapshot, enabling fast metadata reads.
Question 47. Which Ranger policy type can enforce data masking on a column based on user groups? A) Row filter policy B) Column masking policy C) Access type policy D) Tag-based policy Answer: B Explanation: Column masking policies allow different mask expressions per user or group.
Question 48. When configuring Kerberos for a Spark application on CDP, which principal is typically used for the driver? A) hdfs/_HOST@REALM B) spark/_HOST@REALM C) yarn/_HOST@REALM