




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This exam validates skills in developing data pipelines using CDP tools such as Spark, Hive, Impala, NiFi, and HDFS. It includes questions on ETL logic, performance tuning, data ingestion patterns, orchestrating flows, schema design, and SQL optimization within distributed systems.
Typology: Exams
1 / 166
This page cannot be seen from the preview
Don't miss anything!





























































































Question 1. In CDP terminology, which plane hosts the Management Console and handles account provisioning? A) Data Plane B) Control Plane C) Service Plane D) Edge Plane Answer: B Explanation: The Control Plane provides the web UI, APIs, and services for managing clusters, users, and policies across CDP deployments. Question 2. Which CDP deployment model runs on third‑party public cloud providers such as AWS, Azure, or GCP? A) CDP Private Cloud Base B) CDP Public Cloud C) CDP Hybrid Cloud D) CDP On‑Premise Answer: B Explanation: CDP Public Cloud is a fully managed SaaS offering that runs on the major public cloud platforms. Question 3. What is the primary purpose of the Shared Data Experience (SDX) in CDP? A) To provide a unified UI for all services B) To enforce global security policies across data stores C) To enable data sharing across multiple analytic experiences without duplication D) To replace the Hive Metastore Answer: C
Explanation: SDX allows different analytic experiences (e.g., Data Engineering, Data Warehouse) to access the same underlying data assets without needing separate copies. Question 4. Which component stores metadata, lineage, and tags for data assets in CDP? A) Ranger B) Atlas C) Impala Catalog D) Ozone Metadata Service Answer: B Explanation: Apache Atlas provides data governance capabilities such as metadata management, lineage tracking, and tagging. Question 5. Ranger policies are primarily used to control access to which type of resources? A) Compute resources (CPU, memory) B) Network bandwidth C) Data resources such as Hive tables, HDFS directories, and Kafka topics D) UI navigation menus Answer: C Explanation: Apache Ranger enforces fine‑grained, role‑based access control on data resources across the CDP ecosystem. Question 6. Which distributed file system is the default storage layer for CDP Private Cloud Base? A) Amazon S B) Azure Data Lake Store C) HDFS D) Google Cloud Storage
C) Spark SQL D) Presto Answer: B Explanation: Impala is a massively parallel, in‑memory query engine designed for fast, interactive SQL queries. Question 10. When creating a Hive table, what distinguishes a Managed table from an External table? A) Managed tables store data in HDFS; External tables store data in S3. B) Managed tables are owned by the Hive Metastore and are dropped with the table; External tables keep data external to Hive. C) Managed tables support ACID; External tables do not. D) Managed tables can only be partitioned. Answer: B Explanation: In Managed tables Hive controls the lifecycle of the data files; dropping the table also deletes the data. External tables reference data stored outside Hive’s control. Question 11. Which compression codec offers the best trade‑off between speed and compression ratio for Parquet files in CDP? A) GZIP B) Snappy C) BZIP D) LZO Answer: B Explanation: Snappy provides fast compression/decompression with reasonable size reduction, making it the default for Parquet in many CDP workloads.
Question 12. NiFi’s FlowFile provenance data is stored where by default? A) In a relational database on the same host B) In the local file system under /var/lib/nifi/provenance_repository C) In HDFS D) In an external Elasticsearch cluster Answer: B Explanation: NiFi writes provenance events to a local disk‑based repository for fast retrieval and low overhead. Question 13. Which NiFi processor would you use to ingest data from a relational database using a JDBC connection? A) GetFile B) ExecuteSQL C) PutKafka D) ListenHTTP Answer: B Explanation: ExecuteSQL runs a user‑defined SQL query against a JDBC connection and creates FlowFiles from the result set. Question 14. In NiFi, what mechanism prevents a downstream processor from being overwhelmed by upstream data? A) FlowFile attributes B) Back pressure C) Load balancing D) Prioritization queue Answer: B
D) min.insync.replicas Answer: C Explanation: retention.ms defines the time‑based retention period for records in a topic. Question 18. MiNiFi is best suited for which scenario? A) High‑throughput batch ingestion from a data lake B) Edge devices with limited resources that need to forward sensor data C) Centralized orchestration of Spark jobs D) Managing Hive metastore schema changes Answer: B Explanation: MiNiFi is a lightweight NiFi agent designed for constrained environments such as IoT gateways. Question 19. In Spark, which component holds the compiled logical plan before execution? A) Driver B) Catalyst Optimizer C) Executor D) Tungsten Engine Answer: B Explanation: The Catalyst optimizer transforms the logical plan into an optimized physical plan. Question 20. What is the default storage level when you call persist() on a DataFrame without arguments? A) MEMORY_ONLY B) DISK_ONLY C) MEMORY_AND_DISK
Answer: C Explanation: persist() without parameters uses MEMORY_AND_DISK, storing data in memory and spilling to disk if needed. Question 21. Which Spark join strategy is automatically chosen when one side of the join is smaller than the broadcast threshold? A) Shuffle hash join B) Sort‑merge join C) Broadcast join D) Cartesian join Answer: C Explanation: Spark will broadcast the smaller dataset to all executors, avoiding a costly shuffle. Question 22. When reading a Parquet file with Spark, schema inference is performed by: A) The Hive Metastore B) The Parquet file footer C) The Spark driver only D) The underlying Hadoop InputFormat Answer: B Explanation: Parquet stores its schema in the file footer, allowing Spark to read it without external metadata. Question 23. Which Spark configuration controls the number of cores allocated per executor?
Explanation: DataFrames (and Datasets) provide a higher‑level, declarative API with Catalyst optimization, making them ideal for ETL pipelines. Question 26. In CDP, which service provides a fast, column‑oriented store that supports both random reads and analytical scans? A) HBase B) Kudu C) Hive D) Impala Answer: B Explanation: Kudu combines the low‑latency random access of a row store with columnar storage for analytics. Question 27. Which HBase operation retrieves a specific column value from a row? A) Get B) Scan C) Put D) Delete Answer: A Explanation: The Get operation fetches data for a given row key, optionally specifying column families or qualifiers. Question 28. When using Airflow’s SparkSubmitOperator, which argument specifies the main application file? A) applicationFile
B) mainClass C) pyFiles D) jars Answer: A Explanation: applicationFile points to the .jar or .py file that contains the Spark job to be submitted. Question 29. In an Airflow DAG, which keyword defines a task that must run after two upstream tasks have successfully completed? A) trigger_rule='all_success' B) upstream_tasks C) depends_on_past=True D) wait_for_downstream=True Answer: A Explanation: The default trigger rule all_success ensures the task executes only when all its direct upstream tasks finish successfully. Question 30. Which Airflow component stores DAG runs, task instances, and metadata? A) Scheduler B) Webserver C) Worker D) Metadata Database Answer: D
D) HDFS Overview Answer: C Explanation: The Spark UI (accessible via Cloudera Manager) displays stages, tasks, and execution times, helping pinpoint bottlenecks. Question 34. If a Spark job fails with “ExecutorLostFailure”, the most likely cause is: A) Syntax error in the driver code B) Out‑of‑memory error on the executor C) Missing JAR on the classpath D) Incorrect Hive metastore version Answer: B Explanation: ExecutorLostFailure typically indicates that an executor process died, often due to OOM or node failure. Question 35. Which of the following is NOT a valid Spark execution mode in CDP? A) Standalone B) YARN C) Kubernetes D) Mesos Answer: D Explanation: While Spark can run on Mesos, CDP does not ship Mesos as a supported resource manager; only Standalone, YARN, and Kubernetes are offered.
Question 36. What does the “partitionBy” method do when writing a DataFrame to Parquet? A) Creates a single file per partition column value B) Compresses the data using partition-level compression C) Stores the partition column as a separate metadata file D) Writes data into sub‑directories named after partition column values Answer: D Explanation: partitionBy creates directory hierarchies where each subdirectory corresponds to a distinct value of the partition column. Question 37. In Hive, which file format provides built‑in support for ACID transactions? A) TextFile B) ORC C) Parquet D) Avro Answer: B Explanation: ORC supports transaction logs and can be used with Hive’s ACID features for insert, update, and delete. Question 38. When configuring a NiFi processor to connect to a secured Kafka cluster, which property must be set? A) SSL Context Service B) Kafka Topic Name C) Batch Size D) FlowFile Size
A) spark.broadcast.blockSize B) spark.sql.autoBroadcastJoinThreshold C) spark.broadcast.compress D) spark.broadcast.timeout Answer: B Explanation: spark.sql.autoBroadcastJoinThreshold defines the size (in bytes) under which Spark will automatically broadcast a table for a join. Question 42. What is the effect of setting spark.sql.shuffle.partitions to a lower value than the default? A) Increases parallelism of shuffle operations B) Reduces the number of output files but may cause data skew C) Enables in‑memory shuffle only D) Disables shuffle entirely Answer: B Explanation: Fewer shuffle partitions produce fewer output files, but if the data is large, it may lead to uneven partition sizes and performance degradation. Question 43. When using the NiFi PutHDFS processor, which property controls the write mode? A) Conflict Resolution Strategy B) Batch Size C) Compression Codec D) File Owner
Answer: A Explanation: Conflict Resolution Strategy determines whether to overwrite, fail, or append when the target file already exists. Question 44. Which of the following is true about a Hive “external” table stored on an S3 bucket? A) Hive automatically deletes the data when the table is dropped. B) The table’s metadata is stored in the Hive Metastore, but the data remains in S3. C) External tables cannot be partitioned. D) S3 does not support Hive external tables. Answer: B Explanation: External tables reference data outside the Hive warehouse; dropping the table only removes metadata, leaving S3 objects untouched. Question 45. In Kafka Streams, what does the KTable abstraction represent? A) An unbounded stream of events B) A compacted changelog representing the latest value for each key C) A static lookup table loaded from HDFS D) A windowed aggregation Answer: B Explanation: KTable is a view of a changelog topic where only the most recent value per key is retained, providing a table‑like abstraction. Question 46. Which of the following best describes the role of the spark.sql.warehouse.dir property?
Explanation: HiveOperator executes Hive queries or scripts, and can reference scripts located on HDFS. Question 49. When configuring Ranger policies for HDFS, which resource type is required? A) hive_db B) hdfs_path C) kafka_topic D) kudu_table Answer: B Explanation: Ranger’s HDFS service uses the hdfs_path resource to define permissions on directories and files. Question 50. What does the spark.sql.sources.partitionOverwriteMode property control? A) Whether partitioned writes replace existing partitions or append to them B) The number of partitions created during a write C) The compression codec for partition files D) The default file format for partitioned tables Answer: A Explanation: Setting this property to dynamic allows Spark to overwrite only the partitions that are being written, rather than the entire table. Question 51. In CDP, which analytic experience is specifically designed for building and serving machine‑learning models? A) Data Engineering B) Data Warehouse
C) Operational Database D) Machine Learning Answer: D Explanation: The Machine Learning analytic experience provides tools such as MLflow, Spark ML pipelines, and model serving capabilities. Question 52. Which of the following is a valid way to secure a NiFi flow using LDAP? A) Enable the “Single User” authentication mode B) Configure an LDAP Identity Provider and map groups to policies C) Use the “Anonymous” access policy D) Disable HTTPS Answer: B Explanation: NiFi can integrate with LDAP to authenticate users and assign them to groups that are granted specific permissions. Question 53. What does the spark.sql.autoBroadcastJoinThreshold default value of 10 MB imply? A) Tables larger than 10 MB will always be broadcast. B) Tables smaller than 10 MB may be broadcast automatically. C) Broadcast joins are disabled unless the user sets this property. D) The threshold applies to shuffle partitions, not joins. Answer: B Explanation: Spark will automatically broadcast a table if its size is estimated to be less than the threshold (10 MB by default).