









































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Focused review of data engineering, ETL pipelines, distributed processing, data quality, governance, performance tuning, and cloud data workflows.
Typology: Exams
1 / 49
This page cannot be seen from the preview
Don't miss anything!










































Question 1. Which component of the Spark execution engine is primarily responsible for converting a logical plan into an optimized physical plan? A) Catalyst Optimizer B) Tungsten Engine C) DAG Scheduler D) Task Scheduler Answer: A Explanation: The Catalyst Optimizer performs rule-based and cost-based optimizations on the logical query plan before generating the physical plan that Tungsten will execute. Question 2. In a CDE virtual cluster, Spark applications are run inside containers orchestrated by which platform? A) Docker Swarm B) Kubernetes C) Mesos D) YARN Answer: B Explanation: CDE leverages Kubernetes to launch Spark driver and executor pods, providing isolation and resource management. Question 3. Which Spark DataFrame API call reads a JSON file into a DataFrame while automatically inferring the schema? A) spark.read.text() B) spark.read.format("json").load() C) spark.read.parquet() D) spark.read.csv() Answer: B Explanation: The format("json") reader parses JSON and infers column types unless a schema is explicitly supplied.
Question 4. When working with nested columns, which Spark function extracts a field from a struct column? A) explode() B) getItem() C) col() D) withField() Answer: B Explanation: getItem("fieldName") or the dot notation (col("struct.field")) extracts a nested field from a struct. Question 5. In Spark Structured Streaming, which output mode is required when performing aggregations that produce new rows for each micro-batch? A) Append B) Update C) Complete D) Incremental Answer: C Explanation: The complete mode writes the full result of an aggregation on every trigger, necessary for grouped aggregations. Question 6. Which Hive Warehouse Connector (HWC) method writes a Spark DataFrame to an ACID-managed Hive table? A) df.write.format("hive").saveAsTable() B) df.writeTo("hive_table").append() C) df.write.format("iceberg").save() D) df.writeToTable("hive_table") Answer: B Explanation: writeTo with HWC enables ACID semantics by leveraging Hive’s transaction manager.
Question 10. Which Airflow operator is specifically designed to submit a Spark job to a CDE virtual cluster? A) BashOperator B) PythonOperator C) SparkSubmitOperator D) CdeSparkSubmitOperator Answer: D Explanation: The CdeSparkSubmitOperator is a custom operator provided by Cloudera to interact with CDE. Question 11. When implementing Change Data Capture (CDC) in an Airflow DAG, which pattern is most common for incremental extraction? A) Full table reload each run B) Timestamp-based WHERE clause C) Random sampling of rows D) Using a fixed row limit Answer: B Explanation: CDC typically uses a high-watermark timestamp column to fetch only rows changed since the last run. Question 12. In Airflow, what mechanism allows tasks to share small pieces of data without using external storage? A) Variables B) Connections C) XComs D) Pools Answer: C Explanation: XComs (cross-communication) let tasks push and pull small payloads between each other.
Question 13. Which Spark join strategy is automatically chosen when one side of the join is smaller than the broadcast threshold? A) Sort-Merge Join B) Broadcast Hash Join C) Shuffle Hash Join D) Cartesian Join Answer: B Explanation: Spark will broadcast the smaller dataset to all executors, enabling a fast hash join. Question 14. How can data skew be mitigated when joining on a high-cardinality key? A) Increase shuffle partitions only B) Use broadcast join on both sides C) Apply salting to the skewed key D) Disable catalyst optimizer Answer: C Explanation: Adding a random “salt” to the skewed key distributes records more evenly across partitions. Question 15. Which file system issue is addressed by compaction in a data lake? A) Too many small files causing NameNode overload B) Corrupted parquet schema C) Insufficient replication factor D) Missing ACLs Answer: A Explanation: Compaction merges many tiny files into larger ones, reducing metadata overhead and improving read performance.
Question 19. In Iceberg, what is the purpose of a “metadata file”? A) Store the actual data rows B) Track the list of data files and their manifests for each snapshot C) Hold user-defined functions D) Encrypt the data files Answer: B Explanation: Metadata files contain manifests that point to data files, enabling fast table scans and snapshot management. Question 20. Which Iceberg property controls how many files are written per partition during a write operation? A) write.target-file-size-bytes B) iceberg.max-file-size C) spark.sql.files.maxPartitionBytes D) iceberg.target-file-size-bytes Answer: D Explanation: iceberg.target-file-size-bytes defines the desired file size for writes, influencing the number of output files per partition. Question 21. Which CDE CLI command lists all virtual clusters in a CDP environment? A) cde clusters list B) cde virtual-clusters show C) cde cluster list D) cde vc describe Answer: A Explanation: cde clusters list returns a table of existing virtual clusters. Question 22. To limit the memory used by a Spark executor, which configuration should be set?
A) spark.executor.cores B) spark.executor.memoryOverhead C) spark.driver.memory D) spark.sql.shuffle.partitions Answer: B Explanation: spark.executor.memoryOverhead reserves off-heap memory for JVM overhead, preventing OOM errors. Question 23. Which Airflow feature allows you to define a maximum number of concurrent runs for a DAG? A) max_active_runs B) concurrency C) pool_slots D) dagrun_timeout Answer: A Explanation: max_active_runs limits how many DAG runs can be active simultaneously. Question 24. In Spark, which transformation is lazy and does not trigger execution until an action is called? A) map() B) filter() C) reduceByKey() D) all of the above Answer: D Explanation: All listed transformations are lazy; actions such as collect() or write trigger the computation. Question 25. Which Spark UI tab provides details about the amount of data persisted in memory and on disk?
A) Partition pruning B) Snapshot IDs C) Data compaction D) Bucketing Answer: B Explanation: Snapshots capture the table state; querying a specific snapshot ID provides a historical view. Question 29. In Airflow, which component stores connection credentials securely? A) Variables B) Secrets Backend C) XComs D) DAG Parameters Answer: B Explanation: The Secrets Backend integrates with external vaults (e.g., HashiCorp Vault) to retrieve connection info. Question 30. Which Spark configuration reduces the size of shuffle files by using columnar compression? A) spark.sql.parquet.compression.codec B) spark.shuffle.compress C) spark.sql.inMemoryColumnarStorage.compressed D) spark.sql.execution.arrow.enabled Answer: B Explanation: spark.shuffle.compress enables compression of shuffle data, decreasing network I/O. Question 31. Which of the following best describes a “bucket” in Iceberg? A) A physical file on disk
B) A logical grouping of rows based on hash of a column C) A partition directory D) An Iceberg metadata snapshot Answer: B Explanation: Bucketing hashes a column’s value to distribute rows evenly across a fixed number of buckets. Question 32. In CDE, which resource is used to enforce CPU limits for a Spark executor pod? A) executorMemory B) executorCores C) cpuRequest D) cpuLimit Answer: D Explanation: cpuLimit defines the maximum CPU cores a pod can consume, enforced by Kubernetes. Question 33. Which Spark function converts a DataFrame column containing JSON strings into a struct column? A) from_json() B) to_json() C) json_tuple() D) explode_json() Answer: A Explanation: from_json(col, schema) parses JSON strings into a struct according to the provided schema. Question 34. In Airflow, what does setting retries=3 on a task accomplish? A) The task will run three times in parallel B) The task will be attempted up to three additional times after failure
C) snapshot.max-count D) iceberg.snapshot.retention.max Answer: D Explanation: iceberg.snapshot.retention.max limits the number of snapshots kept before automatic expiration. Question 38. In Spark Structured Streaming, which trigger processes data as soon as it arrives? A) Trigger.Once() B) Trigger.ProcessingTime("5 minutes") C) Trigger.Continuous("1 second") D) Trigger.AvailableNow() Answer: C Explanation: Trigger.Continuous runs the query continuously with the specified checkpoint interval. Question 39. Which Airflow hook would you use to interact with a Hive Metastore for metadata queries? A) MySqlHook B) HiveCliHook C) HiveMetastoreHook D) PrestoHook Answer: C Explanation: HiveMetastoreHook provides a client to call Hive Metastore APIs. Question 40. Which Spark configuration property enables whole-stage code generation for Java and Scala? A) spark.sql.codegen.wholeStage B) spark.sql.codegen.factoryMode C) spark.sql.codegen.maxFields
D) spark.sql.codegen.enabled Answer: D Explanation: spark.sql.codegen.enabled toggles the whole-stage code generation optimization. Question 41. In Iceberg, what does the term “manifest list” refer to? A) List of all data files in the table B) List of manifest files that together describe a snapshot C) List of columns in the schema D) List of partition values Answer: B Explanation: A manifest list aggregates multiple manifest files, each of which points to data files for a snapshot. Question 42. Which Spark DataFrame API call removes duplicate rows based on all columns? A) distinct() B) dropDuplicates() C) unique() D) both A and B Answer: D Explanation: Both distinct() and dropDuplicates() without arguments achieve the same result. Question 43. In Airflow, what is the purpose of a “pool”? A) To store temporary files for tasks B) To limit the number of concurrent tasks for a resource C) To group DAGs by owner D) To manage secret variables Answer: B
Question 47. In CDE, what is the effect of setting spark.sql.shuffle.partitions to a value higher than the number of executors? A) Improves performance by increasing parallelism B) Causes unnecessary small tasks and overhead C) Triggers a runtime error D) No effect; Spark ignores the setting Answer: B Explanation: Excessive partitions create many tiny tasks, increasing scheduling overhead without performance gains. Question 48. Which Airflow operator would you use to execute a Python callable that returns a Pandas DataFrame? A) PythonOperator B) PandasOperator C) DataFrameOperator D) SparkSubmitOperator Answer: A Explanation: PythonOperator can run any Python function, including those returning Pandas DataFrames. Question 49. Which Spark feature enables columnar in-memory representation for faster CPU utilization? A) Tungsten B) Catalyst C) GraphX D) MLlib Answer: A Explanation: Tungsten provides off-heap memory management and bytecode generation for columnar processing.
Question 50. When using Hive Warehouse Connector, which Spark configuration must be set to enable ACID reads? A) spark.sql.hive.convertMetastoreParquet B) spark.sql.hive.hwc.enabled C) spark.sql.hive.metastore.version D) spark.sql.hive.metastore.jars Answer: B Explanation: spark.sql.hive.hwc.enabled activates the HWC and its ACID capabilities. Question 51. Which of the following is NOT a valid Spark shuffle write compression codec? A) lz B) snappy C) gzip D) bzip Answer: D Explanation: Spark supports lz4, snappy, and gzip; bzip2 is not available for shuffle compression. Question 52. In Airflow, which parameter of a DAG defines the timezone used for schedule intervals? A) default_args B) catchup C) timezone D) schedule_interval Answer: C Explanation: The timezone argument sets the DAG’s execution timezone.
Question 56. Which Airflow feature allows you to pause a DAG without deleting its schedule? A) set_active(False) B) pause_dag() C) is_paused flag in UI D) disable_schedule() Answer: C Explanation: The UI includes an “is_paused” toggle that stops scheduling while preserving DAG definition. Question 57. Which Spark configuration controls the amount of memory allocated for the driver process? A) spark.driver.memory B) spark.executor.memory C) spark.memory.fraction D) spark.sql.autoBroadcastJoinThreshold Answer: A Explanation: spark.driver.memory defines the JVM heap size for the driver. Question 58. When reading from an Iceberg table via Spark, which format must be specified for optimal performance? A) iceberg B) parquet C) orc D) delta Answer: A Explanation: Using format("iceberg") lets Spark leverage Iceberg’s metadata for pruning and projection. Question 59. Which Spark DataFrame API is used to flatten an array column into multiple rows?
A) explode() B) flatten() C) split() D) unnest() Answer: A Explanation: explode(col) transforms each element of an array into a separate row. Question 60. In Airflow, which attribute of a task determines the maximum execution time before it is marked as failed? A) execution_timeout B) timeout_seconds c) sla d) retry_delay Answer: A Explanation: execution_timeout is a datetime.timedelta after which the task is terminated. Question 61. Which Spark UI tab shows the DAG of stages and tasks for a particular job? A) Stages B) DAG Visualization (under Stages) C) Executors D) SQL Answer: B Explanation: The DAG Visualization within the Stages tab displays stage dependencies. Question 62. Which Iceberg command is used to rewrite data files to achieve better file sizes? A) rewrite_data_files()