


































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This exam certifies proficiency in building data engineering pipelines and distributed computing solutions using Scala. Candidates are tested on Spark, Hadoop, streaming pipelines, data ingestion, schema management, and orchestration tools. It also covers functional programming best practices for big data applications, ensuring engineers can deliver efficient, reliable, and high-performance solutions in cloud environments.
Typology: Exams
1 / 74
This page cannot be seen from the preview
Don't miss anything!



































































Question 1. Which Scala keyword should be used for a value that must never change after its initial assignment? A) var B) val C) def D) lazy Answer: B Explanation: val creates an immutable reference; once assigned it cannot be reassigned, which is essential for safe distributed computation. Question 2. In Spark, which component is responsible for converting a logical plan into an optimized physical plan? A) Scheduler B) Catalyst Optimizer C) Tungsten Engine D) Cluster Manager Answer: B Explanation: Catalyst analyzes and rewrites the logical query plan, applying rule-based and cost-based optimizations before execution. Question 3. What is the result of applying flatMap to a List(List(1,2), List(3)) with the identity function? A) List(List(1,2), List(3)) B) List(1,2,3) C) List(List(1,2,3)) D) List(1, List(2,3)) Answer: B Explanation: flatMap both maps each inner collection and flattens one level, producing a single list of all elements. Question 4. Which Spark storage level stores data only in memory and does not serialize the objects? A) MEMORY_ONLY_SER B) MEMORY_ONLY
Answer: B Explanation: MEMORY_ONLY keeps the RDD in deserialized form in RAM, offering the fastest access but no disk fallback. Question 5. In the Medallion Architecture, which layer typically contains raw, uncurated data? A) Gold B) Silver C) Bronze D) Platinum Answer: C Explanation: Bronze tables store the ingested, raw data; Silver refines it, and Gold provides curated, business-ready views. Question 6. Which Scala construct allows you to define behavior that can be mixed into multiple classes without inheritance? A) Abstract class B) Trait C) Object D) Companion object Answer: B Explanation: Traits are similar to interfaces with concrete methods and can be mixed into any class, facilitating modular design. Question 7. What does the Option type represent in Scala? A) A value that can be either Some or None B) A collection of optional elements C) A mutable container for optional values D) An error handling wrapper similar to Try Answer: A
Question 11. Which Spark configuration property controls the maximum amount of memory a single executor can use? A) spark.driver.memory B) spark.executor.memory C) spark.memory.fraction D) spark.sql.shuffle.partitions Answer: B Explanation: spark.executor.memory sets the heap size for each executor process. Question 12. In Structured Streaming, which output mode is NOT supported for aggregations? A) Append B) Update C) Complete D) Incremental Answer: D Explanation: Spark supports Append, Update, and Complete modes; Incremental is not a defined mode. Question 13. Which of the following best describes a broadcast join in Spark? A) Both sides are shuffled to the same partition B) The smaller dataset is sent to all executor nodes C) Data is partitioned using a hash of the join key D) It only works for inner joins Answer: B Explanation: A broadcast join replicates the small table to every executor, avoiding a shuffle of the large table. Question 14. Which Scala collection guarantees O(1) random access and is immutable? A) List B) Set
C) Vector D) Map Answer: C Explanation: Vector provides efficient indexed access using a tree of arrays, while remaining immutable. Question 15. What does the Try type encapsulate in Scala? A) Optional values B) Potential exceptions during computation C) Lazy evaluation of expressions D) Pattern matching results Answer: B Explanation: Try[T] can be Success[T] or Failure[Throwable], allowing functional handling of exceptions. Question 16. In Delta Lake, which feature enables you to query a table as it existed at a previous point in time? A) Schema evolution B) Time travel C) Data compaction D) Z-order clustering Answer: B Explanation: Delta Lake stores versioned transaction logs, allowing queries on historical snapshots. Question 17. Which Spark UI tab shows the amount of data cached in memory for each RDD/DataFrame? A) Stages B) SQL C) Storage D) Executors Answer: C
Question 21. Which of the following is a correct way to define a window specification that orders rows by timestamp and partitions by user_id? A) Window.partitionBy("user_id").orderBy("timestamp") B) Window.orderBy("user_id").partitionBy("timestamp") C) Window.partitionBy("timestamp").orderBy("user_id") D) Window.over("user_id", "timestamp") Answer: A Explanation: The partitionBy method defines grouping, and orderBy defines the order within each partition. Question 22. In Spark, what does the term “shuffle read” refer to? A) Reading data from external storage B) Pulling partitioned data from other executors after a shuffle C) Loading data into memory cache D) Persisting data to disk Answer: B Explanation: During a wide dependency, executors must fetch shuffled blocks from remote nodes; this is the shuffle read phase. Question 23. Which Scala feature enables compile-time generation of boilerplate code for case classes? A) Implicit conversions B) Macros C) Synthetic methods generated by the compiler D) Type classes Answer: C Explanation: The Scala compiler automatically generates methods like apply, copy, equals, and hashCode for case classes. Question 24. Which of the following is NOT a valid Spark storage level? A) MEMORY_ONLY_ B) DISK_ONLY_ C) OFF_HEAP
Answer: B Explanation: Storage levels can have a replication factor (e.g., MEMORY_ONLY_2), but DISK_ONLY_2 is not defined; the correct one is DISK_ONLY. Question 25. Which of the following expressions correctly creates a Spark DataFrame from a JSON file stored in S3? A) spark.read.json("s3a://bucket/data.json") B) spark.read.format("csv").load("s3://bucket/data.json") C) spark.read.parquet("s3://bucket/data.json") D) spark.read.textFile("s3a://bucket/data.json") Answer: A Explanation: The json method reads JSON files; the s3a scheme enables S access with Hadoop’s S3A connector. Question 26. In Scala, which of the following statements about var is true? A) It creates a thread-safe immutable reference B) It can be reassigned after initialization C) It is only allowed inside a case class D) It guarantees compile-time constant values Answer: B Explanation: var defines a mutable variable that can be reassigned; it offers no thread-safety guarantees. Question 27. Which Spark configuration enables dynamic allocation of executors? A) spark.dynamicAllocation.enabled B) spark.executor.instances C) spark.executor.cores D) spark.sql.autoBroadcastJoinThreshold Answer: A Explanation: Setting spark.dynamicAllocation.enabled to true lets Spark request and release executors based on workload.
B) ALTER TABLE delta_table ALTER COLUMN age SET DEFAULT 0 C) ALTER TABLE delta_table ADD COLUMNS (age INT) NOT NULL DEFAULT 0 D) ALTER TABLE delta_table RENAME TO delta_table_v2 Answer: C Explanation: Delta Lake supports ADD COLUMNS with NOT NULL and DEFAULT to enforce schema evolution while providing a default. Question 32. Which Scala collection method returns a new collection containing only the elements that satisfy a predicate? A) map B) flatMap C) filter D) reduce Answer: C Explanation: filter evaluates the predicate on each element and retains those returning true. Question 33. Which Spark setting controls the number of partitions created when reading a large Parquet file without explicit partitioning? A) spark.sql.shuffle.partitions B) spark.default.parallelism C) spark.sql.parquet.mergeSchema D) spark.sql.sources.partitionOverwriteMode Answer: B Explanation: spark.default.parallelism influences the default number of partitions for RDDs and file reads when no other hint is provided. Question 34. Which of the following is a correct way to read a streaming source from Kafka using Scala? A) spark.readStream.format("kafka").option("kafka.bootstrap.servers","host:9092") .load() B) spark.read.format("kafka").option("topic","myTopic").load() C) spark.readStream.kafka("host:9092","myTopic")
D) spark.read.kafka("host:9092").option("subscribe","myTopic") Answer: A Explanation: readStream with format "kafka" and the kafka.bootstrap.servers option is the standard API. Question 35. Which of the following statements about Spark’s Tungsten execution engine is FALSE? A) It uses off-heap memory management for better CPU cache utilization B) It automatically generates Java bytecode for SQL expressions C) It replaces the need for the Catalyst optimizer D) It provides binary row format for efficient serialization Answer: C Explanation: Tungsten works with Catalyst; Catalyst still performs logical and physical planning. Tungsten focuses on low-level execution. Question 36. In ScalaTest, which trait provides the should syntax for assertions? A) FunSuite B) FlatSpec C) Matchers D) WordSpec Answer: C Explanation: The Matchers trait adds the should DSL for expressive assertions. Question 37. Which of the following best describes the purpose of a “companion object” in Scala? A) To hold mutable state shared across all instances B) To define static members and factory methods for a class C) To enforce inheritance hierarchies D) To provide runtime reflection capabilities Answer: B Explanation: A companion object shares the same name and source file as a class, allowing definition of static-like members and apply factories.
B) Stages C) Executors D) SQL Answer: B Explanation: The Stages tab breaks down the execution into stages, showing task counts and timing. Question 42. In a Spark application, which of the following is the most effective way to reduce data skew for a key with a very high frequency? A) Increase the number of partitions globally B) Use a broadcast join on the skewed key C) Apply a salting technique to the key before the join D) Cache the skewed dataset Answer: C Explanation: Salting appends a random suffix to the skewed key, spreading its rows across multiple reducers. Question 43. Which Scala feature allows you to write list.map(_ + 1) instead of specifying a named parameter? A) Placeholder syntax B) Implicit parameters C) Currying D) Partial functions Answer: A Explanation: The underscore (_) acts as a placeholder for a single argument in an anonymous function. Question 44. In Spark Structured Streaming, which trigger processes data as soon as it arrives? A) Trigger.ProcessingTime("10 seconds") B) Trigger.Once() C) Trigger.Continuous("1 second") D) Trigger.AvailableNow()
Answer: C Explanation: Continuous trigger enables low-latency processing, aiming for sub-second end-to-end latency. Question 45. Which of the following is true about Spark’s mapPartitions transformation? A) It receives a single element at a time B) It can be used to reuse expensive connections across rows in a partition C) It always results in a shuffle D) It guarantees order preservation across partitions Answer: B Explanation: mapPartitions provides an iterator over all rows in a partition, allowing initialization of resources once per partition. Question 46. Which of the following statements about lazy val in Scala is correct? A) It is evaluated at object construction time B) It is thread-safe and evaluated the first time it is accessed C) It can be reassigned later in the program D) It bypasses the compiler’s type inference Answer: B Explanation: lazy val defers initialization until first use and ensures safe publication across threads. Question 47. In Delta Lake, which transaction log file records the schema of each version? A) _delta_log/metadata.json B) _delta_log/commit.json C) _delta_log/checkpoints/… D) _delta_log/schema.parquet Answer: A Explanation: metadata.json within the _delta_log directory holds schema and configuration information for each version.
C) object D) static Answer: C Explanation: An object defines a singleton; methods inside are accessed statically (e.g., MyObject.myMethod). Question 52. Which Spark configuration determines the amount of memory reserved for storing broadcast variables? A) spark.broadcast.blockSize B) spark.memory.fraction C) spark.sql.autoBroadcastJoinThreshold D) spark.broadcast.compress Answer: A Explanation: spark.broadcast.blockSize sets the chunk size for broadcast data; larger blocks can reduce overhead. Question 53. Which of the following is the most appropriate way to handle schema evolution when appending new columns to a Delta table? A) Drop and recreate the table B) Use ALTER TABLE ... ADD COLUMNS with MERGE option C) Overwrite the entire table with new schema D) Disable schema enforcement Answer: B Explanation: Delta Lake supports ALTER TABLE ... ADD COLUMNS and can automatically merge new columns during writes. Question 54. In Spark, which of the following actions forces the execution of a lazy transformation pipeline? A) map B) filter C) foreach D) select Answer: C
Explanation: foreach is an action that triggers the computation; map, filter, and select are transformations. Question 55. Which of the following is a correct way to define a custom accumulator in Spark using Scala? A) Extend AccumulatorV2[Long, Long] and implement required methods B) Use spark.sparkContext.accumulator(0L) directly for custom types C) Define a var in the driver and update it inside tasks D) Use Broadcast variables instead of accumulators Answer: A Explanation: Custom accumulators must subclass AccumulatorV2 and provide zero, add, merge, value, and isZero implementations. Question 56. Which of the following statements about Spark’s coalesce transformation is TRUE? A) It always triggers a shuffle B) It can increase the number of partitions C) It reduces partitions without a full shuffle when decreasing the count D) It is equivalent to repartition with the same number of partitions Answer: C Explanation: coalesce can collapse partitions without a shuffle if the target number is less than or equal to the current number. Question 57. Which of the following is the correct way to enable Hive support in a SparkSession? A) SparkSession.builder().enableHiveSupport().getOrCreate() B) SparkSession.builder().config("spark.sql.catalogImplementation","hive").getOrCr eate() C) Both A and B D) Hive support is enabled by default; no configuration needed Answer: C Explanation: Both the enableHiveSupport() method and setting spark.sql.catalogImplementation to hive achieve the same result.
B) Hash partitioning on the entire row C) Range partitioning on the timestamp column D) No partitioning, let Spark decide Answer: C Explanation: Range partitioning on the timestamp ensures that date-range queries scan only relevant partitions. Question 62. In Scala, which of the following statements about implicit parameters is correct? A) They must be defined in the same file as the calling method B) They are resolved at compile time based on the nearest implicit value in scope C) Implicit parameters can only be of type String D) They cannot be used with generic methods Answer: B Explanation: The compiler searches the implicit scope for a matching value and injects it automatically. Question 63. Which Spark configuration controls the maximum size of a single shuffle file before it is split? A) spark.sql.shuffle.partitions B) spark.shuffle.file.buffer C) spark.reducer.maxSizeInFlight D) spark.shuffle.io.maxSize Answer: D Explanation: spark.shuffle.io.maxSize (or spark.shuffle.file.buffer in older versions) limits the size of shuffle blocks written to disk. Question 64. Which of the following is the correct syntax to define a generic case class Pair[A,B] in Scala? A) case class Pair[A, B](first: A, second: B) B) case class Pair(first: A, second: B) where A, B are types C) class Pair[A, B](val first: A, val second: B) D) case class Pair(first: Any, second: Any)
Answer: A Explanation: Generic type parameters are declared in brackets after the class name. Question 65. Which of the following Spark listeners can be used to capture job start and end events for custom metrics? A) SparkListenerSQLExecutionStart B) SparkListenerStageSubmitted C) SparkListenerJobStart / SparkListenerJobEnd D) SparkListenerTaskEnd Answer: C Explanation: SparkListenerJobStart and SparkListenerJobEnd provide hooks for job-level lifecycle events. Question 66. In a Spark Structured Streaming query, which option enables exactly-once semantics when writing to a file sink? A) outputMode("append") B) outputMode("complete") C) checkpointLocation("/path") D) trigger(Trigger.Once()) Answer: C Explanation: Providing a checkpoint location allows Spark to maintain state and achieve exactly-once guarantees. Question 67. Which of the following statements about the fold operation on an RDD is FALSE? A) It requires an initial zero value B) The zero value must be of the same type as the RDD elements C) It can be used with non-commutative operations safely D) It performs a tree-aggregate to combine results Answer: C Explanation: fold assumes the operation is both associative and commutative; non-commutative functions may yield incorrect results.