




















































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A technical preparation resource focusing on Scala-based cloud data engineering, covering distributed processing, big data frameworks, and advanced coding practices alongside exam simulations and practical exercises.
Typology: Exams
1 / 92
This page cannot be seen from the preview
Don't miss anything!





















































































Question 1. In Scala, which keyword declares an immutable reference that cannot be reassigned? A) var B) val C) def D) lazy Answer: B Explanation: val creates a read-only reference; the value it points to cannot be changed after initialization, unlike var. Question 2. What does the lazy keyword affect in Scala? A) Variable mutability B) Eager evaluation of the expression C) Evaluation of the expression only when first accessed D) Thread safety of the variable Answer: C Explanation: A lazy val is evaluated the first time it is accessed, enabling deferred computation and potentially reducing startup cost. Question 3. Which of the following statements about Scala’s type inference is true? A) The compiler always requires explicit type annotations. B) Type inference works only for local variables, not for method parameters. C) The compiler can infer the type of a val from its right-hand side expression. D) Type inference is disabled when using the REPL. Answer: C
Explanation: Scala’s compiler infers the type of a val based on the expression assigned, reducing boilerplate. Question 4. Which of the following is a pure function in functional programming? A) A function that modifies a global variable. B) A function that reads from a database. C) A function that returns the same output for the same input and has no side effects. D) A function that prints to the console. Answer: C Explanation: Pure functions are deterministic and free of side effects, making them easier to reason about and test. Question 5. How can a higher-order function be identified in Scala? A) It returns a Future. B) It takes another function as a parameter or returns a function. C) It uses pattern matching. D) It is defined inside an object. Answer: B Explanation: Higher-order functions either accept functions as arguments or produce functions as results. Question 6. What does “currying” a function mean in Scala? A) Converting a method into a class. B) Splitting a function that takes multiple arguments into a series of functions each taking a single argument. C) Making a function tail-recursive.
B) flatMap C) filter D) All of the above Answer: D Explanation: map, flatMap, and filter all process elements sequentially and retain the original order in the resulting collection. Question 10. Between List, Vector, and Array, which provides the fastest random-access reads? A) List B) Vector C) Array D) All have the same performance Answer: C Explanation: Array stores elements in a contiguous memory block, allowing O(1) index access, while List is linked-list based and Vector uses a tree structure. Question 11. Which Spark component is responsible for converting user code into a physical execution plan? A) Cluster Manager B) Driver C) Catalyst Optimizer D) Executor Answer: C Explanation: The Catalyst Optimizer parses, analyzes, and optimizes logical plans into efficient physical execution plans.
Question 12. In Spark, what is the role of the Driver program? A) Store data on disk. B) Execute tasks on worker nodes. C) Coordinate job execution, schedule tasks, and maintain SparkContext. D) Manage the cluster’s resources. Answer: C Explanation: The Driver hosts the SparkContext, builds the DAG, and distributes tasks to executors. Question 13. Which of the following statements about Spark’s lazy evaluation is correct? A) Transformations are executed immediately when called. B) Actions trigger the execution of the lineage graph. C) Lazy evaluation increases memory usage. D) Lazy evaluation is only applicable to DataFrames, not RDDs. Answer: B Explanation: Transformations build a logical plan; the actual computation occurs only when an action (e.g., collect, count) is invoked. Question 14. What does a shuffle operation in Spark typically cause? A) Increased CPU usage only. B) Network I/O and possible data skew. C) No impact on performance. D) Automatic data compression. Answer: B Explanation: Shuffles redistribute data across partitions, incurring network traffic and can lead to skew if partition sizes become uneven.
Answer: B Explanation: UDFs are treated as black boxes, preventing Catalyst from optimizing them; they can become performance bottlenecks. Question 18. What is the purpose of broadcasting a variable in Spark? A) To replicate the variable on each executor for efficient read-only access. B) To share mutable state across tasks. C) To store intermediate shuffle data. D) To increase the size of the driver’s heap. Answer: A Explanation: Broadcast variables are sent once to each executor, reducing network traffic when the same read-only data is needed by many tasks. Question 19. Which technique helps mitigate data skew during a join operation? A) Increasing the number of partitions arbitrarily. B) Using a salted key (adding a random prefix) to distribute skewed keys. C) Disabling shuffle. D) Converting the join to a Cartesian product. Answer: B Explanation: Salting adds a random component to skewed keys, spreading them across partitions and reducing hotspot executors. Question 20. In Spark, what is the difference between storage memory and execution memory? A) Storage memory is used for caching, execution memory for shuffles, joins, and aggregations. B) Execution memory stores RDDs, storage memory holds DataFrames.
C) Both are the same; Spark does not differentiate them. D) Storage memory is only for driver, execution memory for executors. Answer: A Explanation: Spark partitions the JVM heap into storage (for persisted data) and execution (for intermediate computation) regions. Question 21. Which file format provides columnar storage and is optimized for query performance in Spark? A) CSV B) JSON C) Parquet D) Text Answer: C Explanation: Parquet stores data column-wise, enabling predicate pushdown and efficient compression, which benefits Spark queries. Question 22. When reading data from Amazon S3 using Spark, which configuration improves read throughput? A) spark.hadoop.fs.s3a.connection.maximum = 1 B) spark.hadoop.fs.s3a.fast.upload = true C) spark.hadoop.fs.s3a.impl = org.apache.hadoop.fs.s3a.S3AFileSystem D) spark.hadoop.fs.s3a.buffer.dir = /tmp Answer: C Explanation: Setting the S3A implementation ensures Spark uses the optimized S3A connector for high-throughput reads.
Explanation: Time travel stores versioned metadata, enabling queries against historical snapshots of the data. Question 26. Which Spark Structured Streaming trigger processes data as soon as it arrives? A) Trigger.Once() B) Trigger.ProcessingTime("10 seconds") C) Trigger.Continuous("1 second") D) Trigger.AvailableNow() Answer: C Explanation: Trigger.Continuous enables low-latency, micro-batch-like processing, continuously ingesting data as it becomes available. Question 27. In Structured Streaming, what is the purpose of watermarking? A) To encrypt data in transit. B) To limit the amount of state kept for late-arriving events. C) To compress output files. D) To increase parallelism of the job. Answer: B Explanation: Watermarks define a threshold for how late data can be considered; events older than the watermark are dropped from stateful operations. Question 28. Which window type groups events that have no fixed start or end but are separated by inactivity periods? A) Tumbling window B) Sliding window C) Session window
D) Fixed window Answer: C Explanation: Session windows close when there is a gap (inactivity) longer than a defined timeout, capturing bursts of activity. Question 29. When integrating Spark with Apache Kafka, which option ensures exactly-once processing semantics? A) enable.auto.commit = true B) spark.sql.streaming.checkpointLocation set and using write-ahead logs C) Using Kafka’s at-least-once delivery mode only D) Disabling offsets commit in the consumer Answer: B Explanation: Providing a checkpoint location lets Spark track progress and commit offsets only after successful processing, achieving exactly-once. Question 30. Which Scala testing framework is commonly used for unit testing Spark jobs? A) JUnit B) TestNG C) ScalaTest D) Cucumber Answer: C Explanation: ScalaTest integrates well with Scala’s language features and provides matchers for collections, making it suitable for Spark unit tests. Question 31. In SBT, which command compiles the project and creates a JAR file?
Question 34. Which AWS service can be used to store secrets such as database passwords for Spark jobs? A) S B) IAM C) Secrets Manager D) CloudWatch Answer: C Explanation: AWS Secrets Manager securely stores, rotates, and retrieves credentials, which can be accessed by Spark applications at runtime. Question 35. What does the term “data lineage” refer to in a data engineering context? A) The physical location of data files. B) The transformation history and provenance of data from source to destination. C) The backup schedule of a dataset. D) The schema definition of a table. Answer: B Explanation: Data lineage tracks how data moves and transforms across pipelines, essential for auditability and compliance. Question 36. Which of the following is a best practice for handling PII data in a cloud lakehouse? A) Store it in plain text for faster queries. B) Encrypt at rest using a customer-managed KMS key. C) Disable all access controls. D) Replicate it across all regions without encryption.
Answer: B Explanation: Encrypting PII at rest with a customer-managed key ensures confidentiality while still allowing controlled access. Question 37. In Spark, which operation can be used to remove duplicate rows based on a subset of columns? A) distinct() B) dropDuplicates() C) dropDuplicates("col1","col2") D) removeDuplicates() Answer: C Explanation: dropDuplicates("col1","col2") removes rows that are duplicated on the specified columns while preserving others. Question 38. Which of the following Spark configurations controls the number of shuffle partitions? A) spark.sql.shuffle.partitions B) spark.default.parallelism C) spark.executor.cores D) spark.memory.fraction Answer: A Explanation: spark.sql.shuffle.partitions sets the default number of partitions used during shuffle operations like joins and aggregations. Question 39. What is the primary advantage of using Delta Lake’s “Z-ordering” on a table? A) Improves write throughput.
Question 42. When using Apache Iceberg, which feature enables schema evolution without rewriting existing data files? A) Partition pruning B) Snapshot isolation C) Manifest files with column definitions D) Table compaction Answer: C Explanation: Iceberg stores column metadata in manifest files, allowing new columns to be added or existing ones to be renamed without touching data files. Question 43. Which of the following is a common cause of “Task Not Serializable” exceptions in Spark? A) Using a mutable variable inside a transformation. B) Defining a case class inside a method. C) Referencing a non-serializable object (e.g., a database connection) from a closure. D) Calling collect() on a DataFrame. Answer: C Explanation: Spark ships closures to executors; if the closure captures a non-serializable object, serialization fails, causing the exception. Question 44. In Structured Streaming, what does the outputMode “append” mean? A) Only new rows are written to the sink; existing rows are never updated. B) All rows are rewritten on every trigger. C) Updates are written, but deletions are ignored. D) The sink receives a complete snapshot each time.
Answer: A Explanation: “append” mode writes only rows that are newly added to the result table, suitable for sinks that do not support updates. Question 45. Which GCP service provides a serverless data warehouse that integrates natively with Spark via the Spark-BigQuery connector? A) Cloud Storage B) BigQuery C) Dataproc D) Pub/Sub Answer: B Explanation: BigQuery is GCP’s analytics data warehouse; the Spark-BigQuery connector lets Spark read/write data directly to BigQuery tables. Question 46. Which of the following statements about the foldLeft operation on a Scala collection is true? A) It processes elements from right to left. B) It requires an initial accumulator value and a binary operator. C) It cannot be used on parallel collections. D) It returns an Option. Answer: B Explanation: foldLeft starts with an initial value and applies a binary function sequentially from the leftmost element. Question 47. In Spark, what is the effect of setting spark.sql.autoBroadcastJoinThreshold to -1? A) Disables broadcast joins completely.
Question 50. Which of the following is an advantage of using Dataset[T] over DataFrame in Spark when working with Scala? A) Datasets are always faster than DataFrames. B) Datasets provide compile-time type safety and can leverage Scala case classes. C) DataFrames cannot be cached. D) Datasets do not support SQL queries. Answer: B Explanation: Dataset[T] retains strong typing, allowing the compiler to catch errors early, while still offering the optimizations of DataFrames. Question 51. In Scala, which keyword is used to define a function that may not return a value (i.e., returns Unit)? A) def B) val C) lazy D) unit Answer: A Explanation: def declares a method; if its body does not return a value, its result type defaults to Unit. Question 52. Which Spark configuration controls the amount of memory allocated for caching persisted data? A) spark.memory.fraction B) spark.memory.storageFraction C) spark.executor.memory D) spark.sql.shuffle.partitions Answer: B
Explanation: spark.memory.storageFraction (default 0.5) defines the portion of the unified memory pool reserved for storage (caching). Question 53. What is the main benefit of using “map-side joins” in Spark? A) They avoid shuffling the larger dataset. B) They increase the number of shuffle stages. C) They require more memory on the driver. D) They disable partitioning. Answer: A Explanation: In a map-side join, the smaller dataset is broadcast to each mapper, eliminating the need to shuffle the larger table. Question 54. Which of the following is a correct way to define a case class in Scala? A) case class Person(name: String, age: Int) B) class Person case class(name: String, age: Int) C) case class Person { val name: String; val age: Int } D) def Person(name: String, age: Int) = (name, age) Answer: A Explanation: The syntax case class Person(name: String, age: Int) creates an immutable data holder with automatically generated apply, unapply, equals, and hashCode. Question 55. In Spark Structured Streaming, which sink guarantees exactly-once delivery when writing to a file system? A) console B) foreachBatch