






























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Databricks Certified Associate Developer for Apache Spark 3.5 – Python Exam | Complete 200- Question Practice Exam with Answers & Explanations | PDF
Typology: Exams
1 / 70
This page cannot be seen from the preview
Don't miss anything!































































Question 1 A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate. Which code snippet can be used to meet this requirement? A. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate") B. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate") C. df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate") D. df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate") Answer: A Explanation: The PySpark drop() method removes specified columns and returns a new DataFrame. Multiple column names are passed as separate arguments.
Question 2 A data engineer is working on a Streaming DataFrame streaming_df with unbounded streaming data. Which operation is supported with streaming_df? A. streaming_df.select(countDistinct("Name")) B. streaming_df.groupby("Id").count() C. streaming_df.orderBy("timestamp").limit(4) D. streaming_df.filter(col("count") < 30).show() Answer: B Explanation: Structured Streaming supports aggregations over a key (groupBy). Global operations like countDistinct, orderBy, limit, or show() are not supported without windows or watermarks.
C. Spark stores as much as possible in memory and spills the rest to disk when memory is full, continuing processing with performance overhead. D. Spark stores frequently accessed rows in memory and less frequently accessed rows on disk. Answer: C Explanation: MEMORY_AND_DISK caches as much data as possible in memory and spills the remainder to disk to continue processing. Question 5 A data engineer is building a Structured Streaming pipeline and wants it to recover from failures or intentional shutdowns by continuing where it left off. How can this be achieved? A. Configure checkpointLocation during readStream B. Configure recoveryLocation during SparkSession initialization C. Configure recoveryLocation during writeStream D. Configure checkpointLocation during writeStream Answer: D Explanation: Setting checkpointLocation in writeStream allows Spark to store streaming progress and recover from failures. Question 6 A Spark DataFrame df contains a column event_time of type timestamp. You want to calculate the time difference in seconds between consecutive rows, partitioned by user_id and ordered by event_time. Which function should you use?
A. lag() B. lead() C. row_number() D. dense_rank() Answer: A Explanation: The lag() function returns the value of a column from a previous row in a window. Combined with window partitioning and ordering, it allows you to calculate differences between consecutive rows. Question 7 Which PySpark DataFrame method allows adding a new column based on an existing column using a SQL expression? A. withColumn() B. selectExpr() C. transform() D. map() Answer: B Explanation: selectExpr() allows using SQL expressions to create new columns or transform existing ones. Example: df.selectExpr("existing_col * 2 as new_col"). Question 8 You want to join two DataFrames df1 and df2 on the column id, keeping all rows from df1 and only matching rows from df2. Which join type should you use?
B. explode() C. split() D. collect_list() Answer: A Explanation: from_json() parses JSON strings into structured columns. Example: df.withColumn("jsonData", from_json("payload", schema)). Question 11 Which of the following is a correct way to repartition a DataFrame df by a column user_id into 10 partitions? A. df.repartition(10, "user_id") B. df.coalesce(10, "user_id") C. df.partitionBy("user_id", 10) D. df.shuffle("user_id", 10) Answer: A Explanation: repartition(numPartitions, *cols) reshuffles the DataFrame by the specified column(s) into the given number of partitions. Question 12 You have a DataFrame df with a nested structure: a column address of type StructType containing city and state. How can you select only the city? A. df.select("address.city") B. df.select(col("address").city)
C. df.select("address->city") D. df.selectStruct("address", "city") Answer: A Explanation: Nested fields in a StructType can be accessed using dot notation: "struct_col.field_name". Question 13 Which PySpark transformation allows you to explode an array column into multiple rows? A. split() B. explode() C. flatten() D. collect_list() Answer: B Explanation: explode() converts each element of an array into a separate row while keeping other columns intact. Question 14 You want to persist a DataFrame df in memory only without writing to disk. Which storage level should you use? A. MEMORY_ONLY B. MEMORY_AND_DISK C. DISK_ONLY D. OFF_HEAP
You want to compute the average sales per region in a DataFrame df. Which PySpark method is correct? A. df.groupBy("region").mean("sales") B. df.groupBy("region").agg({"sales": "avg"}) C. df.groupby("region").average("sales") D. Both A and B Answer: D Explanation: Both groupBy("region").mean("sales") and groupBy("region").agg({"sales": "avg"}) are valid ways to calculate averages in PySpark. Question 17 Which Spark operation triggers the execution of transformations? A. map() B. filter() C. show() D. select() Answer: C Explanation: Transformations are lazy in Spark; actions like show(), collect(), or write() trigger execution. Question 18
You want to join df1 and df2 using a broadcast join to optimize performance because df2 is small. Which function should you use? A. df1.join(broadcast(df2), "id") B. df1.join(df2.hint("broadcast"), "id") C. Both A and B D. df1.join(df2, "id") Answer: C Explanation: Both broadcast(df2) and the .hint("broadcast") approach inform Spark to broadcast the small DataFrame to all nodes for a more efficient join. Question 19 Which PySpark function is used to flatten an array column into multiple rows? A. flatten() B. explode() C. split() D. collect_list() Answer: B Explanation: explode() generates a new row for each element of an array column, keeping other columns unchanged. Question 20 You want to drop rows in a DataFrame where the column age is null. Which method is correct? A. df.dropna(subset=["age"])
Answer: D Explanation: You can use row_number() over a window to filter out the first 5 rows efficiently in Spark. Question 23 Which PySpark function allows you to compute a cumulative sum over a window? A. sum() B. cumsum() C. sum().over(windowSpec) D. aggregate() Answer: C Explanation: Window functions like sum().over(windowSpec) allow cumulative or running totals in Spark. Question 24 You want to convert a PySpark DataFrame df to a Pandas DataFrame. Which method is correct? A. df.toPandas() B. df.collect() C. df.asPandas() D. df.convertToPandas() Answer: A Explanation:
toPandas() collects the Spark DataFrame to the driver and returns a Pandas DataFrame. Question 25 Which PySpark method ensures DataFrame persistence in memory across multiple actions? A. cache() B. persist(StorageLevel.MEMORY_ONLY) C. Both A and B D. checkpoint() Answer: C Explanation: Both cache() and persist() store DataFrames in memory. persist() allows specifying storage levels like MEMORY_AND_DISK. Question 26 You want to explode a map column attributes into two columns key and value. Which function should you use? A. explode() B. posexplode() C. inline() D. from_json() Answer: C Explanation: inline() can explode a map or struct into multiple columns for easier processing.
Question 29 You want to convert a column date_str in format yyyy-MM-dd to a DateType column. Which function should you use? A. to_date("date_str", "yyyy-MM-dd") B. cast("date_str", "date") C. from_unixtime("date_str") D. date_format("date_str", "yyyy-MM-dd") Answer: A Explanation: to_date() parses a string column to DateType using the given format. Question 30 You want to aggregate a DataFrame with multiple aggregations: sum of sales and max of profit per region. Which syntax is correct? A. df.groupBy("region").agg({"sales": "sum", "profit": "max"}) B. df.groupBy("region").agg(sum("sales"), max("profit")) C. Both A and B D. df.aggregate("sales", "profit") Answer: C Explanation: Both dictionary-based aggregation and column function aggregation are supported in PySpark. Question 31
You want to remove duplicate rows based on the id column in a DataFrame df. Which method is correct? A. df.dropDuplicates(["id"]) B. df.distinct(["id"]) C. df.drop_duplicates(["id"]) D. df.removeDuplicates(["id"]) Answer: A Explanation: dropDuplicates(["id"]) removes rows with duplicate values in the specified columns. distinct() removes full duplicate rows. Question 32 You need to perform a left outer join between df1 and df2. Which syntax is correct? A. df1.join(df2, on="id", how="left") B. df1.join(df2, "id", "left_outer") C. Both A and B D. df1.leftJoin(df2, "id") Answer: C Explanation: Both syntaxes are valid for performing a left outer join in PySpark. Question 33 You want to calculate the rolling average of the sales column over the last 3 rows, ordered by date. Which PySpark function is appropriate?
D. Both A and B Answer: D Explanation: You can access nested fields using dot notation or col("struct.field") in PySpark. Question 36 You want to remove the first row of a DataFrame df without collecting the DataFrame to the driver. Which approach is correct? A. df.tail(df.count() - 1) B. df.filter(row_number() > 1) C. df.drop(0) D. df.limit(df.count() - 1) Answer: B Explanation: Using row_number() over a window allows filtering without bringing data to the driver. Question 37 You want to persist a DataFrame df in memory and disk for fault tolerance. Which storage level should you use? A. MEMORY_ONLY B. MEMORY_AND_DISK C. DISK_ONLY D. MEMORY_ONLY_SER
Answer: B Explanation: MEMORY_AND_DISK stores as much as possible in memory and spills remaining partitions to disk. Question 38 You need to perform incremental aggregation on a streaming DataFrame by userId. Which operation is supported? A. streaming_df.groupBy("userId").count() B. streaming_df.select(countDistinct("userId")) C. streaming_df.orderBy("timestamp").limit(10) D. streaming_df.show() Answer: A Explanation: Streaming aggregations over a key are supported. Global aggregates or ordering are not allowed without windowing. Question 39 You have a column tags as an array. Which function converts it into multiple rows? A. explode(col("tags")) B. split(col("tags"), ",") C. posexplode(col("tags")) D. Both A and C Answer: D