Databricks Certified Associate Developer for Apache Spark 3.5 – Python Exam | Complete, Exams of Nursing

Databricks Certified Associate Developer for Apache Spark 3.5 – Python Exam | Complete 200- Question Practice Exam with Answers & Explanations | PDF

Typology: Exams

2025/2026

Available from 06/09/2026

worlden
worlden 🇺🇸

4

(2)

4.5K documents

1 / 70

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Databricks Certified Associate
Developer for
Apache Spark 3.5 Python Exam |
Complete 200-
Question Practice Exam with Answers &
Explanations | PDF .
Question 1
A data scientist of an e-commerce company is working with user data obtained
from its subscriber database and has stored the data in a DataFrame df_user.
Before further processing the data, the data scientist wants to create another
DataFrame df_user_non_pii and store only the non-PII columns in this
DataFrame. The PII columns in df_user are first_name, last_name,
email, and birthdate. Which code snippet can be used to meet this
requirement?
A. df_user_non_pii = df_user.drop("first_name",
"last_name", "email", "birthdate")
B. df_user_non_pii = df_user.drop("first_name",
"last_name", "email", "birthdate")
C. df_user_non_pii = df_user.dropfields("first_name",
"last_name", "email", "birthdate")
D. df_user_non_pii = df_user.dropfields("first_name,
last_name, email, birthdate")
Answer: A
Explanation:
The PySpark drop() method removes specified columns and returns a new
DataFrame. Multiple column names are passed as separate arguments.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46

Partial preview of the text

Download Databricks Certified Associate Developer for Apache Spark 3.5 – Python Exam | Complete and more Exams Nursing in PDF only on Docsity!

Databricks Certified Associate

Developer for

Apache Spark 3.5 – Python Exam |

Complete 200-

Question Practice Exam with Answers &

Explanations | PDF.

Question 1 A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate. Which code snippet can be used to meet this requirement? A. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate") B. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate") C. df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate") D. df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate") Answer: A Explanation: The PySpark drop() method removes specified columns and returns a new DataFrame. Multiple column names are passed as separate arguments.

Question 2 A data engineer is working on a Streaming DataFrame streaming_df with unbounded streaming data. Which operation is supported with streaming_df? A. streaming_df.select(countDistinct("Name")) B. streaming_df.groupby("Id").count() C. streaming_df.orderBy("timestamp").limit(4) D. streaming_df.filter(col("count") < 30).show() Answer: B Explanation: Structured Streaming supports aggregations over a key (groupBy). Global operations like countDistinct, orderBy, limit, or show() are not supported without windows or watermarks.

C. Spark stores as much as possible in memory and spills the rest to disk when memory is full, continuing processing with performance overhead. D. Spark stores frequently accessed rows in memory and less frequently accessed rows on disk. Answer: C Explanation: MEMORY_AND_DISK caches as much data as possible in memory and spills the remainder to disk to continue processing. Question 5 A data engineer is building a Structured Streaming pipeline and wants it to recover from failures or intentional shutdowns by continuing where it left off. How can this be achieved? A. Configure checkpointLocation during readStream B. Configure recoveryLocation during SparkSession initialization C. Configure recoveryLocation during writeStream D. Configure checkpointLocation during writeStream Answer: D Explanation: Setting checkpointLocation in writeStream allows Spark to store streaming progress and recover from failures. Question 6 A Spark DataFrame df contains a column event_time of type timestamp. You want to calculate the time difference in seconds between consecutive rows, partitioned by user_id and ordered by event_time. Which function should you use?

A. lag() B. lead() C. row_number() D. dense_rank() Answer: A Explanation: The lag() function returns the value of a column from a previous row in a window. Combined with window partitioning and ordering, it allows you to calculate differences between consecutive rows. Question 7 Which PySpark DataFrame method allows adding a new column based on an existing column using a SQL expression? A. withColumn() B. selectExpr() C. transform() D. map() Answer: B Explanation: selectExpr() allows using SQL expressions to create new columns or transform existing ones. Example: df.selectExpr("existing_col * 2 as new_col"). Question 8 You want to join two DataFrames df1 and df2 on the column id, keeping all rows from df1 and only matching rows from df2. Which join type should you use?

B. explode() C. split() D. collect_list() Answer: A Explanation: from_json() parses JSON strings into structured columns. Example: df.withColumn("jsonData", from_json("payload", schema)). Question 11 Which of the following is a correct way to repartition a DataFrame df by a column user_id into 10 partitions? A. df.repartition(10, "user_id") B. df.coalesce(10, "user_id") C. df.partitionBy("user_id", 10) D. df.shuffle("user_id", 10) Answer: A Explanation: repartition(numPartitions, *cols) reshuffles the DataFrame by the specified column(s) into the given number of partitions. Question 12 You have a DataFrame df with a nested structure: a column address of type StructType containing city and state. How can you select only the city? A. df.select("address.city") B. df.select(col("address").city)

C. df.select("address->city") D. df.selectStruct("address", "city") Answer: A Explanation: Nested fields in a StructType can be accessed using dot notation: "struct_col.field_name". Question 13 Which PySpark transformation allows you to explode an array column into multiple rows? A. split() B. explode() C. flatten() D. collect_list() Answer: B Explanation: explode() converts each element of an array into a separate row while keeping other columns intact. Question 14 You want to persist a DataFrame df in memory only without writing to disk. Which storage level should you use? A. MEMORY_ONLY B. MEMORY_AND_DISK C. DISK_ONLY D. OFF_HEAP

You want to compute the average sales per region in a DataFrame df. Which PySpark method is correct? A. df.groupBy("region").mean("sales") B. df.groupBy("region").agg({"sales": "avg"}) C. df.groupby("region").average("sales") D. Both A and B Answer: D Explanation: Both groupBy("region").mean("sales") and groupBy("region").agg({"sales": "avg"}) are valid ways to calculate averages in PySpark. Question 17 Which Spark operation triggers the execution of transformations? A. map() B. filter() C. show() D. select() Answer: C Explanation: Transformations are lazy in Spark; actions like show(), collect(), or write() trigger execution. Question 18

You want to join df1 and df2 using a broadcast join to optimize performance because df2 is small. Which function should you use? A. df1.join(broadcast(df2), "id") B. df1.join(df2.hint("broadcast"), "id") C. Both A and B D. df1.join(df2, "id") Answer: C Explanation: Both broadcast(df2) and the .hint("broadcast") approach inform Spark to broadcast the small DataFrame to all nodes for a more efficient join. Question 19 Which PySpark function is used to flatten an array column into multiple rows? A. flatten() B. explode() C. split() D. collect_list() Answer: B Explanation: explode() generates a new row for each element of an array column, keeping other columns unchanged. Question 20 You want to drop rows in a DataFrame where the column age is null. Which method is correct? A. df.dropna(subset=["age"])

Answer: D Explanation: You can use row_number() over a window to filter out the first 5 rows efficiently in Spark. Question 23 Which PySpark function allows you to compute a cumulative sum over a window? A. sum() B. cumsum() C. sum().over(windowSpec) D. aggregate() Answer: C Explanation: Window functions like sum().over(windowSpec) allow cumulative or running totals in Spark. Question 24 You want to convert a PySpark DataFrame df to a Pandas DataFrame. Which method is correct? A. df.toPandas() B. df.collect() C. df.asPandas() D. df.convertToPandas() Answer: A Explanation:

toPandas() collects the Spark DataFrame to the driver and returns a Pandas DataFrame. Question 25 Which PySpark method ensures DataFrame persistence in memory across multiple actions? A. cache() B. persist(StorageLevel.MEMORY_ONLY) C. Both A and B D. checkpoint() Answer: C Explanation: Both cache() and persist() store DataFrames in memory. persist() allows specifying storage levels like MEMORY_AND_DISK. Question 26 You want to explode a map column attributes into two columns key and value. Which function should you use? A. explode() B. posexplode() C. inline() D. from_json() Answer: C Explanation: inline() can explode a map or struct into multiple columns for easier processing.

Question 29 You want to convert a column date_str in format yyyy-MM-dd to a DateType column. Which function should you use? A. to_date("date_str", "yyyy-MM-dd") B. cast("date_str", "date") C. from_unixtime("date_str") D. date_format("date_str", "yyyy-MM-dd") Answer: A Explanation: to_date() parses a string column to DateType using the given format. Question 30 You want to aggregate a DataFrame with multiple aggregations: sum of sales and max of profit per region. Which syntax is correct? A. df.groupBy("region").agg({"sales": "sum", "profit": "max"}) B. df.groupBy("region").agg(sum("sales"), max("profit")) C. Both A and B D. df.aggregate("sales", "profit") Answer: C Explanation: Both dictionary-based aggregation and column function aggregation are supported in PySpark. Question 31

You want to remove duplicate rows based on the id column in a DataFrame df. Which method is correct? A. df.dropDuplicates(["id"]) B. df.distinct(["id"]) C. df.drop_duplicates(["id"]) D. df.removeDuplicates(["id"]) Answer: A Explanation: dropDuplicates(["id"]) removes rows with duplicate values in the specified columns. distinct() removes full duplicate rows. Question 32 You need to perform a left outer join between df1 and df2. Which syntax is correct? A. df1.join(df2, on="id", how="left") B. df1.join(df2, "id", "left_outer") C. Both A and B D. df1.leftJoin(df2, "id") Answer: C Explanation: Both syntaxes are valid for performing a left outer join in PySpark. Question 33 You want to calculate the rolling average of the sales column over the last 3 rows, ordered by date. Which PySpark function is appropriate?

D. Both A and B Answer: D Explanation: You can access nested fields using dot notation or col("struct.field") in PySpark. Question 36 You want to remove the first row of a DataFrame df without collecting the DataFrame to the driver. Which approach is correct? A. df.tail(df.count() - 1) B. df.filter(row_number() > 1) C. df.drop(0) D. df.limit(df.count() - 1) Answer: B Explanation: Using row_number() over a window allows filtering without bringing data to the driver. Question 37 You want to persist a DataFrame df in memory and disk for fault tolerance. Which storage level should you use? A. MEMORY_ONLY B. MEMORY_AND_DISK C. DISK_ONLY D. MEMORY_ONLY_SER

Answer: B Explanation: MEMORY_AND_DISK stores as much as possible in memory and spills remaining partitions to disk. Question 38 You need to perform incremental aggregation on a streaming DataFrame by userId. Which operation is supported? A. streaming_df.groupBy("userId").count() B. streaming_df.select(countDistinct("userId")) C. streaming_df.orderBy("timestamp").limit(10) D. streaming_df.show() Answer: A Explanation: Streaming aggregations over a key are supported. Global aggregates or ordering are not allowed without windowing. Question 39 You have a column tags as an array. Which function converts it into multiple rows? A. explode(col("tags")) B. split(col("tags"), ",") C. posexplode(col("tags")) D. Both A and C Answer: D