


























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Databricks Certified Associate Developer for Apache Spark 3.5 – Python Exam | Complete 200 Question Practice Exam with Answers & Explanations | PDF
Typology: Exams
1 / 66
This page cannot be seen from the preview
Don't miss anything!



























































Question 1 A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate. Which code snippet can be used to meet this requirement? A. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate") B. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate") C. df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate") D. df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate") Answer: A Explanation: The PySpark drop() method removes specified columns and returns a new DataFrame. Multiple column names are passed as separate arguments. Question 2 A data engineer is working on a Streaming DataFrame streaming_df with unbounded streaming data.
Which operation is supported with streaming_df? A. streaming_df.select(countDistinct("Name")) B. streaming_df.groupby("Id").count() C. streaming_df.orderBy("timestamp").limit(4) D. streaming_df.filter(col("count") < 30).show() Answer: B Explanation: Structured Streaming supports aggregations over a key (groupBy). Global operations like countDistinct, orderBy, limit, or show() are not supported without windows or watermarks. Question 3 An MLOps engineer is building a Pandas UDF that applies a language model translating English strings to Spanish. The initial code loads the model on every call to the UDF:
Question 5 A data engineer is building a Structured Streaming pipeline and wants it to recover from failures or intentional shutdowns by continuing where it left off. How can this be achieved? A. Configure checkpointLocation during readStream B. Configure recoveryLocation during SparkSession initialization C. Configure recoveryLocation during writeStream D. Configure checkpointLocation during writeStream Answer: D Explanation: Setting checkpointLocation in writeStream allows Spark to store streaming progress and recover from failures. Question 6 A Spark DataFrame df contains a column event_time of type timestamp. You want to calculate the time difference in seconds between consecutive rows, partitioned by user_id and ordered by event_time. Which function should you use? A. lag() B. lead() C. row_number() D. dense_rank() Answer: A Explanation: The lag() function returns the value of a column from a previous row in a window. Combined with window partitioning and ordering, it allows you to calculate differences between consecutive rows.
Question 7 Which PySpark DataFrame method allows adding a new column based on an existing column using a SQL expression? A. withColumn() B. selectExpr() C. transform() D. map() Answer: B Explanation: selectExpr() allows using SQL expressions to create new columns or transform existing ones. Example: df.selectExpr("existing_col * 2 as new_col"). Question 8 You want to join two DataFrames df 1 and df2 on the column id, keeping all rows from df1 and only matching rows from df2. Which join type should you use? A. Inner join B. Left join C. Right join D. Full outer join Answer: B Explanation: A left join keeps all rows from the left DataFrame (df 1 ) and appends matched rows from the right DataFrame (df2). Question 9
A. df.repartition(10, "user_id") B. df.coalesce(10, "user_id") C. df.partitionBy("user_id", 10) D. df.shuffle("user_id", 10) Answer: A Explanation: repartition(numPartitions, *cols) reshuffles the DataFrame by the specified column(s) into the given number of partitions. Question 12 You have a DataFrame df with a nested structure: a column address of type StructType containing city and state. How can you select only the city? A. df.select("address.city") B. df.select(col("address").city) C. df.select("address->city") D. df.selectStruct("address", "city") Answer: A Explanation: Nested fields in a StructType can be accessed using dot notation: "struct_col.field_name". Question 13 Which PySpark transformation allows you to explode an array column into multiple rows? A. split() B. explode() C. flatten() D. collect_list()
Answer: B Explanation: explode() converts each element of an array into a separate row while keeping other columns intact. Question 14 You want to persist a DataFrame df in memory only without writing to disk. Which storage level should you use? A. MEMORY_ONLY B. MEMORY_AND_DISK C. DISK_ONLY D. OFF_HEAP Answer: A Explanation: MEMORY_ONLY caches the DataFrame in memory; if it does not fit, some partitions will not be cached. Question 15 You have a PySpark UDF that returns multiple columns. Which function is used to apply it? A. udf() B. pandas_udf() with StructType return type C. map() D. apply() Answer: B Explanation: To return multiple columns, use a Pandas UDF with a StructType specifying the schema of returned columns.
C. show() D. select() Answer: C Explanation: Transformations are lazy in Spark; actions like show(), collect(), or write() trigger execution. Question 18 You want to join df1 and df2 using a broadcast join to optimize performance because df2 is small. Which function should you use? A. df1.join(broadcast(df2), "id") B. df1.join(df2.hint("broadcast"), "id") C. Both A and B D. df1.join(df2, "id") Answer: C Explanation: Both broadcast(df2) and the .hint("broadcast") approach inform Spark to broadcast the small DataFrame to all nodes for a more efficient join. Question 19 Which PySpark function is used to flatten an array column into multiple rows? A. flatten() B. explode() C. split() D. collect_list() Answer: B
Explanation: explode() generates a new row for each element of an array column, keeping other columns unchanged. Question 20 You want to drop rows in a DataFrame where the column age is null. Which method is correct? A. df.dropna(subset=["age"]) B. df.filter("age IS NOT NULL") C. Both A and B D. df.na.drop() Answer: C Explanation: Both dropna(subset=["age"]) and filter("age IS NOT NULL") remove rows with nulls in the age column. Question 21 Which PySpark method allows you to rename multiple columns at once? A. withColumnRenamed() B. toDF() C. alias() D. selectExpr() Answer: B Explanation: toDF(*new_column_names) renames all columns at once. withColumnRenamed() works for one column at a time.
A. df.toPandas() B. df.collect() C. df.asPandas() D. df.convertToPandas() Answer: A Explanation: toPandas() collects the Spark DataFrame to the driver and returns a Pandas DataFrame. Question 25 Which PySpark method ensures DataFrame persistence in memory across multiple actions? A. cache() B. persist(StorageLevel.MEMORY_ONLY) C. Both A and B D. checkpoint() Answer: C Explanation: Both cache() and persist() store DataFrames in memory. persist() allows specifying storage levels like MEMORY_AND_DISK. Question 26 You want to explode a map column attributes into two columns key and value. Which function should you use? A. explode() B. posexplode() C. inline() D. from_json()
Answer: C Explanation: inline() can explode a map or struct into multiple columns for easier processing. Question 27 You need to write a DataFrame df as a Parquet file with Snappy compression. Which option is correct? A. df.write.option("compression", "snappy").parquet("/path") B. df.write.parquet("/path", compression="snappy") C. Both A and B D. df.write.snappy("/path") Answer: C Explanation: Both syntaxes are valid for writing Parquet files with Snappy compression in PySpark. Question 28 Which PySpark function allows you to create a new column with the rank of a value within a window? A. rank() B. dense_rank() C. row_number() D. All of the above Answer: D
You want to remove duplicate rows based on the id column in a DataFrame df. Which method is correct? A. df.dropDuplicates(["id"]) B. df.distinct(["id"]) C. df.drop_duplicates(["id"]) D. df.removeDuplicates(["id"]) Answer: A Explanation: dropDuplicates(["id"]) removes rows with duplicate values in the specified columns. distinct() removes full duplicate rows. Question 32 You need to perform a left outer join between df1 and df2. Which syntax is correct? A. df1.join(df2, on="id", how="left") B. df1.join(df2, "id", "left_outer") C. Both A and B D. df1.leftJoin(df2, "id") Answer: C Explanation: Both syntaxes are valid for performing a left outer join in PySpark. Question 33 You want to calculate the rolling average of the sales column over the last 3 rows, ordered by date. Which PySpark function is appropriate? A. avg("sales").over(windowSpec) with a window of 3 preceding rows B. rolling(3).mean("sales")
C. sum("sales").over(windowSpec) D. cumsum("sales") Answer: A Explanation: Window functions with rowsBetween(-2, 0) allow calculating rolling averages in Spark. Question 34 Which PySpark function can split a string column tags containing comma- separated values into an array column? A. split(col("tags"), ",") B. array_split(col("tags"), ",") C. explode(col("tags"), ",") D. from_csv(col("tags")) Answer: A Explanation: split() converts a string column into an array using the specified delimiter. Question 35 A DataFrame df has nested columns in a struct called address. How do you select the city field inside address? A. df.select("address.city") B. df.select(col("address.city")) C. df.select("address.*") D. Both A and B Answer: D
Question 38 You need to perform incremental aggregation on a streaming DataFrame by userId. Which operation is supported? A. streaming_df.groupBy("userId").count() B. streaming_df.select(countDistinct("userId")) C. streaming_df.orderBy("timestamp").limit(10) D. streaming_df.show() Answer: A Explanation: Streaming aggregations over a key are supported. Global aggregates or ordering are not allowed without windowing. Question 39 You have a column tags as an array. Which function converts it into multiple rows? A. explode(col("tags")) B. split(col("tags"), ",") C. posexplode(col("tags")) D. Both A and C Answer: D Explanation: explode() and posexplode() flatten an array column into multiple rows; posexplode() also gives the position of each element. Question 40
You need to ensure exactly-once processing in a Structured Streaming pipeline. Which configuration is essential? A. .option("checkpointLocation", "/path") during writeStream B. .option("checkpointLocation", "/path") during readStream C. .option("recoveryLocation", "/path") D. .option("stateStoreLocation", "/path") Answer: A Explanation: checkpointLocation during writeStream allows Spark to recover streaming queries and maintain exactly-once semantics. Question 41 Which PySpark function allows you to extract the month from a date column order_date? A. month("order_date") B. date_format("order_date", "MM") C. Both A and B D. extract_month("order_date") Answer: C Explanation: Both month() and date_format() can extract the month as an integer or string from a date column. Question 42 You want to join two DataFrames df1 and df2 without duplicating columns that exist in both. Which approach works?