


























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Databricks Certified Associate Developer for Apache Spark 3.5 – Python Exam | Complete 200 Question Practice Exam with Answers & Explanations
Typology: Exams
1 / 66
This page cannot be seen from the preview
Don't miss anything!



























































Question 1 A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_piiand store only the non-PII columns in this DataFrame. The PII columns in df_userare first_name, last_name, email, and birthdate. Which code snippet can be used to meet this requirement? A. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate") B. df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate") C. df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate") D. df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate") Answer: A Explanation: The PySpark drop()method removes specified columns and returns a new DataFrame. Multiple column names are passed as separate arguments. Question 2 A data engineer is working on a Streaming DataFrame streaming_dfwith unbounded streaming data.
Which operation is supported with streaming_df? A. streaming_df.select(countDistinct("Name")) B. streaming_df.groupby("Id").count() C. streaming_df.orderBy("timestamp").limit(4) D. streaming_df.filter(col("count") < 30).show() Answer: B Explanation: Structured Streaming supports aggregations over a key (groupBy). Global operations like countDistinct, orderBy, limit, or show()are not supported without windows or watermarks. Question 3 An MLOps engineer is building a Pandas UDF that applies a language model translating English strings to Spanish. The initial code loads the model on every call to the UDF:
Question 5 A data engineer is building a Structured Streaming pipeline and wants it to recover from failures or intentional shutdowns by continuing where it left off. How can this be achieved? A. Configure checkpointLocationduring readStream B. Configure recoveryLocationduring SparkSession initialization C. Configure recoveryLocationduring writeStream D. Configure checkpointLocationduring writeStream Answer: D Explanation: Setting checkpointLocationin writeStreamallows Spark to store streaming progress and recover from failures. Question 6 A Spark DataFrame dfcontains a column event_timeof type timestamp. You want to calculate the time difference in seconds between consecutive rows, partitioned by user_idand ordered by event_time. Which function should you use? A. lag() B. lead() C. row_number() D. dense_rank() Answer: A Explanation: The lag()function returns the value of a column from a previous row in a window. Combined with windowpartitioning and ordering, it allows you to calculate differences between consecutive rows.
Question 7 Which PySpark DataFrame method allows adding a new column based on an existing column using a SQL expression? A. withColumn() B. selectExpr() C. transform() D. map() Answer: B Explanation: selectExpr()allows using SQL expressions to create new columns or transform existing ones. Example: df.selectExpr("existing_col * 2 as new_col"). Question 8 You want to join two DataFrames df1and df2on the column id, keeping all rows from df1and only matching rows from df2. Which join type should you use? A. Inner join B. Left join C. Right join D. Full outer join Answer: B Explanation: A left join keeps all rows from the left DataFrame (df1) and appends matched rows from the right DataFrame (df2). Question 9
A. df.repartition(10, "user_id") B. df.coalesce(10, "user_id") C. df.partitionBy("user_id", 10) D. df.shuffle("user_id", 10) Answer: A Explanation: repartition(numPartitions, *cols)reshuffles the DataFrame by the specified column(s) into the given number of partitions. Question 12 You have a DataFrame dfwith a nested structure: a column addressof type StructTypecontaining cityand state. How can you select only the city? A. df.select("address.city") B. df.select(col("address").city) C. df.select("address->city") D. df.selectStruct("address", "city") Answer: A Explanation: Nested fields in a StructTypecan be accessed using dot notation: "struct_col.field_name". Question 13 Which PySpark transformation allows you to explode an array column into multiple rows? A. split() B. explode() C. flatten() D. collect_list()
Answer: B Explanation: explode()converts each element of an array into a separate row while keeping other columns intact. Question 14 You want to persist a DataFrame dfin memory only without writing to disk. Which storage level should you use? A. MEMORY_ONLY B. MEMORY_AND_DISK C. DISK_ONLY D. OFF_HEAP Answer: A Explanation: MEMORY_ONLYcaches the DataFrame in memory; if it does not fit, some partitions will not be cached. Question 15 You have a PySpark UDF that returns multiple columns. Which function is used to apply it? A. udf() B. pandas_udf()with StructTypereturn type C. map() D. apply() Answer: B Explanation: To return multiple columns, use a Pandas UDF with a StructTypespecifying the schema of returned columns.
C. show() D. select() Answer: C Explanation: Transformations are lazy in Spark; actions like show(), collect(), or write()trigger execution. Question 18 You want to join df1and df2using a broadcast join to optimize performance because df2is small. Which function should you use? A. df1.join(broadcast(df2), "id") B. df1.join(df2.hint("broadcast"), "id") C. Both A and B D. df1.join(df2, "id") Answer: C Explanation: Both broadcast(df2)and the .hint("broadcast")approach inform Spark to broadcast the small DataFrame to all nodes for a more efficient join. Question 19 Which PySpark function is used to flatten an array column into multiple rows? A. flatten() B. explode() C. split() D. collect_list() Answer: B
Explanation: explode()generates a new row for each element of an array column, keeping other columns unchanged. Question 20 You want to drop rows in a DataFrame where the column ageis null. Which method is correct? A. df.dropna(subset=["age"]) B. df.filter("age IS NOT NULL") C. Both A and B D. df.na.drop() Answer: C Explanation: Both dropna(subset=["age"])and filter("age IS NOT NULL") remove rows with nulls in the agecolumn. Question 21 Which PySpark method allows you to rename multiple columns at once? A. withColumnRenamed() B. toDF() C. alias() D. selectExpr() Answer: B Explanation: toDF(*new_column_names)renames all columns at once. withColumnRenamed()works for one column at a time.
A. df.toPandas() B. df.collect() C. df.asPandas() D. df.convertToPandas() Answer: A Explanation: toPandas()collects the Spark DataFrame to the driver and returns a Pandas DataFrame. Question 25 Which PySpark method ensures DataFrame persistence in memory across multiple actions? A. cache() B. persist(StorageLevel.MEMORY_ONLY) C. Both A and B D. checkpoint() Answer: C Explanation: Both cache()and persist()store DataFrames in memory. persist() allows specifying storage levels like MEMORY_AND_DISK. Question 26 You want to explode a map column attributesinto two columns keyand value. Which function should you use? A. explode() B. posexplode() C. inline() D. from_json()
Answer: C Explanation: inline()can explode a map or struct into multiple columns for easier processing. Question 27 You need to write a DataFrame dfas a Parquet file with Snappy compression. Which option is correct? A. df.write.option("compression", "snappy").parquet("/path") B. df.write.parquet("/path", compression="snappy") C. Both A and B D. df.write.snappy("/path") Answer: C Explanation: Both syntaxes are valid for writing Parquet files with Snappy compression in PySpark. Question 28 Which PySpark function allows you to create a new column with the rank of a value within a window? A. rank() B. dense_rank() C. row_number() D. All of the above Answer: D
You want to remove duplicate rows based on the idcolumn in a DataFrame df. Which method is correct? A. df.dropDuplicates(["id"]) B. df.distinct(["id"]) C. df.drop_duplicates(["id"]) D. df.removeDuplicates(["id"]) Answer: A Explanation: dropDuplicates(["id"])removes rows with duplicate values in the specified columns. distinct()removes full duplicate rows. Question 32 You need to perform a left outer join between df1and df2. Which syntax is correct? A. df1.join(df2, on="id", how="left") B. df1.join(df2, "id", "left_outer") C. Both A and B D. df1.leftJoin(df2, "id") Answer: C Explanation: Both syntaxes are valid for performing a left outer join in PySpark. Question 33 You want to calculate the rolling average of the salescolumn over the last 3 rows, ordered by date. Which PySpark function is appropriate? A. avg("sales").over(windowSpec)with a window of 3 preceding rows B. rolling(3).mean("sales")
C. sum("sales").over(windowSpec) D. cumsum("sales") Answer: A Explanation: Window functions with rowsBetween(-2, 0)allow calculating rolling averages in Spark. Question 34 Which PySpark function can split a string column tagscontaining comma- separated values into an array column? A. split(col("tags"), ",") B. array_split(col("tags"), ",") C. explode(col("tags"), ",") D. from_csv(col("tags")) Answer: A Explanation: split()converts a string column into an array using the specified delimiter. Question 35 A DataFrame dfhas nested columns in a structcalled address. How do you select the cityfield inside address? A. df.select("address.city") B. df.select(col("address.city")) C. df.select("address.*") D. Both A and B Answer: D
Question 38 You need to perform incremental aggregation on a streaming DataFrame by userId. Which operation is supported? A. streaming_df.groupBy("userId").count() B. streaming_df.select(countDistinct("userId")) C. streaming_df.orderBy("timestamp").limit(10) D. streaming_df.show() Answer: A Explanation: Streaming aggregations over a key are supported. Global aggregates or ordering are not allowed without windowing. Question 39 You have a column tagsas an array. Which function converts it into multiple rows? A. explode(col("tags")) B. split(col("tags"), ",") C. posexplode(col("tags")) D. Both A and C Answer: D Explanation: explode()and posexplode()flatten an array column into multiple rows; posexplode()also gives the position of each element. Question 40
You need to ensure exactly-once processing in a Structured Streaming pipeline. Which configuration is essential? A. .option("checkpointLocation", "/path")during writeStream B. .option("checkpointLocation", "/path")during readStream C. .option("recoveryLocation", "/path") D. .option("stateStoreLocation", "/path") Answer: A Explanation: checkpointLocationduring writeStreamallows Spark to recover streaming queries and maintain exactly-once semantics. Question 41 Which PySpark function allows you to extract the month from a date column order_date? A. month("order_date") B. date_format("order_date", "MM") C. Both A and B D. extract_month("order_date") Answer: C Explanation: Both month()and date_format()can extract the month as an integer or string from a date column. Question 42 You want to join two DataFrames df1and df2without duplicating columns that exist in both. Which approach works?