








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Various topics related to the databricks certified associate developer for apache spark 3.0 - python exam. It provides explanations and examples for concepts such as spark driver, worker nodes, slots, tasks, stages, shuffles, transformations, actions, execution/deployment modes, out-of-memory errors, storage levels, broadcast variables, data partitioning, dataframes, and sql udfs. Likely intended to serve as a study guide or reference material for individuals preparing for the databricks certified associate developer for apache spark 3.0 - python exam. The level of detail and the technical nature of the content suggest that this document would be most useful for university students or lifelong learners with a strong background in data engineering, big data processing, and apache spark.
Typology: Exams
1 / 14
This page cannot be seen from the preview
Don't miss anything!









What is a Spark Driver? - The spark driver is the node in which the Spark application's main method runs to coordinate the Spark application. It contains the SparkContext object. Responsible for scheduling the execution of data by various worker nodes in cluster mode. What are worker nodes in cluster-mode Spark - Worker nodes are machines that host the executors responsible for the execution of tasks. What are slots? - Slots are resources for parallelization within a Spark application. What is a combination of a block of data and a set of transformers that runs on a single executor? - Task What is a group of tasks that can be executed in parallel to compute the same set of operations on potentially multiple machines? - Stage
What is a shuffle? - A shuffle is the process by which data is compared across partitions. If you have a DF with more partitions than you have (single core) executors what happens? - Performance will be suboptimal because not all data can be processed at the same time. Shuffle commands will create a large number of connections. Increased overhead associated with managing resources for data processing for each task. Increased risk of out-of-memory errors depending on the size of executors. which of the following operations will trigger evaluation? A) df.filter() B) df.distinct() C) df.intersect() D) df.join() E) df.count() - E) df.count()
pass - pass What is an out-of-memory error in Spark? - An out-of-memory error occurs when either the driver or an executor does not have enough memory to collect or process the data allocated to it. Which of the following is the default storage level for persist() for a non- streaming dataframe/dataset? A) MEMORY_AND_DISK B) MEMORY_AND_DISK_SER C) DISK_ONLY D) MEMORY_ONLY_SER E) MEMORY_ONLY - A) MEMORY_AND_DISK What is a broadcast variable? - A broadcast variable is entirely cached on each worker node so it doesn't need to be shipped or shuffled between nodes within each stage.
Which of the following operations is most likely to skew in size of your data's partitions? A) df.collect() B) df.cache() C) df.repartition(n) D) df.coalesce(n) E) df.persist() - D) df.coalesce(n) What data structures are Spark DataFrames built on top of? - RDDs (resilient distributed datasets) What is the code block needed to return a dataframe containing only column 'storeId' and column 'division' from a dataframe called 'storesDF'? - storesDF.select("storeId", "division") pass - pass
What is the code that returns a new DF from a DF 'storesDF' where column 'numberOfManagers' is the constant integer 1? - storesDF.withColumn("numberOfManagers", lit(1)) pass - pass Which of the following operations can be used to split an array column into an individual DataFrame row for each element in the array? A) extract() B) split() C) explode() D) arrays_zip() E) unpack() - C) explode() What code returns a new DataFrame where column "storeCategory" is an all-lowercase version of column "storeCategory" in DataFrame "storesDF". - storesDF.withColumn("storeCategory", lower(col("storeCategory")))
The code block shown below contains an error. The code block is intended to return a new DataFrame where column division from DataFrame storesDF has been renamed to column state and column managerName from DataFrame storesDF has been renamed to column managerFullName. Identify the error. Code block: (storesDF.withColumnRenamed("state", "division") .withColumnRenamed("managerFullName", "managerName")) - The first argument to operation withColumnRenamed() should be the old column name and the second argument should be the new column name. What is the code that returns a DataFrame where rows in DataFrame "storesDF" containing missing values in every column have been dropped. - storesDF.na.drop("all") Which of the following operations fails to return a DataFrame where every row is unique? A) DataFrame.distinct()
Fill in the blanks on the block below to return a new DF with the mean of column 'sqft' from DF 'storesDF' in col 'sqftMean'. storesDF.1(2(3).alias("sqftMean") - 1 - agg 2 - mean 3 - col("sqft") Which of the following code blocks returns the number of rows in DF 'storesDF' A. storesDF.withColumn("numberOfRows", count()) B. storesDF.withColumn(count().alias("numberOfRows")) C. storesDF.countDistinct() D. storesDF.count() E. storesDF.agg(count()) - D. storesDF.count()
What is the code block which returns the sum of values in colum 'sqft' in DF 'storesDF' grouped by distinct values in col 'division' - storesDF.groupBy("division".agg(sum(col("sqft))) What is the code block which returns a DF containing summary statistics only for column 'sqft' in DF 'storesDF'. - storesDF.describe("sqft") Which of the following operations can be used to sort the rows of a DataFrame? A) sort() and orderBy() B) orderby() C) sort() and orderby() D orderBy() E) sort() - A) sort() and orderBy()
1.2._3 - 1) storesDF