Databricks Certified Associate Developer for Apache Spark 3.0 – Python exam, Exams of Nursing

Various topics related to the databricks certified associate developer for apache spark 3.0 - python exam. It provides explanations and examples for concepts such as spark driver, worker nodes, slots, tasks, stages, shuffles, transformations, actions, execution/deployment modes, out-of-memory errors, storage levels, broadcast variables, data partitioning, dataframes, and sql udfs. Likely intended to serve as a study guide or reference material for individuals preparing for the databricks certified associate developer for apache spark 3.0 - python exam. The level of detail and the technical nature of the content suggest that this document would be most useful for university students or lifelong learners with a strong background in data engineering, big data processing, and apache spark.

Typology: Exams

2023/2024

Available from 08/13/2024

Ellah1
Ellah1 🇺🇸

4.3

(11)

11K documents

1 / 14

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Databricks Certified Associate Developer for
Apache Spark 3.0 – Python exam
What is a Spark Driver? - The spark driver is the node in which the Spark
application's main method runs to coordinate the Spark application. It
contains the SparkContext object. Responsible for scheduling the
execution of data by various worker nodes in cluster mode.
What are worker nodes in cluster-mode Spark - Worker nodes are
machines that host the executors responsible for the execution of tasks.
What are slots? - Slots are resources for parallelization within a Spark
application.
What is a combination of a block of data and a set of transformers that
runs on a single executor? - Task
What is a group of tasks that can be executed in parallel to compute the
same set of operations on potentially multiple machines? - Stage
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe

Partial preview of the text

Download Databricks Certified Associate Developer for Apache Spark 3.0 – Python exam and more Exams Nursing in PDF only on Docsity!

Apache Spark 3.0 – Python exam

What is a Spark Driver? - The spark driver is the node in which the Spark application's main method runs to coordinate the Spark application. It contains the SparkContext object. Responsible for scheduling the execution of data by various worker nodes in cluster mode. What are worker nodes in cluster-mode Spark - Worker nodes are machines that host the executors responsible for the execution of tasks. What are slots? - Slots are resources for parallelization within a Spark application. What is a combination of a block of data and a set of transformers that runs on a single executor? - Task What is a group of tasks that can be executed in parallel to compute the same set of operations on potentially multiple machines? - Stage

Apache Spark 3.0 – Python exam

What is a shuffle? - A shuffle is the process by which data is compared across partitions. If you have a DF with more partitions than you have (single core) executors what happens? - Performance will be suboptimal because not all data can be processed at the same time. Shuffle commands will create a large number of connections. Increased overhead associated with managing resources for data processing for each task. Increased risk of out-of-memory errors depending on the size of executors. which of the following operations will trigger evaluation? A) df.filter() B) df.distinct() C) df.intersect() D) df.join() E) df.count() - E) df.count()

Apache Spark 3.0 – Python exam

pass - pass What is an out-of-memory error in Spark? - An out-of-memory error occurs when either the driver or an executor does not have enough memory to collect or process the data allocated to it. Which of the following is the default storage level for persist() for a non- streaming dataframe/dataset? A) MEMORY_AND_DISK B) MEMORY_AND_DISK_SER C) DISK_ONLY D) MEMORY_ONLY_SER E) MEMORY_ONLY - A) MEMORY_AND_DISK What is a broadcast variable? - A broadcast variable is entirely cached on each worker node so it doesn't need to be shipped or shuffled between nodes within each stage.

Apache Spark 3.0 – Python exam

Which of the following operations is most likely to skew in size of your data's partitions? A) df.collect() B) df.cache() C) df.repartition(n) D) df.coalesce(n) E) df.persist() - D) df.coalesce(n) What data structures are Spark DataFrames built on top of? - RDDs (resilient distributed datasets) What is the code block needed to return a dataframe containing only column 'storeId' and column 'division' from a dataframe called 'storesDF'? - storesDF.select("storeId", "division") pass - pass

Apache Spark 3.0 – Python exam

What is the code that returns a new DF from a DF 'storesDF' where column 'numberOfManagers' is the constant integer 1? - storesDF.withColumn("numberOfManagers", lit(1)) pass - pass Which of the following operations can be used to split an array column into an individual DataFrame row for each element in the array? A) extract() B) split() C) explode() D) arrays_zip() E) unpack() - C) explode() What code returns a new DataFrame where column "storeCategory" is an all-lowercase version of column "storeCategory" in DataFrame "storesDF". - storesDF.withColumn("storeCategory", lower(col("storeCategory")))

Apache Spark 3.0 – Python exam

The code block shown below contains an error. The code block is intended to return a new DataFrame where column division from DataFrame storesDF has been renamed to column state and column managerName from DataFrame storesDF has been renamed to column managerFullName. Identify the error. Code block: (storesDF.withColumnRenamed("state", "division") .withColumnRenamed("managerFullName", "managerName")) - The first argument to operation withColumnRenamed() should be the old column name and the second argument should be the new column name. What is the code that returns a DataFrame where rows in DataFrame "storesDF" containing missing values in every column have been dropped. - storesDF.na.drop("all") Which of the following operations fails to return a DataFrame where every row is unique? A) DataFrame.distinct()

Apache Spark 3.0 – Python exam

Fill in the blanks on the block below to return a new DF with the mean of column 'sqft' from DF 'storesDF' in col 'sqftMean'. storesDF.1(2(3).alias("sqftMean") - 1 - agg 2 - mean 3 - col("sqft") Which of the following code blocks returns the number of rows in DF 'storesDF' A. storesDF.withColumn("numberOfRows", count()) B. storesDF.withColumn(count().alias("numberOfRows")) C. storesDF.countDistinct() D. storesDF.count() E. storesDF.agg(count()) - D. storesDF.count()

Apache Spark 3.0 – Python exam

What is the code block which returns the sum of values in colum 'sqft' in DF 'storesDF' grouped by distinct values in col 'division' - storesDF.groupBy("division".agg(sum(col("sqft))) What is the code block which returns a DF containing summary statistics only for column 'sqft' in DF 'storesDF'. - storesDF.describe("sqft") Which of the following operations can be used to sort the rows of a DataFrame? A) sort() and orderBy() B) orderby() C) sort() and orderby() D orderBy() E) sort() - A) sort() and orderBy()

Apache Spark 3.0 – Python exam

1.2._3 - 1) storesDF

  1. first()
  2. sqft How do you print the schema of a DataFrame? - DataFrame.printSchema() In what order should the below lines of code be run in order to create and register a SQL UDF named "ASSESS_PERFORMANCE" using the Python function assessPerformance'' and apply it to column 'customerSatistfaction' in table 'stores'? Lines of code:
  1. spark.udf.register("ASSESS_PERFORMANCE", assessPerformance)
  2. spark.sql("SELECT customerSatisfaction, assessPerformance(customerSatisfaction) AS result FROM stores")
  3. spark.udf.register(assessPerformance, "ASSESS_PERFORMANCE")

Apache Spark 3.0 – Python exam

  1. spark.sql("SELECT customerSatisfaction, ASSESS_PERFORMANCE(customerSatisfaction) AS result FROM stores")
  • 1 -> 4