

















































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This exam guide prepares data engineers to design and maintain data pipelines. Coverage includes data modeling, ETL/ELT processes, data warehouses, streaming data, performance tuning, security, governance, and analytics enablement for business intelligence and machine learning use cases.
Typology: Exams
1 / 89
This page cannot be seen from the preview
Don't miss anything!


















































































Question 1. Which Sqoop command option is used to import only the rows that satisfy a specific condition? A) --where B) --target-dir C) --as-parquetfile D) --split-by Answer: A Explanation: The --where option lets you specify a SQL WHERE clause so that only rows meeting the condition are imported. Question 2. In Flume, what component is responsible for temporarily storing events before they are delivered to the sink? A) Source B) Channel C) Sink D) Interceptor Answer: B Explanation: A channel acts as a buffer between the source and sink, holding events until they can be written to the sink. Question 3. Which Hadoop file system command lists the contents of a directory in HDFS? A) hdfs dfs - ls B) hdfs dfs - put C) hdfs dfs - rm D) hdfs dfs - cat Answer: A
Explanation: ‑ls displays the files and sub‑directories within a given HDFS path. Question 4. When converting a CSV file to Parquet using Spark, which option provides the most efficient columnar storage? A) spark.read.csv(...).write.parquet(...) B) spark.read.text(...).write.parquet(...) C) spark.read.json(...).write.parquet(...) D) spark.read.parquet(...).write.parquet(...) Answer: A Explanation: Reading the CSV as a DataFrame and writing it directly to Parquet yields columnar storage with compression and schema inference. Question 5. Which compression codec offers the best trade‑off between speed and compression ratio for Hive tables stored as ORC? A) Gzip B) Snappy C) Bzip D) LZO Answer: B Explanation: Snappy is optimized for fast compression/decompression with reasonable size reduction, making it a common choice for ORC. Question 6. In data cleaning, which Spark function is used to drop rows that contain any null values? A) na.drop() B) filter(isnull) C) dropna()
B) flatMap() C) withColumn() D) groupBy() Answer: C Explanation: withColumn() adds or replaces a column; you can apply a UDF to compute the new “point” column. Question 10. In Kafka, what term describes a logical grouping of partitions that a consumer reads from? A) Topic B) Broker C) Consumer group D) Cluster Answer: C Explanation: All consumers in a consumer group share the load of reading partitions of a topic. Question 11. Which Sqoop option specifies the number of map tasks to use during import? A) - m or --num-mappers B) --split-by C) --fetch-size D) --direct Answer: A Explanation: - m or --num-mappers sets the parallelism level for the import job.
Question 12. When loading data into HDFS, which command would you use to change the permission of a directory to be readable and writable by the owner only? A) hdfs dfs - chmod 700 /dir B) hdfs dfs - chmod 777 /dir C) hdfs dfs - chown root /dir D) hdfs dfs - setrep 3 /dir Answer: A Explanation: chmod 700 gives read, write, execute permissions only to the owner. Question 13. Which Hive function can be used to calculate the standard deviation of a numeric column? A) STDDEV_POP() B) VARIANCE() C) AVG() D) MEDIAN() Answer: A Explanation: STDDEV_POP() computes the population standard deviation across all rows. Question 14. In Spark, which storage level persists data only in memory without replication? A) MEMORY_ONLY_SER B) MEMORY_ONLY C) DISK_ONLY D) MEMORY_AND_DISK_ Answer: B
Answer: A Explanation: Adding a field with a default value ensures older readers can ignore the new field. Question 18. In Hive, which clause is used to limit the number of rows returned by a query? A) TOP B) LIMIT C) ROWNUM D) FETCH FIRST Answer: B Explanation: LIMIT restricts the result set to the specified number of rows. Question 19. Which Oozie action type is used to run a MapReduce job? A) shell B) java C) map-reduce D) spark Answer: C Explanation: The map-reduce action defines a Hadoop MapReduce job within an Oozie workflow. Question 20. When using Flume to write data to HDFS, which sink type automatically rolls files based on time or size? A) HDFS Sink B) File Channel
C) Avro Sink D) HDFS Sink with RollPolicy Answer: D Explanation: The HDFS sink’s RollPolicy controls when a file is closed and a new one opened, based on time or file size. Question 21. Which Spark SQL function can be used to explode an array column into multiple rows? A) flatten() B) posexplode() C) explode() D) split() Answer: C Explanation: explode() creates a new row for each element in the array. Question 22. What is the purpose of the --as-avrodatafile option in Sqoop import? A) Imports data as a sequence file B) Stores data in Avro format with schema embedded C) Compresses data using Gzip D) Splits data by a column Answer: B Explanation: --as-avrodatafile writes the imported data as Avro files, preserving the schema. Question 23. Which Hive setting controls the number of reducers for a query that contains a GROUP BY?
Answer: B Explanation: ORC stores columnar statistics (min, max, ndv) that the optimizer can use. Question 27. In Spark, which API is used to read data from a Hive table directly? A) spark.read.format("hive") B) spark.sql("SELECT …") after enabling Hive support C) spark.read.hiveTable() D) spark.read.jdbc() Answer: B Explanation: With .enableHiveSupport(), spark.sql() can query Hive tables as if they were Spark tables. Question 28. Which of the following is a valid use of Apache Atlas? A) Real‑time data streaming B) Metadata catalog and lineage tracking C) Cluster resource scheduling D) Data compression Answer: B Explanation: Atlas provides metadata management, data lineage, and governance capabilities. Question 29. Which Hadoop command removes a directory and all its contents from HDFS?
A) hdfs dfs - rm /dir B) hdfs dfs - rmdir /dir C) hdfs dfs - rm - r /dir D) hdfs dfs - delete /dir Answer: C Explanation: ‑rm ‑r recursively deletes the directory and its files. Question 30. When using Sqoop to export data from HDFS to a relational database, which option specifies the target table? A) --table B) --export-dir C) --target-dir D) --columns Answer: A Explanation: --table identifies the database table that receives the exported data. Question 31. Which Spark configuration property sets the default number of shuffle partitions? A) spark.sql.shuffle.partitions B) spark.default.parallelism C) spark.shuffle.compress D) spark.executor.instances Answer: A Explanation: spark.sql.shuffle.partitions controls how many partitions are created during shuffles.
Question 35. Which Spark method triggers execution of a lazy transformation chain? A) transform() B) collect() C) map() D) filter() Answer: B Explanation: collect() forces the DAG to be computed and returns results to the driver. Question 36. In Oozie, which element defines a conditional branch based on the exit code of a previous action? A) B) C) D) Answer: C Explanation: contains statements that route workflow based on exit codes or other conditions. Question 37. Which Hive setting controls whether a query can read from external tables without explicit permission? A) hive.exec.dynamic.partition.mode B) hive.security.authorization.enabled C) hive.metastore.warehouse.dir D) hive.exec.max.dynamic.partitions Answer: B
Explanation: Enabling hive.security.authorization.enabled makes Hive enforce permission checks on external tables. Question 38. When using Spark Structured Streaming, which sink writes data to a file system in a transactional manner? A) console B) memory C) file D) foreachBatch Answer: C Explanation: The file sink writes micro‑batches as files, committing each batch atomically. Question 39. Which command in Sqoop is used to import a table into HDFS as a sequence file? A) --as-sequencefile B) --as-avrodatafile C) --as-parquetfile D) --as-textfile Answer: A Explanation: --as-sequencefile stores the imported data in Hadoop’s SequenceFile format. Question 40. In Hive, which function can be used to convert a string column containing JSON into a struct? A) from_json() B) json_tuple() C) parse_json()
C) CREATE TABLE t (… ) STORED AS PARQUET; D) CREATE TABLE t (… ) WITH SERDEPROPERTIES …; Answer: A Explanation: PARTITIONED BY specifies the column used for partitioning. Question 44. In Spark, which method is used to cache a DataFrame in memory for repeated access? A) persist() B) checkpoint() C) cache() D) repartition() Answer: C Explanation: cache() is a shortcut for persist(StorageLevel.MEMORY_ONLY) and stores the DataFrame in RAM. Question 45. Which Flume channel type provides guaranteed delivery at the cost of higher latency? A) MemoryChannel B) FileChannel C) JDBCChannel D) KafkaChannel Answer: B Explanation: FileChannel writes events to disk, ensuring durability even if the agent crashes. Question 46. When using Hive’s INSERT OVERWRITE statement, what happens to the existing data in the target directory?
A) It is appended to the existing data. B) It is deleted before the new data is written. C) It is moved to a backup location. D) Hive throws an error. Answer: B Explanation: INSERT OVERWRITE replaces the contents of the target directory, removing previous files. Question 47. Which Spark API is optimal for performing iterative machine‑learning algorithms that repeatedly reuse the same dataset? A) RDD API with checkpointing B) DataFrame API with caching C) Spark Streaming API D) GraphX API Answer: B Explanation: Caching a DataFrame keeps data in memory across iterations, reducing recomputation. Question 48. Which command in Oozie is used to submit a new workflow job? A) oozie job - run B) oozie jobs - submit C) oozie job - submit - config workflow.xml D) oozie admin - jobs - submit Answer: C Explanation: oozie job - submit with the - config parameter specifies the workflow definition to launch.
Question 52. Which Hive command removes a table’s metadata but retains the underlying HDFS data? A) DROP TABLE table_name; B) DROP TABLE table_name PURGE; C) DROP TABLE table_name EXTERNAL; D) DROP TABLE table_name IF EXISTS; Answer: C Explanation: Declaring a table as EXTERNAL and dropping it removes only the metadata; the data files stay intact. Question 53. Which Spark function is used to convert a DataFrame column of strings to timestamps given a format? A) to_date() B) to_timestamp() C) unix_timestamp() D) date_format() Answer: B Explanation: to_timestamp(col, format) parses strings into TimestampType using the supplied pattern. Question 54. In Flume, what is the purpose of an interceptor? A) To route events to multiple sinks B) To modify or drop events before they reach the channel C) To store events in HDFS D) To provide authentication for sources
Answer: B Explanation: Interceptors can transform, enrich, or filter events before they are placed in a channel. Question 55. Which Hive DDL statement adds a new column with a default value to an existing table? A) ALTER TABLE tbl ADD COLUMNS (col INT DEFAULT 0); B) ALTER TABLE tbl MODIFY COLUMN col INT DEFAULT 0; C) ALTER TABLE tbl ADD COLUMNS (col INT); D) ALTER TABLE tbl CHANGE col col INT DEFAULT 0; Answer: C Explanation: Hive’s ADD COLUMNS adds new columns; default values are not stored, but existing rows will have nulls which can be handled later. Question 56. Which Spark configuration controls the maximum size of a single partition when reading from HDFS? A) spark.hadoop.fs.local.block.size B) spark.sql.files.maxPartitionBytes C) spark.default.parallelism D) spark.sql.shuffle.partitions Answer: B Explanation: spark.sql.files.maxPartitionBytes sets the target size for each file partition. Question 57. Which Oozie coordinator action is used to trigger a workflow based on time‑based schedules? A) B)