






















































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This exam tests knowledge of Hadoop architecture, HDFS, MapReduce programming, cluster configuration, data ingestion, and ecosystem tools such as Hive, Pig, Sqoop, Flume, and HBase. Learners solve coding scenarios, performance tuning problems, and data transformation tasks. The exam ensures understanding of distributed computing principles, job workflows, fault tolerance, and designing efficient big-data pipelines for large-scale processing.
Typology: Exams
1 / 94
This page cannot be seen from the preview
Don't miss anything!























































































Question 1. Which of the following best describes the “Velocity” characteristic of Big Data? A) The size of data sets is extremely large. B) Data is generated and processed at high speed. C) Data comes in many different formats. D) The accuracy and trustworthiness of data. Answer: B Explanation: Velocity refers to the rapid rate at which data is created, collected, and processed, requiring real‑time or near‑real‑time handling. Question 2. In the Hadoop ecosystem, which component is primarily responsible for resource management and job scheduling? A) HDFS B) YARN C) Hive D) Pig Answer: B Explanation: YARN (Yet Another Resource Negotiator) decouples resource management from processing, handling allocation of containers and scheduling jobs. Question 3. What is the default block size in HDFS for most Hadoop distributions? A) 64 MB B) 128 MB C) 256 MB D) 512 MB Answer: B
Explanation: Modern Hadoop distributions set the default HDFS block size to 128 MB, though it can be configured per file or cluster. Question 4. Which Hadoop component provides a high‑level declarative language similar to SQL for querying data stored in HDFS? A) Pig B) HBase C) Hive D) Sqoop Answer: C Explanation: Hive offers HiveQL, a SQL‑like language that translates queries into MapReduce, Tez, or Spark jobs. Question 5. In HDFS, what is the role of the Secondary NameNode? A) It provides a hot standby for the active NameNode. B) It periodically merges the edit log with the FsImage. C) It stores block replicas on secondary storage devices. D) It balances data across DataNodes. Answer: B Explanation: The Secondary NameNode checkpoints the namespace by merging the edit log with the FsImage, reducing NameNode startup time. Question 6. Which of the following statements about Hadoop’s “rack awareness” is true? A) All replicas are placed on the same rack to reduce network latency. B) Replicas are placed on different racks to improve fault tolerance.
A) Map phase B) Shuffle phase C) Combine phase D) Partition phase Answer: B Explanation: The shuffle phase transfers mapper output to reducers and sorts it by key, ensuring reducers receive ordered data. Question 10. What is the purpose of a Combiner in a MapReduce job? A) To partition data across reducers. B) To aggregate mapper output locally before shuffling. C) To serialize data for network transfer. D) To generate final output files. Answer: B Explanation: A Combiner performs a mini‑reduce on each mapper’s output, reducing the amount of data transferred across the network. Question 11. Which of the following Hadoop components enables real‑time ingestion of log data into HDFS? A) Sqoop B) Flume C) Oozie D) Zookeeper Answer: B
Explanation: Apache Flume is designed for high‑volume, real‑time log data collection and delivery to HDFS or other storage. Question 12. In YARN, what entity negotiates resources on behalf of an application? A) ResourceManager B) NodeManager C) ApplicationMaster D) JobTracker Answer: C Explanation: The ApplicationMaster communicates with the ResourceManager to request containers and monitors their execution. Question 13. Which YARN component runs on each worker node and manages container life cycles? A) ResourceManager B) NodeManager C) ApplicationMaster D. JobTracker Answer: B Explanation: The NodeManager is responsible for launching, monitoring, and terminating containers on its node. Question 14. Which of the following best describes a “managed table” in Hive? A) The table data is stored outside of Hive’s warehouse directory. B) Hive controls both the metadata and the data files.
Question 17. In Pig Latin, which data type represents an unordered collection of tuples? A) Tuple B) Bag C) Map D) Chararray Answer: B Explanation: A Bag is an unordered multiset of tuples, analogous to a table in relational databases. Question 18. Which Pig statement is used to define a user‑defined function (UDF) written in Java? A) DEFINE myfunc org.example.MyUDF(); B) REGISTER 'myudf.jar'; C) LOAD myfunc USING org.example.MyUDF(); D) STORE myfunc AS org.example.MyUDF(); Answer: A Explanation: The DEFINE statement associates a name with a Java class that implements the UDF interface. Question 19. What is the primary storage unit in HBase? A) Row key B) Column family C) Cell (column qualifier + timestamp) D) Region Answer: C
Explanation: An HBase cell stores a value identified by row key, column family, column qualifier, and timestamp. Question 20. Which HBase component is responsible for serving read/write requests? A) HMaster B) RegionServer C) ZooKeeper D) HDFS NameNode Answer: B Explanation: RegionServers host regions and handle client read/write operations. Question 21. In Spark, which abstraction represents an immutable, distributed collection of objects that can be operated on in parallel? A) DataFrame B) Dataset C) RDD D) DStream Answer: C Explanation: Resilient Distributed Datasets (RDDs) are the foundational immutable collections in Spark. Question 22. Which Spark transformation is lazy and returns a new RDD without immediately executing the computation? A) collect() B) reduce() C) map()
A) A distributed dataset of static files. B) A continuous sequence of RDDs representing streaming data. C) A configuration object for Spark contexts. D) A machine‑learning model. Answer: B Explanation: DStreams (Discretized Streams) are series of RDDs generated at each batch interval from live data sources. Question 26. Which MLlib algorithm is most appropriate for clustering unlabeled data? A) Linear Regression B) Decision Tree C) K‑Means D) Naïve Bayes Answer: C Explanation: K‑Means partitions data into k clusters without needing labeled outcomes. Question 27. Which component of Hadoop provides workflow scheduling and coordination of multiple jobs? A) Oozie B) Zookeeper C) Flume D) Ambari Answer: A Explanation: Apache Oozie manages complex job workflows, handling dependencies, triggers, and scheduling.
Question 28. When configuring HDFS high availability, which two NameNode modes are typically deployed? A) Active and Passive B) Primary and Secondary C) Active and Standby D) Master and Slave Answer: C Explanation: HA uses an Active NameNode serving client requests and a Standby NameNode ready to take over on failure. Question 29. Which Hadoop configuration file contains settings for the ResourceManager and NodeManager? A) core-site.xml B) hdfs-site.xml C) yarn-site.xml D) mapred-site.xml Answer: C Explanation: yarn-site.xml holds YARN-specific configuration, including RM and NM properties. Question 30. In a Hadoop cluster, which command is used to check the health of the HDFS file system? A) hdfs dfsadmin - report B) hdfs dfsck / C) hdfs balancer
B) Kerberos C) SSL/TLS D) OAuth Answer: B Explanation: Kerberos issues tickets that both client and service verify, ensuring secure authentication across the cluster. Question 34. Which HDFS command changes the permission of a directory to read‑only for all users? A) hdfs dfs - chmod 444 /path B) hdfs dfs - chown 444 /path C) hdfs dfs - setrep 444 /path D) hdfs dfs - chmod 777 /path Answer: A Explanation: chmod 444 sets read‑only permissions for owner, group, and others. Question 35. What does the Hadoop “DistCp” utility primarily accomplish? A) Copy data between HDFS clusters efficiently. B) Distribute a compiled JAR to all nodes. C) Perform a distributed checksum verification. D) Dynamically balance block distribution. Answer: A Explanation: DistCp (distributed copy) leverages MapReduce to copy large datasets between clusters or within a cluster.
Question 36. Which of the following best characterizes a “cold” data node in a Hadoop cluster? A) A node that stores frequently accessed data. B) A node that has been shut down for maintenance. C) A node that stores infrequently accessed archival data. D) A node that runs only the ResourceManager. Answer: C Explanation: “Cold” storage refers to data that is rarely accessed, often moved to cheaper, slower disks. Question 37. In Spark, which API provides compile‑time type safety and functional transformations? A) RDD API B) DataFrame API C) Dataset API D) SQL API Answer: C Explanation: The Dataset API combines the benefits of RDDs (type safety) with the optimization of DataFrames. Question 38. Which of the following statements about Hadoop’s “map‑side join” is true? A) It requires the same number of reducers as mappers. B) It can be used when the small dataset fits into memory on each mapper. C) It is only possible with Hive, not raw MapReduce. D) It always improves performance regardless of dataset sizes.
C) Replicating a region across multiple nodes. D) Merging two regions into one. Answer: B Explanation: When a region grows beyond a configured size, HBase automatically splits it to improve load balancing. Question 42. Which tool would you use to schedule a periodic Hive query that runs every night at 02:00? A) Oozie B) Flume C) Zookeeper D) Ambari Answer: A Explanation: Oozie can define time‑based workflows, enabling scheduled Hive queries. Question 43. In Spark, which transformation results in a new RDD that contains only distinct elements from the source RDD? A) filter() B) distinct() C) union() D) groupByKey() Answer: B Explanation: distinct() removes duplicate elements, returning an RDD of unique values.
Question 44. Which of the following best describes the purpose of the DistributedCache in Hadoop MapReduce? A) To cache intermediate key/value pairs between map and reduce phases. B) To make small read‑only files available to all nodes executing a job. C) To store the final output of a job in memory. D) To replicate HDFS blocks across racks. Answer: B Explanation: DistributedCache distributes files (e.g., lookup tables) to each node’s local filesystem for fast access during job execution. Question 45. What is the default replication factor for HDFS blocks in a fresh Hadoop installation? A) 1 B) 2 C) 3 D) 4 Answer: C Explanation: By default, each block is replicated three times to ensure fault tolerance. Question 46. Which of the following is a primary reason to use Apache Pig over raw MapReduce? A) Pig provides a GUI for designing jobs. B) Pig scripts are automatically compiled to Java code. C) Pig Latin abstracts the low‑level MapReduce API, reducing development time. D) Pig runs only on Windows clusters. Answer: C
B) registerTempTable() C) cacheTable() D) sqlContext.register() Answer: A Explanation: createOrReplaceTempView("viewName") makes the DataFrame accessible via Spark SQL statements. Question 50. Which of the following best explains “schema on read” as used by Hadoop? A) The schema is defined before data is loaded into storage. B) Data is validated against a schema during ingestion. C) The schema is applied when the data is read for processing. D) Hadoop does not support schemas at any stage. Answer: C Explanation: “Schema on read” means that raw data is stored without a predefined schema; the schema is applied at query or processing time. Question 51. Which command would you use to increase the replication factor of an existing file to 5? A) hdfs dfs - setrep - w 5 /path/file B) hdfs dfs - setrep - R 5 /path/file C) hdfs dfs - chmod 5 /path/file D) hdfs dfs - chown - R 5 /path/file Answer: B Explanation: -setrep - R recursively sets the replication factor; specifying 5 changes it to five replicas.
Question 52. In Hive, which statement is true about a “bucketed” table? A) Bucketing automatically partitions data by date. B) Bucketing distributes rows into a fixed number of files based on a hash of a column. C) Bucketing is only supported for ORC file format. D) Bucketing eliminates the need for a primary key. Answer: B Explanation: Bucketing uses a hash of one or more columns to assign rows to a predefined number of buckets (files). Question 53. Which Hadoop ecosystem tool is designed for bulk loading data into HBase from HDFS files? A) Sqoop B) Flume C) ImportTsv D) Oozie Answer: C Explanation: ImportTsv (part of the HBase bulk load utilities) reads TSV files from HDFS and loads them into HBase tables. Question 54. Which of the following is a characteristic of a “cold standby” NameNode in HDFS HA? A) It actively serves client read/write requests. B) It maintains a synchronized edit log with the active NameNode. C) It stores a full copy of all block metadata in memory. D) It runs on the same host as the active NameNode. Answer: B