


Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Big data chapter 6: MapReduce Algorithms - Graph Algorithms Exam 2025/2026 – Solved Questions and Answers
Typology: Exams
1 / 4
This page cannot be seen from the preview
Don't miss anything!



Define the notion of graphs. Why do we consider graphs a big data case? - A graph is defined as G = (V, E), where V is a set of nodes and E is a set of edges connecting them. Graphs can be directed or undirected, cyclic or acyclic, and may include attributes on nodes and edges. Graphs are a big data case because real-world examples like the web or social networks involve billions of elements, making them too large for a single machine. How can we represent a graph? Which representation is considered appropriate for the MapReduce framework? - Graphs can be represented using adjacency matrices or adjacency lists. An adjacency matrix is an n×n grid where each entry indicates whether an edge exists between nodes. Adjacency lists store each node's neighbors, making them more compact and scalable. MapReduce handles adjacency lists efficiently. It allows grouping by destination node for in-link computations and simplifies graph inversion by emitting edges as (destination, source) pairs. Describe Dijkstra's solution for the SSSP problem. - Dijkstra's algorithm finds the shortest paths from a source node to all others, assuming positive edge weights. It starts with all distances set to ∞, except the source (0), and uses a priority queue. At each step, it picks the node with the smallest distance, updates neighbors if shorter paths are found, and repeats until all nodes are processed. How can we perform BFS with the MapReduce framework? How many iterations are needed and why? - Parallel BFS in MapReduce advances the frontier one hop per iteration, requiring multiple steps to explore the full graph. Mappers emit distance info for neighbors; reducers pick the shortest and update nodes. The number of iterations equals the graph's diameter (equal weights) or up to |V|−1 (positive weights). It stops when distances stop changing, tracked via Hadoop counters and a driver program. Describe the MapReduce SSSP algorithm. How many iterations are needed? What termination criterion do we use? How can it be implemented? - The MapReduce SSSP algorithm finds shortest paths from a source by iteratively updating distance estimates in parallel. Mappers emit each node's data and tentative distances to neighbors; reducers keep the smallest distance and update nodes if needed. The process repeats until distances no longer change. Iterations depend on edge weights: up to the graph's diameter for equal weights, or up to |V|−1 for positive weights. Termination is based on convergence—no distance updates—tracked via Hadoop counters and controlled by a driver program. The graph structure is passed through each iteration.
Using the respective MapReduce algorithm, compute the SSSPs from node 0 to all nodes. Assume the employment of 2 mappers and 2 reducers. The first mapper is presented with nodes 0,1,2, and the second mapper with nodes 3,4. Suppose that the partitioner sends nodes 0,1,2 to the first reducer, while the rest are sent to the second reducer. - see google docs Compare the MapReduce solution to Dijkstra's solution. - Dijkstra's algorithm is more efficient for SSSP because it expands only the minimum-cost paths using a priority queue, avoiding unnecessary work. In contrast, MapReduce explores all paths in parallel, leading to redundant computations. It lacks global data structures like a priority queue, so each iteration recomputes distances across the graph. What is a random walk? - A random walk involves moving from one point to another in a state space, where each successive step is determined by a probability distribution. For example, in the "random surfer model", a user starts at a random web page and, with probability 𝛼, jumps to a completely different page or, with probability (1 − 𝛼), clicks a random link on the current page. This process continues indefinitely, simulating a random exploration of the web. Suppose that an undirected graph is stored as a file of edges. Each line of the file is in the format (𝑢, 𝑣), denoting that there is an edge between nodes 𝑢 and 𝑣. Each edge in the graph occurs only once in the file. That is, for an edge between nodes 𝑢 and 𝑣, the file contains only (𝑢, 𝑣), 𝑢 < 𝑣. Devise a MapReduce algorithm that finds the node with the maximum degree (i.e., maximum number of edges). - see on google docs Write a MapReduce algorithm to find in a directed graph 𝐺 = (𝑉, 𝐸), given as a collection of adjacency lists: a) The total number of reflexive links. A link is called reflexive if it connects a node to itself. b) All pairs of nodes that are reciprocally linked. A pair of nodes 𝑣, 𝑢 is reciprocally linked if there exists a link from 𝑣 to 𝑢 and vice versa - see on google docs Write a MapReduce program that takes, as input, a very large file containing a directed graph 𝐺 = (𝑉, 𝐸) as a collection of adjacency lists, and augments the adjacency list of each node 𝑣 ∈ 𝑉 with all nodes that are reachable in two steps from 𝑣. In other words, if there exists a path of edges 𝑣 → 𝑤 → 𝑢, then you should add to the adjacency list of 𝑣 the entry 𝑢, unless it already exists. Important note: There should be no duplicate nodes in an adjacency list. - see on google docs Define PageRank. What does it represent? - PageRank characterizes the amount of time spent on any given page, and it is a probability distribution over pages. This distribution signifies the likelihood that a random walk will arrive at a specific node. It captures notions of page importance, where high-quality pages are approved by many other pages via incoming hyperlinks. Also, if a high-quality page links to another page, that linked page is likely to be of high quality.
How can we exploit the graph topologies? What is the difficulty with this issue? - Graph partitioning aims to group nodes with many internal links and few external ones, so map tasks can process components locally and use combiners effectively. This reduces data transfer. Good partitioning is hard and often heuristic-based—e.g., sorting social network nodes by zip code or school, or web pages by language or domain, where intra-group links are denser.