6 of the mapper code in Figure 5.4, instead of emitting d + 1 as the value, we must now emit d +
w where w is the edge distance. No other changes to the algorithm are required, but the termination
behavior is very different. To illustrate, consider the graph fragment in Figure 5.5, where s is the
source node, and in this iteration, we just “discovered” node r for the very first time. Assume for
the sake of argument that we’ve already discovered the shortest distance to node p, and that the
shortest distance to r so far goes through p. This, however, does not guarantee that we’ve
discovered the shortest distance to r, since there may exist a path going through q that we haven’t
encountered yet (because it lies outside the search frontier).6 However, as the search frontier
expands, we’ll eventually cover q and all other nodes along the path from p to q to r—which means
that with sufficient iterations, we will discover the shortest distance to r. But how do we know that
we’ve found the shortest distance to p? Well, if the shortest path to p lies within the search frontier,
we would have already discovered it. And if it doesn’t, the above argument applies. Similarly, we
can repeat the same argument for all nodes on the path from s to p. The conclusion is that, with
sufficient iterations, we’ll eventually discover all the shortest distances.
So exactly how many iterations does “eventually” mean? In the worst case, we might need as many
iterations as there are nodes in the graph minus one. In fact, it is not difficult to construct graphs
that will elicit this worse-case behavior: Figure 5.6 provides an example, with n1 as the source.
The parallel breadth-first search algorithm would not discover that the shortest path from n1 to n6
goes through n3, n4, and n5 until the fifth iteration. Three more iterations are necessary to cover
the rest of the graph. Fortunately, for most real-world graphs, such extreme cases are rare, and the
number of iterations necessary to discover all shortest distances is quite close to the diameter of
the graph, as in the unit edge distance case.
In practical terms, how do we know when to stop iterating in the case of arbitrary edge distances?
The algorithm can terminate when shortest distances at every node no longer change. Once again,
we can use counters to keep track of such events. Every time we encounter a shorter distance in
the reducer, we increment a counter. At the end of each MapReduce iteration, the driver program
reads the counter value and determines if another iteration is necessary.
Compared to Dijkstra’s algorithm on a single processor, parallel breadth-first search in
MapReduce can be characterized as a brute force approach that “wastes” a lot of time performing
computations whose results are discarded. At each iteration, the algorithm attempts to recompute
distances to all nodes, but in reality only useful work is done along the search frontier: inside the
search frontier, the algorithm is simply repeating previous computations.7 Outside the search
frontier, the algorithm hasn’t discovered any paths to nodes there yet, so no meaningful work is
done. Dijkstra’s algorithm, on the other hand, is far more efficient. Every time a node is explored,
we’re guaranteed to have already found the shortest path to it. However, this is made possible by
maintaining a global data structure (a priority queue) that holds nodes sorted by distance—this is
not possible in MapReduce because the programming model does not provide support for global
data that is mutable and accessible by the mappers and reducers. These inefficiencies represent the
cost of parallelization.
The parallel breadth-first search algorithm is instructive in that it represents the prototypical
structure of a large class of graph algorithms in MapReduce. They share in the following
characteristics: