0

0

975 days ago,
255 views

Prologue to diagram calculations and chart representations ... Does the calculation ever end? In the long run, all hubs will be found, all edges ...

Distributed computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 Some material adjusted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (authorized under Creation Commons Attribution 3.0 License) This work is authorized under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/for points of interest

Today's Topics Introduction to chart calculations and diagram representations Single Source Shortest Path (SSSP) issue Refresher: Dijkstra's calculation Breadth-First Search with MapReduce PageRank

What's a diagram? G = (V,E), where V speaks to the arrangement of vertices (hubs) E speaks to the arrangement of edges (connections) Both vertices and edges may contain extra data Different sorts of charts: Directed versus undirected edges Presence or nonappearance of cycles Graphs are all over: Hyperlink structure of the Web Physical structure of PCs on the Internet Interstate parkway framework Social systems

Some Graph Problems Finding most brief ways Routing Internet activity and UPS trucks Finding least crossing trees Telco setting down fiber Finding Max Flow Airline booking Identify "exceptional" hubs and groups Breaking up fear monger cells, spread of avian influenza Bipartite coordinating Monster.com, Match.com And obviously... PageRank

Graphs and MapReduce Graph calculations normally include: Performing calculation at every hub Processing hub particular information, edge-particular information, and connection structure Traversing the diagram in some way Key inquiries: How would you speak to chart information in MapReduce? How would you navigate a diagram in MapReduce?

Representing Graphs G = (V, E) A poor representation for computational purposes Two normal representations Adjacency framework Adjacency list

Adjacency Matrices Represent a diagram as a n x n square network M n = |V| M ij = 1 implies a connection from hub i to j 2 1 3 4

Adjacency Matrices: Critique Advantages: Naturally epitomizes emphasis over hubs Rows and sections relate to inlinks and outlinks Disadvantages: Lots of zeros for meager grids Lots of squandered space

Adjacency Lists Take nearness lattices… and discard every one of the zeros 1: 2, 4 2: 1, 3, 4 3: 1 4: 1, 3

Adjacency Lists: Critique Advantages: Much more minimized representation Easy to figure over outlinks Graph structure can be separated and disseminated Disadvantages: Much more hard to register over inlinks

Single Source Shortest Path Problem: find most limited way from a source hub to at least one target hubs First, a refresher: Dijkstra's Algorithm

Dijkstra's Algorithm Example 1 10 0 9 2 3 4 6 5 7 2 Example from CLR

Dijkstra's Algorithm Example 10 1 10 0 9 2 3 4 6 5 7 5 2 Example from CLR

Dijkstra's Algorithm Example 8 14 1 10 0 9 2 3 4 6 5 7 5 7 2 Example from CLR

Dijkstra's Algorithm Example 8 13 1 10 0 9 2 3 4 6 5 7 5 7 2 Example from CLR

Dijkstra's Algorithm Example 8 9 1 10 0 9 2 3 4 6 5 7 5 7 2 Example from CLR

Dijkstra's Algorithm Example 8 9 1 10 0 9 2 3 4 6 5 7 5 7 2 Example from CLR

Single Source Shortest Path Problem: find briefest way from a source hub to at least one target hubs Single processor machine: Dijkstra's Algorithm MapReduce: parallel Breadth-First Search (BFS)

Finding the Shortest Path First, consider measure up to edge weights Solution to the issue can be characterized inductively Here's the instinct: DistanceTo(startNode) = 0 For all hubs n specifically reachable from startNode, DistanceTo( n ) = 1 For all hubs n reachable from some other arrangement of hubs S, DistanceTo(n) = 1 + min(DistanceTo(m), m S)

From Intuition to Algorithm A guide errand gets Key: hub n Value: D (remove from begin), focuses to (rundown of hubs reachable from n ) p focuses to: radiate ( p , D+1) The decrease undertaking assembles conceivable separations to a given p and chooses the base one

Multiple Iterations Needed This MapReduce assignment propels the "known outskirts" by one jump Subsequent emphasess incorporate more reachable hubs as boondocks advances Multiple cycles are expected to investigate whole chart Feed yield once again into the same MapReduce assignment Preserving diagram structure: Problem: Where did the focuses to list go? Arrangement: Mapper discharges ( n , focuses to) too

Visualizing Parallel BFS 3 1 2 3 4

Termination Does the calculation ever end? In the long run, all hubs will be found, all edges will be considered (in an associated chart) When do we stop?

Weighted Edges Now add positive weights to the edges Simple change: focuses to list in guide errand incorporates a weight w for each indicated hub discharge ( p , D+ w p ) rather than ( p , D+1) for every hub p Does this ever end? Yes! In the long run, no better separations will be found. At the point when separation is the same, we stop Mapper ought to emanate ( n , D) to guarantee that "present separation" is conveyed into the reducer

Comparison to Dijkstra's calculation is more proficient At any progression it just seeks after edges from the base cost way inside the boondocks MapReduce investigates all ways in parallel Divide and overcome Throw more equipment at the issue

General Approach MapReduce is adjust at controlling diagrams Store charts as nearness records Graph calculations with for MapReduce: Each guide errand gets a hub and its outlinks Map undertaking figure some capacity of the connection structure, discharges esteem with focus as the key Reduce assignment gathers keys (target hubs) and totals Iterate various MapReduce cycles until some end condition Remember to "pass" chart structure from one emphasis to next

Random Walks Over the Web Model: User begins at an irregular Web page User arbitrarily taps on connections, surfing from page to page What's the measure of time that will be spent on any given page? This is PageRank

PageRank: Defined Given page x with in-bound connections t 1 … t n , where C(t) is the out-level of t is likelihood of irregular hop N is the aggregate number of hubs in the diagram t 1 X t 2 … t n

Computing PageRank Properties of PageRank Can be registered iteratively Effects at every emphasis is neighborhood Sketch of calculation: Start with seed PR i values Each page disseminates PR i "credit" to all pages it connections to Each objective page includes "credit" from various in-bound connections to figure PR i+1 Iterate until qualities unite

PageRank in MapReduce Map: appropriate PageRank "credit" to connection targets Reduce: get together PageRank "credit" from different sources to process new PageRank esteem Iterate until union ...

PageRank: Issues Is PageRank ensured to focalize? How rapidly? What is the "right" esteem of , and how touchy is the calculation to it? Shouldn't something be said about dangling joins? How would you know when to stop?

Questions? (Ask them now, since will need to execute this!)

SPONSORS

No comments found.

SPONSORS

SPONSORS