Jimmy Lin The iSchool College of Maryland Wednesday, September 3, 2008

Jimmy lin the ischool university of maryland wednesday september 3 2008 l.jpg
1 / 44
1063 days ago, 353 views
PowerPoint PPT Presentation
MapReduce: the

Presentation Transcript

Slide 1

Distributed computing Lecture #1 What is Cloud Computing? (furthermore, an introduction to parallel/disseminated handling) Jimmy Lin The iSchool University of Maryland Wednesday, September 3, 2008 Some material adjusted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing Seminar, 2007 (authorized under Creation Commons Attribution 3.0 License) This work is authorized under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/for subtle elements

Slide 2

Source: http://www.free-pictures-photos.com/

Slide 3

What is Cloud Computing? Web-scale issues Large server farms Different models of registering Highly-intelligent Web applications

Slide 4

1. Web-Scale Problems Characteristics: Definitely information concentrated May likewise be handling serious Examples: Crawling, ordering, seeking, mining the Web "Post-genomics" life sciences examine Other logical information (material science, cosmologists, and so forth.) Sensor systems Web 2.0 applications …

Slide 5

How much information? Wayback Machine has 2 PB + 20 TB/month (2006) Google forms 20 PB a day (2008) "all words ever talked by people" ~ 5 EB NOAA has ~1 PB atmosphere information (2007) CERN's LHC will create 15 PB a year (2008) 640K should be sufficient for anyone.

Slide 6

Maximilien Brice, © CERN

Slide 7

Maximilien Brice, © CERN

Slide 8

(Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) There's not at all like more information! s/motivation/information/g;

Slide 9

What to do with more information? Noting tidbit questions Pattern coordinating on the Web Works amazingly well Learning relations Start with seed occurrences Search for examples on the Web Using examples to discover more cases Who shot Abraham Lincoln?  X shot Abraham Lincoln Wolfgang Amadeus Mozart (1756 - 1791) Einstein was conceived in 1879 Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879) PERSON ( DATE – PERSON was conceived in DATE (Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … )

Slide 10

2. Substantial Data Centers Web-scale issues? Toss more machines at it! Clear pattern: centralization of processing assets in substantial server farms Necessary fixings: fiber, squeeze, and space What do Oregon, Iceland, and deserted mines have in like manner? Critical Issues: Redundancy Efficiency Utilization Management

Slide 11

Source: Harper's (Feb, 2008)

Slide 12

Maximilien Brice, © CERN

Slide 13

Key Technology: Virtualization App OS Operating System Hypervisor Hardware Traditional Stack Virtualized Stack

Slide 14

3. Diverse Computing Models Utility registering Why purchase machines when you can lease cycles? Illustrations: Amazon's EC2, GoGrid, AppNexus Platform as a Service (PaaS) Give me pleasant API and deal with the usage Example: Google App Engine Software as a Service (SaaS) Just run it for me! Case: Gmail "Why do it without anyone's help in the event that you can pay somebody to do it for you?"

Slide 15

4. Web Applications An error on top of a hack based on sand held together by channel tape? What is the way of programming applications? From the desktop to the program SaaS == Web-based applications Examples: Google Maps, Facebook How would we convey profoundly intelligent Web-based applications? AJAX (nonconcurrent JavaScript and XML) For better, or for more regrettable…

Slide 16

What is the course about? MapReduce: the "back-end" of distributed computing Batch-situated handling of huge datasets Ajax: the "front-end" of distributed computing Highly-intuitive Web-based applications Computing "in the mists" Amazon's EC2/S3 for instance of utility figuring

Slide 17

Amazon Web Services Elastic Compute Cloud (EC2) Rent registering assets by the hour Basic unit of bookkeeping = example hour Additional expenses for transfer speed Simple Storage Service (S3) Persistent capacity Charge by the GB/month Additional expenses for transmission capacity You'll be utilizing EC2/S3 for course assignments!

Slide 18

This course is not for you… If you're not truly intrigued by the subject If you're not prepared to do a considerable measure of programming If you're not open to pondering processing in new ways If you can't adapt to uncertainly, eccentrics, poor documentation, and juvenile programming If you can't invest the energy Otherwise, this will be a lavishly compensating course!

Slide 19

Source: http://davidzinger.wordpress.com/2007/05/page/2/

Slide 20

Cloud Computing Zen Don't get baffled (take a full breath)… This is cutting edge innovation Those W$*#T@F! minutes Be tolerant… This is the second first time I've shown this course Be adaptable… There will be unexpected issues en route Be useful… Tell me how I can improve everybody's experience

Slide 21

Source: Wikipedia

Slide 22

Source: Wikipedia

Slide 23

Source: Wikipedia

Slide 24

Source: Wikipedia

Slide 25

Things to go over… Course plan Assignments and deliverables Amazon EC2/S3

Slide 26

Web-Scale Problems? Try not to hold your breath: Biocomputing Nanocomputing Quantum figuring … It all comes down to… Divide-and-overcome Throwing more equipment at the issue Simple to comprehend… a lifetime to ace…

Slide 27

Divide and Conquer "Work" Partition w 1 w 2 w 3 "laborer" "specialist" "laborer" r 1 r 2 r 3 Combine "Result"

Slide 28

Different Workers Different strings in a similar center Different centers in a similar CPU Different CPUs in a multi-processor framework Different machines in a conveyed framework

Slide 29

Choices, Choices, Choices Commodity versus "fascinating" equipment Number of machines versus processor versus centers Bandwidth of memory versus plate versus arrange Different programming models

Slide 30

Flynn's Taxonomy Instructions Single (SI) Multiple (MI) Single (SD) Data Multiple (MD)

Slide 31

SISD Processor D Instructions

Slide 32

SIMD Processor D 0 D 0 D 0 D 0 D 0 D 0 D 0 D 1 D 1 D 1 D 1 D 1 D 1 D 1 D 2 D 2 D 2 D 2 D 2 D 2 D 2 D 3 D 3 D 3 D 3 D 3 D 3 D 3 D 4 D 4 D 4 D 4 D 4 D 4 D 4 … D n D n D n D n D n D n D n Instructions

Slide 33

MIMD Processor D Instructions

Slide 34

Memory Typology: Shared Processor Memory Processor

Slide 35

Memory Typology: Distributed Processor Memory Processor Memory Network Processor Memory Processor Memory

Slide 36

Memory Typology: Hybrid Processor Memory Processor Memory Processor Network Processor Memory Processor Memory Processor

Slide 37

Parallelization Problems How would we appoint function units to specialists? Imagine a scenario in which we have more work units than specialists. Consider the possibility that specialists need to share incomplete results. How would we total fractional results? How would we know every one of the specialists have wrapped up? What if laborers pass on? What is the basic subject of these issues?

Slide 38

General Theme? Parallelization issues emerge from: Communication between specialists Access to shared assets (e.g., information) Thus, we require a synchronization framework! This is precarious: Finding bugs is hard Solving bugs is much harder

Slide 39

Managing Multiple Workers Difficult on the grounds that (Often) don't have the foggiest idea about the request in which specialists run (Often) don't know where the laborers are running (Often) don't know when laborers interfere with each different Thus, we require: Semaphores (bolt, open) Conditional factors (hold up, inform, communicate) Barriers Still, bunches of issues: Deadlock, livelock, race conditions, ... Lesson of the story: be cautious! Much trickier if the specialists are on various machines

Slide 40

Patterns for Parallelism Parallel registering has been around for quite a long time Here are some "plan designs" …

Slide 41

Master/Slaves ace slaves

Slide 42

Producer/Consumer Flow P C P C P C P C P C P C

Slide 43

Work Queues P C shared line P C W P C

Slide 44

Rubber Meets Road From examples to execution: pthreads, OpenMP for multi-strung programming MPI for grouping figuring … The truth: Lots of erratic arrangements, custom code Write you possess devoted library, then program with it Burden on the software engineer to unequivocally oversee everything MapReduce to the safeguard! (for next time)