MPI The Best High Performance Programming Model for Clusters and Grids

William gropp www mcs anl gov gropp mathematics and computer science argonne national laboratory l.jpg
1 / 23
1090 days ago, 409 views
PowerPoint PPT Presentation

Presentation Transcript

Slide 1

William Gropp Mathematics and Computer Science Argonne National Laboratory MPI—The Best High Performance Programming Model for Clusters and Grids

Slide 2

Outline MPI Features that make it effective Why is MPI the programming model of decision? Unique Cluster and Grid needs Fault Tolerance Latency Tolerance Communication Tuning

Slide 3

The Success of MPI Applications Most late Gordon Bell prize victors utilize MPI 26TF Climate reproduction on Earth Simulator, 16TF DNS Libraries Growing accumulation of capable programming parts MPI programs with no MPI calls (all in libraries) Tools Performance following (Vampir, Jumpshot, and so on.) Debugging (Totalview, and so on.) Results Papers: Beowulf Ubiquitous parallel registering Grids MPICH-G2 (MPICH over Globus, )

Slide 4

Why Was MPI Successful? It address the majority of the accompanying issues: Portability Performance Simplicity and Symmetry Modularity Composability Completeness Portability Performance Modularity Simplicity and Symmetry Composability Completeness

Slide 5

These ought to be a similar Performance must be focused Pay thoughtfulness regarding memory movement Leave flexibility for implementers to abuse any uncommon components Standard report requires watchful perusing Not all usage are immaculate MPI pingpong execution ought to be asymptotically on a par with any interprocess correspondence system

Slide 6

CPUs Cache This is the hardest hole Main Memory Main Memory Remote Memory Not this Parallel Computing and Uniprocessor Performance Deeper memory order Synchronization/coordination costs

Slide 7

Simplicity and Symmetry MPI is sorted out around a little number of ideas The quantity of schedules is not a decent measure of multifaceted nature Fortran Large number of characteristic capacities C and Java runtimes are huge Development Frameworks Hundreds to a large number of strategies This doesn't trouble a large number of software engineers Why would it be a good idea for it to trouble us?

Slide 8

Is Ease of Use the Overriding Goal? MPI regularly portrayed as "the low level computing construct of parallel programming" C and Fortran have been depicted as "versatile low level computing constructs" Ease of utilization is essential. In any case, fulfillment is more imperative. Try not to compel clients to change to an alternate approach as their application evolves Don't overlook: no application needs to utilize Parallelism just to get sufficient execution Ease of utilization to the detriment of execution is unimportant

Slide 9

Fault Tolerance in MPI Can MPI be blame tolerant? What does that mean? Execution versus Detail Work to be done on the executions Work to be done on the calculations Semantically significant and proficient aggregate operations Use MPI at the right level Build libraries to typify critical programming ideal models (Following slides are joint work with Rusty Lusk)

Slide 10

Myths and Facts Myth: MPI conduct is characterized by its usage. Reality: MPI conduct is characterized by the Standard Document at Myth: MPI is not blame tolerant. Certainty: This announcement is not all around shaped. Its truth relies on upon what it means, and one can't tell from the announcement itself. All the more later. Myth: All procedures of MPI projects exit if any one procedure crashes. Truth: Sometimes they do; some of the time they don't; now and then they ought to; now and again they shouldn't. All the more later. Myth: Fault resilience implies unwavering quality. Actuality: These are totally unique. Once more, definitions are required.

Slide 11

More Myths and Facts Myth: Fault resistance is free of execution. Actuality: as a rule, no. Maybe for a few (powerless) viewpoints, yes. Bolster for adaptation to non-critical failure will contrarily affect execution. Myth: Fault resistance is a property of the MPI standard (which it doesn't have. Truth: Fault resistance is not a property of the determination, so it can't not have it.  Myth: Fault resistance is a property of a MPI usage (which most don't have). Actuality: Fault resilience is a property of a program. A few executions make it simpler to compose blame tolerant projects than others do.

Slide 12

What is Fault Tolerance Anyway? A blame tolerant program can "get by" (in some sense we have to talk about) a disappointment of the foundation (machine crash, organize disappointment, and so on.) This is not when all is said in done totally feasible. (Imagine a scenario where all procedures crash?) How much is recoverable relies on upon how much express the fizzled segment holds at the season of the crash. In numerous ace slave calculations a slave holds a little measure of effectively recoverable express (the latest subproblem it got). In most work calculations a procedure may hold a lot of hard to-recuperate state (information values for some segment of the framework/grid). Correspondence systems hold differing measure of state in correspondence cradles.

Slide 13

What Does the Standard Say About Errors? An arrangement of mistakes is characterized, to be returned by MPI capacities if MPI_ERRORS_RETURN is set. Usage are permitted to augment this set. It is not required that ensuing operations work after a blunder is returned. (On the other hand that they come up short, it is possible that.) It may not be workable for a usage to recoup from a few sorts of blunders sufficiently even to give back a mistake code (and such executions are acclimating). Much is left to the usage; some accommodating executions may return mistakes in circumstances where other acclimating usage prematurely end. (See "nature of execution" issue in the Standard.) Implementations are permitted to exchange execution against adaptation to internal failure to address the issues of their clients

Slide 14

Some Approaches to Fault Tolerance in MPI Programs Master-slave calculations utilizing intercommunicators No change to existing MPI semantics MPI intercommunicators sum up the surely knew two gathering model to gatherings of procedures, permitting either the ace or slave to be a parallel program enhanced for execution. Checkpointing In wide utilize now Plain versus favor MPI-IO can make it proficient Extending MPI with some new questions so as to permit a more extensive class of blame tolerant projects. The "pseudo-communicator" Another approach: Change semantics of existing MPI works No more drawn out MPI (semantics, not sentence structure, characterizes MPI)

Slide 15

A Fault-Tolerant MPI Master/Slave Program Master handle comes up alone first. Size of MPI_COMM_WORLD = 1 It makes slaves with MPI_Comm_spawn Gets back an intercommunicator for every one Sets MPI_ERRORS_RETURN on every Master speaks with every slave utilizing its specific communicator MPI_Send/Recv to/from rank 0 in remote gathering Master keeps up state data to restart each subproblem in the event of disappointment Master may begin supplanting slave with MPI_Comm_spawn Slaves may themselves be parallel Size of MPI_COMM_WORLD > 1 on slaves Allows software engineer to control tradeoff between adaptation to internal failure and execution

Slide 16

Checkpointing Application-driven versus remotely determined Application knows when message-passing subsystem is quiet Checkpointing each n timesteps permits long (months) ASCI calculations to continue routinely in face of blackouts. Remotely determined checkpointing requires considerably more participation from MPI usage, which may affect execution. MPI-IO can help with extensive, application-driven checkpoints "Extraordinary" checkpointing – MPICH-V (Paris gather) All messages logged States occasionally checkpointed nonconcurrently Can reestablish nearby state from checkpoint + message log since last checkpoint Not superior Scalability challenges

Slide 17

Latency Tolerance Memory frameworks have 100+ to 1 dormancy to CPU Cluster interconnects have 10000+ to 1 inactivity to CPU Grid interconnects have 10000000+ to 1 inertness to CPU The others you may very well potentially settle with shrewd building. Altering this one requires Warp Drive innovation. Great approach is to part operations into a different start and finishing step LogP display isolates out overhead from irreducible inactivity Programmers once in a while decent at composing programs with split operations

Slide 18

Split Phase Operations MPI nonblocking operations gives split operations MPI-2 includes summed up solicitations (with same hold up/test) "Everything's a Request" (nearly (moan)) MPI-2 RMA operations are nonblocking yet with independent consummation operations Chosen to make RMA API quick Nonblocking  Concurrent Parallel I/O in MPI Aggregates, not POSIX streams Rational atomicity Permits sensible reserving methodologies while keeping up sensible semantics Well-suited to matrix scale remote I/O Two-stage aggregate I/O gives a model to two-stage aggregate operations

Slide 19

MPI Implementation Choices for Latency Tolerance MPI Implementations can be upgraded for Clusters and Grids Example: Point-to-point "Meet" Typical 3-way: Sender asks for Receiver acks with alright to send Sender conveys information Alternative: "Beneficiary solicitations" 2-way Receiver sends "demand to get" to assigned Sender conveys information MPI_ANY_SOURCE gets meddle (Even better: utilize MPI RMA: sender conveys information to already concurred area) Example: Collective calculations Typical: straightforward MST, and so forth. Matrices and SMP groups: Topology-mindful (MagPIE; MPICH-G2) Switched systems Scatter/accumulate based calculations rather than fanout trees

Slide 20

Communication Tuning for Grids Quality of Service MPI Attributes give a particular approach to give data to the execution Only change to client code is MPI_Comm_set_attr call TCP recuperation Recovery from transient TCP disappointments Hard to get TCP state from TCP API (how much information was conveyed?) Messages are not streams User support can be sent in any request Allows forceful (yet great resident) UDP based correspondence Aggregate acks/nacks Compare to "Unbounded Window" TCP (get cradle) 80%+ of transfer speed achievable on whole deal framework