The Multicore Programming Challenge Barbara Chapman University of Houston November 22, 2007 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools
Slide 2Agenda Multicore is Here … Here comes Manycore The Programming Challenge OpenMP as a Potential API How about the Implementation?
Slide 3The Power Wall Want to evade the warmth wave? Go multicore! Include shared memory multithreading (SMT) for better throughput. Include quickening agents for low power elite on a few operations
Slide 4The Memory Wall Even along these lines, this is the way our assembled programs run Growing divergence between memory get to times and CPU speed Cache estimate expands demonstrate consistent losses Multithreading can conquer minor deferrals. Be that as it may, multicore, SMT diminishes measure of store per string and acquaints rivalry for data transmission with memory
Slide 5So Now We have Multicore Small number of centers, shared memory Some frameworks have multithreaded centers Trend to effortlessness in centers (e.g. no branch expectation) Multiple strings share assets (L2 reserve, perhaps FP units) Deployment in installed showcase and also different divisions IBM Power4, 2001 Sun T-1 (Niagara), 2005 Intel causes trouble 2005
Slide 6Take-Up in Enterprise Server Market Increase in volume of clients Increase in information, number of exchanges Need to build execution Survey information
Slide 7What Is Hard About MC Programming? Processor Accelerator We may need kin strings to partake in a workload on a multicore. In any case, we may need SMT strings to do distinctive things Parallel writing computer programs is standard Lower single string execution Hierarchical, heterogeneous parallelism SMPs, numerous centers, SMT, ILP, FPGA, GPGPU,… Diversity in kind and degree of asset sharing, potential for string conflict Reduced compelling reserve per direction stream Non-uniform memory access on chip Contention for access to primary memory Runtime control administration Core
Slide 8Manycore is Coming, Ready or Not An Intel expectation: innovation may bolster 2010: 16—64 centers 200GF—1 TF 2013: 64—256 centers 500GF– 4 TF 2016: 256- - 1024 centers 2 TF– 20 TF More centers, all the more multithreading More many-sided quality in individual framework Hierarchical parallelism (ILP, SMT, center) Accelerators, representation units, FPGAs Multistage systems for information development and match up.? Memory rationality? Applications are durable: A program composed for multicore PCs may need to run quick on manycore frameworks later
Slide 9Agenda Multicore is Here … Here comes Manycore The Programming Challenge OpenMP as a Potential API How about the Implementation?
Slide 10Application Developer Needs Time to showcase Often an abrogating necessity for ISVs and venture IT May support quick prototyping and iterative advancement Cost of programming Development , testing Maintenance and support (identified with quality yet similarly to convenience) Human exertion of improvement and testing likewise now a perceived issue in HPC and logical figuring Productivity
Slide 11Does It Matter? Of Course! Server showcase study Performance
Slide 12Programming Model: Some Requirements General-reason stages are parallel Generality of parallel programming model matters User desires Performance and efficiency matter, so does blunder taking care of Many strings with shared memory Scalability matters Mapping of work and information to machine will influence execution Work/information area matters More intricate, "componentized" applications Modularity matters Even if parallelization is simple, scaling may be hard Amdahl's Law
Slide 13Some Programming Approaches From top of the line registering Libraries MPI, Global Arrays Partitioned Global Address Space Languages Co-Array Fortran, Titanium, UPC Shared memory programming OpenMP, Pthreads, autoparallelization New thoughts HPCS Languages Fortress, Chapel, X10 Transactional Memory And merchant and space particular APIs Let's investigate advance…
Slide 14First Thoughts: Ends of Spectrum Automatic parallelization Usually works for short districts of code Current research endeavors to improve by joining static and element ways to deal with reveal parallelism Consideration of interchange with ILP-level issues in multithreaded environment MPI (Message Passing Interface) Widely utilized as a part of HPC, can be actualized on shared memory Enforces territory But absence of incremental advancement way, moderately low level of deliberation and uses an excessive amount of memory For vast frameworks: what number procedures can be upheld?
Slide 15PGAS Languages Partitioned Global Address Space Co-Array Fortran Titanium UPC Different subtle elements however comparable in soul Raises level of deliberation User determines information and work mapping Not intended for fine-grained parallelism Co-Array Fortran Communication unequivocal yet points of interest up to compiler SPMD calculation (neighborhood perspective of code) Entering Fortran standard X = F[ p ]
Slide 16HPCS Languages High-execution High-Productivity Programming CHAPEL, Fortress, X10 Research dialects that investigate an assortment of new thoughts Target worldwide address space, multithreading stages Aim for abnormal amounts of adaptability Asynchronous and synchronous strings All of them offer need to bolster for territory and liking Machine portrayals, mapping of work and calculation to machine Locales, places Attempt to lower cost of synchronization and give less difficult programming model to synchronization Atomic squares/exchanges
Slide 19Shared Memory Models 1: PThreads Flexible library for shared memory programming Some insufficiencies therefore: No memory show Widely accessible Does not bolster profitability Relatively low level of reflection Doesn't generally work with Fortran No simple code relocation way from consecutive program Lack of structure means blunder inclined Performance can be great Likely to be utilized for programming multicore
Slide 20Agenda Multicore is Here … Here comes Manycore The Programming Challenge OpenMP as a Potential API How about the Implementation?
Slide 21Shared Memory Models 2: OpenMP * C$OMP FLUSH #pragma omp basic C$OMP THREADPRIVATE(/ABC/) CALL OMP_SET_NUM_THREADS(10) An arrangement of compiler orders and library schedules Can be utilized with Fortran, C and C++ User maps code to strings that share memory Parallel circles, parallel areas, workshare User chooses if information is shared or private User facilitates shared information gets to Critical locales, nuclear upgrades, hindrances, locks C$OMP parallel do shared(a, b, c) call omp_test_lock(jlok) call OMP_INIT_LOCK (ilok) C$OMP MASTER C$OMP ATOMIC C$OMP SINGLE PRIVATE(X) setenv OMP_SCHEDULE "dynamic" C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) C$OMP ORDERED C$OMP PARALLEL REDUCTION (+: A, B) C$OMP SECTIONS #pragma omp parallel for private(A, B) !$OMP BARRIER C$OMP PARALLEL COPYIN(/blk/) C$OMP DO lastprivate(XX) Nthrds = OMP_GET_NUM_PROCS() omp_set_lock(lck) * The name "OpenMP" is the property of the OpenMP Architecture Review Board.
Slide 22Shared Memory Models 2: OpenMP High-level mandate based multithreaded programming The client settles on key choices Compiler makes sense of subtle elements Threads collaborate by sharing factors Synchronization to request gets to and avert information clashes Structured programming to diminish probability of bugs #pragma omp parallel #pragma omp for schedule(dynamic) for (I=0;I<N;I++){ NEAT_STUFF(I); }/* certain obstruction here */
Slide 23M = 2.6 = 2.09 º = 0.8º Cart3D OpenMP Scaling 4.7 M cell work Space Shuttle Launch Vehicle illustration OpenMP adaptation utilizes same area disintegration methodology as MPI for information region, evading false sharing and fine-grained remote information get to OpenMP form marginally beats MPI form on SGI Altix 3700BX2, both near direct scaling.
Slide 24The OpenMP ARB OpenMP is kept up by the OpenMP Architecture Review Board (the ARB), which Interprets OpenMP Writes new details - keeps OpenMP pertinent Works to build the effect of OpenMP Members are associations - not people Current individuals Permanent: AMD, Cray, Fujitsu, HP, IBM, Intel, Microsoft, NEC, PGI, SGI, Sun Auxiliary: ASCI, cOMPunity, EPCC, KSL, NASA, RWTH Aachen www.compunity.org
Slide 25Oct 1997 – 1.0 Fortran Oct 1998 – 1.0 C/C++ Nov 1999 – 1.1 Fortran (translations included) Nov 2000 – 2.0 Fortran Mar 2002 – 2.0 C/C++ May 2005 – 2.5 Fortran/C/C++ (for the most part a union) ?? 2008 – 3.0 Fortran/C/C++ (expansions) Original objectives: Ease of utilization, incremental way to deal with parallelization "Satisfactory" speedup on little SMPs, capacity to compose adaptable code on substantial SMPs with comparing exertion As far as could reasonably be expected, parallel code "good" with serial code www.compunity.org
Slide 26OpenMP 3.0 Many proposition for new elements Features to improve expressivity, uncover parallelism, bolster multicore Better parallelization of circle homes Parallelization of more extensive scope of circles Nested parallelism Controlling the default conduct of sit without moving strings And more
Slide 27Pointer - pursuing Loops in OpenMP? for(p = list; p; p = p->next) { process(p->item); } Cannot be parallelized with omp for : number of emphasess not known ahead of time Transformation to an "authoritative" circle can be exceptionally work serious/wasteful
Slide 28OpenMP 3.0 Introduces Tasks expressly made and prepared #pragma omp parallel { #pragma omp single { p = listhead ; while (p) { #pragma omp errand prepare (p) p=next (p) ; } Each experiencing string bundles another occurrence of an assignment (code and information) Some string in the group executes the undertaking
Slide 29More and More Threads Busy holding up may devour significant assets, meddle with work of different strings on multicore OpenMP 3.0 will permit client more control over the way sit without moving strings are taken care of Improved support for
SPONSORS
SPONSORS
SPONSORS