Superior Programming on a Single Processor: Memory Hierarchies Matrix Multiplication Automatic Performance Tu

1841 days ago, 643 views
PowerPoint PPT Presentation

Presentation Transcript

Slide 1

Elite Programming on a Single Processor: Memory Hierarchies Matrix Multiplication Automatic Performance Tuning James Demmel CS267 Lecure 2

Slide 2

Outline Idealized and genuine expenses in present day processors Memory orders Case Study: Matrix Multiplication Automatic Performance Tuning CS267 Lecure 2

Slide 3

Outline Idealized and real expenses in cutting edge processors Memory chains of command Case Study: Matrix Multiplication Automatic Performance Tuning CS267 Lecure 2

Slide 4

Idealized Uniprocessor Model Processor names bytes, words, and so on in its address space These speak to whole numbers, skims, pointers, exhibits, and so on. Operations incorporate Read and compose (given an address/pointer) Arithmetic and other intelligent operations Order indicated by program Read gives back the most as of late composed information Compiler and design make an interpretation of abnormal state expressions into "self-evident" lower level guidelines Hardware executes directions all together determined by compiler Cost Each operations has generally a similar cost (read, compose, include, increase, and so forth.) Read address(B) to R1 Read address(C) to R2 R3 = R1 + R2 Write R3 to Address(A) A = B + C  CS267 Lecure 2

Slide 5

Uniprocessors in the Real World Real processors have registers and reserves little measures of quick memory store estimations of as of late utilized or adjacent information diverse memory operations can have altogether different costs parallelism numerous "useful units" that can keep running in parallel distinctive requests, guideline blends have diverse expenses pipelining a type of parallelism, similar to a sequential construction system in a production line Why is this your issue? In principle, compilers see the greater part of this and can enhance your program; by and by they don't. Regardless of the possibility that they could advance one calculation, they won't think about an alternate calculation that may be a greatly improved "match" to the processor CS267 Lecure 2

Slide 6

30 40 20 A B C D What is Pipelining? Dave Patterson's Laundry illustration: 4 individuals doing clothing wash (30 min) + dry (40 min) + overlay (20 min) = 90 min Latency In this case: Sequential execution takes 4 * 90min = 6 hours Pipelined execution takes 30+4*40+20 = 3.5 hours Bandwidth = loads/hour BW = 4/6 l/h w/o pipelining BW = 4/3.5 l/h w pipelining BW <= 1.5 l/h w pipelining, more aggregate burdens Pipelining helps transmission capacity yet not inertness (90 min) Bandwidth constrained by slowest pipeline arrange Potential speedup = Number pipe stages 6 PM 7 8 9 Time T a s k O r d e r CS267 Lecure 2

Slide 7

MEM/WB ID/EX/MEM IF/ID Adder 4 Address ALU Example: 5 Steps of MIPS Datapath Figure 3.4, Page 134 , CA:AQA 2e by Patterson and Hennessy Instruction Fetch Execute Addr. Calc Memory Access Instr. Unravel Reg. Get Write Back Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Memory RS2 Data Memory MUX Sign Extend WB Data Imm RD Pipelining is additionally utilized inside number juggling units a fp duplicate may have inactivity 10 cycles, yet throughput of 1/cycle CS267 Lecure 2

Slide 8

Outline Idealized and real expenses in present day processors Memory chains of command Case Study: Matrix Multiplication Automatic Performance Tuning CS267 Lecure 2

Slide 9

Memory Hierarchy Most projects have a high level of territory in their gets to spatial area: getting to things close-by past gets to worldly region: reusing a thing that was beforehand gotten to Memory pecking order tries to adventure region processor control Second level reserve (SRAM) Secondary stockpiling (Disk) Main memory (DRAM) Tertiary stockpiling (Disk/Tape) datapath on-chip store registers CS267 Lecure 2

Slide 10

Processor-DRAM Gap (idleness) Memory orders are getting further Processors get speedier more rapidly than memory µProc 60%/yr. 1000 CPU "Moore's Law" 100 Processor-Memory Performance Gap: (grows half/year) Performance 10 DRAM 7%/yr. Measure 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time CS267 Lecure 2

Slide 11

Approaches to Handling Memory Latency Bandwidth has enhanced more than inactivity Approach to address the memory dormancy issue Eliminate memory operations by sparing qualities in little, quick memory (reserve) and reusing them require fleeting area in program Take preferred standpoint of better data transfer capacity by getting a piece of memory and sparing it in little quick memory (store) and utilizing entire lump require spatial region in program Take favorable position of better transmission capacity by permitting processor to issue numerous peruses to the memory framework on the double simultaneousness in the guideline stream, eg stack entire exhibit, as in vector processors; or prefetching CS267 Lecure 2

Slide 12

Cache Basics Cache hit : in-store memory get to—shoddy Cache miss : non-reserved memory get to—costly Need to access next, slower level of reserve Consider a small reserve (for delineation just) Cache line length : # of bytes stacked together in one section 2 in above illustration Associativity coordinate mapped: just 1 address (line) in a given range in store n - way: n  2 lines with various locations can be put away CS267 Lecure 2

Slide 13

Why Have Multiple Levels of Cache? On-chip versus off-chip On-chip stores are speedier, however constrained in size A huge reserve has postpones Hardware to check longer addresses in store takes additional time Associativity, which gives a more broad arrangement of information in reserve, likewise takes additional time Some illustrations: Cray T3E disposed of one store to accelerate misses IBM utilizes a level of store as a "casualty store" which is less expensive There are different levels of the memory pecking order Register, pages (TLB, virtual memory), … And it isn't generally a chain of importance CS267 Lecure 2

Slide 14

Experimental Study of Memory (Membench) Microbenchmark for memory framework execution for cluster An of length L from 4KB to 8MB by 2x for walk s from 4 Bytes (1 word) to L/2 by 2x time the accompanying circle (rehash ordinarily and normal) for i from 0 to L by s stack A[i] from memory (4 Bytes) time the accompanying circle (rehash commonly and normal) for i from 0 to L by s stack A[i] from memory (4 Bytes) CS267 Lecure 2

Slide 15

memory time measure > L1 reserve hit time add up to estimate < L1 Membench: What to Expect normal cost per get to Consider the normal cost per stack Plot one line for every exhibit length, time versus walk Small walk is ideal: if reserve line holds 4 words, at most ¼ miss If exhibit is littler than a given store, each one of those gets to will hit (after the main run, which is irrelevant for sufficiently huge runs) Picture expect one and only level of store Values have become more hard to quantify on present day procs s = walk CS267 Lecure 2

Slide 16

Mem: 396 ns (132 cycles) L2: 2 MB, 12 cycles (36 ns) L1: 16 KB 2 cycles (6ns) L1: 16 B line L2: 64 byte line 8 K pages, 32 TLB sections Memory Hierarchy on a Sun Ultra-2i Sun Ultra-2i, 333 MHz Array length See for subtle elements CS267 Lecure 2

Slide 17

Memory Hierarchy on a Pentium III Array measure Katmai processor on Millennium, 550 MHz L2: 512 KB 60 ns L1: 64K 5 ns, 4-way? L1: 32 byte line ? CS267 Lecure 2

Slide 18

Mem: 396 ns (132 cycles) L2: 8 MB 128 B line 9 cycles L1: 32 KB 128B line .5-2 cycles Memory Hierarchy on a Power3 (Seaborg) Power3, 375 MHz Array measure CS267 Lecure 2

Slide 19

Memory Performance on Itanium 2 (CITRIS) Itanium2, 900 MHz Array estimate Mem: 11-60 cycles L3: 2 MB 128 B line 3-20 cycles L2: 256 KB 128 B line .5-4 cycles L1: 32 KB 64B line .34-1 cycles Uses MAPS Benchmark: CS267 Lecure 2

Slide 20

Lessons Actual execution of a straightforward program can be an entangled capacity of the engineering Slight changes in the engineering or program change the execution altogether To compose quick projects, need to consider design True on successive or parallel processor We might want basic models to help us plan proficient calculations We will delineate with a typical procedure for enhancing store execution, called blocking or tiling Idea: utilized partition and-vanquish to characterize an issue that fits in enroll/L1-reserve/L2-store CS267 Lecure 2

Slide 21

Outline Idealized and genuine expenses in advanced processors Memory chains of command Case Study: Matrix Multiplication Automatic Performance Tuning CS267 Lecure 2

Slide 22

Why Matrix Multiplication? An imperative part in logical issues Appears in numerous direct variable based math calculations Closely identified with different calculations, e.g., transitive conclusion on a diagram utilizing Floyd-Warshall Optimization thoughts can be utilized as a part of different issues The best case for streamlining adjustments The most-examined calculation in elite registering CS267 Lecure 2

Slide 23

Matrix-increase, enhanced a few routes Speed of n-by-n framework duplicate on Sun Ultra-1/170, top = 330 MFlops CS267 Lecure 2

Slide 24

Outline Idealized and real expenses in advanced processors Memory orders Case Study: Matrix Multiplication Simple reserve demonstrate Warm-up: Matrix-vector augmentation Blocking calculations Other systems Automatic Performance Tuning CS267 Lecure 2

Slide 25

Note on Matrix Storage A lattice is a 2-D cluster of components, yet memory locations are "1-D" Conventions for grid format by section, or "segment significant" (Fortran default); A(i,j) at A+i+j*n by line, or "line real" (C default) A(i,j) at A+i*n+j recursive (later) Column major (for the time being) Column real network in memory Row real Column real 0 5 10 15 0 1 2 3 1 6 11 16 4 5 6 7 2 7 12 17 8 9 10 11 3 8 13 18 12 13 14 15 4 9 14 19 16 17 18 19 Blue line of lattice is put away in red cachelines CS267 Lecure 2 Figure source: Larry Carter, UCSD

Slide 26

Computational Intensity: Key to calculation productivity Machine Balance: Key to machine e