Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems

0
0
1357 days ago, 528 views
PowerPoint PPT Presentation
2. Diagram. Foundation and GoalMotivationDestruction of Intra-string DRAM Bank ParallelismParallelism-Aware Batch SchedulingBatchingWithin-bunch SchedulingSystem Software SupportEvaluationSummary. The DRAM System. 3. . . Segments. Columns. . . . . . . . . . . . . . . Line Buffer. . . . . . . . . .

Presentation Transcript

Slide 1

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research

Slide 2

Outline Background and Goal Motivation Destruction of Intra-string DRAM Bank Parallelism-Aware Batch Scheduling Batching Within-cluster Scheduling System Software Support Evaluation Summary

Slide 3

The DRAM System Columns BANK 0 BANK 1 BANK 2 BANK 3 Rows Row Buffer DRAM Bus FR-FCFS arrangement Row-hit first Oldest first DRAM CONTROLLER

Slide 4

Multi-Core Systems Multi-Core Chip CORE 0 CORE 1 CORE 2 CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE strings' solicitations meddle Shared DRAM Memory System DRAM MEMORY CONTROLLER . . . Measure Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank 7

Slide 5

Inter-string Interference in the DRAM System Threads postpone each other by bringing on asset dispute: Bank, transport, push cushion clashes [MICRO 2007] Threads can likewise demolish each other's DRAM bank parallelism Otherwise parallel solicitations can get to be serialized Existing DRAM schedulers are unconscious of this obstruction They just mean to augment DRAM throughput Thread-uninformed and string out of line No plan to benefit each string's solicitations in parallel FR-FCFS approach: 1) push hit initial, 2) most seasoned first Unfairly organizes strings with high column cradle territory

Slide 6

Consequences of Inter-Thread Interference in DRAM is the main shared asset High need Unfair stoppage of various strings [MICRO 2007] System execution misfortune [MICRO 2007] Vulnerability to refusal of administration [USENIX Security 2007] Inability to uphold framework level string needs [MICRO 2007] Memory execution hoard Low need Normalized Memory Stall-Time Cores gain moderate ground

Slide 7

Our Goal Control between string impedance in DRAM Design a common DRAM scheduler that gives high framework execution saves each string's DRAM bank parallelism gives reasonableness to strings sharing the DRAM framework adjusts memory-log jams of equivalent need strings is controllable and configurable empowers diverse administration levels for strings with various needs

Slide 8

Outline Background and Goal Motivation Destruction of Intra-string DRAM Bank Parallelism-Aware Batch Scheduling Batching Within-cluster Scheduling System Software Support Evaluation Summary

Slide 9

The Problem Processors attempt to endure the dormancy of DRAM asks for by creating numerous exceptional solicitations Memory-Level Parallelism (MLP) Out-of-request execution, non-blocking reserves, runahead execution Effective just if the DRAM controller really benefits the numerous solicitations in parallel in DRAM banks Multiple strings share the DRAM controller DRAM controllers don't know about a string's MLP Can benefit each string's remarkable demands serially, not in parallel

Slide 10

Bank Parallelism of a Thread Bank 1 Bank 0 2 DRAM Requests Single Thread: Compute Thread A : Stall Bank 0 Bank 1 Thread A: Bank 0, Row 1 Thread A: Bank 1, Row 1 Bank get to latencies of the two solicitations covered Thread slows down for ~ONE bank get to idleness

Slide 11

Bank Parallelism Interference in DRAM Bank 1 Bank 0 Baseline Scheduler: 2 DRAM Requests Compute A : Stall Bank 0 Bank 1 Thread A: Bank 0, Row 1 2 DRAM Requests Thread B: Bank 1, Row 99 Compute B: Stall Thread B: Bank 0, Row 99 Bank 1 Bank 0 Thread A: Bank 1, Row 1 Bank get to latencies of each string serialized Each string slows down for ~TWO bank get to latencies

Slide 12

Parallelism-Aware Scheduler Baseline Scheduler: Bank 1 Bank 0 2 DRAM Requests Compute A : Stall Bank 0 Bank 1 2 DRAM Requests Thread A: Bank 0, Row 1 Compute B: Stall Thread B: Bank 1, Row 99 Bank 1 Thread B: Bank 0, Row 99 Bank 0 Thread A: Bank 1, Row 1 Parallelism-mindful Scheduler: 2 DRAM Requests Compute Stall A : Saved Cycles Average slow down time: ~1.5 bank get to latencies Bank 0 Bank 1 2 DRAM Requests Compute B: Stall Bank 0 Bank 1

Slide 13

Outline Background and Goal Motivation Destruction of Intra-string DRAM Bank Parallelism-Aware Batch Scheduling (PAR-BS) Request Batching Within-clump Scheduling System Software Support Evaluation Summary

Slide 14

Parallelism-Aware Batch Scheduling (PAR-BS) Principle 1: Parallelism-mindfulness Schedule asks for from a string (to various banks) consecutive Preserves each string's bank parallelism But, this can bring about starvation… Principle 2: Request Batching Group a settled number of most established solicitations from each string into a "bunch" Service the clump before every single other demand Form another clump when the ebb and flow one is done Eliminates starvation, gives decency Allows parallelism-mindfulness inside a cluster T1 T2 T0 T2 Batch T3 T2 T0 T3 T2 T1 T0 Bank 0 Bank 1

Slide 15

PAR-BS Components Request grouping Within-group booking Parallelism mindful

Slide 16

Request Batching Each memory ask for has a bit ( checked) related with it Batch development: Mark up to Marking-Cap most established solicitations per bank for each string Marked solicitations constitute the cluster Form another bunch when no stamped solicitations are left Marked solicitations are organized over unmarked ones No reordering of solicitations crosswise over clumps: no starvation, high decency How to organize asks for inside a clump?

Slide 17

Within-Batch Scheduling Can utilize any current DRAM booking arrangement FR-FCFS (push hit to begin with, then most seasoned first) abuses push support region But, we additionally need to safeguard intra-string bank parallelism Service each string's solicitations consecutive Scheduler figures a positioning of strings when the group is framed Higher-positioned strings are organized over lower-positioned ones Improves the probability that solicitations from a string are overhauled in parallel by various banks Different strings organized in a similar request over ALL banks HOW?

Slide 18

How to Rank Threads inside a Batch Ranking plan influences framework throughput and decency Maximize framework throughput Minimize normal slow down time of strings inside the cluster Minimize injustice (Equalize the stoppage of strings) Service strings with characteristically low slow down time right on time in the group Insight: deferring memory non-serious strings brings about high lull Shortest slow down time first ( most limited occupation first ) positioning Provides ideal framework throughput [Smith, 1956]* Controller gauges each string's slow down time inside the clump Ranks strings with shorter slow down time higher * W.E. Smith, "Different analyzers for single stage generation," Naval Research Logistics Quarterly, 1956.

Slide 19

Shortest Stall-Time First Ranking Maximum number of checked solicitations to any bank (max-bank-stack) Rank string with lower max-bank-stack higher (~ low slow down time) Total number of stamped solicitations (add up to load) Breaks ties: rank string with lower add up to stack higher T3 T2 T3 T1 T0 T2 T0 T2 T1 T2 T3 T1 T0 T3 T1 T3 T2 T3 Ranking: T0 > T1 > T2 > T3 Bank 0 Bank 1 Bank 2 Bank 3

Slide 20

Example Within-Batch Scheduling Order Baseline Scheduling Order (Arrival arrange) PAR-BS Scheduling Order 7 T3 6 T3 5 T3 T2 T3 4 Time T1 T0 T2 T0 T3 T2 3 T2 T1 T2 T3 2 T3 T1 T0 T3 T1 T2 1 T1 T3 T2 T3 T1 T0 Bank 0 Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank 3 Ranking: T0 > T1 > T2 > T3 Stall times Stall times AVG: 5 bank get to latencies AVG: 3.5 bank get to latencies

Slide 21

Putting It Together: PAR-BS Scheduling Policy PAR-BS Scheduling Policy (1) Marked asks for initial (2) Row-hit asks for initial (3) Higher-rank string first (most limited slow down time initial) (4) Oldest initial Three properties: Exploits push cradle area and intra-string bank parallelism Work-saving Services unmarked solicitations to banks without stamped demands Marking-Cap is imperative Too little top: demolishes push cushion region Too extensive top: punishes memory non-serious strings Many more exchange offs examined in the paper Batching Parallelism-mindful inside cluster planning

Slide 22

Hardware Cost <1.5KB capacity cost for 8-center framework with 128-passage memory ask for cushion No unpredictable operations (e.g., divisions) Not on the basic way Scheduler settles on a choice just every DRAM cycle

Slide 23

Outline Background and Goal Motivation Destruction of Intra-string DRAM Bank Parallelism-Aware Batch Scheduling Batching Within-group Scheduling System Software Support Evaluation Summary

Slide 24

System Software Support OS passes on each string's need level to the controller Levels 1, 2, 3, … (most noteworthy to least need) Controller authorizes needs in two ways Mark asks for from a string with need X just every Xth bunch Within a clump, higher-need strings' solicitations are booked first Purely shrewd administration Special low need level L Requests from such strings never stamped Quantitative investigation in paper

Slide 25

Outline Background and Goal Motivation Destruction of Intra-string DRAM Bank Parallelism-Aware Batch Scheduling Batching Within-clump Scheduling System Software Support Evaluation Summary

Slide 26

Evaluation Methodology 4-, 8-, 16-center frameworks x86 processor demonstrate in light of Intel Pentium M 4 GHz processor, 128-section direction window 512 Kbyte per center private L2 stores, 32 L2 miss cradles Detailed DRAM display in light of Micron DDR2-800 128-passage memory ask for support 8 banks, 2Kbyte column cradle 40ns (160 cycles) push hit round-outing inactivity 80ns (320 cycles) push struggle round-trek inertness Benchmarks Multiprogrammed SPEC CPU2006 and Windows Desktop applications 100, 16, 12 program mixes for 4-, 8-, 16-center trials

Slide 27

Comparison with Other DRAM Controllers Baseline FR-FCFS [ Zuravleff and Robinson, US Patent 1997; Rixner et al., ISCA 2000] Prioritizes push

SPONSORS