Elite Sorting and Seeking utilizing Design Processors

0
0
1806 days ago, 599 views
PowerPoint PPT Presentation
Google sorts more than 100 billion terms in its list > 1 Trillion ... CPU calculations require self-assertive information composes - require new calculation without ...

Presentation Transcript

Slide 1

Elite Sorting and Searching utilizing Graphics Processors Naga K. Govindaraju Microsoft Concurrency

Slide 2

Sorting and Searching "I trust that practically every essential part of programming emerges some place with regards to sorting or looking !" - Don Knuth

Slide 3

Sorting and Searching Well concentrated High execution processing Databases Computer illustrations Programming dialects ... Google outline calculation Spec benchmark schedule!

Slide 4

Massive Databases Terabyte-information sets are normal Google sorts more than 100 billion terms in its list > 1 Trillion records in web filed! Database sizes are quickly expanding! Max DB sizes increments 3x every year (http://www.wintercorp.com) Processor upgrades not coordinating data blast

Slide 5

CPU (3 GHz ) AGP Memory (512 MB) CPU versus GPU (690 MHz) Video Memory (512 MB) 2 x 1 MB Cache System Memory (2 GB) PCI-E Bus (4 GB/s) GPU (690 MHz) Video Memory (512 MB)

Slide 6

Massive Data Handling on CPUs Require arbitrary memory gets to Small CPU stores (< 2MB) Random memory gets to slower than even consecutive circle gets to High memory inertness Huge memory to figure hole! CPUs are profoundly pipelined Pentium 4 has 30 pipeline stages Do not shroud inertness - high cycles per guideline (CPI) CPU is under-used for information concentrated applications

Slide 7

Massive Data Handling on CPUs Sorting is hard! GPU a conceivably adaptable answer for terabyte sorting and logical figuring We beat the sorting benchmark with GPUs and give a scaleable arrangement

Slide 8

G raphics P rocessing U nits ( GPU s ) Commodity processor for design applications Massively parallel vector processors High memory data transfer capacity Low memory dormancy pipeline Programmable High development rate Power-productive

Slide 9

GPU: Commodity Processor Laptops Consoles Cell telephones PSP Desktops

Slide 10

G raphics P rocessing U nits ( GPU s ) Commodity processor for representation applications Massively parallel vector processors 10x a bigger number of operations per sec than CPUs High memory transmission capacity Better shrouds memory idleness pipeline Programmable High development rate Power-proficient

Slide 11

Parallelism on GPUs Graphics FLOPS G PU – 1.3 T FLOPS CPU – 25.6 GFLOPS

Slide 12

G raphics P rocessing U nits ( GPU s ) Commodity processor for illustrations applications Massively parallel vector processors High memory transmission capacity Better conceals inertness pipeline Programmable 10x more memory transmission capacity than CPUs High development rate Power-effective

Slide 13

Low pipeline profundity Graphics Pipeline 56 GB/s programmable vertex preparing (fp32) vertex polygon setup, winnowing, rasterization setup polygon rasterizer Hides memory inactivity!! programmable per-pixel math (fp32) pixel per-pixel surface, fp16 mixing surface Z-buf, fp16 mixing, hostile to false name (MRT) memory picture

Slide 14

NON-Graphics Pipeline Abstraction programmable MIMD handling (fp32) information Courtesy: David Kirk, Chief Scientist, NVIDIA SIMD "rasterization" setup records rasterizer programmable SIMD preparing (fp32) information get, fp16 mixing information predicated compose, fp16 mix, numerous yield memory information

Slide 15

G raphics P rocessing U nits ( GPU s ) Commodity processor for design applications Massively parallel vector processors High memory data transfer capacity Better conceals inactivity pipeline Programmable High development rate Power-productive

Slide 16

GPU Growth Rate CPU Growth Rate Exploiting Technology Moving Faster than Moore's Law

Slide 17

G raphics P rocessing U nits ( GPU s ) Commodity processor for representation applications Massively parallel vector processors High memory transmission capacity Better shrouds idleness pipeline Programmable High development rate Power-effective

Slide 18

CPU versus GPU (Henry Moreton: NVIDIA, Aug. 2005)

Slide 19

GPUs for Sorting and Searching: Issues No support for discretionary composes Optimized CPU calculations don't delineate! Absence of support for general information sorts Cache-effective calculations Small information stores No reserve data from merchants Out-of-center calculations Limited GPU memory

Slide 20

Outline Overview Sorting and Searching on GPUs Applications Conclusions and Future Work

Slide 21

Sorting on GPUs Adaptive sorting calculations Extent of sorted request in an arrangement General sorting calculations External memory sorting calculations

Slide 22

Adaptive Sorting on GPUs Prior versatile sorting calculations require irregular information composes GPUs streamlined for least profundity or obvious surface calculation Using profundity test usefulness Design versatile sorting utilizing just least calculations

Slide 23

N. Govindaraju, M. Henson, M. Lin and D. Manocha, Proc. Of ACM I3D, 2005 Adaptive Sorting Algorithm Multiple cycles Each emphasis utilizes a two pass calculation First pass – Compute an expanding arrangement M Second pass - Compute the sorted components in M Iterate on the staying unsorted components

Slide 24

Increasing Sequence Given a succession S={x 1 ,… , x n }, a component x i has a place with M if and just if x i ≤ x j , i<j, x j in S

Slide 25

≤ Increasing Sequence X 1 X 2 X 3 … X i-1 X i X i+1 … X n-2 X n-1 X n M is an expanding grouping

Slide 26

Compute Increasing Sequence Computation X 1 X 2 … X i-1 X i X i+1 … X n-1 X n

Slide 27

Compute X n ≤ ∞ Increasing Sequence Computation X n

Slide 28

Compute Yes. Prepend x i to M Min = x i x i ≤Min ? Expanding Sequence Computation X i X i+1 … X n-1 X n

Slide 29

Compute Increasing Sequence Computation X 1 X 2 … X i-1 X i X i+1 … X n-1 X n x 1 ≤ { x 2 ,… ,x n } ?

Slide 30

Computing Sorted Elements Theorem 1: Given the expanding arrangement M, rank of a component x i in M is resolved if x i < min (I-M)

Slide 31

Computing Sorted Elements X 1 X 2 X 3 … X i-1 X i X i+1 … X n-2 X n-1 X n

Slide 32

≥ ≤ Computing Sorted Elements X 1 X 3 … X i X i+1 … X n-2 X n-1 X 2 … X i-1 … X n

Slide 33

Computing Sorted Elements Linear-time calculation Maintaining least

Slide 34

Compute Computing Sorted Elements X 1 X 3 … X i X i+1 … X n-2 X n-1 X 2 … X i-1 … X n

Slide 35

Compute No. Overhaul min X i in M? Yes. Add X i to sorted rundown X i ≤ min ? Figuring Sorted Elements X 1 X 2 … X i-1 X i

Slide 36

Compute Computing Sorted Elements X 1 X 2 … X i-1 X i X i+1 … X n-1 X n

Slide 37

Algorithm Analysis Knuth's measure of turmoil : Given an arrangement I and its longest expanding succession LIS(I), the grouping of confused components Y = I - LIS(I) Theorem 2: Given a succession I and LIS(I), our versatile calculation sorts in at most (2 ||Y|| + 1) emphasess

Slide 38

Pictorial Proof X 1 X 2 … X l X l+1 X l+2 … X m X m+1 X m+2 ... X q X q+1 X q+2 ... X n

Slide 39

2 cycles 2 emphasess 2 cycles Pictorial Proof X 1 X 2 … X l X l+1 X l+2 … X m X m+1 X m+2 ... X q X q+1 X q+2 ... X n

Slide 40

≤ Example 8 1 2 3 4 5 6 7 9

Slide 41

Sorted Example 8 9 1 2 3 4 5 6 7

Slide 42

Analysis

Slide 43

Advantages Linear in the information measure and sorted degree Works well on practically sorted information Maps well to GPUs Uses profundity test usefulness for least operations Useful for performing 3D perceivability requesting Perform straightforwardness calculations on element 3D situations Cons: Expected time: O(n 2 – 2 n√n) on arbitrary groupings

Slide 44

Video: Transparent PowerPlant 790K polygons Depth many-sided quality ~ 13 1600x1200 determination NVIDIA GeForce 6800 5-8 fps

Slide 45

Video: Transparent PowerPlant

Slide 46

N. Govindaraju, N. Raghuvanshi and D. Manocha, Proc. Of ACM SIGMOD, 2005 General Sorting on GPUs General datasets High execution

Slide 47

General Sorting on GPUs Design sorting calculations with deterministic memory gets to – "Finishing" on GPUs 56 GB/s crest memory transmission capacity Can better shroud the memory inactivity!! Require least and greatest calculations – "Mixing usefulness" on GPUs Low spreading overhead No information conditions Utilize high parallelism on GPUs

Slide 48

GPU-Based Sorting Networks Represent information as 2D clusters Multi-arrange calculation Each stage includes numerous means In every progression Compare one exhibit component against precisely one other component at altered separation Perform a restrictive task (MIN or MAX) at every component area

Slide 49

Sorting Flash Animation here

Slide 50

2D Memory Addressing GPUs streamlined for 2D representations Map 1D exhibits to 2D exhibits Minimum and most extreme districts mapped to push adjusted or segment adjusted quads

Slide 51

1D – 2D Mapping MIN MAX

Slide 52

MIN 1D – 2D Mapping Effectively lessen guidelines per component

Slide 53

Sorting on GPU: Pipelining and Parallelism Input Vertices Texturing, Caching and 2D Quad Comparisons Sequential Writes

Slide 54

Comparison with GPU-Based Algorithms 3-6x speedier than earlier GPU-based calculations!

Slide 55

GPU versus Top of the line Multi-Core CPUs 2-2.5x speedier than Intel top of the line processors Single GPU execution tantamount to top of the line double center Athlon Hand-advanced CPU code from Intel Corporation!

Slide 56

GPU versus Top of the line Multi-Core CPUs 2-2.5x speedier than Intel top of the line processors Single GPU execution practically identical to top of the line double center Athlon Slash Dot and Toms Hardware Guide Headlines, June 2005

Slide 57

N. Govindaraju, S. Larsen, J. Dim, and D. Manocha, Proc. Of ACM SuperComputing, 2006 GPU Cache Model Small information reserves Better conceal the memory idleness Vendors don't unveil store data – basic for logical figuring on GPUs We outline basic model Determine store parameters (square and store sizes) Improve sorting, FFT and SGEMM execution

Slide 58

Cache Evictions Cache Eviction Cache Eviction Cac

SPONSORS