0

0

1806 days ago,
599 views

PowerPoint PPT Presentation
Google sorts more than 100 billion terms in its list > 1 Trillion ... CPU calculations require self-assertive information composes - require new calculation without ...

Elite Sorting and Searching utilizing Graphics Processors Naga K. Govindaraju Microsoft Concurrency

Sorting and Searching "I trust that practically every essential part of programming emerges some place with regards to sorting or looking !" - Don Knuth

Sorting and Searching Well concentrated High execution processing Databases Computer illustrations Programming dialects ... Google outline calculation Spec benchmark schedule!

Massive Databases Terabyte-information sets are normal Google sorts more than 100 billion terms in its list > 1 Trillion records in web filed! Database sizes are quickly expanding! Max DB sizes increments 3x every year (http://www.wintercorp.com) Processor upgrades not coordinating data blast

CPU (3 GHz ) AGP Memory (512 MB) CPU versus GPU (690 MHz) Video Memory (512 MB) 2 x 1 MB Cache System Memory (2 GB) PCI-E Bus (4 GB/s) GPU (690 MHz) Video Memory (512 MB)

Massive Data Handling on CPUs Require arbitrary memory gets to Small CPU stores (< 2MB) Random memory gets to slower than even consecutive circle gets to High memory inertness Huge memory to figure hole! CPUs are profoundly pipelined Pentium 4 has 30 pipeline stages Do not shroud inertness - high cycles per guideline (CPI) CPU is under-used for information concentrated applications

Massive Data Handling on CPUs Sorting is hard! GPU a conceivably adaptable answer for terabyte sorting and logical figuring We beat the sorting benchmark with GPUs and give a scaleable arrangement

G raphics P rocessing U nits ( GPU s ) Commodity processor for design applications Massively parallel vector processors High memory data transfer capacity Low memory dormancy pipeline Programmable High development rate Power-productive

GPU: Commodity Processor Laptops Consoles Cell telephones PSP Desktops

G raphics P rocessing U nits ( GPU s ) Commodity processor for representation applications Massively parallel vector processors 10x a bigger number of operations per sec than CPUs High memory transmission capacity Better shrouds memory idleness pipeline Programmable High development rate Power-proficient

Parallelism on GPUs Graphics FLOPS G PU – 1.3 T FLOPS CPU – 25.6 GFLOPS

G raphics P rocessing U nits ( GPU s ) Commodity processor for illustrations applications Massively parallel vector processors High memory transmission capacity Better conceals inertness pipeline Programmable 10x more memory transmission capacity than CPUs High development rate Power-effective

Low pipeline profundity Graphics Pipeline 56 GB/s programmable vertex preparing (fp32) vertex polygon setup, winnowing, rasterization setup polygon rasterizer Hides memory inactivity!! programmable per-pixel math (fp32) pixel per-pixel surface, fp16 mixing surface Z-buf, fp16 mixing, hostile to false name (MRT) memory picture

NON-Graphics Pipeline Abstraction programmable MIMD handling (fp32) information Courtesy: David Kirk, Chief Scientist, NVIDIA SIMD "rasterization" setup records rasterizer programmable SIMD preparing (fp32) information get, fp16 mixing information predicated compose, fp16 mix, numerous yield memory information

G raphics P rocessing U nits ( GPU s ) Commodity processor for design applications Massively parallel vector processors High memory data transfer capacity Better conceals inactivity pipeline Programmable High development rate Power-productive

GPU Growth Rate CPU Growth Rate Exploiting Technology Moving Faster than Moore's Law

G raphics P rocessing U nits ( GPU s ) Commodity processor for representation applications Massively parallel vector processors High memory transmission capacity Better shrouds idleness pipeline Programmable High development rate Power-effective

CPU versus GPU (Henry Moreton: NVIDIA, Aug. 2005)

GPUs for Sorting and Searching: Issues No support for discretionary composes Optimized CPU calculations don't delineate! Absence of support for general information sorts Cache-effective calculations Small information stores No reserve data from merchants Out-of-center calculations Limited GPU memory

Outline Overview Sorting and Searching on GPUs Applications Conclusions and Future Work

Sorting on GPUs Adaptive sorting calculations Extent of sorted request in an arrangement General sorting calculations External memory sorting calculations

Adaptive Sorting on GPUs Prior versatile sorting calculations require irregular information composes GPUs streamlined for least profundity or obvious surface calculation Using profundity test usefulness Design versatile sorting utilizing just least calculations

N. Govindaraju, M. Henson, M. Lin and D. Manocha, Proc. Of ACM I3D, 2005 Adaptive Sorting Algorithm Multiple cycles Each emphasis utilizes a two pass calculation First pass – Compute an expanding arrangement M Second pass - Compute the sorted components in M Iterate on the staying unsorted components

Increasing Sequence Given a succession S={x 1 ,… , x n }, a component x i has a place with M if and just if x i ≤ x j , i<j, x j in S

≤ Increasing Sequence X 1 X 2 X 3 … X i-1 X i X i+1 … X n-2 X n-1 X n M is an expanding grouping

Compute Increasing Sequence Computation X 1 X 2 … X i-1 X i X i+1 … X n-1 X n

Compute X n ≤ ∞ Increasing Sequence Computation X n

Compute Yes. Prepend x i to M Min = x i x i ≤Min ? Expanding Sequence Computation X i X i+1 … X n-1 X n

Compute Increasing Sequence Computation X 1 X 2 … X i-1 X i X i+1 … X n-1 X n x 1 ≤ { x 2 ,… ,x n } ?

Computing Sorted Elements Theorem 1: Given the expanding arrangement M, rank of a component x i in M is resolved if x i < min (I-M)

Computing Sorted Elements X 1 X 2 X 3 … X i-1 X i X i+1 … X n-2 X n-1 X n

≥ ≤ Computing Sorted Elements X 1 X 3 … X i X i+1 … X n-2 X n-1 X 2 … X i-1 … X n

Computing Sorted Elements Linear-time calculation Maintaining least

Compute Computing Sorted Elements X 1 X 3 … X i X i+1 … X n-2 X n-1 X 2 … X i-1 … X n

Compute No. Overhaul min X i in M? Yes. Add X i to sorted rundown X i ≤ min ? Figuring Sorted Elements X 1 X 2 … X i-1 X i

Compute Computing Sorted Elements X 1 X 2 … X i-1 X i X i+1 … X n-1 X n

Algorithm Analysis Knuth's measure of turmoil : Given an arrangement I and its longest expanding succession LIS(I), the grouping of confused components Y = I - LIS(I) Theorem 2: Given a succession I and LIS(I), our versatile calculation sorts in at most (2 ||Y|| + 1) emphasess

Pictorial Proof X 1 X 2 … X l X l+1 X l+2 … X m X m+1 X m+2 ... X q X q+1 X q+2 ... X n

2 cycles 2 emphasess 2 cycles Pictorial Proof X 1 X 2 … X l X l+1 X l+2 … X m X m+1 X m+2 ... X q X q+1 X q+2 ... X n

≤ Example 8 1 2 3 4 5 6 7 9

Sorted Example 8 9 1 2 3 4 5 6 7

Analysis

Advantages Linear in the information measure and sorted degree Works well on practically sorted information Maps well to GPUs Uses profundity test usefulness for least operations Useful for performing 3D perceivability requesting Perform straightforwardness calculations on element 3D situations Cons: Expected time: O(n 2 – 2 n√n) on arbitrary groupings

Video: Transparent PowerPlant 790K polygons Depth many-sided quality ~ 13 1600x1200 determination NVIDIA GeForce 6800 5-8 fps

Video: Transparent PowerPlant

N. Govindaraju, N. Raghuvanshi and D. Manocha, Proc. Of ACM SIGMOD, 2005 General Sorting on GPUs General datasets High execution

General Sorting on GPUs Design sorting calculations with deterministic memory gets to – "Finishing" on GPUs 56 GB/s crest memory transmission capacity Can better shroud the memory inactivity!! Require least and greatest calculations – "Mixing usefulness" on GPUs Low spreading overhead No information conditions Utilize high parallelism on GPUs

GPU-Based Sorting Networks Represent information as 2D clusters Multi-arrange calculation Each stage includes numerous means In every progression Compare one exhibit component against precisely one other component at altered separation Perform a restrictive task (MIN or MAX) at every component area

Sorting Flash Animation here

2D Memory Addressing GPUs streamlined for 2D representations Map 1D exhibits to 2D exhibits Minimum and most extreme districts mapped to push adjusted or segment adjusted quads

1D – 2D Mapping MIN MAX

MIN 1D – 2D Mapping Effectively lessen guidelines per component

Sorting on GPU: Pipelining and Parallelism Input Vertices Texturing, Caching and 2D Quad Comparisons Sequential Writes

Comparison with GPU-Based Algorithms 3-6x speedier than earlier GPU-based calculations!

GPU versus Top of the line Multi-Core CPUs 2-2.5x speedier than Intel top of the line processors Single GPU execution tantamount to top of the line double center Athlon Hand-advanced CPU code from Intel Corporation!

GPU versus Top of the line Multi-Core CPUs 2-2.5x speedier than Intel top of the line processors Single GPU execution practically identical to top of the line double center Athlon Slash Dot and Toms Hardware Guide Headlines, June 2005

N. Govindaraju, S. Larsen, J. Dim, and D. Manocha, Proc. Of ACM SuperComputing, 2006 GPU Cache Model Small information reserves Better conceal the memory idleness Vendors don't unveil store data – basic for logical figuring on GPUs We outline basic model Determine store parameters (square and store sizes) Improve sorting, FFT and SGEMM execution

Cache Evictions Cache Eviction Cache Eviction Cac

SPONSORS

No comments found.

SPONSORS

SPONSORS