S04: Elite Processing with CUDA Contextual investigation: Atomic Displaying Applications

S04 high performance computing with cuda case study molecular modeling applications l.jpg
1 / 40
1237 days ago, 518 views
PowerPoint PPT Presentation
NIH Resource for Macromolecular Modeling and Bioinformatics ... GPU quickening of cutoff pair possibilities for atomic displaying applications. ...

Presentation Transcript

Slide 1

S04: High Performance Computing with CUDA Case Study: Molecular Modeling Applications John E. Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/Research/gpu/Tutorial S04, Supercomputing 2009, Portland, OR, Nov 15, 2009

Slide 2

VMD – "Visual Molecular Dynamics" Visualization and investigation of atomic flow reenactments, arrangement information, volumetric information, quantum science reproductions, molecule frameworks, … User extensible with scripting and modules http://www.ks.uiuc.edu/Research/vmd/

Slide 3

Case Study Topics: VMD – sub-atomic perception + examination NAMD – sub-atomic progression recreation See our GPU article in CACM issue incorporated into your SC2009 enlistment treats pack… See live demos in NVIDIA corner Klaus Schulten Masterworks Lecture: "Battling Swine Flu through Computational Medicine " Wednesday,  04:15PM - 05:00PM Room PB253-254 Includes GPU perf comes about + logical applications

Slide 4

CUDA Acceleration in VMD Electrostatic field count, particle position 20x to 44x speedier Molecular orbital figuring and show 100x to 120x quicker Imaging of gas relocation pathways in proteins with verifiable ligand testing 20x to 30x quicker GPU: greatly parallel co-processor

Slide 5

Recurring Algorithm Design Principles Pre-preparing and sorting of operands to compose calculation for pinnacle productivity on the GPU Tiled/blocked information structures in GPU worldwide memory for pinnacle transfer speed usage Extensive utilization of on-chip shared memory and steady memory to advance increase memory transmission capacity Use of CPU to "regularize" the work done by the GPU, handle special cases & bizarre work units Asynchronous operation of CPU/GPU empowering covering of calculation and I/O on both closures

Slide 6

Electrostatic Potential Maps Electrostatic possibilities assessed on 3-D grid: Applications include: Ion situation for structure building Time-found the middle value of possibilities for reenactment Visualization and investigation Isoleucine tRNA synthetase

Slide 7

Infinite versus Cutoff Potentials Infinite range potential: All iotas add to all cross section focuses Quadratic time intricacy Cutoff (extend constrained) potential: Atoms contribute inside cutoff separation to grid focuses bringing about straight time unpredictability Used for quick rotting associations (e.g. Lennard-Jones, Buckingham) Fast full electrostatics: Replace electrostatic potential with moved frame Combine short-run part with long-go estimation Multilevel summation strategy (MSM), direct time many-sided quality

Slide 8

Short-extend Cutoff Summation Each cross section point gathers electrostatic potential commitment from particles inside cutoff separate: if (r ij < cutoff) potential[j] += (charge[i]/r ij ) * s(r ij ) Smoothing capacity s(r) is calculation subordinate Cutoff range r ij : remove from lattice[j] to atom[i] Lattice point j being assessed atom[i]

Slide 9

Cutoff Summation on the GPU Atoms are spatially hashed into settled size receptacles CPU handles flooded containers (GPU portion can be extremely forceful) GPU string piece computes relating locale of potential guide, Bin/area neighbor checks expensive; understood with all inclusive table look-into Each string square agreeably stacks iota canisters from encompassing neighborhood into shared memory for assessment Shared memory iota canister Global memory Constant memory Offsets for canister neighborhood Potential guide districts Look-up table encodes "rationale" of spatial geometry Bins of iotas

Slide 10

GPU cutoff with CPU cover: 17x-21x quicker than CPU center If nonconcurrent stream obstructs because of line filling, execution will debase from pinnacle… Cutoff Summation Performance GPU increasing speed of cutoff match possibilities for sub-atomic displaying applications. C. Rodrigues, D. Solid, J. Stone, K. Schulten, W. Hwu. Procedures of the 2008 Conference On Computing Frontiers , pp. 273-282, 2008.

Slide 11

Cutoff Summation Observations Use of CPU to handle flooded canisters is exceptionally successful, covers totally with GPU work Caveat: Overfilling stream line can trigger blocking conduct. Late drivers line >100 operations before blocking. Higher accuracy: Compensated summation (all GPUs) or twofold exactness (GT200 just) just a ~10% execution punishment versus single-accuracy number-crunching Next-gen "Fermi" GPUs will have an even lower execution cost for twofold exactness math

Slide 12

Multilevel Summation Calculation correct short-run cooperations introduced long-run connections outline   Computational Steps 4 h - grid long-extend parts prolongation limitation 2 h - cross section cutoff prolongation confinement h - cross section cutoff anterpolation insertion molecule charges short-go cutoff delineate

Slide 13

Multilevel Summation on the GPU Accelerate short-go cutoff and cross section cutoff parts Performance profile for 0.5 Å guide of potential for 1.5 M particles. Equipment stage is Intel QX6700 CPU and NVIDIA GTX 280.

Slide 14

Photobiology of Vision and Photosynthesis Investigations of the chromatophore, a photosynthetic organelle Light Partial model: ~10M particles Electrostatics expected to fabricate full auxiliary model, put particles, concentrate on naturally visible properties Electrostatic field of chromatophore model from multilevel summation technique: figured with 3 GPUs (G80) in ~90 seconds, 46x speedier than single CPU center Full chromatophore model will allow basic, substance and dynamic examinations at a basic frameworks science level

Slide 15

Computing Molecular Orbitals Visualization of MOs helps in comprehension the science of sub-atomic framework MO spatial appropriation is corresponded with electron likelihood thickness Calculation of high determination MO networks can oblige tens to several seconds on CPUs >100x speedup permits intuitive activity of MOs @ 10 FPS C 60

Slide 16

Molecular Orbital Computation and Display Process One-time instatement Read QM recreation log document, direction Initialize Pool of GPU Worker Threads Preprocess MO coefficient information take out copies, sort by sort, and so forth… For current casing and MO file, recover MO wavefunction coefficients Compute 3-D matrix of MO wavefunction amplitudes Most execution requesting step, keep running on GPU… For each trj outline, for every MO indicated Extract isosurface work from 3-D MO matrix Apply client shading/finishing and render the subsequent surface

Slide 17

CUDA Block/Grid Decomposition MO 3-D cross section disintegrates into 2-D cuts (CUDA lattices) Grid of string squares: 0,0 0,1 … 1,0 1,1 … Small 8x8 string pieces manage the cost of vast per-string register tally, shared mem. Strings figure one MO cross section point each. … Padding upgrades glob. mem perf, ensuring mixing

Slide 18

MO Kernel for One Grid Point (Naive C) Loop over molecules … for (at=0; at<numatoms; at++) { int prim_counter = atom_basis[at]; calc_distances_to_atom(&atompos[at], &xdist, &ydist, &zdist, &dist2, &xdiv); for (contracted_gto=0.0f, shell=0; shell < num_shells_per_atom[at]; shell++) { int shell_type = shell_symmetry[shell_counter]; for (prim=0; tidy < num_prim_per_shell[shell_counter]; prim++) { skim example = basis_array[prim_counter ]; glide contract_coeff = basis_array[prim_counter + 1]; contracted_gto += contract_coeff * expf(- exponent*dist2); prim_counter += 2; } for (tmpshell=0.0f, j=0, zdp=1.0f; j<=shell_type; j++, zdp*=zdist) { int imax = shell_type - j; for (i=0, ydp=1.0f, xdp=pow(xdist, imax); i<=imax; i++, ydp*=ydist, xdp*=xdiv) tmpshell += wave_f[ifunc++] * xdp * ydp * zdp; } esteem += tmpshell * contracted_gto; shell_counter++; } … .. Circle over shells Loop over primitives: biggest segment of runtime, because of expf() Loop over precise momenta (unrolled in genuine code)

Slide 19

MO GPU Kernel Snippet: Contracted GTO Loop, Use of Constant Memory [… external circle over particles … ] drift dist2 = xdist2 + ydist2 + zdist2;/L oop over the shells having a place with this iota (or premise work) for (shell=0; shell < maxshell; shell++) { coast contracted_gto = 0.0f;/Loop over the Gaussian primitives of this contracted premise capacity to assemble the nuclear orbital int maxprim = const_num_prim_per_shell[shell_counter]; int shelltype = const_shell_types[shell_counter]; for (prim=0; demure < maxprim; prim++) { glide type = const_basis_array[prim_counter ]; skim contract_coeff = const_basis_array[prim_counter + 1]; contracted_gto += contract_coeff * __ expf(- exponent*dist2); prim_counter += 2; } [… proceed to rakish momenta circle … ] Constant memory: about enroll speed when exhibit components got to as one by all companion strings… .

Slide 20

MO GPU Kernel Snippet: Unrolled Angular Momenta Loop unrolling: Saves registers (imperative for GPUs!) Reduces circle control overhead Increases math force/* duplicate with the fitting wavefunction coefficient */drift tmpshell=0; switch (shelltype) { case S_SHELL: esteem += const_wave_f[ifunc++] * contracted_gto; break; [… P_SHELL case … ] case D_SHELL: tmpshell += const_wave_f[ifunc++] * xdist2; tmpshell += const_wave_f[ifunc++] * xdist * ydist; tmpshell += const_wave_f[ifunc++] * ydist2; tmpshell += const_wave_f[ifunc++] * xdist * zdist; tmpshell += const_wave_f[ifunc++] * ydist * zdist; tmpshell += const_wave_f[ifunc++] * zdist2; esteem += tmpshell * contracted_gto; break; [... Different cases: F_SHELL, G_SHELL, and so forth … ] }/end switch

Slide 21

Preprocessing of Atoms, Basi