Quickening Atomic Displaying Applications with GPU Figuring

Accelerating molecular modeling applications with gpu computing l.jpg
1 / 33
0
0
1062 days ago, 519 views
PowerPoint PPT Presentation
Commonly utilized as a desktop application, for intuitive 3D sub-atomic design and examination ... GPU increasing speed of cutoff pair possibilities for atomic demonstrating applications. ...

Presentation Transcript

Slide 1

Quickening Molecular Modeling Applications with GPU Computing John Stone Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/Research/gpu/Supercomputing 2009 Portland, OR, November 18, 2009

Slide 2

VMD – "Visual Molecular Dynamics" Visualization and examination of sub-atomic progression reenactments, arrangement information, volumetric information, quantum science reproductions, molecule frameworks, … User extensible with scripting and modules http://www.ks.uiuc.edu/Research/vmd/

Slide 3

Range of VMD Usage Scenarios Users run VMD on an assorted scope of equipment: tablets, desktops, bunches, and supercomputers Typically utilized as a desktop application, for intelligent 3D sub-atomic representation and investigation Can likewise be keep running in immaculate content mode for numerically concentrated examination errands, clump mode film rendering, and so on… GPU speeding up gives a chance to make some moderate, or cluster figurings fit for being run intuitively, or on-request…

Slide 4

CUDA Acceleration in VMD Electrostatic field count, particle situation 20x to 44x speedier Molecular orbital estimation and show 100x to 120x quicker Imaging of gas relocation pathways in proteins with certain ligand inspecting 20x to 30x speedier

Slide 5

Electrostatic Potential Maps Electrostatic possibilities assessed on 3-D grid: Applications include: Ion position for structure building Time-found the middle value of possibilities for recreation Visualization and investigation Isoleucine tRNA synthetase

Slide 6

Multilevel Summation Main Ideas Split the 1/r potential into a short-extend cutoff part in addition to smoothed parts that are progressively more gradually differing. Everything except the finish level potential are cut off. Smoothed possibilities are added from progressively coarser grids. Finest grid dividing h and littlest cutoff separate an are multiplied at each progressive level. Part the 1/r potential Interpolate the smoothed possibilities 2 h - cross section + 1/r = h - grid + iotas a 2 a

Slide 7

Multilevel Summation Calculation correct short-extend associations interjected long-run communications outline   Computational Steps 4 h - grid long-go parts prolongation limitation 2 h - grid cutoff prolongation confinement h - cross section cutoff anterpolation addition particle charges short-run cutoff delineate

Slide 8

Multilevel Summation on the GPU Accelerate short-go cutoff and grid cutoff parts Performance profile for 0.5 Å guide of potential for 1.5 M molecules. Equipment stage is Intel QX6700 CPU and NVIDIA GTX 280.

Slide 9

Photobiology of Vision and Photosynthesis Investigations of the chromatophore, a photosynthetic organelle Light Partial model: ~10M iotas Electrostatics expected to manufacture full auxiliary model, put particles, think about plainly visible properties Electrostatic field of chromatophore model from multilevel summation strategy: figured with 3 GPUs (G80) in ~90 seconds, 46x quicker than single CPU center Full chromatophore model will allow basic, concoction and active examinations at a basic frameworks science level

Slide 10

Computing Molecular Orbitals Visualization of MOs helps in comprehension the science of atomic framework MO spatial conveyance is associated with electron likelihood thickness Calculation of high determination MO networks can oblige tens to many seconds on CPUs >100x speedup permits intuitive movement of MOs @ 10 FPS C 60

Slide 11

Molecular Orbital Computation and Display Process One-time introduction Read QM reproduction log record, direction Initialize Pool of GPU Worker Threads Preprocess MO coefficient information wipe out copies, sort by sort, and so forth… For current edge and MO list, recover MO wavefunction coefficients Compute 3-D matrix of MO wavefunction amplitudes Most execution requesting step, keep running on GPU… For each trj outline, for every MO demonstrated Extract isosurface work from 3-D MO framework Apply client shading/finishing and render the subsequent surface

Slide 12

CUDA Block/Grid Decomposition MO 3-D cross section disintegrates into 2-D cuts (CUDA lattices) Grid of string pieces: 0,0 0,1 … 1,0 1,1 … Small 8x8 string squares bear the cost of substantial per-string register tally, shared mem. Strings process one MO cross section point each. … Padding enhances glob. mem perf, ensuring blending

Slide 13

MO Kernel for One Grid Point (Naive C) Loop over particles … for (at=0; at<numatoms; at++) { int prim_counter = atom_basis[at]; calc_distances_to_atom(&atompos[at], &xdist, &ydist, &zdist, &dist2, &xdiv); for (contracted_gto=0.0f, shell=0; shell < num_shells_per_atom[at]; shell++) { int shell_type = shell_symmetry[shell_counter]; for (prim=0; tidy < num_prim_per_shell[shell_counter]; prim++) { coast type = basis_array[prim_counter ]; skim contract_coeff = basis_array[prim_counter + 1]; contracted_gto += contract_coeff * expf(- exponent*dist2); prim_counter += 2; } for (tmpshell=0.0f, j=0, zdp=1.0f; j<=shell_type; j++, zdp*=zdist) { int imax = shell_type - j; for (i=0, ydp=1.0f, xdp=pow(xdist, imax); i<=imax; i++, ydp*=ydist, xdp*=xdiv) tmpshell += wave_f[ifunc++] * xdp * ydp * zdp; } esteem += tmpshell * contracted_gto; shell_counter++; } … .. Circle over shells Loop over primitives: biggest segment of runtime, because of expf() Loop over rakish momenta (unrolled in genuine code)

Slide 14

MO GPU Kernel Snippet: Contracted GTO Loop, Use of Constant Memory [… external circle over molecules … ] drift dist2 = xdist2 + ydist2 + zdist2;/L oop over the shells having a place with this iota (or premise work) for (shell=0; shell < maxshell; shell++) { glide contracted_gto = 0.0f;/Loop over the Gaussian primitives of this contracted premise capacity to construct the nuclear orbital int maxprim = const_num_prim_per_shell[shell_counter]; int shelltype = const_shell_types[shell_counter]; for (prim=0; tidy < maxprim; prim++) { coast example = const_basis_array[prim_counter ]; skim contract_coeff = const_basis_array[prim_counter + 1]; contracted_gto += contract_coeff * __ expf(- exponent*dist2); prim_counter += 2; } [… proceed to precise momenta circle … ] Constant memory: about enroll speed when cluster components got to as one by all associate strings… .

Slide 15

MO GPU Kernel Snippet: Unrolled Angular Momenta Loop unrolling: Saves registers (critical for GPUs!) Reduces circle control overhead Increases math power/* duplicate with the fitting wavefunction coefficient */drift tmpshell=0; switch (shelltype) { case S_SHELL: esteem += const_wave_f[ifunc++] * contracted_gto; break; [… P_SHELL case … ] case D_SHELL: tmpshell += const_wave_f[ifunc++] * xdist2; tmpshell += const_wave_f[ifunc++] * xdist * ydist; tmpshell += const_wave_f[ifunc++] * ydist2; tmpshell += const_wave_f[ifunc++] * xdist * zdist; tmpshell += const_wave_f[ifunc++] * ydist * zdist; tmpshell += const_wave_f[ifunc++] * zdist2; esteem += tmpshell * contracted_gto; break; [... Different cases: F_SHELL, G_SHELL, and so forth … ] }/end switch

Slide 16

Preprocessing of Atoms, Basis Set, and Wavefunction Coefficients Must make viable utilization of high transfer speed, low-dormancy GPU on-chip memory, or CPU reserve: Overall stockpiling prerequisite diminished by taking out copy premise set coefficients Sorting molecules by component sort permits re-utilization of premise set coefficients for consequent iotas of indistinguishable sort Padding, arrangement of clusters ensures mixed GPU worldwide memory gets to, CPU SSE loads

Slide 17

GPU Traversal of Atom Type, Basis Set, Shell Type, and Wavefunction Coefficients Monotonically expanding memory references Constant for all MOs, all timesteps Loop cycles dependably get to same or successive exhibit components for all strings in a string square: Yields great consistent memory store execution Increases shared memory tile reuse Different at each timestep, and for every MO Strictly consecutive memory references

Slide 18

Use of GPU On-chip Memory If add up to information under 64 kB, utilize just const mem: Broadcasts information to all strings, no worldwide memory gets to! For vast information, shared memory utilized as a program-oversaw store, coefficients stacked on-request: Tiles measured sufficiently substantial to administration whole internal circle runs, communicate to every one of the 64 strings in a piece Complications: settled circles, various exhibits, differing length Key to execution is to find tile stacking checks outside of the two execution basic inward circles Only 27% slower than equipment reserving gave by steady memory (GT200) Next-gen "Fermi" GPUs will give bigger on-chip shared memory, L1/L2 reserves, diminished control overhead

Slide 19

Array tile stacked in GPU shared memory. Tile size is a force of-two, various of combining size, and permits basic ordering in inward circles (exhibit files are only counterbalanced for reference inside stacked tile) . Encompassing information, unreferenced by next group of circle cycles 64-Byte memory mixing square limits Full tile cushioning Coefficient exhibit in GPU worldwide memory

Slide 20

MO GPU Kernel Snippet: Loading Tiles Into Shared Memory On-Demand [… external circle over particles … ] if ((prim_counter + (maxprim<<1)) >= SHAREDSIZE) { prim_counter += sblock_prim_counter; sblock_prim_counter = prim_counter & MEMCOAMASK; s_basis_array[sidx ] = basis_array[sblock_prim_counter + sidx ]; s_basis_array[sidx + 64] = basis_array[sblock_prim_counter + sidx + 64]; s_basis_array[sidx + 128] = basis_ar

SPONSORS