11.2. Parallelization

11.2.1. Introduction

There are numerous solutions to address the issue of high computational requirements. Generally, the solution is using supercomputers, clusters, or grid systems. These systems are built by connecting numerous processors, either by some sort of direct link or by a network connection. Several computing centres can be connected to each other by the internet, thus creating grids. Management and programming of these systems require certain software environments and tools.

11.2.2. Supercomputers, clusters, and Grids

Parallel computers can be roughly classified according to the level at which the hardware supports parallelism. This classification is broadly analogous to the distance between basic computing nodes.

Multicore computing:

A multicore processor is a processor that includes multiple execution units ("cores") on the same chip. These processors differ from superscalar processors, which can issue multiple instructions per cycle from one instruction stream (thread); in contrast, a multicore processor can issue multiple instructions per cycle from multiple instruction streams. Each core in a multicore processor can potentially be superscalar as well - that is, on every cycle, each core can issue multiple instructions from one instruction stream.

Supercomputers (Symmetric multiprocessing):

A symmetric multiprocessor (SMP) is a computer system with multiple identical processors that share memory and connect by a local high-speed computer bus.

Distributed computing:

A distributed computer is a distributed memory computer system in which the processing elements are connected by a network. Distributed computers are highly scalable.

1. Cluster computing:

A cluster is a group of loosely coupled computers that work together closely, so that in some respects they can be regarded as a single computer. Clusters are composed of multiple standalone machines connected by a network. While machines in a cluster do not have to be symmetric, load balancing is more difficult if they are not.

2. Grid computing:

A distributed system consists of multiple computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal. A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs.

In order to parallelise the sequential code of a reaction-diffusion simulation the domain decomposition concept should be followed; the two-dimensional grid is partitioned along the x space direction, so the domain is decomposed into horizontal columns. Therefore, the two-dimensional subdomains can be mapped onto a one-dimensional logical grid of processes. An equal partition of subdomains among the processes gives us a well balanced load during the solution of the reaction-diffusion equations. During the calculation of the diffusion of the chemical species communications are required to exchange information on the boundary concentrations between the nearest neighbour subdomains (Figure 11.16).

A typical performance result of a diffusion problem on Grid.

Figure 11.16: A typical performance result of a diffusion problem on Grid.

11.2.3. GPU – Graphical Processing Unit

In the past five years, the technological development of consumer graphics hardware created the possibility to use desktop video cards to solve numerically intensive problems, since their computational capacity far exceeds the capacity of desktop CPUs. Using GPUs (processors of video cards) for general purpose calculations is called GPGPU. Its main advantage is high cost-effectiveness. Compared to former solutions, there are virtually no extra costs to operate these systems, like high-powered air conditioning, high electric power consumption and the need for a large space (building). Programming GPU is achieved by using library functions designed for 3D computer graphics (games, CAD). Due to this constraint programming such applications is difficult. Recently, NVIDIA addressed this problem by creating a new parallel computing model called CUDA. It is also an extension to the well-known C language. A compiler, a software development kit with utilities and numerous examples, and a complete documentation of the architecture and the language extensions are available.

In this chapter we present an efficient parallelization of the stochastic Lagrangian particle model using CUDA. In this model, each particle can be handled independently, thus being a perfect candidate for parallelization.

11.2.3.1. Lagrangian model

In this model type as discussed earlier, individual particles are released from a point source in each time step. Each particle represents a given mass or activity, and they can be moved in 3D space by advection (wind field) and turbulent diffusion, they can transform to other species (by radioactive decay), and particles can be removed from the atmosphere via dry and wet deposition processes to the surface. This means that the model is continuous in space and discontinuous in time. It was supposed that advection field is deterministic, but the effect of the turbulent diffusion is always stochastic. The new spatial position of particles can be calculated with the following equations (Molnár et al, 2010):

,

(11.3)

where and are the spatial coordinates of particles after and before the time step, respectively. is the ith coordinate of the wind velocity vector, is the time step (in the model simulation 10 s). The last term in equation (11.3) () describes the effect of stochastic turbulent processes, and i denotes the spatial dimension (i = 1, 2, 3). Stochastic term is calculated by

.

(11.4)

Random numbers (rand) with normal distribution (with mean 0.0 and variance 1.0) were generated using Mersenne Twister random number generator and a Cartesian Box-Muller transformation. is the turbulent diffusion coefficient in each direction. For horizontal dispersion a constant value was used (). Vertical diffusion coefficient () depends on the height, and it can be calculated by the Monin–Obukhov theory.

Released particles can decay (radioactive decay) and deposit by dry and wet pathways. The probability that individual particles during a time step will decay or deposit can be estimated by the following relation:

,

(11.5)

where k is either the radioactive decay constant or wet or dry deposition constant. These constants depend on the chemical nature of the toxic species. During dry deposition, particles can deposit below the mixing layer height. Wet deposition occurs only in case of precipitation.

11.2.3.2. Parallel application and implementation:

The main concept of CUDA parallel computing model is to operate with tens of thousands of lightweight threads, grouped into thread blocks. These threads must execute the same function, with different parameters. The function that contains the computations and runs in parallel in many instances is called the kernel. Threads in the same thread group can synchronize with each other, by inserting synchronization points in the kernel, which must be reached by all threads in the group before continuing execution. These threads can also share data during execution. This way usually several hundred threads in the same block can work cooperatively. Threads of different thread blocks cannot be synchronized and should be considered to run independently.

It is possible to use a small number of threads and/or small number of blocks to execute a kernel, however, it would be very inefficient. This would utilize only a fraction of the computing power of the GPU. Therefore, CUDA is the best suited to those problems that can be divided into many parts, which can be computed independently (in different blocks), and these should be further divided into smaller cooperating pieces (into threads).

There are several types of memory available in CUDA, designed for different uses in kernels. Proper use of these memories can increase the computation performance. The global memory is essentially the random access video memory available on the video card. It may be read or written any time at any location by any of the threads, but to achieve high performance access to global memory should be coalesced, meaning the threads must follow a specific memory access pattern. For the complete (and hardware-revision dependent) description please refer to the Programming Guide. A kernel has access to two cached, read-only memories: the constant memory and the texture memory. Constant memory may be used to store constants that do not change during kernel execution. Texture memory may be used efficiently when threads access data with spatial locality in one, two, or three dimensions. It also provides built-in linear interpolation of the data. There is also a parallel data cache available for reading and writing for all the threads of a thread block called the shared memory. It makes the cooperative work of threads in a block possible. It is divided into banks. Kernels should be written in a way to avoid bank conflicts, meaning the threads which are executed physically at the same time should access different banks. The Programming Guide presents details how to achieve this.

Memory management and kernel execution is controlled by CUDA library functions in the host code (the one which runs on the CPU). While the kernels are executing on the device, the CPU continues to execute host code, so CPU and GPU can work in parallel.

Implementing a single-threaded CPU version of the model is straightforward. Assuming that the emission profile (the amount of emitted particles in every time step) for all species and the maximum number of particles released during the simulation are known, activities or masses can be a priori assigned to the particles. The maximum number of particles is the dominant variable that affects simulation time and precision. It is limited only by available memory.

Flowchart of the stochastic Lagrangian model on CPU and GPU.

Figure 11.17: Flowchart of the stochastic Lagrangian model on CPU and GPU. A typical performance result of a diffusion problem on Grid.

The main loop is the time evolution of the simulation (Figure 11.17). In every iteration the main steps for every particle are the same: interpolating (sampling) weather data in x, y, z, and time dimensions using linear interpolation, calculating the turbulent diffusion coefficient, moving the particle by the wind, then by turbulent diffusion, and finally testing for dry and wet depositions. Particles may become inactive, meaning they are no longer moved by wind or turbulent diffusion, when the particle is deposited or it reaches the predefined boundaries of the simulated area. At the end of every nth time step the activity of particles is calculated based on the isotope’s half-life and the time since particle was emitted. Activities or masses of particles are summarized on a rectangular grid for visualization and further statistical evaluation. From the technical point of view, the value of n should not be smaller than four, it should be set to the highest number possible to achieve the desired precision of the time-integrated dosage calculation (a post-processing of simulation results).

In the parallelized CUDA version of the program, we utilize the various memory types available. Weather data are loaded into three-dimensional textures, to utilize the hardware-implemented trilinear interpolation. Since the weather data are four-dimensional, an extra interpolation step is necessary. The fourth interpolation must be in z dimension, because the vertical weather information is nonlinear, unlike x, y and time dimensions. Using texture memory is also useful because the plume usually propagates in a specific direction, giving the required spatial locality for the texture cache to work efficiently. Physical constants of the isotopes are loaded to constant memory. Shared memory is used for caching the particle data (position, state information). The data is loaded from global memory to shared memory when a kernel starts, and written back at the end, if data was modified.

The calculations in each time step are done by two kernel functions. First step is generating random numbers, using the Mersenne-Twister random number generator. The implementation is provided by the CUDA SDK. A large buffer (on the GPU) is used to store the numbers, because the random number generator works more efficiently if large amount of numbers is asked for at once. The second step is the main kernel. First, particle information and random numbers are loaded from global memory to shared memory. Second step is sampling the weather data, interpolation in z dimension, and calculating the turbulent diffusion coefficient. These results are stored in shared memory. Next step is moving the particles by wind field and turbulent diffusion, and testing for deposition. This step uses the interpolated weather prediction model outputs and the random numbers for calculation of turbulent motion and stochastic deposition. The final step is writing the changed particle information back to global memory.

Calculation and integration of activities on a rectangular grid is performed by the same way as in the sequential version, and it is calculated on the CPU, while the GPU is processing the next time steps in parallel. It should be noted that this step requires extra memory transfer between device and host and should be done as rarely as possible.

The kernel is configured to execute in 32 blocks, and with 256 threads in each block. The optimal number of threads in a block was found using the CUDA Occupancy Calculator, which is part of the CUDA SDK. All threads process multiple particles, since the number of particles may well exceed the number of threads.

The simulation is clearly limited by memory bandwidth, because there are only relatively few arithmetic operations to update a particle information. Therefore, the most important rule to optimize this simulation in CUDA is to follow the specific memory access pattern to achieve coalesced memory reads and writes. High multiprocessor occupancy (number of threads which are executed physically in parallel, compared to the maximum supported number of threads) is also desired to hide memory latency. The highest possible occupancy is 33% because of the high register requirement of the kernel, which is achieved by using 256 threads per block. Shared memory bank conflicts never arise since particles never interact with each other.

We performed simulations with various particle numbers on a 2.33 GHz Core 2 Duo CPU and on two video cards: GeForce 8800 GTS and GeForce 8800 GTX. Figure 11.18 shows the actual speedup gained using the graphics processors. There is no technological difference between the two cards, only computing power and memory speed. The ratio of the speedup with the GTS and the GTX cards well represents the performance difference between them. In both cases the speedup is higher as the particle number is increased. For optimal performance at least half million particles are required. If fewer particles are used, the GPU spends most of the time on initialization for the calculations, synchronizing execution, etc., while the useful calculations actually take less time. In a real application, however, one should use as many particles are allowed by the memory on the video card, which should be several millions of particles.

The speedup achieved is a result of many factors. The memory bandwidth between GPU and video card’s RAM is approximately one order of magnitude higher than between CPU and PC’s RAM. During GPU calculation access to physical constants and meteorological data is even faster, because constant and texture cache are used. Calculation is also accelerated by the instant trilinear interpolation provided by the GPU. Finally, the random number generation on the GPU is about two orders of magnitude faster than using the CPU.

The speedup of the applications.

Figure 11.18: The speedup of the applications as a function of released particles using GeForce 8800 GTS and GeForce 8800 GTX video cards.

11.2.3.3. Eulerian model

Solving diffusion or advection-diffusion equations fits well to the architecture of CUDA. The basic ideas are the following. Concentrations of the species in a simulation can be stored in global memory on the device. We can assign computation of the next time level to a kernel function, assigning threads to compute individual grid points. The rectangular space represented by the grid points can easily be splitted to smaller parts, which can be assigned to blocks. Shared memory within a block utilized in the approximation of the spatial derivatives, because this computation requires data which belongs to neighboring grid points (and threads). Physical constants and other parameters of a simulation can be stored in the constant memory (Molnár et al., 2011).

Two very important performance guidelines must be followed to reach maximum performance: accessing (reading or writing) global memory should follow a coalesced access pattern, and accesses to the shared memory should be without bank conflicts. CUDA devices from the first generation have very strict rules for achieving coalesced memory access, which result in a computational solution for them and another one for the second generation. However, a solution which is optimized well for the first generation of devices should also run very efficiently on the devices of second generation with minor adjustments to some parameters.

Two different reaction-diffusion problems were chosen and solved using CUDA to illustrate the capability and efficiency of GPU computing. The first one is the pure diffusion, which presents the ‘core’ mechanism in all reaction-diffusion-advection related problems. The second example is an extension of the pure diffusion transport problem with advection. This arises in many areas, especially in air pollution, where the transport of air pollutants consist of two main transport phenomena (advection – transport by wind field and turbulent diffusion).

The first application is a simple diffusion problem without any reactions. Here a chemically inert compound diffused from the centre of the computational domain. During this process the diffusing species can reach outer regions. The simulation of this problem is essentially calculating the Laplacian operator on all of the grid points and updating concentrations every time step. Therefore, it is easier to review the computational solution and performance in this simple case before the main concepts can be applied to more complex simulations.

The main question in computing the Laplacian is how to split the computational job on the domain (grid) to smaller rectangular blocks, which will be assigned to thread blocks, where a single thread updates a single grid point. Using the spatial discretization scheme, data from nineteen grid points should be read to update one point, but since these are neighbouring points, data of every point will be read nineteen times (by its thread and its neighbours) while updating all of the grid. In order to avoid reading so many times from the slow global memory, data should be copied to the shared memory of a thread block. Using this stencil, a rectangular block will require an extra layer of grid points around it to allow computation on all points in the block. To read this data, one way is that after all threads read their corresponding grid points into shared memory, some threads read the extra layer. The other way is that although all threads read their corresponding grid points to shared memory, the ones on the edge do not compute, and the blocks overlap in every direction. We found that the second approach is generally faster, however it results in a read redundancy: grid points where the blocks overlap will be read more than once in a single time step. This measure can be calculated by the following formula:

,

(11.6)

where width, height and depth are the dimensions of a block. If there is no overlap in a given direction (see later) then that factor can be eliminated in the formula.

We can also utilize the fact that block indices are only two dimensional in CUDA (though blocks themselves are three-dimensional). We cannot assign all the “blocks” of grid points to thread blocks at once, instead, we assign only a thin, one block wide layer, and the kernels iterate though the simulated space in the third dimension. Since the blocks must overlap, the last two z-layers of data can be kept in the shared memory instead of reading them again from global memory, as the blocks iterate though the space (Figure 3). We call this solution the simple “Shared” method because it utilizes the shared memory in a simple way. It is very similar to the 3D finite difference computation example in the CUDA SDK, but in our case the stencil uses off-axis elements, so we need shared memory for all z-layers in the block, not only for the central z-layer. Moreover, the Shared method fits only the second generation of CUDA devices, because all global memory accesses are uncoalesced according to the strict rules of the first generation of CUDA devices.

For first generation devices the thread blocks must iterate on the first dimension, and they must read and write global memory starting at an address multiple of 64 bytes, therefore the block width must be 16 (using 4-byte floats), to achieve coalesced access. Each block must use a wide tile of shared memory for (1 + 2 × 16) × height × depth elements. The extra shared memory is used as a streaming buffer for the global memory. In an iteration step data from the global memory is read to the second 16-element wide part of the shared memory. Then Laplacian can be computed on all elements of the first 16-element wide part, because the extra layer of data required on the left edge is provided from the previous step in the 1-element wide part, and the right edge is provided by the newly read data in the second 16-element wide part. After computing and writing the results back to global memory, the tile is moved to the right by 16 elements, and the iteration continues. The Moving Tiles method achieves coalesced memory access and it can be implemented without shared memory bank conflicts, because the block width is 16, which is equal to the number of shared memory banks. However, because of the high shared memory requirement less blocks can be allocated on a multiprocessor at once, limiting the performance for more complicated reaction-diffusion systems.

On the edge of the simulated space, the outermost grid points cannot be updated (the Laplacian cannot be computed), because they have no outer neighbours. Instead, these points are boundary points, their values are set according to the boundary conditions of the PDE before each time step. We use three separate kernels for updating these values on the left and right, top and bottom, and front and back sides of the simulated space, because the space is not necessarily cube shaped. Three additional kernels are updating the edges of the space parallel to the x, y, and z axes. Corners do not need values because the stencil does not use them. These kernels are not optimized because their job is very small compared to the Laplacian computation, they contribute approximately 2% to the computational time.

Our last example is an advection-diffusion problem, which has a great relevance in air pollution modelling. Solving diffusion-advection equations to describe the spread and/or transformation of air pollutants is a very important computational and environmental task. The numerical simulations must be obviously achieved faster than in real time in order to use them in decision support.

From the computational point of view this simulation is very similar to the simple diffusion problem, however, an extra advection term is added. Data read into the shared memory for Laplacian computation can be used to approximate first derivatives for advection. An upwind approximation was used to provide a stable solution. The parallel implementations achieve typical acceleration values in the order of 5–40 times compared to CPU using a single-threaded implementation on a 2.8 GHz desktop computer depending on the problem and parallelization strategy used.

Most of the models used in environmental protection and air quality management fall into Lagrangian and Eulerian types. Therefore, it is important to know the real advantage and drawback of the applied models (Figure 11.19). From results presented in this study, it is clearly seen that if we estimate the dispersion of pollutants from a strong point source in local scale a Lagrangian particle model should inevitably be used. This model has a main advantage that the positions of each particle are continuous functions in space. This can overcome the problem of steep concentration gradients arises near a point source, and the Lagrangian particle model can be used even in case of extremely low wind speed, where other models (Gaussian, Eulerian) simple fail and they cannot provide a stable and/or realistic solution. On the other hand, the main problem with this approach is that the each particle represents a given mass or activity (in case of radionuclides), therefore, it is very hard to obtain continuous physical quantity like concentration with small number of particles, increasing this number results in an increase in computational time. Using Lagrangian particle model to simulate the transport of air pollutants from multi-emission sources is practically impossible because of huge computational task. However, this task is really suitable for Eulerian models, and they can easily provide and calculate continuous quantities. Eulerian models can be predominantly used to calculate the transport of chemical species from multi-emission sources. In this framework the chemical reactions occurring between chemical species can be easily handled, because the chemical rate can be calculated from concentrations, which are prognostic variables in Eulerian models. Modelling of the chemical interactions between particles in Lagrangian models is very challenging task, and it involves consideration of the stochastic nature of chemical reactions. In conclusion, the model should be always chosen to a given air pollution problem to provide the accurate and cost efficient solution. Development of such cost efficient strategies requires very accurate calculations, and numerical simulations need a huge computational effort.

There are numerous solutions to address this issue. Parallelization of the models and using supercomputers, clusters, or grid systems to solve problems are a common technique. These systems are built by connecting numerous processors, either by some sort of direct link or by a network connection. Several computing centers can be connected to each other by the Internet, thus creating grids. We investigated a new parallel computational framework to solve environmental related problems using graphical processing unit. A stochastic Lagrangian particle model and an Eulerian model were developed on CUDA to estimate the transport and the transformation of the radionuclides from a single point source during hypothetical accidental releases. Our results show that parallel implementation achieves typical acceleration values in the order of a hundred times in case of Lagrangian model and tens of times in case of Eulerian model compared to CPU using a single-threaded implementation on a regular desktop computer (Figure 6). The relatively high speedup with no additional costs to maintain this parallel architecture could result in a wide usage of GPU for diversified environmental applications in the near future.

We should also mention some important aspects of this technology. In the past years, as this technology has developed, many papers have been published (for as review see Lee et al., 2010), and it seems that the “speedup” gained over general CPU solutions has become the measure of CUDA technology. However, one must be very careful to interpret such results correctly. There are numerous parameters that formulate the final performance gain over the CPU, such as the GPU type, clock frequency, number of CUDA cores, amount of memory, speed of memory on the video card, CPU type, CPU clock, host memory amount and speed, and most importantly, the implementation of the algorithm itself. These together make it almost impossible to predict a given computational speedup when ported to CUDA. Some of the published works show speedups of several hundred times. This is never the “magical” power of CUDA, it is always the result of the unoptimized CPU code which serves as the basis of comparison. Paper published by Lee et al., 2010 carries important information: you can achieve great optimizations on CPU if you have enough experience and expertise, bringing down the relative performance gain of GPU solutions. This can be close to what you could expect by looking at the raw hardware performance of CPUs and GPUs. However, the amount of computational and programming knowledge required to do so, is usually so large, that only computer scientists of specific fields can afford to learn them. This is why CUDA has gained much popularity among scientists from different fields: with relatively limited (but still advanced) knowledge of CPU and GPU programming, the CUDA solution is much faster than the CPU solution. This is the main advantage of CUDA (together with its cost-effectiveness).

We believe that using GPU – as a new computational tool – will revolutionize the environmental computing and can provide a new and cost efficient and effective technique in environmental protection. This system can be effectively used, and hence the most profitable would be in simulation of an accidental release of hazardous materials from industrial facilities (e.g., nuclear power plants). The recent accident at the Fukushima Nuclear Power Plant highlights the importance of predicting accidental release of toxic materials. Implementing a model using this easy-to-maintain and cost efficient computational framework in all facilities storing hazardous materials would be realistic and could help to decrease environmental consequences of possible future accidents.

The speedup of the applications.

Figure 11.19: Speedup of different applications on first and second generation video cards (GeForce 8800 GTX and GeForce GTX 275, respectively) compared to CPU (2.33 GHz Core 2).

References

Baklanov A., Mahura A., Jaffe D., Thaning L., Bergman R., and Andres R.. 2002. Atmospheric transport patterns and possible consequences for the European North after a nuclear accident In: Journal of Environmental Radioactivity. 60. 23-48.

Brandt J., Mikkelsen T., Thykier-Nielsen S., and Zlatev Z.. 1996. Using a combination of two models in tracer simulations In: Mathematical and Computer Modelling. 23. 99-115.

Bryall D.B. and Maryon R.H.. 1998. Validation of the UK MET office NAME model against the ETEX data set In: Atmospheric Environment. 32. 4265–4276.

Garcia-Menendez F. and Odman M.T.. 2011. Adaptive grid use in air quality modelling In: Atmosphere. 2. 484-509.

Hart G., Tomlin A., Smith J., and Berzins M.. 1998. Multi-scale atmospheric dispersion modelling by use of adaptive gridding techniques In: Environmental Monitoring and Assessment. 52. 225–238.

Horányi A., Ihász I., and Radnóti G.. 1996. ARPEGE/ALADIN: A numerical Weather prediction model for Central-Europe with the participation of the Hungarian Meteorological Service In: Időjárás. 100. 277-301.

Lagzi I., Tomlin A. S., Turányi T., and Haszpra L.. 2009. Modelling photochemical air pollutant formation in Hungary using an adaptive grid technique In: International Journal of Environment and Pollution. 36. 44-58.

Lagzi I., Kármán D., Tomlin A.S., Turányi T., and Haszpra L.. 2004. Simulation of the dispersion of nuclear contamination using an adaptive Eulerian grid model In: Journal of Environmental Radioactivity. 75. 59-82.

Molnár F., Szakály T., Mészáros R., and Lagzi I.. 2010. Air pollution modelling using a Graphics Processing Unit with CUDA In: Computer Physics Communication. 181. 105-112.

Molnár F., Izsák F., Mészáros R., and Lagzi I.. 2011. Simulation of reaction-diffusion processes in three dimensions using CUDA In: Chemometrics and Intelligent Laboratory Systems. 108. 76-85.

Nasstrom J.S. and Pace J.C.. 1998. Evaluation of the effect of meteorological data resolution on Lagrangian particle dispersion simulations using the ETEX experiment In: Atmospheric Environment. 32. 4187–4194.

Sorensen J.H.. 1998. Sensitivity of the DERMA Long-range Gaussian dispersion model to meteorological input and diffusion parameters In: Atmospheric Environment. 32. 4195–4206.