The Hybrid MPI and OpenMP Parallel Scheme of GRAPES_Global Model
-
Abstract
Clustered SMP systems are gradually becoming more prominence, as advances in multi-core technology which allows larger numbers of CPUs to have access to a single memory space. To take advantage of benefits of this hardware architecture that combines both distributed and shared memory, utilizing hybrid MPI and OpenMP parallel programming model is a good trial. This hierarchical programming model can achieve both inter-node and intra-node parallelization by using a combination of message passing and thread based shared memory parallelization paradigms within the same application. MPI is used to coarse-grained communicate between SMP nodes and OpenMP based on threads is used to fine-grained compute within a SMP node.As a large-scale computing and storage-intensive typical numerical weather forecasting application, GRAPES (Global/Regional Assimilation and PrEdictions System) has been developed into MPI version and put into operational use. To adapt to SMP cluster systems and achieve higher scalability, a hybrid MPI and OpenMP parallel model suitable for GRPAES_Global model is developed with the introduction of horizontal domain decomposition method and loop-level parallelization. In horizontal domain decomposition method, a patch is uniformly divided into several tiles while patches are obtained by dividing the whole forecasting domain. There are two main advantages in performing parallel operations on tiles. Firstly, tile-level parallelization which applies OpenMP at a high level, to some extent, is coarse grained parallelism. Compared to computing work associated with each tile, OpenMP thread overhead is negligible. Secondly, implementation of this method is relative simple, and the subroutine thread safety is the only thing to ensure. Loop-level parallelization which can improve load imbalance by adopting different thread scheduling policies is fine grained parallelism. The main computational loops are applied OpenMP's parallel directives in loop-level parallelization method. The preferred method is horizontal domain decomposition for uniform grid computing, while loop-level parallelization method is preferred for non-uniform grid computing and the thread unsafe procedures. Experiments with 1°×1° dataset are performed and timing on main subroutines of integral computation are compared. The hybrid parallel performance is superior to single MPI scheme in terms of long wave radiation process, microphysics and land surface process while Helmholtz equation generalized conjugate residual (GCR) solution has some difficulty in thread parallelism for incomplete LU factorization preconditioner part. ILU part with tile-level parallelization can improve GCR's hybrid parallelization. Short wave process hybrid parallel performance is close to single MPI scheme under the same computing cores. It requires less elapsed time with increase of the number of threads under certain MPI processes in hybrid parallel scheme. And hybrid parallel scheme within four threads is superior to single MPI scheme under large-scale experiment. Hybrid parallel scheme can also achieve better scalability than single MPI scheme. The experiment shows hybrid MPI and OpenMP parallel scheme is suitable for GRAPES_Global model.
-
-