Jiang Qingu, Jin Zhiyan. The hybrid MPI and OpenMP parallel scheme of GRAPES_global model. J Appl Meteor Sci, 2014, 25(5): 581-591.
Citation: Jiang Qingu, Jin Zhiyan. The hybrid MPI and OpenMP parallel scheme of GRAPES_global model. J Appl Meteor Sci, 2014, 25(5): 581-591.

The Hybrid MPI and OpenMP Parallel Scheme of GRAPES_Global Model

  • Received Date: 2013-10-25
  • Rev Recd Date: 2014-04-30
  • Publish Date: 2014-09-30
  • Clustered SMP systems are gradually becoming more prominence, as advances in multi-core technology which allows larger numbers of CPUs to have access to a single memory space. To take advantage of benefits of this hardware architecture that combines both distributed and shared memory, utilizing hybrid MPI and OpenMP parallel programming model is a good trial. This hierarchical programming model can achieve both inter-node and intra-node parallelization by using a combination of message passing and thread based shared memory parallelization paradigms within the same application. MPI is used to coarse-grained communicate between SMP nodes and OpenMP based on threads is used to fine-grained compute within a SMP node.As a large-scale computing and storage-intensive typical numerical weather forecasting application, GRAPES (Global/Regional Assimilation and PrEdictions System) has been developed into MPI version and put into operational use. To adapt to SMP cluster systems and achieve higher scalability, a hybrid MPI and OpenMP parallel model suitable for GRPAES_Global model is developed with the introduction of horizontal domain decomposition method and loop-level parallelization. In horizontal domain decomposition method, a patch is uniformly divided into several tiles while patches are obtained by dividing the whole forecasting domain. There are two main advantages in performing parallel operations on tiles. Firstly, tile-level parallelization which applies OpenMP at a high level, to some extent, is coarse grained parallelism. Compared to computing work associated with each tile, OpenMP thread overhead is negligible. Secondly, implementation of this method is relative simple, and the subroutine thread safety is the only thing to ensure. Loop-level parallelization which can improve load imbalance by adopting different thread scheduling policies is fine grained parallelism. The main computational loops are applied OpenMP's parallel directives in loop-level parallelization method. The preferred method is horizontal domain decomposition for uniform grid computing, while loop-level parallelization method is preferred for non-uniform grid computing and the thread unsafe procedures. Experiments with 1°×1° dataset are performed and timing on main subroutines of integral computation are compared. The hybrid parallel performance is superior to single MPI scheme in terms of long wave radiation process, microphysics and land surface process while Helmholtz equation generalized conjugate residual (GCR) solution has some difficulty in thread parallelism for incomplete LU factorization preconditioner part. ILU part with tile-level parallelization can improve GCR's hybrid parallelization. Short wave process hybrid parallel performance is close to single MPI scheme under the same computing cores. It requires less elapsed time with increase of the number of threads under certain MPI processes in hybrid parallel scheme. And hybrid parallel scheme within four threads is superior to single MPI scheme under large-scale experiment. Hybrid parallel scheme can also achieve better scalability than single MPI scheme. The experiment shows hybrid MPI and OpenMP parallel scheme is suitable for GRAPES_Global model.
  • Fig. 1  The horizontal domain decomposition scheme of GRAPES hybrid parallel (from Reference [18])

    Fig. 2  The calculation flow chart of GRAPES model (from Reference [19])

    Fig. 3  The average time of GCR algorithm with ILU preconditioner in a single integrate step

    Fig. 4  The comparsion of different parallel scheme computional time

    Fig. 5  The speedup of integral computation in hybrid parallelization

    Fig. 6  The comparison of main subroutine integral computation time in each experiment scheme

    Fig. 7  The integral time of 35 steps with different computing cores

    Table  1  The comparison of parallel computational time with different tile-decomposition schemes

    二级分区划分方案 经向×纬向二级分区数 计算时间/ms
    一维经向划分 8×1 20.3
    水平二维划分 (1) 4×2 17.1
    水平二维划分 (2) 2×4 14.3
    一维纬向划分 1×8 12.3
    DownLoad: Download CSV

    Table  2  The comparison of computational time with different patch-decomposition schemes

    经向×纬向一级分区数 二级分区数 计算时间/ms
    8×2 4 28.1
    4×4 4 19.1
    2×8 4 19.3
    DownLoad: Download CSV

    Table  3  The comparison of interpolation and cumulus convection scheme computational time with three thread scheduling policies

    进程数 线程数 插值程序计算时间/ms 积云对流参数化计算时间/ms
    静态调度 动态调度 指导调度 静态调度 动态调度 指导调度
    16 1 48.4 47.7 47.9 75.0 75.1 75.0
    16 2 28.7 27.7 28.0 40.4 38.2 38.5
    16 4 17.2 17.5 16.9 21.3 20.0 19.8
    16 8 11.6 12.2 11.8 11.1 10.4 10.7
    DownLoad: Download CSV

    Table  4  The comparison of tile-level and loop-level parallelization results

    进程数 线程数 计算时间/ms
    二级分区并行 循环并行
    16 1 50.8 48.4
    16 2 31.0 28.7
    16 4 18.3 17.2
    16 8 12.3 11.6
    DownLoad: Download CSV

    Table  5  The comparison of hybrid parallel results of GCR algorithm

    计算节点数 进程数 线程数 计算时间/s
    16 64 1* 0.6875
    16 64 1 0.6937
    16 32 2 0.6777
    16 16 4 0.7622
    8 8 8 0.8230
    注:*表示单一MPI方案未开启OpenMP编译选项。
    DownLoad: Download CSV
  • [1]
    Gysi T, Fuhrer O, Osuna C, et al.Porting COSMO to Hybrid Architectures.[2013-04-14]. http://data1.gfdl.noaa.gov/multi-core/2012/presentations/Session_2_Messmer.pdf.
    [2]
    冯云, 周淑秋.MPI+OpenMP混合并行编程模型应用研究.计算机系统应用, 2006(2):33-35. http://cdmd.cnki.com.cn/Article/CDMD-10530-2008180946.htm
    [3]
    樊志杰, 赵文涛.GRAPES四维变分同化系统MPI和OpenMP混合算法研究.计算机光盘软件与应用, 2012(19):21-23. http://www.cnki.com.cn/Article/CJFDTOTAL-GPRJ201219008.htm
    [4]
    The Weather Research and Forecasting (WRF) Model.[2013-01-09].http://wrf-model.org/.
    [5]
    The Users Home Page for the Weather Research and Forecasting (WRF) Modeling System.[2013-01-09]. http://www.mmm.ucar.edu/wrf/users/.
    [6]
    Skamarock W C, Klemp J B, Dudhia J, et al.A Description of the Advanced Research WRF Version 3.NCAR Tech Note NCAR/TN-475+STR, 2005.
    [7]
    Šipková V, Lúcny A, Gazák M.Experiments with a Hybrid-Parallel Model of Weather Research and Forecasting (WRF) System.GCCP 2010 Book of Abstracts, 2010:37. doi:  10.1175/2008MWR2445.1
    [8]
    Epicoco I, Mocavero S, Giovanni A.NEMO-Med:Optimization and Improvement of Scalability.CMCC Research Paper, 2011. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1959924
    [9]
    张昕, 季仲贞, 王斌.OpenMP在MM5中尺度模式中的应用试验.气候与环境研究, 2001, 6(1):84-90. http://www.cnki.com.cn/Article/CJFDTOTAL-QHYH200101009.htm
    [10]
    朱政慧, 施培量, 颜宏.用OpenMP并行化气象预报模式试验.应用气象学报, 2002, 13(1):102-108. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20020112&flag=1
    [11]
    朱政慧.并行高分辨率有限区预报系统在IBM SP上的建立.应用气象学报, 2003, 14(1):119-121. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20030114&flag=1
    [12]
    朱政慧.一个数值天气预报模式的并行混合编程模型及其应用.数值计算与计算机应用, 2005, 26(3):203-204. http://www.cnki.com.cn/Article/CJFDTOTAL-SZJS200503005.htm
    [13]
    郭妙, 金之雁, 周斌.基于通用图形处理器的GRAPES长波辐射并行方案.应用气象学报, 2012, 23(3):348-354. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20120311&flag=1
    [14]
    郑芳, 许先斌, 向冬冬, 等.基于GPU的GRAPES数值预报系统中RRTM模块的并行化研究.计算机科学, 2012, 39(6):370-374. http://www.cnki.com.cn/Article/CJFDTOTAL-JSJA2012S1100.htm
    [15]
    OpenMP Specications.OpenMP Application Programing Interface.V3.0, 2008.[2013-01-09]. http://www.openmp.org/mp-documents/spec30.pdf.
    [16]
    Chapman B, Jost G, Van Der Pas R.Using OpenMP:Portable Shared Memory Parallel Programming.London:MIT Press, 2008.
    [17]
    Blaise Barney.OpenMP.[2013-01-09].https://computing.llnl.gov/tutorials/openMP/.
    [18]
    薛纪善, 陈德辉.数值预报系统GRAPES的科学设计与应用.北京:科学出版社, 2008.
    [19]
    伍湘君.GRAPES高分辨率气象数值预报模式并行计算关键技术研究.北京:国防科学技术大学, 2011.
    [20]
    伍湘君, 金之雁, 黄丽萍, 等.GRAPES模式软件框架与实现.应用气象学报, 2005, 16(4):539-546. doi:  10.11898/1001-7313.20050415
    [21]
    Fowler R F, Greenough C.Mixed MPI:OpenMP Programming:A Study in Parallelisation of a CFD Multiblock Code.CCLRC Rutherford Appleton Laboratory, 2003. http://www.softeng.rl.ac.uk/st/archive/SoftEng/SESP/Publications/mpi_openmp/mpi_openmp/
    [22]
    金之雁, 王鼎兴.一种在异构系统中实现负载平衡的方法.应用气象学报, 2003, 14(4):410-418. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20030451&flag=1
    [23]
    陈德辉, 沈学顺.新一代数值预报系统GRAPES研究进展.应用气象学报, 2007, 17(6):773-777. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=200606125&flag=1
    [24]
    刘宇, 曹建文.适用于GRAPES数值天气预报软件的ILU预条件子.计算机工程与设计, 2008, 29(3):731-734. http://www.cnki.com.cn/Article/CJFDTOTAL-SJSJ200803062.htm
  • 加载中
  • -->

Catalog

    Figures(7)  / Tables(5)

    Article views (2722) PDF downloads(967) Cited by()
    • Received : 2013-10-25
    • Accepted : 2014-04-30
    • Published : 2014-09-30

    /

    DownLoad:  Full-Size Img  PowerPoint