Wei Min, Wang Bin, He Xiang, et al. Optimizing BCCAGCM on sunway taihulight. J Appl Meteor Sci, 2019, 30(4): 502-512. DOI:  10.11898/1001-7313.20190410.
Citation: Wei Min, Wang Bin, He Xiang, et al. Optimizing BCCAGCM on sunway taihulight. J Appl Meteor Sci, 2019, 30(4): 502-512. DOI:  10.11898/1001-7313.20190410.

Optimizing BCCAGCM on Sunway TaihuLight

DOI: 10.11898/1001-7313.20190410
  • Received Date: 2019-03-17
  • Rev Recd Date: 2019-05-08
  • Publish Date: 2019-07-31
  • With the rise of many-core processors such as Intel MIC, GPU and SW26010, the architecture of supercomputer systems has undergone great changes. The supercomputer transitions from a homogeneous system containing only multi-core CPUs to a heterogeneous system with coexistence of CPU and many-core accelerators. Heterogeneous architectures provide powerful computing power for large, complex applications. However, since the numerical model is basically based on conventional CPU development different from the many-core accelerator, the existing tens of thousands of lines of legacy code cannot take full advantage of the parallel computing capacity of the new architecture. Carrying out the porting and optimization of the weather and climate numerical model on the new system is of great significance to improve the adaptability of the model in the new computing architecture.The Sunway TaihuLight System is the world's first supercomputer with a peak performance greater than 100 PFlops based on homegrown SW26010 heterogeneous many-core chip. Each SW26010 processor consists of management processing elements (MPEs) and clusters of computing processing elements (CPEs). To support parallel computing for heterogeneous architectures, the system provides a set of compilation tools, including basic C/C++, Fortran compilers. In addition to that, there is also a customized Sunway OpenACC tool that supports the OpenACC2.0 syntax.As the atmospheric component of BCCCSM, BCCAGCM is the most computationally expensive component in typical configurations. Since BCCAGCM has not been operated in the Sunway system, BCCAGCM is first ported to the Sunway system, using only MPE to perform the computation. And then, the calculation framework is analyzed to determine the major kernels that take the most time to calculate. BCCAGCM uses a hybrid parallelization scheme combining MPI and OpenMP to complete the calculation. In the Sunway system, MPI and OpenACC are used to obtain appropriate parallelism from the CPE cluster. On one hand, by adjusting the computational sequence and the loop structures to aggregate more parallel computations, the parallelism from the CPE cluster is fully utilized. On the other hand, the design optimizes data access and transmission strategy, improves the LDM availability, and minimizes the proportion of data moving and computation overhead.The efficiency of the MPE+CPE heterogeneous calculation after optimization is compared with the calculation efficiency of the original MPE only. The optimized kernel calculation efficiency is basically about 3 times as before, and up to about 14 times. Kernels are integrated, and the new version is integrated with a computing efficiency of 1.9 times as before. Although the overall acceleration effect of the model is not very obvious, the formation of the BCCAGCM heterogeneous many-core basic version add to the experience for the optimization and refactoring of the new computing architecture for the meteorological numerical model.
  • Fig. 1  General architecture of the SW26010 processor

    Fig. 2  Software environment composition of Sunway TaihuLight

    Fig. 3  Compilation process of Sunway TailhuLight OpenACC

    Fig. 4  Execution model of Sunway TaihuLight OpenACC

    Fig. 5  Storage model of Sunway TaihuLight OpenACC

    Fig. 6  Computing framework of BCCAGCM

    Fig. 7  The acceleration effect of major kernels of BCCAGCM and their proportions in the total runtime

    Fig. 8  The acceleration effect of BCCAGCM comparing the performance of the model running on MPEs and CPE clusters against the performance of the model running on only MPEs

    Fig. 9  The speedup of BCCAGCM computing

    Fig. 10  The parallel efficiency of BCCAGCM computing

    Table  1  Major kernels of BCCAGCM

    函数功能 百分比/%
    重力波过程 23.09
    勒让德逆变换对称分量系数计算 5.15
    勒让德逆变换反对称分量系数计算 4.75
    短波辐射过程 1.32
    动力插值 0.91
    大尺度凝结降水过程 0.88
    DownLoad: Download CSV
  • [1]
    赵立成.气象信息系统.北京:气象出版社, 2011.
    [2]
    沈学顺, 苏勇, 胡江林, 等.GRAPES_GFS全球中期预报系统的研发和业务化.应用气象学报, 2017, 28(1):1-10. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20170101&flag=1
    [3]
    王金成, 陆慧娟, 韩威, 等.GRAPES全球三维变分同化业务系统性能.应用气象学报, 2017, 28(1):11-24. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20170102&flag=1
    [4]
    赵立成, 沈文海, 肖华东, 等.高性能计算技术在气象领域的应用.应用气象学报, 2016, 27(5):550-558. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20160504&flag=1
    [5]
    刘鑫, 郭恒, 孙茹君, 等.神威太湖之光计算机系统大规模应用特征分析与E级可扩展性研究.计算机学报, 2018, 41(10):2209-2220. doi:  10.11897/SP.J.1016.2018.02209
    [6]
    黄丽萍, 陈德辉, 邓莲堂, 等.GRAPES_Meso4.0主要技术改进和预报效果检验.应用气象学报, 2017, 28(1):25-37. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20170103&flag=1
    [7]
    麻素红, 张进, 沈学顺, 等.2016年GRAPES_TYM改进及对台风预报影响.应用气象学报, 2018, 29(3):257-269. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20180301&flag=1
    [8]
    刘永柱, 张林, 金之雁.GRAPES全球切线性和伴随模式的调优.应用气象学报, 2017, 28(1):62-71. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20170106&flag=1
    [9]
    Mielikainen J.Intel Xeon Phi accelerated Weather Research and Forecasting (WRF) Goddard microphysics scheme.Geosci Model Dev Discuss, 2014, 7:8941-8973. doi:  10.5194/gmdd-7-8941-2014
    [10]
    Huang Melin, Huang Bormin, Gu Lingjia, et al.Parallel GPU architecture framework for the WRF single moment 6-class microphysics scheme.Comput Geosci, 2015, 83:17-26. doi:  10.1016/j.cageo.2015.06.014
    [11]
    Mark Govett.Parallelization of the FV3 Dycore for GPU and MIC Processors.17th Workshop on HPC in Meteorology, ECMWF, 2016.
    [12]
    Fuhrer O, Chadha T, Hoefler T, et al.Near-global climate simulation at 1 km resolution:Establishing a performance baseline on 4888 GPUs with COSMO 5.0.Geosci Model Dev, 2017, 11(4):1665-1681.
    [13]
    Mikko Byckling.IFS RAPS14 Benchmark on 2nd Generation Intel Xeon Phi Processor.17th Workshop on HPC in Meteorology, ECMWF, 2016.
    [14]
    Sami Saarinen.Using OpenACC in IFS Physics' Cloud Scheme (CLOUDSC).16th Workshop on HPC in Meteorology, ECMWF, 2014.
    [15]
    Xu S, Huang X, Oey L Y, et al.POM.gpu-v1.0:A GPU-based Princeton Ocean Model.Geosci Model Dev, 2015, 8:2815-2827. doi:  10.5194/gmd-8-2815-2015
    [16]
    Sun J, Fu J S, Drake J B, et al.Computational benefit of GPU optimization for the atmospheric chemistry modeling.J Adv Model Earth Sys, 2018, 10:1952-1969. doi:  10.1029/2018MS001276
    [17]
    方宝辉, 徐金秀, 魏敏, 等.BCC_AGCM_T106在Intel众核上混合异构编程与优化研究.计算机科学与探索, 2015, 9(9):1093-1099. http://d.old.wanfangdata.com.cn/Periodical/jsjkxyts201509010
    [18]
    Fu H, Yin W, Yang G, et al.18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-meter Scenarios.The International Conference for High Performance Computing, Networking, Storage and Analysis, 2017: 1-12.
    [19]
    Zheng F, Li H L, Lv H, et al.Cooperative computing techniques for a deeply fused and heterogeneous many-core processor architecture.J Comput Sci Technol, 2015, 30(1):145-162. doi:  10.1007/s11390-015-1510-9
    [20]
    漆锋滨."神威·太湖之光"超级计算机.中国计算机学会通讯, 2017, 13(10):16-22. http://d.old.wanfangdata.com.cn/Periodical/jsjxb201709007
    [21]
    Fu Haohuan, Liao Junfeng, Yang Jinzhe, et al.The Sunway TaihuLight supercomputer:System and applications.Sci China Inform Sci, 2016, 59(7):072001. doi:  10.1007/s11432-016-5588-7
    [22]
    陈国良.并行计算.北京:高等教育出版社, 1999.
    [23]
    孙晨, 王彬, 顾文静, 等.基于OpenACC的GRAPES_GLOBAL模式长波辐射异构并行化研究.气象科技进展, 2018, 8(1):197-202. doi:  10.3969/j.issn.2095-1973.2018.01.027
    [24]
    何沧平.OpenACC并行编程实战.北京:机械工业出版社, 2017.
    [25]
    吴统文, 宋连春, 李伟平, 等.北京气候中心气候系统模式研发进展——在气候变化研究中的应用.气象学报, 2014, 72(1):12-29. doi:  10.3969/j.issn.1005-0582.2014.01.003
    [26]
    Wu T, Yu R, Zhang F, et al.The Beijing Climate Center atmospheric general circulation model:Description and its performance for the present-day climate.Climate Dyn, 2010, 34:123-147. doi:  10.1007/s00382-008-0487-2
    [27]
    魏敏.气象高性能计算应用服务环境适应性研究.气象, 2015, 41(1):92-97. http://d.old.wanfangdata.com.cn/Periodical/qx201501011
    [28]
    魏敏.BCC_CSM气候系统模式移植优化及其气候模拟验证.气象与环境学报, 2015, 31(1):101-105. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=lnqx2015010016
    [29]
    张萌, 于海鹏, 黄建平, 等.GRAPES_GFS2.0模式系统误差评估.应用气象学报, 2018, 29(5):571-583. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20180506&flag=1
    [30]
    韦青, 李伟, 彭颂, 等.国家级天气预报检验分析系统建设与应用.应用气象学报, 2019, 30(2):245-256. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20190211&flag=1
  • 加载中
  • -->

Catalog

    Figures(10)  / Tables(1)

    Article views (3463) PDF downloads(25) Cited by()
    • Received : 2019-03-17
    • Accepted : 2019-05-08
    • Published : 2019-07-31

    /

    DownLoad:  Full-Size Img  PowerPoint