Optimizing BCCAGCM on Sunway TaihuLight
-
摘要: 开展气象数值模式在神威·太湖之光系统的移植与优化,对研究模式与新型计算架构的适应性有重要意义。该文以BCCAGCM模式为研究对象,将其移植到神威·太湖之光全国产异构众核计算系统,进行性能分析,对模式动力框架和物理过程计算结构进行调整,将计算核心段采用OpenACC技术进行众核加速优化,大量代码进行算法重构。结果表明:各核心段计算效率基本达到未优化的3倍左右,最高可达14倍左右,将各核心段集成,形成异构众核集成版本,可正确、稳定运行,计算误差合理。在不同并行规模,采用从核对模式整体计算进行加速效果比较稳定,基本保持在1.9倍,26000核并行规模动力试验并行效率约70%,其他试验约为57%。Abstract: With the rise of many-core processors such as Intel MIC, GPU and SW26010, the architecture of supercomputer systems has undergone great changes. The supercomputer transitions from a homogeneous system containing only multi-core CPUs to a heterogeneous system with coexistence of CPU and many-core accelerators. Heterogeneous architectures provide powerful computing power for large, complex applications. However, since the numerical model is basically based on conventional CPU development different from the many-core accelerator, the existing tens of thousands of lines of legacy code cannot take full advantage of the parallel computing capacity of the new architecture. Carrying out the porting and optimization of the weather and climate numerical model on the new system is of great significance to improve the adaptability of the model in the new computing architecture.The Sunway TaihuLight System is the world's first supercomputer with a peak performance greater than 100 PFlops based on homegrown SW26010 heterogeneous many-core chip. Each SW26010 processor consists of management processing elements (MPEs) and clusters of computing processing elements (CPEs). To support parallel computing for heterogeneous architectures, the system provides a set of compilation tools, including basic C/C++, Fortran compilers. In addition to that, there is also a customized Sunway OpenACC tool that supports the OpenACC2.0 syntax.As the atmospheric component of BCCCSM, BCCAGCM is the most computationally expensive component in typical configurations. Since BCCAGCM has not been operated in the Sunway system, BCCAGCM is first ported to the Sunway system, using only MPE to perform the computation. And then, the calculation framework is analyzed to determine the major kernels that take the most time to calculate. BCCAGCM uses a hybrid parallelization scheme combining MPI and OpenMP to complete the calculation. In the Sunway system, MPI and OpenACC are used to obtain appropriate parallelism from the CPE cluster. On one hand, by adjusting the computational sequence and the loop structures to aggregate more parallel computations, the parallelism from the CPE cluster is fully utilized. On the other hand, the design optimizes data access and transmission strategy, improves the LDM availability, and minimizes the proportion of data moving and computation overhead.The efficiency of the MPE+CPE heterogeneous calculation after optimization is compared with the calculation efficiency of the original MPE only. The optimized kernel calculation efficiency is basically about 3 times as before, and up to about 14 times. Kernels are integrated, and the new version is integrated with a computing efficiency of 1.9 times as before. Although the overall acceleration effect of the model is not very obvious, the formation of the BCCAGCM heterogeneous many-core basic version add to the experience for the optimization and refactoring of the new computing architecture for the meteorological numerical model.
-
表 1 BCCAGCM模式计算核心段
Table 1 Major kernels of BCCAGCM
函数功能 百分比/% 重力波过程 23.09 勒让德逆变换对称分量系数计算 5.15 勒让德逆变换反对称分量系数计算 4.75 短波辐射过程 1.32 动力插值 0.91 大尺度凝结降水过程 0.88 -
[1] 赵立成.气象信息系统.北京:气象出版社, 2011. [2] 沈学顺, 苏勇, 胡江林, 等.GRAPES_GFS全球中期预报系统的研发和业务化.应用气象学报, 2017, 28(1):1-10. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20170101&flag=1 [3] 王金成, 陆慧娟, 韩威, 等.GRAPES全球三维变分同化业务系统性能.应用气象学报, 2017, 28(1):11-24. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20170102&flag=1 [4] 赵立成, 沈文海, 肖华东, 等.高性能计算技术在气象领域的应用.应用气象学报, 2016, 27(5):550-558. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20160504&flag=1 [5] 刘鑫, 郭恒, 孙茹君, 等.神威太湖之光计算机系统大规模应用特征分析与E级可扩展性研究.计算机学报, 2018, 41(10):2209-2220. doi: 10.11897/SP.J.1016.2018.02209 [6] 黄丽萍, 陈德辉, 邓莲堂, 等.GRAPES_Meso4.0主要技术改进和预报效果检验.应用气象学报, 2017, 28(1):25-37. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20170103&flag=1 [7] 麻素红, 张进, 沈学顺, 等.2016年GRAPES_TYM改进及对台风预报影响.应用气象学报, 2018, 29(3):257-269. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20180301&flag=1 [8] 刘永柱, 张林, 金之雁.GRAPES全球切线性和伴随模式的调优.应用气象学报, 2017, 28(1):62-71. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20170106&flag=1 [9] Mielikainen J.Intel Xeon Phi accelerated Weather Research and Forecasting (WRF) Goddard microphysics scheme.Geosci Model Dev Discuss, 2014, 7:8941-8973. doi: 10.5194/gmdd-7-8941-2014 [10] Huang Melin, Huang Bormin, Gu Lingjia, et al.Parallel GPU architecture framework for the WRF single moment 6-class microphysics scheme.Comput Geosci, 2015, 83:17-26. doi: 10.1016/j.cageo.2015.06.014 [11] Mark Govett.Parallelization of the FV3 Dycore for GPU and MIC Processors.17th Workshop on HPC in Meteorology, ECMWF, 2016. [12] Fuhrer O, Chadha T, Hoefler T, et al.Near-global climate simulation at 1 km resolution:Establishing a performance baseline on 4888 GPUs with COSMO 5.0.Geosci Model Dev, 2017, 11(4):1665-1681. [13] Mikko Byckling.IFS RAPS14 Benchmark on 2nd Generation Intel Xeon Phi Processor.17th Workshop on HPC in Meteorology, ECMWF, 2016. [14] Sami Saarinen.Using OpenACC in IFS Physics' Cloud Scheme (CLOUDSC).16th Workshop on HPC in Meteorology, ECMWF, 2014. [15] Xu S, Huang X, Oey L Y, et al.POM.gpu-v1.0:A GPU-based Princeton Ocean Model.Geosci Model Dev, 2015, 8:2815-2827. doi: 10.5194/gmd-8-2815-2015 [16] Sun J, Fu J S, Drake J B, et al.Computational benefit of GPU optimization for the atmospheric chemistry modeling.J Adv Model Earth Sys, 2018, 10:1952-1969. doi: 10.1029/2018MS001276 [17] 方宝辉, 徐金秀, 魏敏, 等.BCC_AGCM_T106在Intel众核上混合异构编程与优化研究.计算机科学与探索, 2015, 9(9):1093-1099. http://d.old.wanfangdata.com.cn/Periodical/jsjkxyts201509010 [18] Fu H, Yin W, Yang G, et al.18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8-meter Scenarios.The International Conference for High Performance Computing, Networking, Storage and Analysis, 2017: 1-12. [19] Zheng F, Li H L, Lv H, et al.Cooperative computing techniques for a deeply fused and heterogeneous many-core processor architecture.J Comput Sci Technol, 2015, 30(1):145-162. doi: 10.1007/s11390-015-1510-9 [20] 漆锋滨."神威·太湖之光"超级计算机.中国计算机学会通讯, 2017, 13(10):16-22. http://d.old.wanfangdata.com.cn/Periodical/jsjxb201709007 [21] Fu Haohuan, Liao Junfeng, Yang Jinzhe, et al.The Sunway TaihuLight supercomputer:System and applications.Sci China Inform Sci, 2016, 59(7):072001. doi: 10.1007/s11432-016-5588-7 [22] 陈国良.并行计算.北京:高等教育出版社, 1999. [23] 孙晨, 王彬, 顾文静, 等.基于OpenACC的GRAPES_GLOBAL模式长波辐射异构并行化研究.气象科技进展, 2018, 8(1):197-202. doi: 10.3969/j.issn.2095-1973.2018.01.027 [24] 何沧平.OpenACC并行编程实战.北京:机械工业出版社, 2017. [25] 吴统文, 宋连春, 李伟平, 等.北京气候中心气候系统模式研发进展——在气候变化研究中的应用.气象学报, 2014, 72(1):12-29. doi: 10.3969/j.issn.1005-0582.2014.01.003 [26] Wu T, Yu R, Zhang F, et al.The Beijing Climate Center atmospheric general circulation model:Description and its performance for the present-day climate.Climate Dyn, 2010, 34:123-147. doi: 10.1007/s00382-008-0487-2 [27] 魏敏.气象高性能计算应用服务环境适应性研究.气象, 2015, 41(1):92-97. http://d.old.wanfangdata.com.cn/Periodical/qx201501011 [28] 魏敏.BCC_CSM气候系统模式移植优化及其气候模拟验证.气象与环境学报, 2015, 31(1):101-105. http://www.wanfangdata.com.cn/details/detail.do?_type=perio&id=lnqx2015010016 [29] 张萌, 于海鹏, 黄建平, 等.GRAPES_GFS2.0模式系统误差评估.应用气象学报, 2018, 29(5):571-583. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20180506&flag=1 [30] 韦青, 李伟, 彭颂, 等.国家级天气预报检验分析系统建设与应用.应用气象学报, 2019, 30(2):245-256. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20190211&flag=1