Optimizing BCCAGCM on Sunway TaihuLight
-
Abstract
With the rise of many-core processors such as Intel MIC, GPU and SW26010, the architecture of supercomputer systems has undergone great changes. The supercomputer transitions from a homogeneous system containing only multi-core CPUs to a heterogeneous system with coexistence of CPU and many-core accelerators. Heterogeneous architectures provide powerful computing power for large, complex applications. However, since the numerical model is basically based on conventional CPU development different from the many-core accelerator, the existing tens of thousands of lines of legacy code cannot take full advantage of the parallel computing capacity of the new architecture. Carrying out the porting and optimization of the weather and climate numerical model on the new system is of great significance to improve the adaptability of the model in the new computing architecture.The Sunway TaihuLight System is the world's first supercomputer with a peak performance greater than 100 PFlops based on homegrown SW26010 heterogeneous many-core chip. Each SW26010 processor consists of management processing elements (MPEs) and clusters of computing processing elements (CPEs). To support parallel computing for heterogeneous architectures, the system provides a set of compilation tools, including basic C/C++, Fortran compilers. In addition to that, there is also a customized Sunway OpenACC tool that supports the OpenACC2.0 syntax.As the atmospheric component of BCCCSM, BCCAGCM is the most computationally expensive component in typical configurations. Since BCCAGCM has not been operated in the Sunway system, BCCAGCM is first ported to the Sunway system, using only MPE to perform the computation. And then, the calculation framework is analyzed to determine the major kernels that take the most time to calculate. BCCAGCM uses a hybrid parallelization scheme combining MPI and OpenMP to complete the calculation. In the Sunway system, MPI and OpenACC are used to obtain appropriate parallelism from the CPE cluster. On one hand, by adjusting the computational sequence and the loop structures to aggregate more parallel computations, the parallelism from the CPE cluster is fully utilized. On the other hand, the design optimizes data access and transmission strategy, improves the LDM availability, and minimizes the proportion of data moving and computation overhead.The efficiency of the MPE+CPE heterogeneous calculation after optimization is compared with the calculation efficiency of the original MPE only. The optimized kernel calculation efficiency is basically about 3 times as before, and up to about 14 times. Kernels are integrated, and the new version is integrated with a computing efficiency of 1.9 times as before. Although the overall acceleration effect of the model is not very obvious, the formation of the BCCAGCM heterogeneous many-core basic version add to the experience for the optimization and refactoring of the new computing architecture for the meteorological numerical model.
-
-