超级计算机上矩阵乘的并行计算与实现
IMPLEMENTATION OF MATRICES-MULTIPLICATION ON SUPERCOMPUTER
-
摘要: 数值预报系统中经常要用到矩阵乘运算。在分布式超级计算机 (如IBM-SP) 上, 矩阵乘的并行计算需要较多的数据移动, 有效的数据传输对矩阵乘的实现至关重要。该文讨论了两种矩阵乘的并行算法, 一种是基于矩阵的列-行划分方式, 一种是基于矩阵的网格划分方式。在IBM-SP计算机上的实验结果表明, 网格划分的矩阵乘并行算法通讯开销更小, 并行效率更高, 其并行加速比较列-行并行算法改善约10 %。Abstract: The matrices multiplication is often used in NWP. On distributed systems, such as IBM-SP, the multiplication of two matrices requires data transpose and the efficient data communication are crucial to its performance. Two parallel algorithms are presented, one is based on column-row decomposition and another is based on mesh partition, and the implementation and communication-time of this two different methods are discussed. Results on IBM-SP show that the communication in mesh algorithm are less and the improvement on speedup is up to 10%.
-
表 1 不同大小矩阵的测试
表 2 9216 ×9216在不同个数CPU运行时的时间比较
-
[1] Strassen V.Gaussian Elimination is Not Optinal.Numerical Mathematics, 1969, 13 :354-356. doi: 10.1007/BF02165411 [2] Barry Wilkinson, Michael Allen著.陆鑫达译.并行程序设计.北京:机械工业出版社, 2002. [3] 都志辉著.高性能计算并行编程技术———MPI并行程序设计.北京:清华大学出版社, 2001. [4] 李晓梅, 蒋增荣著.并行算法.长沙:湖南科学技术出版社, 1992. [5] 施妙根, 顾丽珍编著.科学和工程计算基础.北京:清华大学出版社, 1999. [6] 金之雁, 王鼎兴.大规模数据并行问题的可扩展性分析.应用气象学报, 2003, 14(3):369-374. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=20030345&flag=1