Wang Bin, Zong Xiang, Wei Min. A fine-grained, real time HPC resource management system. J Appl Meteor Sci, 2008, 19(4): 507-512.
Citation: Wang Bin, Zong Xiang, Wei Min. A fine-grained, real time HPC resource management system. J Appl Meteor Sci, 2008, 19(4): 507-512.

A Fine-grained, Real Time HPC Resource Management System

  • Received Date: 2007-07-26
  • Rev Recd Date: 2008-04-07
  • Publish Date: 2008-08-31
  • In contrast to the rapid development of capability construction, resource management of national meteorological high performance computers is left behind. Absence of operational software in resource management keeps system administrators from having a detailed knowledge of what's going on in national meteorological high performance computers and exerting effective control over resource allocations. Regarding existing problems, a fine grained, real time high performance computer resource management system is proposed. The system is designed to be a real time, fine grained one with cross cluster (Grid) support. The system works closely with CPU hours resources under keen competition. With the introduction of GCU (General Computing Unit), a resource virtualization unit, to measure computing resources, diversities of computing resources in different high performance computer systems are shielded and fine grained uniform quantitative management is enabled by the system. The target users of the system include resource users, leaders of user organizations, resource system administrators, decision makers etc. The system comprises three layers, namely, user interfaces, resource management, and high performance computer systems. Resource management layer, the primary layer, can be divided into resource accounting and allocation manager, Grid plat form, and resource information database. With open source software from supercomputing centers abroad, Grid project funded by MOST, and RDBMS employed, the system has seen an implementation, deployment and experimental running in National Meteorological Information Center. Fundamental functions of resource accounting and allocation management have been implemented, including cluster system job accounting, resource accounts management, management, allocation and query of user and organizations, providing command line interface for users. PostgreSQL database technology is adopted as the resource information database, on which accounts, users, organizations, computer systems, job records, accounting and allocation relation tables are created. The software system has been deployed into the three partitions of IBM high performance computer system, Sunway 32I cluster, Sunway 32P cluster, IBM SP system, working with LoadLeveler, PBS. Information of users on national meteorological high performance computer systems have been sorted and updated, resulting in uniform UID and GID, and inserted into databases. Two layers of management, organizations (projects) and individuals, are established. Computing resources are evenly allocated to user organizations according to 200 percent of the total available resource in terms of GCUs. Only resources allocated to their department can be used by individual users. The validity of resources are set to a season. Overdraft is allowed. Based on partial data collected during experimental run, initial statistical analyses are made to probe resource usage by user organizations and individuals. At present, the high performance computer resource system has been put into operational run and successfully applied to operation management.
  • Fig. 1  Design scheme of HPC resource management system

    Fig. 2  Computing resource usage(a)and job submissions(b)by major user organizations

    Table  1  Top 10 computing resource consumption users

    Table  2  Top 5 job submission users

    Table  3  Top 10 computing resource consumption jobs

    Table  4  Computing usage by user organizations

  • [1]
    宗翔, 王彬.国家级气象高性能计算机管理与应用网络平台设计.应用气象学报, 2006, 17(5):629-634. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=200605106&flag=1
    [2]
    [3]
    Jackson S. Allocation Management Solutions for High Performance Computing. Proceedings of PDPTA 2005, Athens: CSREA Press, 2006: 10-16. https://www.researchgate.net/publication/221133017_Allocation_Management_Solutions_for_High_Performance_Computing
    [4]
    Bodel B, Bradshaw R, DeBenedictus E, et al. Scalable system software: A component-based approach. Journal of Physics, 2005, 16: 546-550. doi:  10.1088/1742-6596/16/1/075
    [5]
    [6]
    王彬.国家气象网络计算应用节点门户系统的设计与实现.气象科技, 2006, 34(增刊): 5-9. http://www.cnki.com.cn/Article/CJFDTOTAL-QXKJ2006S1001.htm
    [7]
    王彬, 魏敏, 刘桂英.基于NMIC计算网格平台的MM5业务模式共享系统.2006年中国气象学会信息技术在气象领域的开发应用研讨会论文集, 2006: 145-151.
    [8]
    肖侬, 任浩, 徐志伟, 等.基于资源目录技术的网格系统软件设计与实现.计算机研究与发展, 2002, 39(8): 902-906. http://www.cnki.com.cn/Article/CJFDTOTAL-JFYZ200208002.htm
    [9]
    虞益诚.基于资源管理的网络技术探究.计算机应用与软件, 2005, 22(7): 69-71. http://www.cnki.com.cn/Article/CJFDTOTAL-JYRJ200507029.htm
    [10]
    郑然, 李胜利, 金海.网格资源管理与调度模型的研究.华中科技大学学报, 2001, 29(12): 87-89. http://www.cnki.com.cn/Article/CJFDTOTAL-HZLG200112030.htm
    [11]
    李春林, 卢正鼎, 李腊元.基于Agent的计算网格资源管理.武汉理工大学学报, 2003, 27(1): 7-10. http://www.cnki.com.cn/Article/CJFDTOTAL-JTKJ200301002.htm
    [12]
    Czajkowski K, Foster I, Karonis N, et al. A Resource Manage ment Architecture for Metacomputing Systems. Proc IPPS/SPDP' 98 Workshop on Job Scheduling Strategies for Parallel Processing, 1998: 62-82. doi:  10.1007%2FBFb0053981
    [13]
    Czajkowski K, Foster I, Kesselman C. Resource Co-Allocation in Computational Grids. Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing(HPDC-8), 1999: 219-228.
    [14]
    Foster I. The grid: A new infrastructure for 21st century science. Physics Today, 2002, 55(2):42-47. doi:  10.1063/1.1461327
    [15]
    Foster I, Kesselman C, Tuecke S. The anatomy of the grid: En abling scalable virtual organizations. International Journal of Supercomputer Applications, 2001, 15(3):200-222. http://citeseerx.ist.psu.edu/showciting?cid=1222281
    [16]
    王涌, 肖侬, 王意洁, 等.元计算系统的一个可扩展层次型资源管理模型.计算机研究与发展. 2002, 39(8): 907-912. http://www.cnki.com.cn/Article/CJFDTOTAL-JFYZ200208003.htm
  • 加载中
  • -->

Catalog

    Figures(2)  / Tables(4)

    Article views (3571) PDF downloads(1910) Cited by()
    • Received : 2007-07-26
    • Accepted : 2008-04-07
    • Published : 2008-08-31

    /

    DownLoad:  Full-Size Img  PowerPoint