A Set of MapReduce Tuning Experiments Based on Meteorological Operations
-
摘要: 云计算技术使用分布式的计算技术实现了并行计算的计算能力和计算效率,解决了单机服务器计算能力低的问题。基于长序列历史资料所计算得出的气候标准值对于气象领域实时业务、准实时业务及科学研究中均具有重要的意义。由于长序列历史资料数据量大、运算逻辑较复杂,在传统单节点计算平台上进行整编计算耗时非常长。该文基于Hadoop分布式计算框架搭建了集群模式的云计算平台,以长序列历史资料作为源数据,基于MapReduce计算模型实现了部分整编算法,提高计算时效。同时,由于数据源本身具有文件个数多、单个文件小等特点,对数据源存储形式及数据文件大小进行改造,分别利用SequenceFile方式及文本文件合并方式对同一种场景进行计算时效对比测试,分别测试了10个文件合并、100个文件合并两种情况,使时效性得到了更大程度的提升。Abstract: Cloud computing technologies, which solves the problem of low computing power of a standalone server, uses distributed computing technology to achieve the computing power of parallel computing and computational efficiency. Cloud computing is a new application model for decentralized computing which can provide reliable, customized and maximum number of users with minimum resource, and it is also an important way to carry out cloud computing theory research and practical application combining with other theory and good techniques. In many industries and fields, cloud computing has a wider range of applications, and its flexibility, ease of use, stability is gradually affirmed. In meteorological department, cloud-based platform for the development of scientific computing is still very limited, but some attempts are implemented with the maturation of cloud computing.In meteorological operations, such as large-scale scientific computing and other general computing model are run on high-performance server clusters. Due to limitations of resources and the number of HPC nodes, scientific computing still relies on traditional standalone or clustered mode. Therefore, an internal exploration and conventional general-purpose computing and cloud computing platform is very meaningful for the meteorological department. 60-year valuable and precious long sequence of historical data are stored in National Meteorological Information Center for the use of real-time, near-real-time business and research. Processing these historical data is time-consuming, therefore some new methods are implemented. Based on Hadoop cloud computing platform, a cluster mode is built and a variety of statistical methods are adopted using MapReduce computation model. The storage format of the source data is adjusted with SequenceFile which is composed of < Key, Value > serialization, by this mean multiple files of Format-A are merged to a large SequenceFile to test computational efficiency changes. Meanwhile, many small files are merged to a larger file. Configurations are modified experimentally for the Hadoop cluster environment, and different number of task nodes are used to record different computational efficiency.
-
Key words:
- MapReduce;
- cloud computing;
- Hadoop;
- meteorological data processing
-
表 1 实体机云平台中各主机配置表
Table 1 Configuration of host machine on physical cloud platform
序号 操作系统 CPU核数 内存 网络 1 SUSE 10 (x86_64) 16核 (2.27 GHz) 16 GB 千兆 2 SUSE 10 (x86_64) 16核 (2.27 GHz) 16 GB 千兆 3 SUSE 11 (x86_64) 8核 (2.0 GHz) 16 GB 千兆 4 Redhat 6.3 Beta 8核 (2.0 GHz) 16 GB 千兆 5 Redhat 6.3 Beta 8核 (2.0 GHz) 16 GB 千兆 表 2 不同存储结构及数据文件大小试验结果 (单位:s)
Table 2 Experiment results of different storage structures and data file size (unit:s)
数据存储结构 5节点 6节点 7节点 8节点 9节点 10节点 原始文件 36720 31310 26279 22285 19494 17817 10个文件合并 3945 3170 2663 2278 2006 1805 100个文件合并 535 442 389 342 316 312 Sequencefile方式 166 158 123 110 107 94 -
[1] 郎为民, 杨德鹏, 李虎生.中国云计算发展现状研究.电信快报, 2011, 10:1-6. http://www.cnki.com.cn/Article/CJFDTOTAL-DXKB201110001.htm [2] 李德毅.2011云计算技术发展报告.北京:科学出版社, 2011, 5:1-10. [3] Ray O'Brien.[2011-12-11].http://nebula.nasa.gov/blog/2012/05/29/nasa-and-openstack-2012/. [4] 张诚忠. 广东借助云计算破预报瓶颈天气分辨率升至3公里. [2011-12-11]. http://news.xinhuanet.com/2011-12/11/c_111234079.htm. [5] 沈文海.从云计算看气象部门未来的信息化趋势.气象科技进展, 2012, 1(2):49-56. http://www.cnki.com.cn/Article/CJFDTOTAL-QXKZ201202017.htm [6] 沈文海. 云计算受困于服务手段的有限和体制两因素. [2012-12-15]. http://cio.itxinwen.com/Online/2011/1115/370736.html. [7] 刘小宁, 张洪政, 李庆祥.不同方法计算的气温平均值差异分析.应用气象学报, 2005, 16(3):345-356. doi: 10.11898/1001-7313.20050309 [8] 王炳忠, 申彦波.我国上空的水汽含量及其气候学估算.应用气象学报, 2012, 23(6):763-768. doi: 10.11898/1001-7313.20120614 [9] 张强, 熊安元, 张金艳, 等.晴雨 (雪) 和气温预报评分方法的初步研究.应用气象学报, 2009, 20(6):692-698. doi: 10.11898/1001-7313.20090606 [10] 张顺谦, 马振峰, 张玉芳.四川省潜在蒸散量估算模型.应用气象学报, 2009, 20(6):729-736. doi: 10.11898/1001-7313.20090611 [11] 刘娜.基于MapReduce的数据挖掘算法在全国人口系统中的应用.北京:首都经济贸易大学, 2011:20-43. [12] 李军华.云计算及若干数据挖掘算法的MapReduce化研究.成都:电子科技大学, 2010:19-32. [13] 贾雄.数值天气预报云计算环境关键技术研究与实现.长沙:国防科学技术大学, 2011:2-33. [14] 万至臻.基于MapReduce模型的并行计算平台的设计与实现.杭州:浙江大学, 2008:17-21. [15] 朱珠.基于Hadoop的海量数据处理模型研究和应用.北京:北京邮电大学, 2008:7-20. [16] 吴朱华.云计算核心技术剖析.北京:人民邮电出版社, 2011:16-44. [17] 周敏奇, 王晓玲, 金澈清, 等.Hadoop权威指南 (第2版).北京:清华大学出版社, 2011:213-224. [18] 金之雁, 颜宏.数值天气预报并行计算模式的设计与可行性讨论.应用气象学报, 1993, 4(1):117-121. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=19930122&flag=1 [19] 牟道楠, 王宗皓.层次分解并行计算法在TOVS资料中尺度分析中的应用.应用气象学报, 1994, 5(1):77-81. http://qikan.camscma.cn/jams/ch/reader/view_abstract.aspx?file_no=19940113&flag=1