气象格点数算一体空间分析库的设计与实现

王舒; 徐拥军; 何文春; 吴焕萍; 高峰; 刘媛媛; 刘北; 吕冠儒; 倪学磊

doi:10.11898/1001-7313.20250112

摘要: 气象格点数据通常以文件形式存储在分布式文件库中, 业务系统在使用过程中需要将文件下载到本地, 对文件解析后再进行分析计算。这种方式导致数据检索困难、响应时间长、无法满足业务在线计算及交互式应用需求。为此, 2022年底国家气象信息中心基于天擎空间分析库研发完成了分布式环境下气象格点数据与计算集成的数算一体数据库——PostGrid, 该数据库包含数据层和算子层。数据层将气象格点数据在要素、起报、预报、空间、层次、样本等维度上的拆分后统一规范化存储, 提高数据库的数据读取和分析效率。算子层通过数据库中的SQL函数实现, 支持在数据库内部对格点数据进行各种操作, 且算子支持分布式并行计算。性能测试和业务应用结果表明:PostGrid数据库能将传统的聚合计算服务时效由分钟级提升至毫秒级, 极大提高了气象格点数据服务的性能、灵活性和数算一体能力, 具有广泛应用价值。

Abstract: Meteorological gridded data is typically stored in file formats within distributed file repositories, such as network-attached Storage (NAS). During operations, business systems often need to download files locally, parse them, and subsequently perform analyses and calculations. This traditional approach presents several challenges, including difficulties in data retrieval, prolonged response times, and inability to meet demands for real-time computation and interactive applications. To address these issues, National Meteorological Information Center has developed PostGrid, an integrated database for meteorological gridded data and computing, based on Tianqing Spatial Analysis Library and is specifically designed for distributed environments. The PostGrid database consists of two primary layers: Data layer and operator layer. Data layer is responsible for storing various types of gridded meteorological data. When data is imported into the database, it is stored in a standardized and uniform manner. Each dataset comprises two components: A header file and entity data, both of which are stored in binary format. The header file contains basic descriptive information about the meteorological gridded data, while the entity data store specific layers or fields obtained by partitioning the original gridded dataset. By organizing data according to various dimensions, such as weather elements, forecast start times, spatial layers, levels, and samples-the data layer facilitates the efficient retrieval and analysis of meteorological gridded data. This structured approach significantly enhances the database's capacity to read and process data, rendering it far more efficient than traditional methods. Operator layer in PostGrid is implemented using SQL functions within the database. These operators facilitate a range of operations on gridded data, including matrix calculations, spatial analysis, statistical aggregation, dimensionality reduction, and data filtering. Furthermore, operators are designed to facilitate distributed parallel computing, enabling faster and more efficient processing of large datasets. By leveraging capabilities of parallel computing, PostGrid can perform complex calculations that would typically require minutes, reducing the processing time to just milliseconds. This capability significantly enhances the performance and flexibility of meteorological data services. Performance tests and real-world applications have demonstrated that PostGrid significantly enhances the efficiency of meteorological data services, reducing the time required for traditional aggregation calculations from minutes to just milliseconds. The database's capability to integrate both data and computation within a unified platform marks a significant advancement in the management of large-scale meteorological data. It enables faster data retrieval, real-time computation, and supports more advanced interactive applications, making it an invaluable tool for meteorological services and with the potential for widespread application across various sectors within meteorology.

气象格点数算一体空间分析库的设计与实现

Design and Application of a Data-computation Integrated Database for Meteorological Grid Data