张逸然,陈龙,安向哲,颜深根.面向GPU计算平台的归约算法的性能优化研究[J].计算机科学,2019,46(2):306-309
面向GPU计算平台的归约算法的性能优化研究
Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform
投稿时间:2018-09-12  修订日期:2018-11-20
DOI:
中文关键词:  归约算法,GPU,线程内归约,OpenCL
英文关键词:Reduction algorithm,GPU,Inner threads reduction,OpenCL
基金项目:
作者单位
张逸然 北京信息科技大学 北京100049 
陈龙 中国石油集团东方地球物理勘探有限责任公司 河北 涿州072751 
安向哲 中国石油集团东方地球物理勘探有限责任公司 河北 涿州072751 
颜深根 深圳市商汤科技有限公司 广东 深圳518000 
摘要点击次数: 0
全文下载次数: 0
中文摘要:
      归约算法在科学计算和图像处理等领域有着十分广泛的应用,是并行计算的基本算法之一,因此对归约算法进行加速具有重要意义。为了充分挖掘异构计算平台下GPU的计算能力以对归约算法进行加速,文中提出基于线程内归约、work-group内归约和work-group间归约3个层面的归约优化方法,并打破以往相关工作将优化重心集中在work-group内归约上的传统思维,通过论证指出线程内归约才是归约算法的瓶颈所在。实验结果表明,在不同的数据规模下,所提归约算法与经过精心优化的OpenCV库的CPU版本相比,在AMD W8000和NVIDIA Tesla K20M平台上分别达到了3.91~15.93和2.97~20.24的加速比; 相比于OpenCV库的CUDA版本与OpenCL版本,在NVIDIA Tesla K20M平台上分别达到了2.25~5.97和1.25~1.75的加速比;相比于OpenCL版本,在AMD W8000平台上达到了1.24~5.15的加速比。文中工作不仅实现了归约算法在GPU计算平台上的高性能,而且实现了在不同GPU计算平台间的性能可移植。
英文摘要:
      Reduction algorithm has wide application in scientific computing and image processing,and it is one of the basic algorithms of parallel computing.Hence,it is significant to accelerate reduction algorithm.In order to fully exploit the capability of GPU for general-purpose computing under heterogeneous processing platform,this paper proposed a multi-level reduction optimization algorithm including inner-thread reduction,inner-work-group reduction and inter-work-group reduction.Different from the traditional way of reduction algorithm optimization of putting more emphasis on inner-work-group reduction,this paper proved that inner-thread reduction is the true bottleneck of reduction algorithm.The experimental results demonstrate that the performance of proposed reduction algorithm has reached 3.91~15.93 and 2.97~20.24 times speedup respectively in AMD W8000 and NVIDIA Tesla K20M under different sizes of data set,compared with carefully optimized CPU version of OpenCV library.In NVIDIA Tesla K20M,compared with CUDA version and OpenCL version of OpenCV library,the algorithm has reached 2.25~5.97 and 1.25~1.75 times speedup respectively.And compared with OpenCL version of OpenCV library in AMD W8000,the algorithm has reached 1.24~5.15 times speedup.This work not only realizes high performance of reduction algorithm on GPU platform,but also reaches the portability of performance between different GPU computing platforms.
查看全文  查看/发表评论  下载PDF阅读器