10 Nov 2014 Santa Clara, Livermore - Modern CPU and GPU processors with on-die integration of SIMD execution units for achieving higher performance and power efficiency have posed challenges to use the underlying SIMD hardware (or VPUs, Vector Processing Unit) effectively. Wide vector registers and SIMD instructions - Single Instructions operating on Multiple Data elements packed in wide registers such as AltiVec , SSE, AVX  and MIC  - pose a compilation challenge that is greatly eased through programmer hints. While many applications implemented using OpenMP [13, 17], a widely accepted industry standard for exploiting thread-level parallelism, to leverage the full potential of today's multi-core architectures, no industry standard has offered any means to express SIMD parallelism. Instead, each compiler vendor has provided its own vendor-specific hints for exploiting vector parallelism, or programmers relied on the compilers automatic vectorization capability, which is known to be limited due to many compile-time unknown program factors.
To alleviate the situation for programmers, the OpenMP language committee added SIMD constructs to OpenMP to support vector-level parallelism. These new constructs provide a standardized set of SIMD constructs for programmers who no longer need to use non-portable, vendor-specific vectorization intrinsics or directives. In addition, these SIMD constructs provide additional knowledge about the code structure to the compiler and allow for a better vectorization that blends well with parallelization. To the best of our knowledge, the OpenMP 4.0 specification is the first industry standard that includes explicit vector programming constructs for programmers.
This paper describes the C/C++ and Fortran SIMD extensions for explicit vector programming available in the OpenMP 4.0 specification. We explain the semantics of SIMD constructs and clauses with simple examples. In addition, a set of explicit vector programming guidelines and programming examples are provided in Section 3 and 4 to help programmers to write efficient SIMD programs for achieving a higher performance. Section 5 presents a case study of achieving a~2000xperformance speedup using OpenMP 4.0PARALLELandSIMDconstructs on Intel Xeon Phi coprocessors. Section 6 summarizes this paper.
Keywords: OpenMP, Explicit Vectorization, SIMD programming model, Multicore
The complete article can be downloaded at http://primeurmagazine.com/repository/PrimeurMagazine-AE-PR-12-14-32.pdf
Xinmin Tian (1), Bronis R. de Supinski (2)
(1) Intel Corporation, Santa Clara, California USA
(2) Lawrence Livermore National Laboratory, Livermore, California, USA
OpenMP Architecture Review Board (ARB)
Email: email@example.com , firstname.lastname@example.org