ARM reveals details of vector HPC extensions: The Scalable Vector Extension (SVE) for the ARMv8-A architecture


24 Aug 2016 Cambridge - At the Hot Chips conference, ARM's Nigel Stephens gave details of the new Scalable Vector Extension (SVE) for the ARMv8-A architecture. He summarized his presentation in a blog on the ARM website. ARM is significantly extending the vector processing capabilities associated with AArch64 (64-bit) execution in the ARM architecture, now and into the future, enabling implementation choices for vector lengths that scale from 128 to 2048 bits. High Performance Scientific Compute provides an excellent focus for the introduction of this technology and its associated ecosystem development. SVE features will enable advanced vectorizing compilers to extract more fine-grain parallelism from existing code and so reduce software deployment effort.

Stephens provide some historical context first. ARMv7 Advanced SIMD (aka the ARM NEON instructions) is ~12 years old, a technology originally intended to accelerate media processing tasks on the main processor. It operated on well-conditioned data in memory with fixed-point and single-precision floating-point elements in sixteen 128-bit vector registers. With the move to AArch64, NEON gained full IEEE double-precision float, 64-bit integer operations, and grew the register file to thirty-two 128-bit vector registers. These evolutionary changes made NEON a better compiler target for general-purpose compute. SVE is a complementary extension that does not replace NEON, and was developed specifically for vectorization of HPC scientific workloads, Stephen summarizes the history.

HPC processing is requires because of the Big Data involved. Immense amounts of data are being collected today in areas such as meteorology, geology, astronomy, quantum physics, fluid dynamics, and pharmaceutical research. Exascale computing is the target that many HPC systems aspire to over the next 5-10 years. In addition, advances in data analytics and areas such as computer vision and machine learning are already increasing the demands for increased parallelization of program execution today and into the future.

Over the years, considerable research has gone into determining how best to extract more data level parallelism from general-purpose programming languages such as C, C++ and Fortran. This has resulted in the inclusion of vectorization features such as gather load & scatter store, per-lane predication, and of course longer vectors.

A key choice to make is the most appropriate vector length.

Rather than specifying a specific vector length, SVE allows CPU designers to choose the most appropriate vector length for their application and market, from 128 bits up to 2048 bits per vector register. SVE also supports a vector-length agnostic (VLA) programming model that can adapt to the available vector length. Adoption of the VLA paradigm allows you to compile or hand-code your programme for SVE once, and then run it at different implementation performance points, while avoiding the need to recompile or rewrite it when longer vectors appear in the future. This reduces deployment costs over the lifetime of the architecture; a program just works and executes wider and faster.

Scientific workloads have traditionally been carefully written to exploit as much data-level parallelism as possible with careful use of OpenMP pragmas and other source code annotations. It's therefore relatively straightforward for a compiler to vectorize such code and make good use of a wider vector unit, Stephens notes. Supercomputers are also built with the wide, high-bandwidth memory systems necessary to feed a longer vector unit.

Stephens concludes: "So SVE also introduces novel features that begin to tackle some of the barriers to compiler vectorization. The general philosophy of SVE is to make it easier for a compiler to opportunistically vectorize code where it would not normally be possible or cost effective to do so."
Ad Emmen