In this release, the developers have included, for the very first time, a compiler based on the LLVM compiler infrastructure that will complement the venerable Mercurium source-to-source compiler. This extended LLVM compiler is in a beta stage, but it already supports most of the OmpSs-2 features when targeting the Nanos6 runtime system. Moreover, the LLVM OpenMP runtime distributed with the extended LLVM compiler has been modified to support the TAMPI library that allows a seamless use of non-blocking MPI calls inside OpenMP tasks.
This release is also the first one to support kernels specified with OpenACC pragmas. To that end, the Mercurium source-to-source compiler and the Nanos6 runtime have been extended to support a subset of the OpenACC pragmas and the PGI runtime API respectively. Moreover, the CUDA device has been refactored to include automatic data prefetching when CUDA Unified Memory is used. This version also includes support for cuBLAS and similar libraries.
In this release, the developers have modified the runtime to use the low-level API of jemalloc to improve the performance and scalability of small memory allocations inside the runtime. The CPU manager and scheduling infrastructure has been refactored to improve performance and scalability on many-core systems. The implementation of work-sharing tasks has been modified to exploit better data-locality across task fors instances. Finally, a new turbo variant of the runtime is available. This variant enables some processor floating-point optimizations, as well as, the discrete dependency system.
Nanos6 has a new experimental lightweight tracing module that generates traces in the Common Trace Format (CTF). The module is lockless for most common cases and emits a minimalistic set of Nanos6 events with optional PAPI hardware counters support. Future releases will support MPI and Linux Kernel events. Nanos6 converts CTF to Paraver traces automatically, which can be inspected using the provided new set of Paraver configurations.
In this release the developers have extended the lock-free discrete dependency system to support weak, commutative and concurrent dependencies, so now, it already supports all the OmpSs-2 dependency types but regions.
Most of the new features in this release have been developed in the context of the DEEP-EST and Lo-Sync (PRACE-6IP) projects. The support for OpenACC kernels has been developed in the context of the EPEEC project.