30 Aug 2017 Austin - Data analysis is central to nearly every aspect of research these days - not to mention finance, government and almost all aspects of modern life. Our capacity to capture and generate data that describe the intricate workings of the world requires tools to synthesize that data and turn it into insights - whether for making business decisions, personalizing health care, or understanding plate tectonics.
One of the most popular data analysis tools on the market today is R, a free, powerful, open source software package with extensive statistical computing and graphics capabilities. Designed in the 1990s, it has proven useful for a range of research problems and has accumulated many highly reusable components tailored to popular data analytic methods or designed to address specific domain problems directly.
"For many academics seeking to wean themselves off commercial software, R is the data-analysis tool of choice", data scientist and journalist Sylvia Tippmann wrote inNaturein 2014.
The fact that R was developed more than two decades ago, however, means it was not designed to take advantage of modern computer infrastructure, including multiple cores, parallel coprocessors, and multi-node computing clusters. This infrastructure can speed up data analysis and allow users to investigate much larger datasets in greater depth. Fortunately, R's vast user community has developed many packages to address these shortcomings and enable R to keep pace with the growth of data analytic problem sizes.
Researchers at the Texas Advanced Computing Center (TACC) recognized the growing use of R by the scientists it supported, and the lost opportunities caused by the frequent serial execution of the tool. They set about benchmarking current approaches that can enable R to run parallel and distributed computations. These include both single-node parallelism - which uses multiple processing cores within a computer system - and multi-node parallelism - which uses computing clusters like those designed and operated by TACC.
They went further and performed a detailed performance study to determine the benefits of using accelerators with R - in particular with new many-core architectures.
Results of their study were included in " Conquering Big Data with High Performance Computing ", edited by Ritu Arora and published by Springer in 2016.
"Although R is clearly a 'high productivity' language, high performance has not been a development goal of R", the authors, Weijia Xu, Ruizhu Huang, Hui Zhang, Yaakoub El-Khamra, and David Walling, wrote. "On the contrary, R-based analysis solutions often have the reputation of lacking the efficiency required for large-scale computations. As Big Data becomes a norm in almost every scientific field, many inherent issues hindering the programming execution efficiency become real challenges and roadblocks that prevent an existing solution viable with increasing volume of data and increasing complexity of the analysis in practice."
R is a scripting language, which means it requires an interpreter that translates R scripts, or series of commands, into code.
R's inefficiency can come both from the programming specifications themselves or the way the interpreter is implemented. Designed as a computing language with high-level expressiveness - which means it's easy to write code that's easy to understand - R originally lacked much of the fine-grained control and basic constructs to support highly efficient code development.
But over time, many of these limitations have been overcome. For those hoping to maximize the efficiency of R on today's computers, or on computing clusters, more than 70 packages are available for operating in parallel.
Some of these packages - such as Rmpi and SNOW - are designed so users can control the parallelization explicitly; some are designed to provide implicit parallelism so that the system can abstract parallelization away, such as for multi-core computing; others - for example, snowfall and foreach - are high-level wrapper for other packages and are intended to ease the use of parallelism.
"Taking advantage of parallelism in R can provide immediate speed-ups, allowing researchers to solve more problems, tackle larger datasets or arrive at solutions faster", Weijia Xu stated.
But choosing the right package for the right problem, or simply knowing that a variety of options exist, remains a challenge.
Packages for parallel computing with R had been around for a while, but the opportunities offered by many-core processors like the Intel Xeon Phi are relatively new.
The Xeon Phi is a lightweight x86 processor released by Intel in 2013 that contains more than 60 cores on a single chip. It offers superb parallel processing capabilities for problems that can take advantage of all of its cores.
A critical advantage of the Xeon Phi coprocessor is that, unlike GPU-based coprocessors, it runs the standard x86 instruction set, so the same program code can be used efficiently on the host and the co-processor. Users can offload parts of the computing workload to the Xeon Phi, or use the co-processor to run programs independently.
The TACC team tested the use of Xeon Phi for parallel, high-performance R on a number of test cases, including ones related to disease outbreak data and multi-dimensional educational assessments. They found that the performance improvement varied depending on the type of workload and input data size.
With linear algebra related operations, such as matrix transformations, inverse matrices and matrix cross products, they observed a significant speed up, leading to an approximately 10 percent performance improvement. They found that it was optimal to offload about 70 percent of the workload to the Xeon Phi for these types of operations, with the remainder computed on the host core.
"We are encouraged by these initial results and feel there is great potential to utilize hybrid parallel models, running across multiple nodes and multiple Xeon Phi coprocessors, in tackling large-scale data problems with R", the authors wrote.
From their analysis, the authors concluded that the key challenge to enabling high performance R is not a shortage of software packages that can take advantage of state-of-the-art computing hardware, but the lack of support accessible to end-users to help them apply the right tools.
Different types of problems, and even distinct parts within a given analysis, have the potential to use different types of parallelism. Selecting the most suitable R parallelization package requires a thorough knowledge of the software and a deep understanding of the problem, which is often missing.
To advance the entire community of R users, what was needed, they concluded, were tools to help researchers understand which parts of their codes could be effectively run in parallel and the right package to use to take advantage of those opportunities.
"Many Big Data problems are no longer just algorithmic problems", they wrote. "The solutions to Big Data problems therefore require not only new novel algorithms but also appropriate system support that are capable of executing the developed solutions."
TACC is addressing this problem, in part, by offering regular training in high-performance R for a wide range of scientists. In the past two years more than 250 people have participated in workshops that provided guidance on making the best use of advanced computing resources with R.
Courses focus on showing researchers the right libraries and functions to use to take advantage of HPC environments. Participants learn how to efficiently read and write large amounts of data, how to transform that data in a performant manner using optimized math kernels, and how to effectively use R in the batch environments on TACC systems and other supercomputers.
"Demand for R and other data-centric languages and tools is increasing, and we are addressing that need by adding more training in data analytics", stated Luke Wilson, Director of Training and Professional Development at TACC. "Last year we added a dedicated Data and Information Analytics summer institute, and this spring we expanded our data analytics training to not only cover R, but Hadoop/Spark, Python with matplotlib and pandas, and Scala."
TACC is also considering ways to engineer code analysis tools that can help researchers determine the best packages to use with their codes.
In the meanwhile, TACC has made SparkR available on its resources. SparkR is a package that allows data scientists to use Spark's distributed computation engine to interactively run large scale data analyses. It acts as a wrapper that implements various forms of data parallelism for R codes and potentially lowers the barrier to entry for those looking to employ high-performance R.
"Aside from the basic education and training required to achieving these goals, comprehensive analytic software packages and new programming environments like SparkR can also help in bringing those new technological advantages to the end user", stated Weijia Xu. "TACC is working to develop and deploy such tools, which will help thousands, even tens of thousands of researchers, transition their work to high-performance computers."