At the University of Illinois at Urbana-Champaign's National Center for Supercomputing Applications (NCSA), researchers are used to using compute power to dig for answers in piles of untamed data. NCSA Core Faculty member Peter Christensen, also an Assistant Professor of Agriculture and Consumer Economics, routinely tackles policy-driven research questions with his Big Data for Environmental Economics and Policy (BDEEP) team, investigating the ways that data science and economics can help inform, justify, and validate public policies, or in some cases, reveal underlying flaws. To support this work, Peter Christensen's team has developed a machine learning-driven analytics pipeline to handle data collection, pre-processing, and statistical modelling for a range of projects that involve high-dimensional data and continuous acquisition.
Over the past year, collaborators Peter Christensen, Erica Myers and Mateus Souza of the College for Agricultural, Consumer and Environmental Sciences and Paul Francisco of the Indoor Climate Research & Training Institute have been using the BDEEP team's pipeline and NCSA's experimental data analytics compute cluster to analyze a large dataset of home performance measures to better understand the drivers of past savings and provide insights into how to optimize benefits in the future.
The project's first goal was to quantify disparities between predicted and realized savings. Savings are typically predicted using energy modeling software that incorporates information about housing structure and the upgrades being performed. However, predictions from those models are often inaccurate, and with hundreds of programme offerings - from adding insulation and updating furnaces to installing energy-efficient appliances and LED lightbulbs - performed in various combinations on a variety of home types, parsing out nuanced effects can be difficult. Using before-and-after utility bills, housing structure variables, and data for thousands of homes treated over the past decade, the researchers implemented a new machine learning method that can estimate and predict savings patterns that a human mind - or traditional engineering and statistical models - would be unable to identify.
Most of NCSA's computing systems are a good fit for a wide variety of uses, but the NCSA analytics cluster the researchers used was specifically developed and optimized by NCSA Industry, Yifang Zhang of the NCSA Data Analytics team, and Peter Christensen's BDEEP group for this kind of problem: the multi-processor system runs SparkR, which enables users to employ respected but computationally intensive machine learning algorithms implemented in the R programming language while also capitalizing on the distributed memory capabilities of the Spark platform. Peter Christensen said that SparkR was "essential" for running machine learning models at scale: "These models can take weeks or longer to run without the ability to distribute tasks and run them in parallel, which Spark enables. Our distributed models were running more than ten times faster than on a standard HPC cluster, a big enough difference to transform how we approached the problem."
Mateus Souza, the PhD student who pioneered the implementation of machine learning on the platform, agreed: "NCSA's analytics cluster gave me the flexibility to be more creative as I tested and refined our algorithms and model configurations."
The researchers found that in many cases, current predictions were overestimating some benefits and underestimating others. On average, actual savings were falling short of predicted savings. Providing more accurate assessments of how energy-saving measures benefit particular homes can help policymakers, programmes and homeowners to focus their efforts and return maximum energy savings from their investments.
Peter Christensen predicted that machine learning models will soon be heavily used across industry and government for decision making, but recognizes that they're not quite there yet, in part because model training can be very time-consuming - and sometimes, very expensive - without the right infrastructure. Part of the challenge for researchers will be bringing the benefits of efficient machine learning to collaborators outside of compute-forward environments like NCSA. But the team is ready for that challenge and others, and in addition to this project, has already used machine learning on NCSA's analytics cluster for another project detecting racial discrimination in the US housing market.
Peter Christensen said his group will continue to focus on addressing real-world public challenges with technological solutions: "We use these technologies to rethink economic and policy problems that have been around in our discipline for a long time, but havent been tractable. We want to solve that kind of problem."