Many scientists today use sophisticated data-intensive approaches to combine and analyze large data sets from scientific instruments and data stores all over the country. While these techniques hold great value for discovery and innovation, integrating the necessary data and tools into a scientist's workflow is often a complex undertaking. In addition, errors can be introduced when data is moved or analyzed; if those errors go undetected, it can compromise the science.
With the two new grants, NSF's Office of Advanced Cyberinfrastructure is supporting RENCI and its collaborators for developing a high-performance, end-to-end platform to facilitate the use and processing of scientific data and with developing a system to detect and diagnose unintentional data errors. The ultimate aim of both projects is to improve the productivity of America's scientists and increase confidence in the outcomes of their work.
NSF announced an award of $999,575 over three years to support Integrity Introspection for Scientific Workflows (IRIS), a project to develop a seamless system to uncover unintentional data errors. While previous work has focused on catching malicious hackers or software bugs, IRIS will be the first to specifically address the detection of unintentional - and often unseen - errors that can be introduced when working with Big Data.
"Like a game of telephone, sometimes the data that goes in one end of the process is not exactly the same as the data that comes out the other end", stated RENCI research scientist Anirban Mandal, the project's Principal Investigator. "Some of these errors can go undetected and as a result the scientific analysis or simulation gives wrong answers, and the scientist may not even be aware of that. Many of these errors are not very obvious, and it takes complex analysis to catch them and find the root cause."
RENCI, the project's lead institution, will receive $400,000. Collaborators include Von Welch at Indiana University Center for Applied Cybersecurity Research and Ewa Deelman of University of Southern California Information Sciences Institute's Science Automation Technologies.
The project will use machine learning techniques to monitor data at different points in the research workflow, identify unintentional errors and trace their origins. Once developed, the error detection and analysis tool will be integrated with Pegasus, a software widely used by scientists to manage the computations on and the movement and use of Big Data so that scientists can readily integrate automatic error detection into their scientific analysis processes. IRIS researchers will use data and tools from scientific applications in gravitational wave physics, earthquake science and bioinformatics to develop, validate and test their methods.
"We're using these domain applications to drive our computer science research, and the outcomes of the computer science research will feed back into the domain sciences", stated Anirban Mandal. "While the focus on detecting and analyzing unintentional errors is an angle that we haven't explored before, the project builds on our current work in the NSF SWIP project, research conducted in the NRIG research group, and it leverages RENCI's long history of supporting domain scientists' ability to use distributed large-scale computational infrastructure."
The second project, Delivering a Dynamic Network-centric Platform for Data-driven Science (DyNamo), also uses Pegasus as a platform. In this case, the aim is to help scientists take better advantage of available resources and high performance networks for working with large data sets scattered in facilities around the country. The project, announced in July, provides a total of $1 million over two years.
"Many distributed computing resources and high-speed networks currently exist but are not used effectively because they're too hard to use", stated Anirban Mandal, the project's Principal Investigator. "To make it easier for scientists to use these resources, we will build a platform that bridges the gap between the capabilities that are out there and what scientists can realistically integrate into their workflows."
RENCI, the project's lead institution, will receive $375,000. Co-Principal Investigators include Ewa Deelman of the University of Southern California, Michael Zink of the University of Massachusetts at Amherst, Cong Wang of RENCI, and Ivan Rodero of Rutgers.
The fundamental idea behind DyNamo is that scientists should be able to spend more time doing science and less time figuring out how to navigate complicated systems for accessing, moving and analyzing data. The project will develop new algorithms, policies and mechanisms to coordinate access to data repositories and facilitate their use.
The team will work with domain scientists from the Collaborative Adaptive Sensing of the Atmosphere (CASA) and Ocean Observatories Initiative (OOI) communities to develop and test an end-to-end system that can be seamlessly integrated into Pegasus. If it works for CASA and OOI researchers, who heavily depend on scientific instruments, Big Data sets and simulations to study the atmosphere and oceans, respectively, then it will be well positioned to work for many other scientific domains.
The researchers will test new algorithms and models in real time with live streaming data, which is currently not possible in many scientific domains including CASA and OOI. By serving as testbeds for DyNamo, the CASA and OOI scientific communities will benefit from its capabilities early on, potentially speeding advances in weather forecasting and advancing understanding of how the oceans impact ecosystems, the atmosphere and human society.