Biologists, hydrologists, computer engineers and computer scientists will join forces with Alex Feltus and Melissa Smith to design a system called Scientific Data Analysis at Scale (SciDAS). Their goal is to help current researchers and future innovators discover data, move it smoothly across advanced networks and improve flexibility and accessibility to national and global resources.
Alex Feltus is the principle investigator of the three-year project. Melissa Smith is a co-principal investigator, along with Claris Castillo and Ray Idaszak of Renaissance Computing Institute (RENCI) at the University of North Carolina, Chapel Hill; and Stephen Ficklin of Washington State University in Pullman.
"A key aspect of the SciDAS team is that we'll be processing scientific data at the same time that we're gluing together all the parts needed for a national cyberinfrastructure ecosystem", stated Alex Feltus, associate professor of genetics and biochemistry in Clemson University's College of Science. "We're trying to avoid the problem of 'if you build it they will come' and instead enlist the input of a variety of scientists to join us on the ground floor and help us build it. Thus, our software will be refined by using real data by real users with real habits."
Scientific discovery has become increasingly dependent on terascale (one trillion floating point operations per second) and even petascale (one quadrillion per second) data processing that only the world's fastest supercomputers can process. Fortunately, years of significant and strategic support from public and private sectors have created a distributed computational ecosystem to help meet these extraordinary demands. Available resources include high-speed networks like Internet2, open source scientific software packages, supercomputers in national labs, campus supercomputers, commercial cloud providers and deep data repositories like the National Center for Biotechnology Information. The Internet2 cyberteam will be assisting the research team in optimizing end-to-end data transfer rates.
"Many fields are awash with huge datasets. This is certainly true of biology and hydrology, but it also includes researchers who are studying satellite imagery, remote sensors and education analytics, to name a few", Alex Feltus stated. "Today's scientists are now required to understand both the underlying science and the cyberinfrastructure ecosystem to design and execute mind-bogglingly complex computations. SciDAS will combine new software with existing software to construct a system that will be efficient, practical and user-friendly."
SciDAS will enable a broad range of scientists to not only get information faster, but also to use much larger datasets and tease out information that they might not even know exists.
"The need for large data computing brings new challenges for scientists to be able to use complex systems efficiently and effectively", stated Melissa Smith, associate professor in the Holcombe Department of Electrical and Computer Engineering in Clemson's College of Engineering, Computing and Applied Sciences.
"My specialty is in computing architectures, application optimization and machine learning. Using these tools and techniques, we're going to be building an infrastructure that is easier for data scientists to manage. We have a good body of software and data repositories already in place that have been individually tried and tested. We're going to bring these components together and make their use seamless for the scientist across existing cyberinfrastructure and also cyberinfrastructure that will be available in the future."
On a technical level, SciDAS will combine access to multiple national cyberinfrastructure resources, including NSF Clouds, the Open Science Grid, the Extreme Science and Engineering Discovery Environment, petascale supercomputers such as COMET, and a variety of nationwide university resources such as Clemson's Palmetto Cluster. The distributed and scalable nature of both the data-sharing and the computer infrastructure will be exploited to boost the performance of workflows and scientific productivity.
"Given the huge problems and opportunities at play in the 21st century, we intend to speed up the discovery process and complex end-to-end data analysis process through a tight coupling of science and cyberinfrastructure experts", Alex Feltus stated. "This is not about making one-size-fits-all software. Rather, we'll be binding together the national cyberinfrastructure ecosystem to focus real data of interest to practicing scientists."
RENCI will lead the effort to integrate existing cyber tools and technologies into the new SciDAS infrastructure that will be designed to support all aspects of distributed, data-driven research. Development of the SciDAS framework will involve integrating a number of NSF-funded cyberinfrastructure systems into one package.
"We will build on successful cyberinfrastructure projects developed here at RENCI, most of them with funding from the National Science Foundation", stated Claris Castillo, a senior computational and networked systems researcher at RENCI. "Through NSF support, RENCI has developed a number of tools and environments that make science more productive. SciDAS will integrate those tools and work environments into a unified cyberinfrastructure tailored to support science applications at scale. It is a win for scientists and a way to extend the value of our funded projects."
Stephen Ficklin, a computational biologist with the department of horticulture at Washington State University, will demonstrate the effectiveness of SciDAS by building gene co-expression networks for plants, animals, insects and people as a use-case for systems biology.
This data-intensive project, which maps the interactions of tens of thousands of genes in organisms, could help farmers breed new crops using traditional methods or aid scientists in finding new genes that influence plant and animal health.
"In the end, we will create the most complete repository of gene co-expression networks that exists anywhere", Stephen Ficklin stated. "Improving our cyberinfrastructure helps make our country more competitive in research. It keeps us in the forefront of data science."