All the way through to the 20th century knowledge of soybeans came from the outside through selective breeding and manipulation of its environment - the warm weather, targeted water, loose soil, and full sunlight it needs to grow.
Today, an ambitious project called Soybean Knowledge Base (SoyKB) developed at the University of Missouri-Columbia (MU) aims to find and share comprehensive knowledge from within the soybean, its genetic and genomic data, all publicly available and achieved through the use of high-performance computing.
Dong Xu is one of the principal investigators of SoyKB, which he describes as a web resource for all soybean data from molecular data to field data including several analytical tools. Dong Xu is a professor and department chair of computer science at MU.
"Our goal, first of all, is to provide a resource for people to find information about the soybean genes, their behaviour, their gene expression, the metabolic pathways, and more", Dong Xu stated. He added that it's more than just a clearinghouse of data. SoyKB promotes deeper understanding through data analysis for scientists who want to improve crops to develop and verify their hypothesis. More than 2,000 unique users log on to the SoyKB website every month, and over 10,000 unique users have utilized SoyKB since it was developed in 2010.
SoyKB started small, initially focusing on the genomics aspects of soybean data, according to Co-PI Trupti Joshi. She is the director of Translational Bioinformatics at the School of Medicine Medical Research Office and assistant research professor in the Department of Molecular Microbiology and Immunology at MU.
"After a year or two", stated Trupti Joshi, "we added the USDA germplasm data set, which gives you phenotypic information for about 19,000 soybean germplasm lines." Germplasm is basically the living genetic information from seed banks scientists use to improve their breeding. "That is when we started building a lot of tools in the informatics suite", she stated. These efforts, she added, are helping researchers find connections between the genomics data and variations in the germplasm lines.
"SoyKB has grown tremendously", Trupti Joshi stated. "Over the years, we have had users from academic and industry environments. We have both domestic and international users from Canada, Brazil, India, China, and a lot of different countries in Europe. It's really been widely accessible." Times have changed since the days of American colonist Samuel Bowen.
The ultimate goal of SoyKB, said Trupti Joshi, is to improve soybean traits and support researchers in facilitating more enhanced soybean breeding techniques. "Our focus has been mainly on integrating multi-omics data sets about gene expression, protein expression, variations in the soybean, and then bridging it from this translational genomics side to the molecular breeding side, where it affects the soybean researchers and farmers", Trupti Joshi stated.
The SoyKB project started its computation with NSF-sponsored XSEDE, the eXtreme Science and Engineering Discovery Environment, through an allocation awarded in 2014 on the Stampede supercomputer at the Texas Advanced Computing Center. In all, it has used about 370,000 core hours on a massive project to sequence and analyze the genomes of over 1,000 soybean germplasm lines.
The technique is called resequencing, where the genomic variations compared to a reference genome are found for each line. "The way resequencing is conducted is to chop the genome in many small pieces and see the many, many combinations of small pieces", stated Dong Xu. "The data are huge, millions of fragments mapped to a reference. That's actually a very time consuming process. Resequencing data analysis takes most of our computing time on XSEDE."
SoyKB sought the genetic markers for major soybean traits that include oil and protein content; soybean cyst nematode resistance; resistance to drought, heat and salinity; and healthy root system structure. "These data were very useful", stated Dong Xu, "because once we identified the genetic variations of those lines, they can be used for breeding purposes. It's really valuable data. In order to analyze the data, we didn't have enough resources. That's how XSEDE really helped us a great deal. In fact, we became one of the heavy users of XSEDE. Without XSEDE, we wouldn't be able to analyze this data. Now that the data are mostly analyzed, and we deposited this data into SoyKB, other researchers can also utilize it to answer questions of their interest", Dong Xu stated.
SoyKB was more or less a pipeline of Perl scripts when it first came to XSEDE, according to Mats Rynge. Mats Rynge is a computer scientist with the Information Sciences Institute (ISI), part of the University of Southern California (USC). He's part of the XSEDE Extended Collaborative Support Services (ECSS) effort. ECSS is a pool of experts that help researchers use the cyberinfrastructure of XSEDE, a nationwide grid of some of the most powerful computational hardware and software in the world. Like the warm weather soybeans require, XSEDE provided the environment of hardware, software, and expertise SoyKB needed to thrive.
Mats Rynge's group at ISI had experience with the Pegasus workflow, and he thought it would make a good fit for SoyKB to transform from scripts to a workflow optimized for supercomputers. One might think of Pegasus as the flow of water for a data-thirsty SoyKB platform. "Pegasus is a workflow system that can take a set of computational tasks, where one task produces a piece of data that is used by another task downstream", explained Mats Rynge. Pegasus ensured that the ordering of the tasks was correct and that the data were formatted to best suit the execution environment of the parallel processing machines on XSEDE. It also handled the data management between tasks and the inputs and outputs.
The workflow inputs were moved from MU and hosted on the data store of NSF-funded CyVerse. CyVerse, formerly iPlant, is a multi-institution resource for the life sciences to handle Big Data with platforms that provide data storage, bioinformatics tools, image analyses, Cloud services, APIs, and more. Cyverse resources supported the framework that allowed SoyKB to scale up for its thousand genome resequencing project. "For example, the data store framework really helped us tremendously", Trupti Joshi stated. "We generated close to 25-30 terabytes of raw data from just one large-scale sequencing project."
Another move SoyKB took was to take its memory-guzzling genomic analysis from Stampede to Wrangler, a data intensive system that launched in 2015. Like the loose, fertile soil soybeans need, Wrangler's unprecedentedly large memory-to-core ratio gave ample room for the SoyKB workflow to avoid data bottlenecks. "I think part of the success story", stated Mats Rynge, "is when Wrangler came on, it turned out to be a much better fit. We transitioned from Stampede to Wrangler, and we have been very happy with it since."
"Many times our PGen Pegasus workflows would run anywhere from 10 to 15 days on the Stampede systems", Trupti Joshi stated. "But then the same analysis could be completed in about 8 to 10 days when we moved those to the Wrangler system."
One big highlight of the SoyKB project is the easy-to-use suite of tools developed for informatics data analysis, said Trupti Joshi. "They are complete all the way from doing analysis with the soybean genome to getting you a view of what the gene expression might look like in different soybean tissues versus how certain soybean lines might respond to stress, whether it is in response to soybean cyst nematode worms or whether it is in response to drought stress. We actually built a system that stressed the user's perspective", Trupti Joshi stated.
MU scientists Trupti Joshi and Dong Xu were both on the team that in 2010 sequenced the first reference soybean genome. "It was exhilarating to be part of that community", Trupti Joshi stated. "This was a great step forward for the soybean community with the first genome draft."
"Since then, we have actually had a second revision", stated Trupti Joshi. "A version of the genome sequence and the gene model is being revised. We are really thrilled, because now we are in collaboration with Dr. Henry Nguyen at the University of Missouri and the Washington University genome sequencing centre, McDonnell Genome Institute. We are sequencing the second reference genome for the "Lee" (PI 548656), which is representative of the southern cultivars. We are looking at a second reference genome coming out of soybeans", Trupti Joshi stated.
Dong Xu of MU wants SoyKB to expand its platform to other systems through something like an 'app' store. "This means we have many individual tools other than the data analysis pipeline", Dong Xu stated. "We have a genotype-phenotype analysis pipeline. We also developed some visualization capacity. We have more than a dozen tools. We would like to make these tools available to any other databases. We have been working with the corn community and others", stated Dong Xu.
Another future direction for SoyKB, Dong Xu said, is to make it a genetic platform for other science groups to quickly develop their knowledge base. "Basically you could input the genome of any species and some annotations, and that would feed into what we call the 'KBCommons'", Dong Xu stated. The KBCommons would generate websites automatically for scientists. "People can develop a knowledge base for a particular disease, like heart disease or diabetes", Dong Xu stated. "Even though there are a lot of databases for human genomics, there is still this need for these special purposes. Our platform can allow people to generate a specific platform quickly and easily."
One way that SoyKB is getting more users onboard is through an early research allocation on Jetstream, XSEDE's first scalable and fully-customizable cloud environment. The web-based user interface of Jetstream allows seamless integration with other XSEDE resources via Globus Auth.
With the help of XSEDE hardware, software, and expertise SoyKB has grown to be a rich ecosystem for the community of interdisciplinary researchers, industry, and non-scientists hoping to take advantage of the latest science on soybeans. And it has planted seeds of knowledge in the form of the many students that have participated in SoyKB.
"This is a great training environment for students", Trupti Joshi of MU stated. "Being in an academic institution, where we have developed this system, it also gives a nice framework for us to be training the next generation of scientists. Plus, it gets high school students involved, even if they're simply interested in knowing what a soybean plant looks like and how it responds to stress. You could just go to the SoyKB website and do a quick search to look for one of the lines that are best for growing in a drought environment."
"One of the things that I really like about SoyKB when it comes to knowledge transfer is the student involvement", stated Mats Rynge of XSEDE ECSS. "SoyKB had a more than normal number of students working with us. This is an important point, where the knowledge transfer is not just to computational scientists at some other project. It's really teaching students on how to do computing. That will hopefully help them with their computational needs in research when they are graduated and doing their own research."