An early and important usage of grid (and now Cloud) environments comes from applications managing large data sets. Fields of science such as high-energy physics, seismology, cosmology or more recently life sciences are producing ever more data (by the number of files managed, the size of each file, or both). High Performance Computing applications executed on this data require seamless and high-performance access mechanisms managing the available bandwidth of communication links. To manage big data for grids and clouds, we have designed a high-performance and distributed data-management system for SysFera-DS.
Several important issues must be solved by modern data-management systems.
Transparency is one of the most important, maybe even before performance. Data should be stored at the most appropriate place, sometimes without the end-user's intervention, be it in memory, on a local disk or on remote storage. In the same way, a computational server will need to get the data as fast as possible without a priori knowledge of where it has been stored. Thus, an efficient data manager should incorporate localization mechanisms as well as methods to store data on different kinds of storage while taking into account their performance; in other words, storage heterogeneity has to be taken into account.
On the other hand, it is possible that application developers already have a good knowledge of how their applications use data and thus might want to tune data management by themselves, in order to improve their application's performance (e.g., by overlapping communications and computations). Thus, transparency should never be provided at the expense of flexibility in data management.
When several replicas of the same data exist on a platform, carefully choosing the best source is necessary to get the most of the communication bandwidth. Multiple sources can even sometimes be used to parallelize the transfer itself and thus ensure that communication time will be as low as possible.
To avoid large communication and data-movement overheads when data is used several times (typically, when someone runs a parameter-sweep analysis), data should be kept where it is created or used. Persistence must be used as much as possible and, if the storage volume allows it, replication can be used to increase parallelism between computational servers or to ensure fault-tolerance.
To improve the global throughput of software environments, replicas are usually put at carefully selected sites. Moreover, computation requests have to be scheduled among the available resources. To get the best performance, scheduling and data replication have to be tightly coupled, which is not always the case in existing approaches. These persistence and replication strategies have to be linked with job scheduling to make sure that proper scheduling decisions take all the parameters into account (computation costs as well as communication costs).
Often, many analyses are run on the same initial data set. That is the case for cosmological simulations where the result of a big MPI simulation is fed to several post-processing programs, each called several times with a different set of parameters; or in genomics, where databases are accessed by several persons at the same time. Having only one replica of the data can really hinder the power that a platform could provide to a given application. However, if there are two replicas of the data on two computational resources, it might be possible to process twice as fast as with only one copy of the data. SysFera-DS is able to automatically replicate data on several sites whenever it is required, thus improving computation time. Either this replica is temporary and deleted at the end of the computation that required it, or it is kept persistent in order to limit data movement and to ensure that the data will be present for further needs. Simplicity is crucial, and thus either SysFera-DS can implicitly replicate data when necessary or the user can explicitly state that data needs to be replicated on a given set of nodes.
Thus, replication can potentially provide an efficient way of decreasing data-processing time by allowing more parallelism. However, replication needs to be controlled as it consumes both network bandwidth and storage capacity. SysFera-DS provides a quota mechanism to limit the disk and memory space available for data storage on each node. Whenever a storage resource is full, several policies can be applied: either no new data can be copied to this node until old data is explicitly deleted, or one of the many replacement algorithms can be used to automatically replace old data.
The data manager used in the SysFera-DS software solution provides the functionalities outlined above, among others. It is already used, in a production environment, for applications as different as cosmology, meteorology, biology, etc., which typically require advanced data-management functionalities. The SysFera R&D team constantly improves its software to keep ahead of the ever increasing needs for data management. When coupled to the advanced dataflow engine included in SysFera-DS, it is possible to handle complex dataflows that use large amounts of data processed on heterogeneous and distributed resources.