Back to Table of contents

Primeur weekly 2011-10-03

Special

The perfect data manager ...

The Cloud

Cloud computing - current scenario, trends & key players, according to ReportsnReports ...

ROLF Group signs with HP to move technology infrastructure into the Cloud ...

IBM expands business partner initiative with new Cloud channel offering ...

Red Lambda's MetaGrid software transforms security and operations for customers with big data IT, network and Cloud infrastructures ...

Desktop Grids

Desktop Grid middleware XtremWeb-HEP 7.6.0 released ...

EuroFlash

Bull launches its new mainframe family gcos 7 systems leveraging Extreme Computing technologies ...

Airbus completes data centre transformation with HP PODs ...

"Efficient use of GPU-accelerators to solve large problems" contest third tage starts ...

Unique supercomputer complex presentation held at the Tomsk State University ...

T-Platforms Company has become a Top50 leader in the Russian supercomputer list ...

USFlash

Scientists release most accurate simulation of the universe to date ...

Mongolia's National Agency of Meteorology and Environmental Monitoring orders a Cray XE6m supercomputer ...

How graphene's electrical properties can be tuned ...

Japan's KEK Research and IBM agree to develop powerful KEK central computer system ...

Intel Labs announces latest Science and Technology Center focused on next generation of pervasive computing ...

U.S. Department of Energy selects NetApp as the storage foundation for one of the world's most powerful supercomputers ...

Oracle achieves world record result with SPECjEnterprise2010 benchmark ...

Canon and Oracle join forces to integrate Canon's imaging technologies with Oracle ...

Oracle Utilities Meter Data Management running with Oracle Exadata Database Machine and Oracle Exalogic Elastic Cloud demonstrates extreme performance in processing data from smart meters ...

Oracle announces Hybrid Columnar Compression support for ZFS Storage Appliances and Pillar Axiom Storage Systems ...

Oracle unveils the world's fastest general purpose engineered system - the SPARC SuperCluster T4-4 ...

Oracle launches next generation SPARC T4 servers ...

Oracle's SPARC T4 servers deliver record-breaking performance results ...

SDSC and SDSU share in $4.6 million NSF grant to simulate earthquake faults ...

Stampede charges computational science forward in tackling complex societal challenges ...

The perfect data manager

26 Sep 2011 Lyon - At the EGI TF conference last week in Lyon, the booth of SysFera attracted a lot of attention. The company focuses on middleware and data management. Hotly debated topics in Lyon. In this article, SysFera's CTO, Benjamin Depardon, gives an overview of the state-of-the-art in data management.

Introduction

An early and important usage of grid (and now Cloud) environments comes from applications managing large data sets. Fields of science such as high-energy physics, seismology, cosmology or more recently life sciences are producing ever more data (by the number of files managed, the size of each file, or both). High Performance Computing applications executed on this data require seamless and high-performance access mechanisms managing the available bandwidth of communication links. To manage big data for grids and clouds, we have designed a high-performance and distributed data-management system for SysFera-DS.

Core issues

Several important issues must be solved by modern data-management systems.

Transparency is one of the most important, maybe even before performance. Data should be stored at the most appropriate place, sometimes without the end-user's intervention, be it in memory, on a local disk or on remote storage. In the same way, a computational server will need to get the data as fast as possible without a priori knowledge of where it has been stored. Thus, an efficient data manager should incorporate localization mechanisms as well as methods to store data on different kinds of storage while taking into account their performance; in other words, storage heterogeneity has to be taken into account.

On the other hand, it is possible that application developers already have a good knowledge of how their applications use data and thus might want to tune data management by themselves, in order to improve their application's performance (e.g., by overlapping communications and computations). Thus, transparency should never be provided at the expense of flexibility in data management.

When several replicas of the same data exist on a platform, carefully choosing the best source is necessary to get the most of the communication bandwidth. Multiple sources can even sometimes be used to parallelize the transfer itself and thus ensure that communication time will be as low as possible.

To avoid large communication and data-movement overheads when data is used several times (typically, when someone runs a parameter-sweep analysis), data should be kept where it is created or used. Persistence must be used as much as possible and, if the storage volume allows it, replication can be used to increase parallelism between computational servers or to ensure fault-tolerance.

To improve the global throughput of software environments, replicas are usually put at carefully selected sites. Moreover, computation requests have to be scheduled among the available resources. To get the best performance, scheduling and data replication have to be tightly coupled, which is not always the case in existing approaches. These persistence and replication strategies have to be linked with job scheduling to make sure that proper scheduling decisions take all the parameters into account (computation costs as well as communication costs).

Often, many analyses are run on the same initial data set. That is the case for cosmological simulations where the result of a big MPI simulation is fed to several post-processing programs, each called several times with a different set of parameters; or in genomics, where databases are accessed by several persons at the same time. Having only one replica of the data can really hinder the power that a platform could provide to a given application. However, if there are two replicas of the data on two computational resources, it might be possible to process twice as fast as with only one copy of the data. SysFera-DS is able to automatically replicate data on several sites whenever it is required, thus improving computation time. Either this replica is temporary and deleted at the end of the computation that required it, or it is kept persistent in order to limit data movement and to ensure that the data will be present for further needs. Simplicity is crucial, and thus either SysFera-DS can implicitly replicate data when necessary or the user can explicitly state that data needs to be replicated on a given set of nodes.

Thus, replication can potentially provide an efficient way of decreasing data-processing time by allowing more parallelism. However, replication needs to be controlled as it consumes both network bandwidth and storage capacity. SysFera-DS provides a quota mechanism to limit the disk and memory space available for data storage on each node. Whenever a storage resource is full, several policies can be applied: either no new data can be copied to this node until old data is explicitly deleted, or one of the many replacement algorithms can be used to automatically replace old data.

Conclusion

The data manager used in the SysFera-DS software solution provides the functionalities outlined above, among others. It is already used, in a production environment, for applications as different as cosmology, meteorology, biology, etc., which typically require advanced data-management functionalities. The SysFera R&D team constantly improves its software to keep ahead of the ever increasing needs for data management. When coupled to the advanced dataflow engine included in SysFera-DS, it is possible to handle complex dataflows that use large amounts of data processed on heterogeneous and distributed resources.
Source: www.sysfera.com

Back to Table of contents

Primeur weekly 2011-10-03

Special

The perfect data manager ...

The Cloud

Cloud computing - current scenario, trends & key players, according to ReportsnReports ...

ROLF Group signs with HP to move technology infrastructure into the Cloud ...

IBM expands business partner initiative with new Cloud channel offering ...

Red Lambda's MetaGrid software transforms security and operations for customers with big data IT, network and Cloud infrastructures ...

Desktop Grids

Desktop Grid middleware XtremWeb-HEP 7.6.0 released ...

EuroFlash

Bull launches its new mainframe family gcos 7 systems leveraging Extreme Computing technologies ...

Airbus completes data centre transformation with HP PODs ...

"Efficient use of GPU-accelerators to solve large problems" contest third tage starts ...

Unique supercomputer complex presentation held at the Tomsk State University ...

T-Platforms Company has become a Top50 leader in the Russian supercomputer list ...

USFlash

Scientists release most accurate simulation of the universe to date ...

Mongolia's National Agency of Meteorology and Environmental Monitoring orders a Cray XE6m supercomputer ...

How graphene's electrical properties can be tuned ...

Japan's KEK Research and IBM agree to develop powerful KEK central computer system ...

Intel Labs announces latest Science and Technology Center focused on next generation of pervasive computing ...

U.S. Department of Energy selects NetApp as the storage foundation for one of the world's most powerful supercomputers ...

Oracle achieves world record result with SPECjEnterprise2010 benchmark ...

Canon and Oracle join forces to integrate Canon's imaging technologies with Oracle ...

Oracle Utilities Meter Data Management running with Oracle Exadata Database Machine and Oracle Exalogic Elastic Cloud demonstrates extreme performance in processing data from smart meters ...

Oracle announces Hybrid Columnar Compression support for ZFS Storage Appliances and Pillar Axiom Storage Systems ...

Oracle unveils the world's fastest general purpose engineered system - the SPARC SuperCluster T4-4 ...

Oracle launches next generation SPARC T4 servers ...

Oracle's SPARC T4 servers deliver record-breaking performance results ...

SDSC and SDSU share in $4.6 million NSF grant to simulate earthquake faults ...

Stampede charges computational science forward in tackling complex societal challenges ...