Back to Table of contents

Primeur weekly 2015-11-09

Special

HARNESS explored principles to integrate heterogeneous resources into Cloud platform ...

Focus

Combining the benefits of both GPU and CPU in heterogeneous computing ...

Exascale supercomputing

Towards future supercomputing: EU project Exa2Green improves energy efficiency in high performance computing ...

DEEP project unveils next-generation HPC platform ...

Focus on Europe

Launch of BioExcel - Centre of Excellence for Biomolecular Research ...

Information security community for e-infrastructures crystalises at WISE workshop ...

ALCF helps tackle the Large Hadron Collider's Big Data challenge ...

Middleware

Bright Computing to release updates to popular management software at SC15 ...

Altair partners with South Africa's Centre for High Performance Computing ...

Cray, AMPLab, NERSC collaboration targets Spark performance on HPC platforms ...

Hardware

Singapore scientists among the first to benefit from Infinera Cloud Xpress with 100 GbE for data centre interconnect ...

Supermicro world record performance benchmarks for SYS-1028GR-TR with Intel Xeon Phi coprocessors announced at Fall 2015 STAC Summit ...

IBM Teams with Mellanox to help maximize performance of Power Systems LC line servers for Cloud and cluster deployments ...

LSU deploys new IBM supercomputer "Delta" to advance Big Data research in Louisiana ...

Applications

Nomadic computing speeds up Big Data analytics ...

Clemson researchers and IT scientists team up to tackle Big Data ...

Calcium-48's 'neutron skin' thinner than previously thought ...

Oklahoma University collaborating in NSF South Big Data Regional Innovation Hub ...

Columbia to lead Northeast Big Data Innovation Hub ...

University of Miami gets closer to helping find a cure for gastrointestinal cancer thanks to DDN storage ...

The Cloud

Cornell leads new National Science Foundation federated Cloud project ...

Bright Computing reveals plans for Cloud Expo Frankfurt ...

UberCloud delivers CAE Applications as a Service ...

IBM plans to acquire The Weather Company's product and technology businesses; extends power of Watson to the Internet of Things ...

Oracle updates Oracle Cloud Infrastructure services ...

Nomadic computing speeds up Big Data analytics

Schematic of the proposed approach to predicting gene-disease associations. First, the researchers construct gene and disease features using different sources. Then, they perform Inductive matrix completion using row and column features. The shaded region in the P matrix corresponds to genes or diseases with at least one known association. Credit: Nagarajan Natarajan and Inderjit Dhillon4 Nov 2015 Arlington - How do Netflix or Facebook know which movies you might like or who you might want to be friends with? Here's a hint: It starts with a few trillion data points and involves some complicated math and a lot of smart computer programming. The ability to make sense of massive amounts of raw data - a process known as data analytics - has already brought benefits to consumers and long-lost friends and is beginning to have a real impact in medicine, law enforcement and public services.

Inderjit Dhillon, a professor of computer science at the University of Texas at Austin, is an expert in this new world of Big Data. He was named a 2014 Fellow of the Association for Computing Machinery for his contributions to large-scale data analytics, machine learning and computational mathematics.

"Nowadays, there is an abundance of massive networks", Inderjit Dhillon stated. "These networks may be explicit or implicit, and we want to use predictive analytics on these networks to see what they can tell us."

He is among those who have realized it's possible to tame highly complex data - or "data with high dimensionality", in the lingo of the field - by using machine learning to reduce data to its most meaningful parameters. His approaches are widely adopted in science and industry.

"People have come to realize over the last two decades that indeed, data that comes from different applications often has special structures", Inderjit Dhillon stated. "For example, in the case of a very high dimensional regression, it's only a small number of dimensions that may actually matter."

Imagine Netflix wants to recommend movies that customers might like based on their ratings. A customer who ranks several buddy comedies highly will probably like other films in that genre.

"If the user-movie matrix was general, there's no way that you could infer missing elements because they could be anything", Inderjit Dhillon stated. "But when you make the assumption that people have a finite number of tastes, or factors that determine what they like, the problem becomes tractable."

In collaboration with Vishwanathan's group at the University of California, Santa Cruz, and with support from the National Science Foundation (NSF) and computing resources from the Texas Advanced Computing Center (TACC), Inderjit Dhillon and his group have recently developed a new data analysis tool, called NOMAD. It stands for "non-locking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion".

NOMAD can pull insights from data much faster than other current state-of-the-art tools. It is also able to explore datasets, including some of the largest available, that break other leading software.

Among the problems Inderjit Dhillon and his team are exploring with NOMAD are "topic modelling", where the system automatically determines the appropriate topics related to billions of documents, and "recommender systems", where, based on millions of users and billions of records, the system can suggest appropriate items to buy or people to meet.

In cases like these, fitting the data on a single computer is often impossible. Instead, users distribute data among a large number of host systems. At the heart of NOMAD is a new method for orchestrating computations among those hosts.

"Suppose you have a massive computational problem and you need to run it on datasets that do not fit in a computer's memory", Inderjit Dhillon stated. "If you want the answer in a reasonable amount of time, the logical thing to do would be to distribute the computations over different machines."

Traditionally, systems have managed distributed computations through a process known as bulk synchronization. After computing a solution, each of the processors involved - often thousands - stops and communicates with the others, passing along the results of their computations.

In the programme that Inderjit Dhillon and Vishwanathan have developed, the communication is done asynchronously - the processors no longer stop and communicate at the same time.

"We are trying to develop an asynchronous method where each parameter is, in a sense, a nomad", he explained. "The parameters go to different processors, but instead of synchronizing this computation followed by communication, the nomadic framework does its work whenever a variable is available at a particular processor."

As soon as the work is done, the nomadic parameter travels to another processor. As a result, there is no waiting around for all the other processors in the system to finish computing.

Inderjit Dhillon and his team used the Stampede supercomputer at TACC, the ninth most powerful in the world, and a system called Rustler that specializes in running machine learning algorithms to develop and test NOMAD.

Managing the process in this way, the team was able to get a superlinear speed-up. That means when they ran the code on one thousand processors, they were able to solve the problem more than one thousand times faster. They were also able to seamlessly handle millions of documents with billions of occurrences of words - or millions of users and billions of ratings - in a reasonable time period.

The team reported their results in theProceedings of the VLDB Endowmentin July 2014 and at the World Wide Web conference in May 2015.

The research applies broadly to one of the most pressing challenges today, according to Amy Apon, a programme director at NSF.

"Traditionally, machine learning inference algorithms run on a single large - and sometimes expensive - server, and this limits the size of the problem that can be addressed", she stated. "This team has noticed a property of some machine learning algorithms that if a few slowly changing variables can be only occasionally synchronized, then the work can be more easily distributed across different computers. Their clever mathematical approach is opening doors to running machine algorithms on the kind of massive-scale, distributed, commodity computers that we find in today's Cloud computing environment."

When we hear data analytics, we typically think of social networks like Facebook or LinkedIn, but Inderjit Dhillon has applied his tools to a different type of network - the gene networks involved in disease.

Teaming up with Edward Marcotte in the biology department at the University of Texas at Austin, they have applied the methods and state-of-the-art algorithms from Inderjit Dhillon's group to the gene networking problems Marcotte is trying to solve.

"We thought about the relationships or linkages between genes and diseases as a network", Inderjit Dhillon stated. "From there, the question is: Can you do prediction from this sort of data to determine what genes have a propensity to be linked to which diseases? And you can do that by actually developing new mathematics."

Edward Marcotte and Inderjit Dhillon used evolutionary relationships to track down gene networks involved in human health. With this method, they showed it was possible to predict gene-disease associations for diabetes based on functional gene associations and gene-phenotype associations in model organisms.

The result of the work was published inPLOS Onein 2013, and a further study by Inderjit Dhillon and his group where they used a similar system appeared inBioinformaticslast year.

In October, Inderjit Dhillon was awarded a three-year follow-up grant from NSF to continue work on nomadic algorithms for machine learning in the Cloud, beginning in January 2016.

He and his group are now extending their study to medical informatics, where the problem might be to predict co-morbidities or the propensity for re-hospitalization.

"Data can tell a thousand stories", stated Inderjit Dhillon, "and the challenge is to develop new mathematics and methods that can extract the knowledge and discard the spurious. The possibilities are endless."
Source: National Science Foundation

Back to Table of contents

Primeur weekly 2015-11-09

Special

HARNESS explored principles to integrate heterogeneous resources into Cloud platform ...

Focus

Combining the benefits of both GPU and CPU in heterogeneous computing ...

Exascale supercomputing

Towards future supercomputing: EU project Exa2Green improves energy efficiency in high performance computing ...

DEEP project unveils next-generation HPC platform ...

Focus on Europe

Launch of BioExcel - Centre of Excellence for Biomolecular Research ...

Information security community for e-infrastructures crystalises at WISE workshop ...

ALCF helps tackle the Large Hadron Collider's Big Data challenge ...

Middleware

Bright Computing to release updates to popular management software at SC15 ...

Altair partners with South Africa's Centre for High Performance Computing ...

Cray, AMPLab, NERSC collaboration targets Spark performance on HPC platforms ...

Hardware

Singapore scientists among the first to benefit from Infinera Cloud Xpress with 100 GbE for data centre interconnect ...

Supermicro world record performance benchmarks for SYS-1028GR-TR with Intel Xeon Phi coprocessors announced at Fall 2015 STAC Summit ...

IBM Teams with Mellanox to help maximize performance of Power Systems LC line servers for Cloud and cluster deployments ...

LSU deploys new IBM supercomputer "Delta" to advance Big Data research in Louisiana ...

Applications

Nomadic computing speeds up Big Data analytics ...

Clemson researchers and IT scientists team up to tackle Big Data ...

Calcium-48's 'neutron skin' thinner than previously thought ...

Oklahoma University collaborating in NSF South Big Data Regional Innovation Hub ...

Columbia to lead Northeast Big Data Innovation Hub ...

University of Miami gets closer to helping find a cure for gastrointestinal cancer thanks to DDN storage ...

The Cloud

Cornell leads new National Science Foundation federated Cloud project ...

Bright Computing reveals plans for Cloud Expo Frankfurt ...

UberCloud delivers CAE Applications as a Service ...

IBM plans to acquire The Weather Company's product and technology businesses; extends power of Watson to the Internet of Things ...

Oracle updates Oracle Cloud Infrastructure services ...