Back to Table of contents

Primeur weekly 2015-02-02

The Cloud

Shinra Technologies partners with NTT and Techorus and announces launch date of Japanese technical beta ...

UTSA and Indiana University partner on $6.6 million NSF Cloud-based advanced computing systems grant ...

Independent research firm ranks HP private Cloud a leader in China ...

Desktop Grids

Grant for Nerdalize for heating houses with computing power ...

EuroFlash

Business Secretary Cable announces partners in the Alan Turing Institute in the UK ...

Schools in Wales challenged to break the world land speed record of 1,000mph ...

PUZZLECLUSTER: The first reuse application of the PUZZLEPHONE ...

Chemists control structure to unlock magnetization and polarization simultaneously ...

MEP Awards 2015 - Shortlisted nominees for ICT announced ...

DIADEMS - finding the sensor behind the sparkle ...

Entanglement on a chip: Breakthrough promises secure communications and faster computers ...

USFlash

NERSC seeks industry partners for collaborative research ...

Exascale Hearing Testimony in Congress highlights CS research accomplishments ...

D-Wave Systems raises an additional $29 million, closing 2014 financing at $62 million ...

MAGMA MIC 1.3.1 for Intel Xeon Phi coprocessors released ...

Dot Hill announces general availability of the Ultra56 AssuredSAN Hybrid storage array ...

New climate change projections for Australia ...

SGI reports financial results for the second quarter of fiscal 2015 ...

Researchers identify materials to improve biofuel and petroleum processing ...

Supercomputing the evolution of a model flower ...

Obsidian unveils plans for 400G-capable enhanced InfiniBand services platform at EmTech Singapore 2015 ...

New supercomputer allows for massive data analysis in less time ...

IBM Research to lead company's advanced computer chip R&D at SUNY Polytechnic Institute ...

Building trustworthy Big Data algorithms ...

Parallelizing common algorithms ...

New pathway to valleytronics ...

Nanoscale mirrored cavities amplify, connect quantum memories ...

Building trustworthy Big Data algorithms

29 Jan 2015 Evanston - Much of our reams of data sit in large databases of unstructured text. Finding insights among e-mails, text documents, and websites is extremely difficult unless we can search, characterize, and classify their text data in a meaningful way. One of the leading Big Data algorithms for finding related topics within unstructured text - an area called topic modelling - is latent Dirichlet allocation (LDA). But when Northwestern University professor Luis Amaral set out to test LDA, he found that it was neither as accurate nor reproducible as a leading topic modelling algorithm should be.

Using his network analysis background, Luis Amaral, professor of chemical and biological engineering in Northwestern's McCormick School of Engineering and Applied Science, developed a new topic modelling algorithm that has shown very high accuracy and reproducibility during tests. His results, published with co-author Konrad Kording, associate professor of physical medicine and rehabilitation, physiology, and applied mathematics at Northwestern, were published January 29 inPhysical Review X.

Topic modelling algorithms take unstructured text and find a set of topics that can be used to describe each document in the set. They are the workhorses of big data science, used as the foundation for recommendation systems, spam filtering, and digital image processing. The LDA topic modelling algorithm was developed in 2003 and has been widely used for academic research and for commercial applications, like search engines.

When Luis Amaral explored how LDA worked, he found that the algorithm produced different results each time for the same set of data, and it often did so inaccurately. Luis Amaral and his group tested LDA by running it on documents they created that were written in English, French, Spanish, and other languages. By doing this, they were able to prevent text overlap among documents.

"In this simple case, the algorithm should be able to perform at 100 percent accuracy and reproducibility", he stated. But when LDA was used, it separated these documents into similar groups with only 90 percent accuracy and 80 percent reproducibility. "While these numbers may appear to be good, they are actually very poor, since they are for an exceedingly easy case", Luis Amaral stated.

To create a better algorithm, Luis Amaral took a network approach. The result, called TopicMapping, begins by preprocessing data to replace words with their stem (so "star" and "stars" would be considered the same word). It then builds a network of connecting words and identifies a "community" of related words - just as one could look for communities of people in Facebook. The words within a given community define a topic.

The algorithm was able to perfectly separate the documents according to language and was able to reproduce its results. It also had high accuracy and reproducibility when separating 23,000 scientific papers and 1.2 million Wikipedia articles by topic.

These results show the need for more testing of Big Data algorithms and more research into making them more accurate and reproducible, Luis Amaral said.

"Companies that make products must show that their products work", he stated. "They must be certified. There is no such case for algorithms. We have a lot of uninformed consumers of Big Data algorithms that are using tools that haven't been tested for reproducibility and accuracy."
Source: Northwestern University

Back to Table of contents

Primeur weekly 2015-02-02

The Cloud

Shinra Technologies partners with NTT and Techorus and announces launch date of Japanese technical beta ...

UTSA and Indiana University partner on $6.6 million NSF Cloud-based advanced computing systems grant ...

Independent research firm ranks HP private Cloud a leader in China ...

Desktop Grids

Grant for Nerdalize for heating houses with computing power ...

EuroFlash

Business Secretary Cable announces partners in the Alan Turing Institute in the UK ...

Schools in Wales challenged to break the world land speed record of 1,000mph ...

PUZZLECLUSTER: The first reuse application of the PUZZLEPHONE ...

Chemists control structure to unlock magnetization and polarization simultaneously ...

MEP Awards 2015 - Shortlisted nominees for ICT announced ...

DIADEMS - finding the sensor behind the sparkle ...

Entanglement on a chip: Breakthrough promises secure communications and faster computers ...

USFlash

NERSC seeks industry partners for collaborative research ...

Exascale Hearing Testimony in Congress highlights CS research accomplishments ...

D-Wave Systems raises an additional $29 million, closing 2014 financing at $62 million ...

MAGMA MIC 1.3.1 for Intel Xeon Phi coprocessors released ...

Dot Hill announces general availability of the Ultra56 AssuredSAN Hybrid storage array ...

New climate change projections for Australia ...

SGI reports financial results for the second quarter of fiscal 2015 ...

Researchers identify materials to improve biofuel and petroleum processing ...

Supercomputing the evolution of a model flower ...

Obsidian unveils plans for 400G-capable enhanced InfiniBand services platform at EmTech Singapore 2015 ...

New supercomputer allows for massive data analysis in less time ...

IBM Research to lead company's advanced computer chip R&D at SUNY Polytechnic Institute ...

Building trustworthy Big Data algorithms ...

Parallelizing common algorithms ...

New pathway to valleytronics ...

Nanoscale mirrored cavities amplify, connect quantum memories ...