Back to Table of contents

Primeur weekly 2014-06-30

Special

Innovative HTC facilities needed to support computational genomics at the application and data level ...

HPC-assisted cell-based immunotherapy successful in curing melanoma ...

Restless hearts are simulated with real-time modelling at IBM Research in Zurich ...

A live report from the Adapteva A-1 "smallest supercomputer in the world" launch at ISC'14 ...

The Cloud

HP launches Helion Managed Services for optimizing Cloud storage workloads ...

Oracle unveils next generation Virtual Compute Appliance ...

Desktop Grids

ATLAS@Home crowd computing project launched at CERN ...

EuroFlash

a.s.r. uses ADVA Optical Networking's GNOC for critical network monitoring and maintenance ...

ADVA Optical Networking launches new era of data centre connectivity with Big Data transport solution ...

KTH wins from crystal clear insight into application performance with Allinea Performance Reports ...

Neurological simulation milestone reached after UCL embraces Allinea's tools on UK's largest supercomputer ...

Bright Computing is building on its success in data centres across Europe ...

UK Atomic Weapons Establishment launches SGI supercomputer ...

Spectra Logic tape library to archive the UK's fastest supercomputer ...

Physicists find way to boot up quantum computers 72 times faster than previously possible ...

In the fast lane: Mediatec uses Calibre UK LEDView530 scalers at FIA World Endurance Championships ...

The upcoming cybernetic age is one of intellectual capital ...

International Supercomputing Conference moves to Frankfurt, Germany in 2015 ...

USFlash

DataDirect Networks helps EMSL speed climate, energy and bioscience discoveries with high performance and massively scalable storage ...

Spectra and the Tandy Supercomputer shorten calculation rates from days to minutes, saving time and lives ...

New A*STAR-SMU centre combines high-powered computing and behavioural sciences to study people-centric issues ...

Scheduling algorithms based on game theory makes better use of computational resources ...

National Renewable Energy Laboratory supercomputer tackles power grid problems ...

Simulations helps scientists understand and control turbulence in humans and machines ...

Stampede supercomputer enables discoveries throughout science and engineering ...

Supercomputing simulations crucial to the study of Ras protein in determining anticancer drugs ...

Stampede supercomputer powers innovations in DNA sequencing technologies ...

Stampede supercomputer helps researchers design and test improved hurricane forecasting system ...

NSF-supported Stampede supercomputer powers innovations in materials science ...

D-Wave and predecessors: From simulated to quantum annealing ...

HPC server market shrinks -9.6% in the first quarter of 2014, according to IDC ...

University of Maryland's Deepthought2 debuts in global supercomputer rankings ...

Nine ways NSF-supported supercomputers help scientists understand and treat the disease ...

CAST releases wysiwyg R33 ...

Innovative HTC facilities needed to support computational genomics at the application and data level


24 Jun 2014 Leipzig - In the session on HPC in Life Sciences at ISC'14 in Leipzig, C. Victor Jongeneel from the National Center for Supercomputing Applications (NCSA) and the University of Illinois at Urbana-Champaign, talked about the role of HPC in sequencing technology and genomic biology. He said that traditional HPC facilities can be leveraged to do computational genomics but this takes some effort and technical savvy. There is a real need to design and build HTC facilities, designed specifically to support computational genomics.

There are two key domains in life sciences addressed by HPC. One is sequencing technology which is performed at the Roy J. Carver Biotechnolgoy Center and Genomic Science and HPC, performed at NCSA, stated Victor Jongeneel.

The second initiative is very different. The CompGen software is used where computer science meets genomic biology. The goal is to develop an ideal architecture for this purpose, using an experimental machine. Victor Jongeneel described this interesting development with collaboration of industrial partners from both the user and development side.

The Illumina HISeq 2500 sequencing system is used to perform genomic analysis. Illumina is a predominant vendor in this area. The system can perform a job in 2 to 11 days, running 50-600 Gb and with a speed of 2 x 100bp at maximum level.

Illumina sells these sequencers in batches of ten, optimized for human genome analysis. A set of ten HiSeq X machines can produce more than 320 genomes per week, or more than 18,000 per year, explained Victor Jongeneel. This sets the stage of what data processing can do.

In next-generation DNA sequencing, resequencing is taking the output of the machine and aligns the reads to reference the genome and identify the variants.

Using the De Novo assembly, the researchers can construct a genome sequence from overlaps between reads. We are talking millions to billions of reads here, Victor Jongeneel told the audience.

The aim is to proceed from unmapped reads to true genetic variation in next-generation sequencing data. A single run of a sequencer generates short reads of analysis. The origin of each read from the human genome sequence is found, stated Victor Jongeneel.

To call a human variant, a typical data-intensive work flow has to be generated, followed by an integrated analysis in terms of medical diagnosis or assessment of the genetic background of an individual.

The input consists of short reads generated by the Illumina sequencer. The whole genome would consist of a 50-fold coverage, paired and 1,5 billion reads, 225 billion nucleotides, and more than 300GB of files.

The intermediate files consist of reads aligned to reference, explained Victor Jongeneel. If you want to analyze 100 or 1000 of genomes, you may want to run many of these work flows simultaneously. In order to do this, 600 TB are needed for a single hospital, Victor Jongeneel made the calculation.

Data-level parallelization is a typical strategy when there are no dependencies between subsets of the data

for variant calling. This data prallelization includes:

1. production of chunks of sequence reads and mapping of each chunk to the reference genome

2. merging all alignments

3. performing duplicate marking

4. splitting again per chromosome

The benefits of parallization are that the data comes from the Beagle supercomputer at Argonne. A single user controls the cluster resources. The I/O never becomes limiting and the scaling is indefinitely linear, explained Victor Jongeneel. You can read this in the paper titled "Supercomputing for the parallelization of whole genome analysis".

Victor Jongeneel expanded on the characteristics of traditional and genomic HPC. In tradional HPC, there is a small number of MPI-enables codes with code-level parallelism. There are small numbers of very large jobs. The scaling is limited by messages passing over the network. Only simple work flows are being addressed and a modest total data footprint exists in a limited number of files.

In genomic HPC, there is a large number of scalar codes, some of which are multi-threaded with data-level parallelism. There are large numbers of very small jobs. The scaling is limited by data manipulation. There are overheads and complex, data-intensive work flows. There is also a large total data footprint with large numbers of small files, Victor Jongeneel explained.

The total data footprint is a variant calling. The variant calling work flow on a synthetic whole human genome is being performed 50 times. The input is not large per dataset, about 300-600 GB per genome.

Next, Victor Jongeneel moved on to eGWAS. There is an epistatic interaction on eGWAS in a project with the Mayo Clinic in Florida. The project has a large dataset of 400 patients. Half of them is diagnosed with Alzheimer's Disease. The gene expression data on two brain regions is taken from each patient. The genetic polymorphisms amount at 225,000 loci for each patient. This is Big Data and involves a big computation problem, Victor Jongeneel told the audience.

He said that 181 individuals have 24,000 phentypes with 50,011,495,056 pairs of variants and billions to trillions of datapoints. Indeed, this can be a Big Data problem.

The file system tiering is about 160TB of flash memory. Victor Jongeneel stated that it is a great performance but not the best streaming performance. Therefore, it is ideal for all small files.

A large number of jobs is needed for the variant calling. The following actions are being performed, including alignment; merging and sorting; splitting by chromosome; realigning and recalibrating; to end with the variant calling.

There are 3 steps which are the pre-processing, the actual calculation and the post-processing.

How can the researchers handle this large number of jobs? Victor Jongeneel asked. A solution is to wrap multiple SMP jobs with a launcher, turning them into a single MPI job with a single multi-node reservation.

How to handle the data-parallel work flows constitutes a second issue. The researchers write a set of configuable scripts that handle job submissions, data dependencies and error trapping. Victor Jongeneel said that they need a work flow manager that understands data-parallel computing requirements.

The challenge is to find out where the bottlenecks are at the application level due to inefficient software platforms, as well as at the data level.

In high throughput computing for computational genomics, a notional HTC architecture is needed with four components, Victor Jongeneel ended his talk. Innovations are required for a true HTC system. The application should run a software that leverages the hardware archtecture efficiently and the data should be stored fast in a highly parallel storage which is resiilient to perform concurrently.
Leslie Versweyveld

Back to Table of contents

Primeur weekly 2014-06-30

Special

Innovative HTC facilities needed to support computational genomics at the application and data level ...

HPC-assisted cell-based immunotherapy successful in curing melanoma ...

Restless hearts are simulated with real-time modelling at IBM Research in Zurich ...

A live report from the Adapteva A-1 "smallest supercomputer in the world" launch at ISC'14 ...

The Cloud

HP launches Helion Managed Services for optimizing Cloud storage workloads ...

Oracle unveils next generation Virtual Compute Appliance ...

Desktop Grids

ATLAS@Home crowd computing project launched at CERN ...

EuroFlash

a.s.r. uses ADVA Optical Networking's GNOC for critical network monitoring and maintenance ...

ADVA Optical Networking launches new era of data centre connectivity with Big Data transport solution ...

KTH wins from crystal clear insight into application performance with Allinea Performance Reports ...

Neurological simulation milestone reached after UCL embraces Allinea's tools on UK's largest supercomputer ...

Bright Computing is building on its success in data centres across Europe ...

UK Atomic Weapons Establishment launches SGI supercomputer ...

Spectra Logic tape library to archive the UK's fastest supercomputer ...

Physicists find way to boot up quantum computers 72 times faster than previously possible ...

In the fast lane: Mediatec uses Calibre UK LEDView530 scalers at FIA World Endurance Championships ...

The upcoming cybernetic age is one of intellectual capital ...

International Supercomputing Conference moves to Frankfurt, Germany in 2015 ...

USFlash

DataDirect Networks helps EMSL speed climate, energy and bioscience discoveries with high performance and massively scalable storage ...

Spectra and the Tandy Supercomputer shorten calculation rates from days to minutes, saving time and lives ...

New A*STAR-SMU centre combines high-powered computing and behavioural sciences to study people-centric issues ...

Scheduling algorithms based on game theory makes better use of computational resources ...

National Renewable Energy Laboratory supercomputer tackles power grid problems ...

Simulations helps scientists understand and control turbulence in humans and machines ...

Stampede supercomputer enables discoveries throughout science and engineering ...

Supercomputing simulations crucial to the study of Ras protein in determining anticancer drugs ...

Stampede supercomputer powers innovations in DNA sequencing technologies ...

Stampede supercomputer helps researchers design and test improved hurricane forecasting system ...

NSF-supported Stampede supercomputer powers innovations in materials science ...

D-Wave and predecessors: From simulated to quantum annealing ...

HPC server market shrinks -9.6% in the first quarter of 2014, according to IDC ...

University of Maryland's Deepthought2 debuts in global supercomputer rankings ...

Nine ways NSF-supported supercomputers help scientists understand and treat the disease ...

CAST releases wysiwyg R33 ...