There are two key domains in life sciences addressed by HPC. One is sequencing technology which is performed at the Roy J. Carver Biotechnolgoy Center and Genomic Science and HPC, performed at NCSA, stated Victor Jongeneel.
The second initiative is very different. The CompGen software is used where computer science meets genomic biology. The goal is to develop an ideal architecture for this purpose, using an experimental machine. Victor Jongeneel described this interesting development with collaboration of industrial partners from both the user and development side.
The Illumina HISeq 2500 sequencing system is used to perform genomic analysis. Illumina is a predominant vendor in this area. The system can perform a job in 2 to 11 days, running 50-600 Gb and with a speed of 2 x 100bp at maximum level.
Illumina sells these sequencers in batches of ten, optimized for human genome analysis. A set of ten HiSeq X machines can produce more than 320 genomes per week, or more than 18,000 per year, explained Victor Jongeneel. This sets the stage of what data processing can do.
In next-generation DNA sequencing, resequencing is taking the output of the machine and aligns the reads to reference the genome and identify the variants.
Using the De Novo assembly, the researchers can construct a genome sequence from overlaps between reads. We are talking millions to billions of reads here, Victor Jongeneel told the audience.
The aim is to proceed from unmapped reads to true genetic variation in next-generation sequencing data. A single run of a sequencer generates short reads of analysis. The origin of each read from the human genome sequence is found, stated Victor Jongeneel.
To call a human variant, a typical data-intensive work flow has to be generated, followed by an integrated analysis in terms of medical diagnosis or assessment of the genetic background of an individual.
The input consists of short reads generated by the Illumina sequencer. The whole genome would consist of a 50-fold coverage, paired and 1,5 billion reads, 225 billion nucleotides, and more than 300GB of files.
The intermediate files consist of reads aligned to reference, explained Victor Jongeneel. If you want to analyze 100 or 1000 of genomes, you may want to run many of these work flows simultaneously. In order to do this, 600 TB are needed for a single hospital, Victor Jongeneel made the calculation.
Data-level parallelization is a typical strategy when there are no dependencies between subsets of the data
for variant calling. This data prallelization includes:
1. production of chunks of sequence reads and mapping of each chunk to the reference genome
2. merging all alignments
3. performing duplicate marking
4. splitting again per chromosome
The benefits of parallization are that the data comes from the Beagle supercomputer at Argonne. A single user controls the cluster resources. The I/O never becomes limiting and the scaling is indefinitely linear, explained Victor Jongeneel. You can read this in the paper titled "Supercomputing for the parallelization of whole genome analysis".
Victor Jongeneel expanded on the characteristics of traditional and genomic HPC. In tradional HPC, there is a small number of MPI-enables codes with code-level parallelism. There are small numbers of very large jobs. The scaling is limited by messages passing over the network. Only simple work flows are being addressed and a modest total data footprint exists in a limited number of files.
In genomic HPC, there is a large number of scalar codes, some of which are multi-threaded with data-level parallelism. There are large numbers of very small jobs. The scaling is limited by data manipulation. There are overheads and complex, data-intensive work flows. There is also a large total data footprint with large numbers of small files, Victor Jongeneel explained.
The total data footprint is a variant calling. The variant calling work flow on a synthetic whole human genome is being performed 50 times. The input is not large per dataset, about 300-600 GB per genome.
Next, Victor Jongeneel moved on to eGWAS. There is an epistatic interaction on eGWAS in a project with the Mayo Clinic in Florida. The project has a large dataset of 400 patients. Half of them is diagnosed with Alzheimer's Disease. The gene expression data on two brain regions is taken from each patient. The genetic polymorphisms amount at 225,000 loci for each patient. This is Big Data and involves a big computation problem, Victor Jongeneel told the audience.
He said that 181 individuals have 24,000 phentypes with 50,011,495,056 pairs of variants and billions to trillions of datapoints. Indeed, this can be a Big Data problem.
The file system tiering is about 160TB of flash memory. Victor Jongeneel stated that it is a great performance but not the best streaming performance. Therefore, it is ideal for all small files.
A large number of jobs is needed for the variant calling. The following actions are being performed, including alignment; merging and sorting; splitting by chromosome; realigning and recalibrating; to end with the variant calling.
There are 3 steps which are the pre-processing, the actual calculation and the post-processing.
How can the researchers handle this large number of jobs? Victor Jongeneel asked. A solution is to wrap multiple SMP jobs with a launcher, turning them into a single MPI job with a single multi-node reservation.
How to handle the data-parallel work flows constitutes a second issue. The researchers write a set of configuable scripts that handle job submissions, data dependencies and error trapping. Victor Jongeneel said that they need a work flow manager that understands data-parallel computing requirements.
The challenge is to find out where the bottlenecks are at the application level due to inefficient software platforms, as well as at the data level.
In high throughput computing for computational genomics, a notional HTC architecture is needed with four components, Victor Jongeneel ended his talk. Innovations are required for a true HTC system. The application should run a software that leverages the hardware archtecture efficiently and the data should be stored fast in a highly parallel storage which is resiilient to perform concurrently.