At BGI, researchers are analysing population genomics and making phylogenetic studies as well as genome association studies. The systems biology is working with various levels of data and this has proved not to be easy, as BingQiang Wang explained.
The challenges researchers face are the diverse types of workloads that require high throughput computing
HPC. They have a need for massive compute power and storage capacity and this increases the computing complexity. There is consideration needed for infrastructure which has to be scalable and ready for the future. The speaker also pleaded for a balance between compute and IO, as well as a better scientist-systems interaction. Developers have to make HPC systems more enduser friendly since the training of biologists with computer "things" is not easy, BingQiang Wang insisted.
The present status is as follows, the speaker went on. In physics we know how to model solids, waves, and atoms. In chemistry we know less but in biology we know even less. There is a lack of model, method and theory.
The challenges ahead are to be found in the alignment for long, error-prone third generation reads. This can be done with a hybrid de novo assembly using second and third generation reads. The assembly of meta data amounts to up to several Tera base pairs. Researchers have to identify sequencing errors and rare species.
BingQiang Wang described the ideal computational tools. They consist among others of a SOAP3-DP aligner. Sequence alignment is a way of arranging the sequences of DNA, RNA, or proteins to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
SNP calling with GSNP is another tool. A single-nucleotide polymorphism is a DNA sequence variation occurring when a single nucleotide in the genome differs, the speaker explained. The elapsed time of all steps is included. GSNP is around 50x faster than the single-thread CPU-based method.
When we take a look at the systems perspective, BingQiang Wang explained that BGI runs a large scale compting facility with a very complicated scenario. There are hundreds of endusers, hundreds of analysis tools, and tens of analysis pipelines. Scripting is very popular.
The speaker also highlighted the issues of the current systems. There is an imbalanced compute I/O and BGI researchers are dealing with low utilization.
Compression on the other hand is not for free. BGI uses domain optimized compression and heterogeneous accelerated compression and decompression by means of GPU. The researchers also use generic algorithms instead of specific ones. They try to fully exploit the characteristics of genomics data.
For job characterization and scheduling, they submit a job to the BGI computing farm using the Sun Grid Engine but they need to specify CPU slots and memory usage, the speaker explained.
To analyse the preliminary results, the researchers use a simulator to investigate the system.
BingQiang Wang concluded that the trend is evolving towards more threads and less memory but it is a fact that bioinformatics is memory hungry.