In order to put things into perspective, Dr. Villanustre provided the audience with a brief history of high performance and data intensive computing.
Initially, HPC efforts were driven by the need for specialized hardware to tackle complex problems but the demand quickly picked up volume and made HPC commercially viable with Cray as a strong player in the market, the speaker explained. The goal was to build the largest and fastest computers in the world. Typical workloads in simulation in those days usually involved a large number of floating points operations on a more or less constantly sized data, needing only one single or a few threads of execution.
The birth of Beowulf initiated the notion of distributed execution in HPC, the speaker continued. Commodity hardware, commodity OS, distributed architecture, and virtually limitless scalability made supercomputing affordable to the masses. The parallelism burden thus was moved into the tools and became the programmers' problem. Dr. Villanustre explained.
But then the growth of digital data and the Internet generated data volumes never seen before. The search engines and data services companies pushed traditional data stores to their limits. Since the value derived from the aggregation of massive amounts of data, the distributed data stores and data intensive
computing turned data locality into a key performance factor. As such, the distributed execution paradigms quickly gained interest, the speaker told the audience.
So, in this context, what are the key design principles for distributed data intensive computing? Dr. Villanustre asked. Data can be huge but their volume is highly reduced throughout the process. Sometimes the disk activity can be sequenced to minimize the number of disk seeks. Many data search and retrieval problems are embarrassingly parallel but not all of them. The commodity hardware has become so affordable that an agile exploratory analysis of large data is now feasible and also desirable, according to the speaker.
At present, we are in need of analytics and recommendation systems, and non-linear regression, as well as multinomial classification, and complex clustering. This is where machine learning comes in. As a matter of fact, many algorithms cannot be easily translated into maps and folds. So iterative optimizers are a must when closed form solutions would be highly inefficient or simply do not exist. Dr. Villanustre warned that the parallellisation of algorithms can be a daunting task and when it comes to massive graphs, the partitioning can be really tricky.
Over the years we have learned that floating point operations are still very important, but in-place data processing and data locality are as well. The speaker showed that cache coherency is extremely critical to obtain the most from CPUs and even RAM can be too slow. Parallelism is paramount but unfortunately, parallel algorithms are difficult to implement and debug. A uniform memory access simplifies programming models but has scalability challenges. Most data analytic algorithms luckily can be expressed as a series of
vector and matrix operations. However, power consumption is starting to become a significant problem, as we all can witness, according to the speaker.
In terms of storage, custom is the new commodity, Dr. Villanustre stated. Mechanical hard drives are too slow, particularly when random access is required. NAND is faster but even NAND can be too slow when
compared with RAM. Can we push the processing into NAND controllers? the speaker wondered. How about type aware storage? RAM is also too slow when compared to CPU cache: can we extend this model and push processing into RAM? Yet another interesting question.
Heterogeneous computing can be significantly more efficient than existing homogeneous architectures, the speaker continued. GPGPU is more cost and energy efficient than central CPUs. Traditionally relegated to mobile and battery operated devices, low power processors can also offer a denser and more energy
efficient architecture, at the expense of scaling up the number of threads of execution. FPGAs are quite efficient, when operating on bytes, words or doubles would be a waste if only a few bits are required.
When locality is insufficient, we should turn to networks, the speaker stated, because of the available bandwidth. True, the latency can limit the scalability, if the partition sizes are too small. Sufficient bandwidth is critical to real-time event processing, delivery and analytics, but not really for bulk processing, the speaker explained.
As the biggest challenge, Dr. Villanustre indicated the huge amount of complexity software developers are exposed to. He claimed that the importance of the programmer's productivity must not be underestimated. Indeed, software engineering is the single largest expense in any project but how can we do any better?
Unfortunately too much time is spent on controlling the execution flow and low level details in imperative programming models, and it's hard to generalize the optimizations. According to the speaker, declarative programming, both data flow and functional, offers significant advantages. The compiler can determine the best order of execution. Optimizations applicable to non-strict evaluation can be used. It's up to the compiler to offload portions of the code to the correct execution units. Persisted portions of the execution flow can short-cut future re-execution. The optimizer decides about the partitioning, based on the specific activities and the topology of the system.
In conclusion, Dr. Villanustre provided a look into the future. He believes in specialized environments for bulk processing and real-time processing. He held a plea for more abstraction as opposed to sheer imperative programming models and for higher degrees of parallelism to leverage more power efficient
hardware. He also felt the urge to continue to exploit data locality and to integrate distributed linear algebra frameworks for vectorized algorithms. We also should embrace heterogeneous computing and develop better user interfaces.