Those were the days of old time supercomputing

2 Jun 2010 Hamburg - To celebrate the 100th anniversary of Konrad Zuse, the ISC'10 team organized a session on the history of computing, chaired by Prof. Dr. Wolfgang Karl and Prof. Dr. Klaus Waldschmidt. Prof. Dr. Karl from the Karlsruhe Institute of Technology dedicated a short introduction to the key features of the Z3, developed by Konrad Zuse, who would have become 100 years old on June 22nd 2010. Afterwards, Prof. Dr. Yale N. Patt, as the first speaker of this session, entertained the audience on the history of microarchitecture.

Prof. Dr. Karl gave an overview of some of the key features of the Z3 including the use of the binary number system for numbers and the circuits; the use of floating point numbers, along with the algorithms for the translation between binary and decimal and vice versa; and the implementation of an algorithm for the non-restoring calculation of the square-root. With this algorithm, the square-root can be calculated with n steps.

He also mentioned the value reuse, a look-ahead mechanism which identifies stores with data to be reused within the next two instructions. The data is placed in a register with mechanical contacts where it can be read with no latency. The concept of overflow happens if a result exceeds the range of the arithmetic unit, it is treated as special value then.

Entered Prof. Dr. Yale N. Patt with "pump and circumstance" and lots of "to the point" humor to interpret the microarchitecture development from 1970 to now and beyond. He started out with Moore's Law and the process technology.

In 1971 there were 2300 transistors; in 1992 the number rose to 3.1 million transistors. Today, there are more than a billion transistors and in ten years from now the number will have amounted to 50 billion transistors.

To put microarchitecture into perspective the following items constitute the supercomputing operation:

problem, algorithm, programme, Instruction Set Arch, microarchitecture, circuits, and electrons. Of course, the electrons are doing all the work.

From 1971 to 2000, we could observe three agents of evolution including performance enhancement, the removing of bottlenecks, and ... good fortune, according to Prof. Dr. Patt. In the mid eighties, chips were sufficiently large, no separate chip was needed. The Pentium chip added additional instructions. The next obvious thing to do was to improve performance.

The cache and pipelining are equally important. Branch prediction became the next thing to add to the processor. There was a separate instruction for data caches.

Prof. Dr. Patt also mentioned the on-chip accelerators, starting with the FP unit. He talked about speculative execution with out-of-order execution, with in-order retirement.

And there was the wide-issue. Prof. Dr. Patt noted that trace cache makes no sense at all if you have no wide-issue. He explicitely dedicated the concept of multi-threading to Burton Smith.

All in all, we have seen very different chips in all these years but the last ten years, it is a different story.

Everything has become harder. How have we used the available transistors?

Around 2000, the multi-core is born but the transistors keep growing. Developers tried to find a solution in doubling the number of cores but it is not delivering the expected performance.

Why do we need a multi-core chip? asked Prof. Dr. Patt. It is easier than designing a much better uni-core, he told the giggling audience.

We continue to "step the reticle". The duo-core becomes the quad-core, then the octo-core. Simple 8-core becomes 16-core, then 32-core. So Moore's Law has to be re-defined: "Doubling the number of cores every 18 months". But has it worked eventually?

Prof. Dr. Patt had his doubts. Why have we not seen a comparable benefit? Except in certain cases maybe.

We have seen the emergence of GPUs if there is no divergence and the data base if there are no critical sections.

Quad-core is no good for embarassingly parallel calculations and many mini-cores no good either.

With regard to the microarchitecture for beyond the year 2010, Prof. Dr. Patt expressed the thought that the answer is to break the layers. In fact we already have in limited cases, he said. He mentioned the pragmas in the language, the refrigerator, and X + Superstar.

The algorithm, the language, the compiler and the microarchitecture have to be all working together. There are plenty of opportunities: the compiler, the microarchitecture, the algorithm.

The requirements are to develop heavyweight cores to handle single-tread applications and o/s code; to handle Amdahl bottlenecks, and to handle critical sections.

There is also a need for lots of mini-cores and a very large number of simple processors.

Prof. Dr. Natt concluded with the multi-core chip, the Pentium X/Niagara Y with multiple interfaces to the software and for tackling off-chip bandwidth.
Leslie Verswevyeld