The concept of cramming more and more computer power onto enterprise CPUs is championed by the chipmakers – but when push comes to shove, does it actually mean that your business applications will run faster or more effectively?
The general-purpose CPU is undergoing its first revolution since the advent of the silicon chip in the 1970s, by embracing multicores operating in parallel. Many earlier attempts at a revolution, such as SIMD (single instruction multiple processor) failed to make an impact in mainstream computing; but this time there is no turning back, for there is no other way of sustaining the continuing thirst for ever greater performance.
Bluntly, it is no longer feasible to continue increasing clock speeds at a time when escalating energy costs are conspiring with environmental concerns to make the greater power consumption unacceptable. What’s more, “it is not possible to clock a single processor at 4 or 5GHz”, says Giuseppe Amato, EMEAI marketing director of the value proposition group at AMD, one of the big makers of x86 processors for the Microsoft world. “The industry realised enough was enough.”
Since the launch of dual core processors for x86 in 2006, the trend in clock speeds has been gently downwards, for the first time in the history of the silicon chip. As Amato notes, this enables the voltage to be reduced, cutting power leakage, saving energy. But for the moment, one familiar aspect of chip evolution will not change – transistor density will continue to increase, which means that Moore’s Law will still be obeyed.
During 2007, for example, major chip makers began to migrate from 60nm to 45nm processes, increasing transistor density by 33 per cent. The continuing transistor density increase is no longer being exploited to build faster single processors, but to install multiple cores on a single chip or die.
At some point an even more radical revolution will be required to prevent Moore’s Law being broken, for the more fundamental laws of physics will prevent transistor sizes sinking much closer to the size of a silicon atom, which is about 0.25nm across, only 180 times smaller than the latest process size – but that’s another story.
There is another factor in the mix: the role of dedicated silicon such as ASICs, and FPGAs (field programmable gate arrays).
Use of ASICs was driven by the performance advantages and cost savings that could be achieved through using a design dedicated to a particular task, such as video encoding or data encryption. However, with increasing performance and bandwidth demands, such dedicated processors are becoming a liability, consuming too much power and becoming bottlenecks themselves.
“The penalty for having separate ASICs has become too severe,” agrees Vivek Sarkar from Rice University in the US, a pioneer of programming for parallel architectures, who was senior manager of programming technologies at IBM before joining Rice University to research languages and compiler designs for parallel computing in 2007.
As the number of cores increases, multicore processors will absorb ASICs and dedicated silicon, says Sarkar. The idea is that current homogeneous chips with just a few identical cores will evolve into heterogeneous designs including a variety of different core types optimised for specific tasks.
These cores are unlikely to be totally dedicated to highly specific tasks – such as a particular encryption algorithm – as this would constrain their utility too much, and have the effect of rendering part of the chip obsolete or redundant. The objective instead is to have cores that are programmable but optimised for certain categories of task, such as video encoding, that will be required in some capacity on a large number of systems and will not become obsolete since the logic or software can be upgraded.
Such cores will be built to execute particular types of problem, such as vector manipulation or matrix multiplication, that are common to a number of tasks, but that are not needed by most general purpose business applications.
Rather than ASICs, or even FPGAs, such cores would resemble the GPUs (graphics processing units) that provide visual processing for PCs and game consoles, with a highly parallel structure making them more efficient for a range of algorithms manipulating vectors or rows of symbols than general purpose CPUs.
GPUs are already expanding in both directions, taking over some functions, for example in financial modelling, from the general CPU, and performing specialist tasks such as encoding, with leading chip vendors such as AMD planning to incorporate them in their multicores. “We are now working with some leading software vendors to take advantage of GPUs,” says Amato.
“Soon you will hear about software that will allow the GPU to do high-definition encoding.” Amato reckons consumer software will start to exploit GPUs for tasks such as HD encoding, for example in games applications, by the end of 2008.
Chips have just a few cores,up to four at present, but the number will increase rapidly to reach 1,000 by 2015, according to Sarkar. Already IBM has demonstrated a 256 core design. This proliferation of cores will bring extensive challenges both for bandwidth and programming, which only really emerge as the number of cores starts to exceed current levels. On the bandwidth front, the main problem is that the number of connectors within a chip cannot scale with the number of cores within a two dimensional package, so new approaches are required.
Intel has been promoting PCI Express, which it introduced in 2004 to replace the old PCI expansion bus, adding extensions to make it suitable for multicore; but, ultimately, more radical therapy will be required to transport data between thousands of cores within a chip, with options including optical and RF (radio frequency) interconnects.
IBM has been trying to solve the communication problem by going into the third dimension, by stacking cores on top of each other, increasing the number of interconnects within a given package. In such a chip, all the two dimensional interconnections can still be there, plus additional ones between successive stacks. But this introduces a new problem, removing heat from the cores buried within the stack. IBM has tackled this by reintroducing water cooling (first used for mainframes 40 years ago), but whether this will prove economical remains to be proved, although it has the advantage of efficiently collecting the heat, enabling the energy to be extracted from it.
The even greater challenge, though, lies within the whole software development cycle, from compilers up to high level languages and re-engineering of legacy applications, as was observed recently by Bill Gates when he declared of mulitcore computing: “This is the one which will have the biggest impact on us – we have never had a problem to solve like this… A breakthrough is needed in how applications are done on multicore devices.” This means that for the first time in Microsoft’s history, the software industry is being asked to play an equal part in keeping up with Moore’s Law, rather than assuming that hardware will continue to support larger ever more bloated applications.
“The revolution in hardware technology has got to be followed by an evolution in programming languages,” observes John Stewardson, Hewlett-Packard’s product marketing manager for multi-processor industry standard servers in EMEA.
According to Gates, Microsoft is now aware of its full responsibilities, and is working with the chip makers to bring about this evolution. AMD is one such partner, and is helping ensure that Windows 7, the sequel to Vista, has much better support for multicores, according to Amato. “We are also working with Microsoft on heterogeneous cores, to ensure that the new operating system can take advantage of different types of central units, and be able to interact better with people,” says Amato.
As Amato hints here, there are two aspects to software development for multicore architectures. The first involves splitting applications up into smaller components for execution in parallel across multiple identical processors. On this front there is already a reasonable body of expertise accumulated in the scientific programming arena (and high-performance computing platforms, such as IBM’s Blue Gene). Although actual programming languages used in these arenas will not transfer to the commercial sphere, many of the development tools will, so the industry does not have to start from a clean slate.
But heterogeneous cores provide an extra level of complexity because they require partitioning of an application or process into functional units suited for different types of core, which would pose a particular pain when it comes to re-engineering existing software. It has proved hard enough for developers in the mobile and ARM embedded controller sectors, since different processors tend to have their own set of development tools, operating systems, and compilers.
Amato urges the x86 software community to rise to the challenge and develop a common operating system that can operate across heterogeneous multicores and dispatch tasks to any of them, rather than requiring programmers to cope with the different environments.
Yet this will still not spare programmers from the pain of migrating their own skill sets to multicore, as Rob Gibson, solutions manager at IBM’s Industry Systems Division observes. “Development tools will provide acceptable code, but as with any architecture, programmers must learn new techniques to optimise code for multicore systems.”
There is no choice though, for while uniprocessor performance can still improve some more, it will not be at historical rates. “Multicore is a necessity if we are to continue to deliver performance improvements to our customers at the same rate as in the past,” Gibson concludes.
Further information:
www.amd.com
www.ibm.com
www.rice.edu
http://x10-lang.org/
Trade-offs
Opportunities and limitations of parallelism
With the parallel revolution in hardware gathering pace, the old question of how much it can improve performance is becoming relevant for a growing number of IT departments and software developers. As always, the answer depends on the application, with ‘Amdahl’s Rule’ continuing to apply. This states that if, say, only 80 per cent of a process can be split up into smaller components for independent simultaneous execution in parallel, then the maximum overall speed improvement can be no more than five fold, assuming each processor continues to clock at the same rate. Performance is constrained by the residual 20 per cent of the process that has to be executed sequentially.
This used by the end of the story, but recently the emergence of ‘heterogeneous parallelism’ has added a new twist, because now a process can be split into various unequal parts that can be executed on the processor best suited to their needs. This can provide an additional acceleration over and above that achieved by splitting up a process for parallel execution.
It may even be that the residual 20 per cent of an application that cannot be parallelised, an encryption process perhaps, could still be delegated to a dedicated processor that may well perform the task ten times faster. In that case, assuming that the other 80 per cent can be split into as many parallel components as needed (40), the potential speed improvement would no longer be just a factor of five, but 50. This is because the 20 per cent residual component is itself accelerated ten times, while accounting for one-fifth of the whole process. This is one reason why the big chip makers, such as Intel and AMD, are now investing heavily in heterogeneous cores.
Business case in brief
Enterprise IT specifiers will need to be convinced that paying a premium to upgrade their hardware to platforms based on multicore chips is going to deliver quantifiable return on investment. The main reasons for this is that there isn't a lot of business software specifically designed and written to take advantage of multicore architectures – run a standard application over a multicore processor, and it won't necessarily perform any better; most likely three of the four cores will be left idle. True multi-core software must be designed to let different cores simultaneously handle different tasks in an application.
However, Computerworld reported last month (29/09/08) that there are gains to be found when running virtualising on multicore, where each core is assigned its own virtual machine, allowing each to run a separate application.
Backgrounds
Combination harvester
Multicore architectures and parallel programming may be quite new to Intel and AMD, but have been around longer in other regions of computing.
In 2001, IBM introduced the first general-purpose multicore processor, the dual core Power4 RISC processor with shared memory, deployed in supercomputers and high-end enterprise servers.
The IBM POWER6 contains hundreds of microprocessors on a silicon wafer. The wafer is cut into individual chips that are then packaged, and then built-in to IBM servers. Each chip has two cores, runs at up to 4.7GHz, and contains 790 million transistors. The current POWER6 iteration was claimed to be the world’s fastest processor when launched in May 2007. As in current multicores from Intel and AMD, the cores are identical, but hybrid designs are already commonplace in digital signal processing, as was noted by parallel programming pioneer Vivek Sarkar from Rice University in the US. “The embedded systems world is quite used to this: for example, chips for cell phones combine general purpose CPUs with some analogue capabilities.”
Indeed, heterogeneous cores are now widely in ARM 32-bit RISC processors, which account for 75 per cent of all embedded 32-bit RISC CPUs that are widespread in PDAs, game machines and media players.
Bookmark with:
del.icio.us
Digg
Add to Discover
Comments
All comments
You need to be registered with the IET to leave a comment. Please log in or register as a new user.