It is a sci-fi staple: intelligent machines designing and building even more intelligent machines. If you want it to be scary, it is Skynet and the ‘Terminator’; if you want it to be funny, it is ‘Deep Thought’ and our own Planet Earth. In the real world, though – and particularly in the electronic design automation (EDA) industry – it is all about earning your crust.
As we move further into the multicore processing age, it is tempting to see EDA as a vanguard sector. It helped design the chips. Maybe, its natural participation in designing the generations and refinements that follow will help us solve the accompanying multicore programming problem.
Economics do push EDA to the latest hardware platforms. Chips do not just have multiple cores. The first two billion-transistor design has been publicly reported – Intel’s Tukwila server chip for the Itanium family. Silicon with more than 100 million gates is set to become commonplace. All this means that there are vast amounts of data to crunch for any design, particularly during such stages in a flow as verification, and tight time-to-market constraints. Moreover, while the hardware is relatively cheap, the costs of maintaining and even just cooling it are also on the rise. Combined, this all puts serious pressure on the software to meet high performance standards.
“The good news is that this is not a new problem. You could even say it’s been around for a decade,” says Duaine Pryor, high performance computing architect for the Calibre product line at EDA vendor Mentor Graphics. “Quite a few design tasks exceeded the computational capacity of a desktop workstation some time ago and went out to these increasingly large compute farms – verification is one obvious example. As they made that shift, we obviously had to start thinking about how tools would exploit parallelism.”
With this in mind, EDA has made some impressive advances already in delivering tools that are well-suited to farms of the latest quad-core server processors. Mentor’s Calibre suite uses a variety of multicore techniques and extends along a design flow from physical verification to mask data preparation. It does this so well and so efficiently that it is also one of the big drivers behind Cadence Design Systems’ current hostile bid for the company.
Rival vendor Synopsys has also recently revealed that it has added multicore capabilities to a number of its tools or introduced entirely new ones, and launched a major multicore initiative earlier this year. Its experiences are instructive in that they show that different tasks in a design flow often demand emphasis on different techniques from within the multicore toolbox: ‘horses for courses’ is a good maxim to keep in mind. Here are a couple of examples.
First, take the Proteus back-end tool the company announced in February. Traditionally, mask synthesis has sequentially preceded mask data preparation. In Proteus, the two tasks have been pipelined so that they can be undertaken concurrently, taking a good chunk out of the design cycle time.
Now, take the example of ZRoute, the new router Synopsys is adding to its IC Compiler suite. It makes more use of multi-threading. The company also took the decision to develop ZRoute from the ground up, and is claiming excellent results: a three- to four-fold basic performance, boosted a further 3X, for an overall gain in the region of 10X when running on quad-core CPUs.
“What can you learn from this?” asks Saleem Haider, Synopsys’ senior director of marketing for physical design and DFM. “In part, it is that multicore is naturally driven to the most computationally intensive parts of the design flow, to the bottlenecks. This is where the customers need it most. You are not going to change an entire flow overnight – it will be a gradual process.
“Second, innovation is another area that will benefit from multicore sooner. ZRoute was done from scratch because it addresses the needs of the 45nm node. There is innovation to deal with emerging problems. However, I do not think that any vendor has the wherewithal to take every single tool in a flow and completely rearchitect it to run on multicore systems overnight – it simply wouldn’t make commercial sense. Again, this is something that will happen over time.”
At Mentor, Sudhakar Jilla, director of marketing for the company’s place-and-route products, agrees with this combination. “If you look at P&R, a lot more is being asked of it because of the issue around timing closure and signal integrity, so the computational demands really have risen. There are big pressures to bring down the runtimes and also to bring in new techniques such as multi-corner, multi-mode analysis,” he says
Moreover, in the case of the tools Jilla is discussing, it is also significant that they are, like ZRoute, comparatively new. Mentor has them in its line-up following its acquisition of Sierra Design Automation last year. This is the new stuff, folks – it was written when multicore was already on the agenda. So far, we have seen that multicore has had a large upside for the EDA business. Better tools, running faster. All very good, thank you.
However, there are some significant tensions here. The first of those is a caution among the vendors as to just how much they can sell to their customers. This is because performance scaling with multicore is not consistently linear. “OK, so let’s say you throw 50 CPUs at the problem and you get 50X improvement, but if you throw 100 CPUs at, you get only 80X – and that pretty much describes the kind of scenario we’re seeing,” notes Mentor’s Pryor. “I think that there are then two questions we need to communicate to the customer. One – up to what point do we still get linear scaling and what are the economic implications of that? Two – how far beyond that point where linear scaling stops does it remain economically viable to keep squeezing out what improvements you can?
“And those answers are going to vary. If you’re talking to an IDM, he’s concerned about everything from the start of place-and-route right through to mask data preparation. But a fabless guy has his focus on functional verification, and the mask house is just concerned with the data prep. Also, there are stages where you need to get the full linear improvement for the sums to add up, but also those where just a minor improvement is still worth a lot of value because you’re still addressing cost and complexity.
“What matters is that we don’t oversell this, but get it to match what the users need. That’s better for us in terms of the R&D investment and better for them for tool cost and performance.”
This acknowledgement that different parts of a design flow may stand to gain by different degrees then points to another tension. The last few years have seen all the major EDA vendors move away from selling tools to do specific jobs – so-called ‘point tools’ – and look more to sell integrated flows. As complexity has grown and the need for closer relationships between various parts of the flow have become clear, this flow-led approach has been broadly accepted.
The problem is that while some tools might be parallelisable to good effect for multicore, overall flows are a great deal more stubborn. “If you are going to successfully parallelise an operation then you have to control the overhead,” says Patrick Groeneveld, chief technologist for Magma Design Automation. “You need to keep down the dependencies and interactions between the threads, avoid bottlenecks and reduce the burden of partitioning and then re-assembling. You want tasks that are 100 per cent independent.”
Thus, he adds, you can fairly conclude that analysis is a relatively straightforward task to parallelise, and that routing something is easier than then optimising it in the light of relationships between various elements in the design.
“Another way of looking at things is that parallelisation is something that lends itself more to graphics processing. There, you have a great deal of floating-point activity relative to a small amount of data. However, with EDA, you are looking at a lot less floating-point, but across a much larger amount of data,” Groeneveld says.
Finally, another problem lies in the naturally algorithmic nature of EDA tools and the limits set for high performance computing by Amdahl’s Law. The relevant issue here is that it defines the need for sequential processing as one of the main limits on the improvements you can achieve.
“And, if your tool or your strategy is based on sequences of algorithms, then the limitations on what you can do to parallelise that part of the flow are intrinsic,” says Groeneveld.
So while one might be able to secure a 10X improvement for a single element in a flow by optimising for a multicore platform, Groeneveld’s argument is that the ‘brakes’ might hold back overall improvement to closer to just 2X – and that the more difficult elements in that flow could potentially hold the performance boost to this level for some time to come.
Magma is talking about the problem at the flow level because it believes it has an approach that will work. Its Hydra technology allows for micro- and macro-partitioning of a design as appropriate within the progress of the flow, retaining and then re-inserting the information that will allow these elements to ultimately be stitched back together. And yet partitioning, however smart it is, is still a performance limiter relative to the ideal.
The lessons from EDA then are that there remain many specifics to resolve. The industry is attacking its lower hanging fruit – although given the complexity of even this software task it seems mealy-mouthed to call them such – but will over time still have to address more points.
Skynet isn’t here yet – the process still needs a human guiding hand to help it make smart workarounds.
There is a constant tension in EDA between smart methods and available horsepower.
EDA companies generally present themselves as experts in electronic design. However, very often they will only introduce a new class of tool once a solid demand has appeared in the semiconductor business.
It is a common complaint of semiconductor executives that EDA is slow to react: very often chipmakers have to develop their own prototype tools to solve an immediate problem before anything appears on the market.
However, this view misunderstands the role that the EDA companies play in the market. They do not necessarily invent design techniques: they come up with ways to scale
them from the circuit level to that of the billion-transistor chip.
It means taking algorithms and reworking them so that they scale much more efficiently. An algorithm that scales with the square of design size will quickly run out of steam. Coming up with one that scales with the log of the design size is much more manageable. It is why a lot of EDA-oriented conferences spend a lot of time talking about novel matrix sparse-matrix representations and alternative representations of design data.
However, the appearance of many-core architectures can swing the pendulum back the other. For recent versions of the Calibre optical-proximity correction (OPC) software, Mentor switched to a grid-based technique from a data structure that concentrated on details and corners.
This was both in response to the way in which OPC has become more complex since the appearance of the 65nm process node but also the way in which architectures such as the IBM Cell make it worth developing highly parallelisable code.
Similarly, Nascentric took a step back from the sparse-matrix representations of FastSpice in a version of its Spice simulator that runs on massively parallel graphics processor cores.
Rahm Shastry, CEO of Nascentric, says: “In the first release of OmegaSim GX, we are not accelerating sparse-matrix calculations on the GPU. We profiled sub-65nm circuit simulations, and we found that when you run Spice at its most accurate mode, transistor evaluations consume the bulk of total simulation time – anywhere from 80 per cent to 90 per cent of total runtime.
“We ported this ‘low hanging fruit’ onto the GPU. “So, instead of deploying the ‘Fast-SPICE trick’ of a simplified transistor model, we can now run the detailed BSIM transistor models in our GX option as-is and remove this huge transistor evaluation burden from the CPU. “We are looking to offload other simulation tasks onto GPU
in the future, working closely with nVidia.”
Chris Edwards
In 2006, electronic design tools vendor Synopsys joined the supercomputer club, and did not buy a pile of server blades to do it. The company’s IT team found a way to wire up six of its smaller clusters into one grid computer.
The grid achieved benchmark results of more than 3.7 Teraflops, roughly equivalent to 18,000 PCs working at the same time. Rather than buy new equipment that would remain idle much of the year to form the machine, the Synopsys IT team came up with ways to combine the resources of six existing clusters and test them on a nightly basis to make sure it worked without disrupting ongoing projects.
The supercomputer comprised 329 Linux servers connected by standard Gigabit Ethernet switches. At the time, Hasmukh Ranjan, senior director of IT at Synopsys, said: “The goal of getting into the Top 500 was just to achieve a solution to a business problem. The basic purpose of putting this together was to speed up both the way we build our tools and to improve the runtime performance of our tools.”
Like the other major electronic design-tool suppliers, Synopsys has provided support for companies that use server farms for tasks such as regression testing and, one of the biggest applications of multiple servers in EDA, distributed hardware simulation. You can regard simulation as parallelisation the easy way.
Rather than trying to deal with the communication overhead of running one simulation across many computers, most users cut the problem in the other direction. They run many copies of one simulation and have each run a small subset of the overall testbench. In this environment, the biggest problem is combining the results of all the simulations in one cohesive database. But it is much easier than trying to rewrite a logic simulator to take advantage of hundreds of computers.
In some cases, users have farms in multiple locations – 10,000 CPUs in one place and a few thousand in another – which demands specialised management tools. The farms can generate terabytes of data in which a bug might be lurking, so the post-processing runs the danger of being a bottleneck in itself if teams are not careful.
Chris Edwards
If massively parallel architectures have taken a hold anywhere in EDA, it is in the hardware emulator.
In the 1990s, Quickturn – later bought by Cadence Design Systems – embraced massively parallel reconfigurable computing as the way it would emulate the function of hundred million transistor chips.
Mentor Graphics’ Celaro used reconfigurable computing but the company has recently moved back to emulation on field programming gate arrays with the Veloce family.
However, in contrast to other emulators based on FPGAs, Mentor decided to create its own architecture. Commercial FPGAs are designed for deployment efficiency rather than compile speed. In an emulator, the opposite is better as the design will change frequently.
Similarly, I/O multiplexing from an earlier generation of custom emulators was brought into Veloce but, this time, an implementation based on FPGA gates was swapped into custom logic.
As with the other parts of EDA, the architectures of emulators are in constant flux.
Bookmark with:
del.icio.us
Digg
Add to Discover
Comments
All comments
You need to be registered with the IET to leave a comment. Please log in or register as a new user.