We all have this inner urge to destroy, isn’t it? Not by default. We are good people, right? It definitely needs a trigger. Some alcohol could do the trick. Or after a bad night of sleep combined with some major setback. Or Saturn lines up with the moon while Venus plays online chess with Mars. This primal instinct is important for the verification engineer. A little appetite for destruction is healhty, and here is why.
Bug hunting is not the “CSI” of digital VLSI design. Still, we need people to challenge the design under test. But not to the absurd. It is a kind of yin and yang situation. When we verify a design, we make sure that the assumption the designer made is valid. And those assumptions come from the specification of the device. It’s the primary source of information for the implementation by the designer. But this doesn’t mean everyone would see it the same way. Other project members could read the same spec and interpret it differently. Maybe a certain configuration for a mode of operation. Or it could be an alternative configuration for the same functionality. Or a switch from one mode of operation to the other. Usually, the switching of modes is rarely adequately documented. Now, it does not mean this list of potential bug sources is complete. I just want to highlight the ways verification should try to verify that the design “works”. For sure, the customer could interpret it differently than the designer intended. And then all hell breaks loose. You don’t want that, trust me.
Measuring the coverage of features is a given. Spoiled as we are, specifications are always unambiguous. And they list all the features with a handy reference. So the super detailed verification plan, ready before verification starts, refers to those references. Whenever verification finishes, all test scenarios cover all spec feature references. Easy-peasy, isn’t it? Unfortunately, we cannot measure what we do not find in the specification. Let’s give an example that illustrates my point. Imagine a complex chip with many clocks that have software enables. This allows maximum flexibility for the customer’s future use cases. But verification will pass with all clocks enabled all the time. Or only the clocks for a certain communication protocol under test. The actual use case, which might be unknown at the time of verification, could reveal a clock enabling issue. Because there are so many combinations of active and inactive parts of the chip. Software control over clocks, resets, power domains and master/slave configurations makes a lot of things possible with the actual chip. But in verification land, that poses a huge challenge.
Now, this happened a long time ago. But I feel it illustrates my point and is still extremely relevant. Back then I was running simulations on a system with an AHB master and multiple slaves. And there was an AHB to APB bridge for low bandwidth configuration and status interfaces. First of all, the AMBA protocol has ways to deal with slaves that need multiple wait states. And the bridge supported the protocol. So it was my task to verify the addition of a new communication IP with APB configuration interface. Probably the first thing I did was access the APB slaves with a few back-to-back APB requests. Mind you, I was just starting on the project for an external customer. Not yet familiar with all technical details and the customer. My back-to-back access via the embedded processor failed miserably. First candidate to blame: myself. So, a few hours passed, reading the specification, going through the code and the AMBA protocol documentation. But I couldn’t find anything wrong with my access. Flagging something is never easy. When you do it too soon, it might be something stupid you overlooked and you look bad. Or, you wait too long and your direct supervisor is wondering why the task is taking so long. Therefor, one of the most important things in the logical world of dVLSI design is gut feeling. Now, coming back to the APB bridge, it turned out to be an RTL bug. I wasn’t part of the design team as such, I was a subcontractor. And the external company engineers did APB accesses a certain way. My way was different, I had no bias or assumption like they had. The moral of the story is that one needs to avoid what I call “symmetric” bugs. The basic rule is that the verification resource can never be the designer of the RTL code. Like everything in life, sometimes a unique view is mandatory.
GOOD STUFF, BAD STUFF
In the above, breaking stuff is a good thing. Because it avoids problems swept under the rug returning like a boomerang when the customer starts experimenting. Good team members will always strive to think outside and inside the box. Their goal is the project and reduce the risk by verifying. But today, because -let’s face it- weak management, bugs are used as a metric. Especially the top needs multi-color bar graphs. They use them to take quick action. Because they need visibility. And this comes from tracking the number of bugs found. They want to see more bugs closed than reported. And they don’t want to see systems with a lot of bugs and systems with just a few. They make true data-driven decisions. But they never question the data. Was the design of a certain unit flawed? And was it the designer or is there another reason? Another unit that had little bugs, was it because the verification focussed on hitting coverage and not so much on risk reduction? Were those verification methods the most suitable for that unit? And what about the verification resources? Almost always it suffices to collect data and to draw conclusions. There are so many differentfactors in play that it becomes Garbage In Garbage Out (aka GIGO) reporting. Especially in verification, things can get ugly. Imagine a big chip where firmware (embedded software) is running on an embedded processor. Calling it complicated is the understatement of the century. Breaking the device isn’t that difficult. The question we must ask is: is it a bug or not? For example, if the firmware disables the clock of one part of the chip and the firmware tries to access that part, the firmware could hang forever. And the software person developing the firmware will not immediately know why the access does not return any value and waits for completion. But hardware isn’t there to recover from all misconfigurations. The firmware person would insist on more visibility, one more debug capabilities and actual logic to catch various mishaps. Perfectly understandable. But the hardware cost in area, consumption, silicon yield, … is important. The basic rule is to not bloat the chip with logic to protect against stupidity. As a general rule, there are thousands of ways to deadlock a chip without being able to tell where it happened and why. The only sensible thing in my opinion is to add a watchdog timer for catching infinite loops, time outs and bus deadlocks. Similar to the dead man’s switch in a train.
Of course, there are many more things to say about bugs and debug. But let’s conclude with this. Verification is about risk reduction in a given timeframe with a given budget. To be a lean, mean verification machine, you need to understand everything about what is the most appropriate strategy for each level of the chip. To automate tedious and error-prone tasks. To understand the tools and their use of resources (human and machine). Whenever they mention 10x software engineers. The same goes for hardware engineers. They beat the big guns with 10% of the budget and 10% of the resources and still tape-out considerably earlier with reduced risk. The Pareto principle states 20% of time is spenton design and 80% on verification. Thus verification is a CRUCIAL factor for any ASIC or FPGA project.
(*) Guns N’ Roses album released in 1987.