Predicting And Avoiding Failures In Automotive Chips

Semiconductor Engineering sat down to discuss automotive electronics reliability with Jay Rathert, senior director of strategic collaborations at KLA; Dennis Ciplickas, vice president of advanced solutions at PDF Solutions; Uzi Baruch, vice president and general manager of the automotive business unit at OptimalPlus; Gal Carmel, general manager of proteanTecs‘ Automotive Division; Andre van de Geijn, business development manager at yieldHUB; and Jeff Phillips, go to market lead for transportation at National Instruments. What follows are excerpts of that conversation. To view part one of this discussion, click here.

SE: There are four major problems with data. One is there is just too much of it. If you try to cut out the amount of data using some sort of intelligent processing, you’re in danger of leaving something important behind. Second, no matter how good the data is, it probably still isn’t complete. Third, it’s usually not in a consistent format. And fourth, there’s an underlying competitive element so people are reluctant to share data. How will this play out in automotive?

Ciplickas: With regard to too much data and how to put it together in a clean way, we’ve developed a notion called a “semantic model.” Semantics are different than schema. It’s like the difference between grammar and syntax. A schema is a way to relate keys between different sources of data. But when you put semantics on top of that, although data may come from different sources, each with their own keys, you can see the data are actually very similar or the same to each other. So for different types of tools, or different types of sensors on tools, even though they they’re physically different and from different vendors, sometimes they can be treated as logically identical for analytics purposes. By identifying what’s semantically similar across all of the different sources of data, you can easily assemble data to extract useful results. By driving the notion of semantics up and down the supply chain, you get a very efficient way to put everything together for analytics and control.

Baruch: I agree. We’ve both taken different approaches to that. We grew up on the sensor side, which is much more structured. When we went into the automotive side, we had to revisit that model to bring the semantics into a descriptive layer. Also, I want to introduce the notion of event-driven. Those processes are not bounded. There may be 10, 15, or even 100 process steps when you look at the components as they are built across time. And so that idea of being able to ingest data from multiple layers — different layers, different processes, different sensors, different equipment types — and combine them all together, if you don’t have a descriptive approach you will either end up going back to your engineering team every time you need to do something, or not being able actually to fulfill that use case because you’ll get stuck in different areas. We used this approach, and it helped us move between different companies, from Tier 1s to the OEMs. By the way, both are using different languages for altogether different problems, and they still can provide a solution.

Ciplickas: Yes, there are different layers of data and different sequences that chips go through, from the wafer into the singulated die and into the packages and systems. Traceability is getting more and more important, especially in SiP. The automotive companies are dangling their feet in the water now with these advanced technologies. And if you think about the technology in an iPhone now becoming part of a car, it’s amazing the technology they can jam into a package, what they mount to a board, and how it all interacts. It will be a huge challenge to achieve automotive-level reliability. You want to be able to look at a failure from the field and quickly associate where that device came from, and what its neighbors or siblings were doing at every point along the way. From this you can build a picture of what is abnormal versus what is normal, and contrast the two to quickly get to the bottom of what’s going on. Establishing that traceability link completely through the stack is super valuable, necessary, and difficult to do. I’ve been talking with some folks at big fabless companies, and they say that as soon as you put the die into the assembly, you lose that traceability. No, you don’t have to lose that traceability. You can actually implement that in the tools. There’s formats to represent all of this, like E142, to capture component movement and operations. We participate in, as well as lead, these standards efforts to keep the format current, adding things like consumables, because you may get a bad batch of solder paste or gold wire. Bringing together data across the entire supply chain creates significant efficiencies and understanding of failures. Furthermore, once you understand the root cause of the failure and you have the traceability, you can look at what else is at risk.

Rathert: There’s another side of this loop, too. We’re talking primarily about stopping bad die or bad packages from escaping into the supply chain, but there is also plenty of potential data feed-back that could improve each of our respective domains. How do I harden my design? How do I optimize my process control plan? How do I tighten my test program? Could we can harvest both sides of this?

Carmel: There needs to be an incentive and a means to share data across the industry and across suppliers. This will help to identify, predict, and monitor issues, and offer value throughout — not to mention reach a fast resolution if something does happen. The amount of data is not as important as the quality, relevance and actionability of the data being collected. That’s why we aggregate deep data, based on real-time measurements and with a cross-stage commonality, and extrapolate the value, supplying only what is defined as pertinent. It starts with the chip vendors and Tier 1s. They can begin sharing data without compromising sensitive information. For silos to begin to organically share this data, a framework must be clearly defined, along with a common data baseline. Once those foundations are in place, liability models can be built that serve the full toolchain and push the performance, efficiency and safety envelopes at scale.

Ciplickas: We’re leading a standards effort with SEMI’s Single Device Tracking task force, using a ledger-based method to track assets — die, packages, PCBs, etc. — through the supply chain. We think it requires some type of standards effort because it’s a huge challenge to tool up the supply chain. Everyone must participate in enrolling an asset and tracking its custody and ownership as it moves from one party to another. You want to have as little data in that ledger as you can, because nobody wants to put all their manufacturing data into a public blockchain. But once you have this ledger, you know who to call in order to get more detailed data, such as through a private contract with a supplier. If you’ve got an RMA from an OEM or an automotive Tier 1 supplier, you want to understand, ‘What in the fab, at a particular process step, caused the defect? What inspection or metrology measurements do we have for this lot?’ Using the ledger, you can get back to the right supplier using the public blockchain, and then have a private discussion with them about what exactly happened. Getting everybody in the industry — including the big guns that are making the chips and spending all the money — to buy into this is taking time. So it’s a challenge, but it’s a good approach to solving this.

Phillips: There’s a whole different vector to the role that data can play for autonomous driving around the model and the modeling environment — and the combination of those used to define the behavior of the algorithm itself. Having the data from the street or from fleets being able to feed back in, in real time — whether you call that a digital twin or building some relationship between the real life data and the model, specifically in the realm of autonomous driving — is a huge opportunity. That data can be connected back into the development and testing process.

Ciplickas: It’s like real-world test cases used in software, but now you’re talking about autonomous vehicles and sensor data versus what you’re designing.

van de Geijn: We also use that for change data for subcontractors, our customers, and the customer’s customers. But getting data from fabless companies — which get data from the foundries — and matching that with data you get from wafer sort, is a problem. You need to go to different parties to get different kinds of data. You have to align it, merge it, and put your analysis on it. A lot of this is in place, so it’s very easy to use. The problem is that you have to convince your customers this is the way to go. A lot of our customers agree this is necessary, and they already are collecting data from different parties. And then we merge that together in our tools to create overviews to run analyses from all kinds of different angles, even depending on the processes you have. A CMOS chip needs different analyses than other processes. Microcontrollers may be talking to power components, each using different production processes, and different methodologies are needed to run those analyses. Now it’s a matter of having good conversations with your suppliers. But with E142 and RosettaNet, the users not only have to get the data together, they also have to know and understand what they can do. And we’re seeing more and more suppliers helping them. Then they can bring it together for startups that that have some knowledge about it.

SE: Given all of this, is it realistic to think that we can build a 5nm chip that will hold up for 18 years in the Mojave Desert, or in Fairbanks, Alaska?

van de Geijn: We will know in 18 years. If you go to Johannesburg, they will ask you if you’re planning to drive to the Kalahari Desert, because if you do, you’ll want to be able to crank open your windows by hand. You don’t want to rely on a button if there’s a lion approaching a car. So you end up with basic cars at the moment, because they don’t trust all the electronics. Yes, you have your satellite telephone with you in case you blow up your your gearbox and need help. But there are places where you’re stuck if you’re relying on electronics and something breaks. We will have to wait another 18 years before we know if something is really reliable. Here in the United States or Europe, if something happens to the car, you can take over. But I expect that for the next five years, if I go to a remote place, I will still be manually opening and closing the windows.

Rathert: That question is one that OEMs are frequently asking of us. Historically, automotive semiconductors have been developed on processes that have had years to mature. And now, suddenly, the requirements include 7nm and 5nm parts, and they just don’t have the years of maturity that a 45nm part has. So they’re asking us how we can achieve the reliability of the 45nm part in nine months instead of five years? That is the challenge. How do we cram as much learning into a couple of cycles, across all these different silos, so that we can achieve reliability faster?

Carmel: We need a data-driven approach. This process begins with a shift from identifying non-functional ECUs to predicting performance degradation, and then performing cloud-based troubleshooting while in mission mode. To achieve this, electronics need visibility so we can understand how environments, functional stress, and software impact long-term reliability, both within and in between vehicles. This can be achieved by applying deep data, which allows us to cross-correlate variations or fluctuations, and benchmark operation to the spec and guard-bands. This will help assure its functionality over 18 years. It’s not just a matter of jumping between non-failure to failure in the desert. It’s predicting what’s going to happen at the system level if you go to the desert.

Baruch: That brings up a good point, which is the notion that those chips need to be considered in the context of a higher-level system. So if a Tier 1 is developing a chip, they need to know the details about how it’s being assembled and used in the context of a system. That requires sharing of data because of the reliability aspects. But solutions also are driven by analyzing product failures, so that data also is important for outlier detection and finding anomalies. That can be applied to the system-level, as well, all the way up to the car. The combination of sharing data and applying those techniques into the higher-level components can improve reliability. You don’t want to have to wait 18 years to find out whether something is reliable.

Ciplickas: The predictive nature of having the right data is important. If you can understand why something is behaving the way it’s behaving, that can give you confidence that you’re going to achieve the goal. So when you have a failure, knowing the root cause or why there is drift, even if it’s not a failure, and having that drift correlate with variables that you understand and can control, will then give you some confidence that you’re starting to get your arms around this problem. If you look at the number of cars that will be driven in the Sahara, it’s a small fraction of the cars in the world. But there’s a bazillion other environments in which cars are going to drive. That will propel learning over the short term. Cell phones gave us confidence that we really could have this super high technology in an affordable device to do something really valuable, and that’s bleeding into ADAS systems and things like that. If you have data and predictive models, the future involves feedback loops for what happened in manufacturing. Inside the fab, there didn’t used to be as advanced process controls as we have right now, and as you add more and more advanced control loops, it will start happening in assembly fabs, on test floors and even between all of these places. Then you can take that out into the field, where you’re getting data from these systems as they operate. Building a stable feedback loop from the field is a huge challenge, for sure. But being able to connect all that data together, right back to this data-sharing thing, could then enable a predictive model that you could take action on. That’s the path to understanding long-term reliability before the 18 years is up.

Carmel: Beyond collecting the data is how you adaptively use it to define thresholds. You can extend the lifespan of electronics with continuous verification of pre-set guard-bands. The first step is generating the data, and the second step is to create the learning curves to set the correct threshold so you can balance between safety and availability.

[Uzi Baruch has since left Optimal Plus and joined proteanTecs as chief strategy officer.]