

# **Johannes Pfau**

**RFET Reconfigurable Devices:** 

**Power Aware FPGA Architectures and Toolflow** 

# **RFET Reconfigurable Devices: Power Aware FPGA Architectures and Toolflow**

Zur Erlangung des akademischen Grades eines

Doktors der Ingenieurwissenschaften (Dr.-Ing.)

von der KIT-Fakultät für Elektrotechnik und Informationstechnik des Karlsruher Instituts für Technologie (KIT)

#### **angenommene**

#### **Dissertation**

von

#### **M.Sc. Johannes Pfau**

Tag der mündlichen Prüfung: 25.06.2024

Hauptreferent: Prof. Dr.-Ing. Dr. h. c. Jürgen Becker Korreferent: **Prof. Dr.-Ing. Klaus Hofmann** 



#### **RFET Reconfigurable Devices: Power Aware FPGA Architectures and Toolflow**

First edition: September 2024 DOI: 10.5445/IR/1000174452 Copyright © Johannes Pfau, 2024

The latest digital edition of this work is available at [dx.doi.org/10.5445/IR/1000174452](https://dx.doi.org/10.5445/IR/1000174452 ) .

## <span id="page-5-0"></span>**Abstract**

In recent years, power consumption in modern [Integrated Circuits \(ICs\)](#page-362-0) has become increasingly important. One special kindof [IC,](#page-362-0) [Field Programmable](#page-362-1) [Gate Arrays \(FPGAs\),](#page-362-1) is especially hampered by high power consumption. The main reason for this being thatin [FPGA,](#page-362-1) the real application is not known during chip manufacturing time and will only be programmed later, in the field. Because of this, [FPGAs](#page-362-1) will have unused resources, as applications rarely use all of them. Such unused resources do not use dynamic power, but they will use static power due to leakage currents. Classical mitigation approaches, such as power gating, can not easily be usedin [FPGA,](#page-362-1) as the non-utilized resources are often spread over the whole chip area. In addition to this, [Process-, Voltage-, Temperature-Variation and Aging \(PVTA\)](#page-364-0) effects also affect [FPGAs](#page-362-1) worse: Unlikein [Application Specific Integrated Circuits](#page-361-0) [\(ASICs\),](#page-361-0) logic placement and density are not known during manufacturing time. Voltage drop and temperature analysis can therefore not be performed before chip manufacturing. [FPGA](#page-362-1) architectures must therefore assume uniform applications and worst-case [PVTA](#page-364-0) effects instead. With some transistor technologies, it is however possible to trade-off leakage power with logic performance. In this case, safety margins in performance prevent further reduction in power usage.

As a solution to these problems, this thesis proposes the [Power Aware Re](#page-363-0)[configurable FPGA Architecture \(PARFAIT\)](#page-363-0). This [FPGA](#page-362-1) is divided into power regions, that can be controlled individually by power controllers. The [Elec](#page-362-2)[tronic Design Automation \(EDA\)](#page-362-2) toolflow for user application synthesis is modified to determine the required performance in each region. Additionally, instead of simply assuming worst-case [PVTA,](#page-364-0) a measurement system obtains real propagation delays at runtime. Combining those two approaches allows to reduce the power in each region at runtime, until the performance matches the determined requirements. Such a system also implicitly compensates dynamic changesin [PVTA](#page-364-0) values.

To realize this architecture, a transistor technology that enables a trade-off between performance and leakage current is needed. For comparison, this thesis evaluates all resultsin [Silicon on Insulator \(SOI\)](#page-364-1) technology with body

biasing. The main technology investigated howeveris [Reconfigurable \(Am](#page-364-2)[bipolar\) FET \(RFET\)](#page-364-2) technology with program gate voltage scaling. To fully utilize this technology, various changes to standard [FPGA](#page-362-1) toolflow and architecture are proposed: A standard cell library and evaluation for [RFET](#page-364-2) enables realization of non-reconfigurable logic. For reconfigurable logic, [RFET](#page-364-2) based [Universal Logic Modules \(ULMs\)](#page-364-3) are investigated as a less area intensive alternative for [Lookup Tables \(LUTs\).](#page-363-1) Based on this, the architecture with power regions and region controller is introduced. The architecture further introduces logic invasion, a novel method to repurpose parts of the reconfigurable logic for performance measurement at runtime.

Performance requirements are calculated for each power region during application synthesis with the adjusted [EDA](#page-362-2) toolfow. These values are then used in a final hardware / software co-simulation, demonstrating the functionality and determining achievable power savings. For the simulation, technology models determining propagation delay from [PVTA](#page-364-0) parameters will be derived. Additionally, various scenarios to describe the [PVTA](#page-364-0) parameters in the evaluation are developed. Final results show that a static power reduction down to 49.1% of original power consumption is possible with [RFET.](#page-364-2) Due to the different behaviorof [SOI](#page-364-1) technology, this technology even allows power to be reduced down to 2.76%. This large difference seems to suggest that leakage power sensitivity for the control parameter needs to be improved in [RFET](#page-364-2) technology. As will be explained in the thesis though, the absolute leakage currentsof [RFETs](#page-364-2) are already significantly smaller, so that further optimization may actually not be required. [RFET](#page-364-2) technology propagation delay is also more sensitive to changes in the control parameter, which enables a wider range of compensation.

# <span id="page-7-0"></span>**Kurzfassung**

Der Stromverbrauch in Informationsverarbeitungssystemen hat in den letzten Jahren mehr und mehr an Bedeutung gewonnen, weswegen eine Reduktion der Verlustleistung auch in [ICs](#page-362-0) immer relevanter wird. Im Vergleich zu [ASICs](#page-361-0) haben [FPGAs](#page-362-1) hierbei aufgrund ihrer besonderen Eigenschaften oft eine besonders hohe Verlustleistung: Da die Endanwendung bei der Chipfertigung noch nicht bekannt ist, müssen [FPGAs](#page-362-1) generisch für verschiedene Anwendungen ausgelegt werden. Konkrete Endanwendungen benötigen jedoch selten alle [FPGA-](#page-362-1)Ressourcen, was zu ungenutzten Ressourcen und erhöhtem Energiebedarf führt. Die Verlustleistung in diesen Ressourcen wird primär durch statische Leckströme erzeugt, dynamische Verlustleistung ist in ungenutzten Ressourcen irrelevant. Viele Lösungsansätze, die üblicherweisein [ASICs](#page-361-0) verwendet werden, sind in [FPGAs](#page-362-1) nicht realisierbar: So ist beispielsweise Power-Gating schwer umzusetzen, da unbenutzte Ressourcen zur Fertigungszeit unbekannt, und oft über die Chip-Fläche verteilt sind. Aus denselben Gründen sind [FPGAs](#page-362-1) auch von [PVTA](#page-364-0) Effekten besonders betroffen: Da mit der Logikplatzierung auch die Logikdichte zur Fertigungszeit unbekannt ist, können Analysen zu Spannungsabfällen und Temperaturverteilungen nicht vorab durchgeführt werden. [FPGAs](#page-362-1) müssen also unter Annahme von Worst-Case-Szenarien entwickelt werden. Werden zur [FPGA-](#page-362-1)Realisierung Technologien verwendet, die eine Abwägung zwischen Geschwindigkeit und Verlustleistung erlauben, verhindert die Nutzung solcher Worst-Case-Szenarien eine weitere Reduktion der Verlustleistung.

Zur Lösung dieser Probleme wird in dieser Dissertation die [Power Aware](#page-363-0) [Reconfigurable FPGA Architecture \(PARFAIT\)](#page-363-0) vorgestellt. Diese unterteilt den [FPGA](#page-362-1) in mehrere Regionen mit dazugehörigen Controllern zur Regelung der Verlustleistung und Geschwindigkeit. Zur Bestimmung des maximalen Propagation Delays, das als Maß für die Geschwindigkeitsanforderungen in jeder Region dient, werden Open-Source [EDA](#page-362-2) Programme angepasst. Anstatt ein Worst-Case-Szenario anzunehmen, wird zusätzlich ein System entwickelt, um das tatsächliche Propagation Delay in jeder Region zu erfassen. Durch die Kombination der beiden Ansätze wird ein System realisiert, dass die Verlustleistung dynamisch reduziert und dabei sicherstellt, dass

die Anforderungen an die Schaltungsgeschwindigkeit erfüllt werden. Weiterhin lassen sich mit diesem System dynamische [PVTA](#page-364-0) Effekte kompensieren.

Zur Umsetzung dieser Ansätze werden Schaltungstechnologien verwendet, die eine Abwägung zwischen Verlustleistung und Geschwindigkeit erlauben. Hierfür werden eine kommerzielle [SOI](#page-364-1) Technologie mit [Body Biasing \(BB\)](#page-361-1) als Referenz, und eine [RFET](#page-364-2) Technologie mit Program-Gate basiertem Threshold-Voltage-Scaling, evaluiert. Weiterhin erfordern die Konzepte auch Anpassungen an [EDA](#page-362-2) Tools und der [FPGA](#page-362-1) Architektur: So wird in dieser Arbeit eine Standardzellenbibliothek für [RFETs](#page-364-2) zur Realisierung von digitaler Logik eingeführt. Zur Umsetzung der rekonfigurierbaren Logik werden [RFET-](#page-364-2)basierte [Universal Logic Modules \(ULMs\)](#page-364-3) als Alternative für [LUTs](#page-363-1) untersucht. Darauf aufbauend wird die [FPGA](#page-362-1) Architektur mit Power Regionen und Region Controller vorgestellt. Unter anderem wird hier das Konzept der Logic-Invasion eingeführt, das eine Charakterisierung der Schaltungsgeschwindigkeit durch Mitnutzung der bereits vorhandenen rekonfigurierbaren Logikelemente ermöglicht.

Die Anforderungen an die Schaltungsgeschwindigkeit in jeder Region werden durch die angepassten [EDA](#page-362-2) Tools berechnet. Für verschiedene Benchmarks werden diese Anforderungen dann in einer Hardware/Software-Kosimulation verwendet, um die mögliche Reduktion der Verlustleistung zu bestimmen. Dafür werden für die untersuchten Technologien Simulationsmodelle eingeführt, die eine Abschätzung des Propagation Delay in Abhängigkeit von [PVTA](#page-364-0) ermöglichen. Abschließend werden mehrere Szenarien zu Änderungen der [PVTA](#page-364-0) Parameter evaluiert. Hierbei wird gezeigt, dass mit der [RFET](#page-364-2) Technologie eine Reduktion auf bis zu 49% der ursprünglichen Verlustleistung möglich ist. Für die untersuchte [SOI](#page-364-1) Technologie ergibt sich eine Reduktion auf bis zu 2%. Diese Ergebnisse zeigen einerseits, dass der Einfluss des Program-Gates auf die Leckströme in der [RFET](#page-364-2) Technologie noch verbessert werden kann. Anderseits sind die absoluten Leckströme in der [RFET](#page-364-2) Technologie bereits um Größenordnungen geringer. Zusätzlich ist die Abhängigkeit der Schaltungsgeschwindigkeit von dem Kontrollparameter in [RFET](#page-364-2) Technologie stärker, was eine bessere [PVTA](#page-364-0) Kompensation ermöglicht.

# <span id="page-9-0"></span>**Acknowledgements**

Writing a dissertation and the research involved in it is a complex, long-term task. I couldn't have completed it without various people's help, both technical help and support in general.

First, I'd like to thank my supervisor Jürgen Becker: When I joined the institute, I intended to work in longer established projects. Luckily you convinced me to join the [Power Aware Reconfigurable FPGA Architecture \(PARFAIT\)](#page-363-0) project. As the project progressed, you always provided invaluable feedback and ideas, e.g. during meetings and interim presentations. Despite this guidance, you still offered me complete freedom to shape the research according to my interests. I'm very grateful for the trust in me and for providing the opportunity to take responsibility early on. I'd also like to thank the members of my examination board: Klaus Hofmann for providing help and feedback as my secondary advisor and as project partnerin [PARFAIT.](#page-363-0) I still fondly recall the productive discussions and unique ideas we developed in project meetings. Peter Rost for chairing the examination and for the reassuring words in the preparation meeting and Jasmin Aghassi-Hagmann and Laurent Schmalen for investing their time to be part of the committee.

Invaluable insights were also provided by [PARFAIT](#page-363-0) project partners and the co-authors of my research papers: Maximilian Reuter started his dissertation research at essentially the same time as I did. It was a pleasure to discuss ideas, organize work plans, write joint papers and carry out research and dissertation basically in lockstep. Tillmann Krauss introduced me to the world of [RFETs.](#page-364-2) Thank you for the patience when teaching me [RFET](#page-364-2) working principles and for the dedication to still offer advice years after you left university. Jens Trommer joined the second phase of the [PARFAIT](#page-363-0) project. Being the "new Till" for me, you had to explain [RFET](#page-364-2) circuits such as RGATEs. Your help with the [PARFAIT](#page-363-0) 2 project proposal enabled a large part of the research in this thesis. Giulio Galderisi was the one carrying out device measurements and improvements. For a long time I thought this dissertation would have to be based solely on simulated transistors. You made it possible to actually use real, measured transistor characteristics in the evaluation. In addition, there have been many more people involved to get the [PARFAIT](#page-363-0) project up and

running. There are too many of you to list everyone, but I certainly did not forget your valuable help. This also includes all students whose master or bachelor thesis I supervised. I always loved to gain new perspectives and to discuss various technical details with you.

I also want to express my gratitude to all colleagues and friends at the institute: A special thanks goes to Tobias Dörr, with whom I shared an office for almost seven years. Whenever I had any question about anything, you always provided advice and often a complementary perspective, which I could not get to on my own. More thanks go to Hannes Stoll and Timo Sandmann. For a short time, we set up some server things at the institute and since then, you're my go-to experts for everything network related. Similarly, thanks go to the "dynamic duo" Fabian Lesniak and Tim Hotfilter: We did not only set up servers but also locomotives and coffee machines. Time went by way too fast when we plotted various plans, and I'll for sure continue to annoy you with music genre discussions. In almost seven years at the institute I met many more colleagues, most of which I consider good friends by now. There's not enough space here to thank everyone individually, but anyone who was part of the cinema group, shared discussions in coffee breaks or at lunch, was part of PhD hat building projects or any other craft projects: You're what makes the institute special. Thank you for all the shared laughs and the great atmosphere in the last seven years.

I'd also like to thank everyone involved in my research stay in Kobe, especially Kentaro Sano who so kindly received me as his guest researcher. Thanks to the whole team in Kobe as well, I learned so much from you. The opportunity to see a culture which is sometimes so different from the German one, was probably a once-in-a-lifetime experience. Special thanks go to Carlos Cortes for being an awesome friend, tour guide and for always finding all the best places to eat in Japan.

An apology to my non-institute friends and family: You didn't get to see me a lot when I was writing the thesis and I am sorry for missing one or the other event. Special thanks to everyone in the board game group, for offering an escape from sometimes boring day-to-day life: ToValentin and Sara for always offering shelter, Benni for figuring out the best strategies, Jannika for being the best hiking-buddy one can think of, Feli and Steven for always having an ear for me ranting about something, and Pia and Reiner for always coming back the long distance to Karlsruhe to meet everyone. Thank you all for the countless fun hours. Thanks also go to my small family core: To my parents, Hans and Gaby, who always supported me on my way to the PhD. You always believed in me and let me go my own way, even though academia is not really your world. To my grandma Anne-Marie, who still can't believe that sitting in front of a screen in home-office is real work, and to my brother Mathias, who has to do the real work while I'm sitting in front of screens. Also thanks to my aunt, uncle and cousin Johanna, Kurt and Monika. I'll promise to join family meetings more often again.

Last but not least, I'd like to thank the giants on whose shoulders I'm standing on: This includes all the countless scientists, who did the previous research that enabled the work I did as part of this thesis. It also includes all those, that invested their time to teach me those scientific concepts in 13 long years of school and five years of university. All those motivated teachers who put in unpaid extra hours, true humanists burning for education and enabling social advancement through the promotion of knowledge. Thanks to all those who still do more than "Dienst nach Vorschrift" and burn for something more than ever more money in these neoliberal times.

*To all my friends, present, past and beyond — Pennywise*

# **Contents**



#### **II PARFAIT Architecture [105](#page-127-0)**



#### **III Final Remarks [251](#page-273-0)**



# <span id="page-19-0"></span>**Notation**

This chapter introduces the notation and symbol types which are used in this thesis.

### **General notation**



**xvi Notation**

### **Terminology**

Technology Process and/or transistor type used. Hard Logic Logic realized in non-reprogrammable ways. Soft Logic Logic realized in reprogrammable logic ways. Faster Logic Higher application frequency, i.e. less propagation delay in the critical path.

# <span id="page-21-0"></span>**Symbols**

- $\alpha$  Switching activity.
- $C_{\text{in}}$  Input capacitance.
- **<sup>L</sup>** Load capacitance.
- **tot** Total load capacitance in a path.
- **clk** Clock frequency.
- <span id="page-21-3"></span> $I_{\rm D}$  $I_{\rm D}$  $I_{\rm D}$  The current flowing through the drain contact of a [FET.](#page-362-3)
- <span id="page-21-6"></span> $I_{\text{off}}$ Off current of a [FET:](#page-362-3) The maximum current when the transistor off voltage is applied to the respective gate.
- <span id="page-21-5"></span>**on** On current ofa [FET:](#page-362-3) The maximum current when the transistor on voltage is applied to the respective gate.
- $L_{\text{eff}}$  Effective transistor channel length, i.e. including process variation.
- <span id="page-21-7"></span> $\mu$ Charge carrier mobility in a [FET.](#page-362-3)
- <span id="page-21-8"></span>T Temperature.
- <span id="page-21-2"></span>**arr** Actual arrival time: Actual time a signal arrives ata [Flip-Flop \(FF\)](#page-362-4) input.
- **clk** Clock period: Time between two rising clock edges.
- $t_f$  Fall time: Time for an output to fall from 80 % of its steady state high value to 20%.
- $t_{\text{hold}}$  Hold time: Time a signal at a Flip-Flop input must be stable after the clock edge arrived.
- <span id="page-21-4"></span> $t_{\text{PD}}$  Propagation delay: Time for a signal to propagate through a cell.
- $t_r$  Rise time: Time for an output to rise from 20 % of its steady state high value to 80%.
- <span id="page-21-1"></span>**req** Required arrival time: Latest time a signal must arrive ata [FF](#page-362-4) input at for timing closure.
- **setup** Setup time: Time a signal at a Flip-Flop input must be stable before the clock edge arrives.
- **skew** Clock skew: Time a local clock is shifted compared to a global, virtual time reference clock.
- $t_{\rm slack}$  Slack: Difference between [req](#page-21-1)uired arrival time  $t_{\rm req}$  and actual arrival time  $t_{\text{arr}}$  $t_{\text{arr}}$  $t_{\text{arr}}$ .
- $t_{WD}$  Wire delay: Part of propagation delay caused by parasitic effects of connected wires.
- $V$  Supply Voltage.
- <span id="page-22-1"></span>**VDD** Supply Voltage Positive Potential.<br> **VSS** Supply Voltage Ground Potential.
- $VSS$  Supply Voltage Ground Potential.<br> $\Delta V$  Variation in Supply Voltage.
- 
- $\Delta V$  Variation in Supply Voltage.<br> $V_{\text{IH}}$  Input voltages above this th  $V_{\text{IH}}$  Input voltages above this threshold are stable high signals.<br> $V_{\text{II}}$  Input voltages below this threshold are stable low signals.
- Input voltages below this threshold are stable low signals.
- <span id="page-22-2"></span> $V_{\text{BS}}$  Body Bias voltage, as potential difference between bulk and source contacts.
- <span id="page-22-0"></span> $V_{th}$ Threshold Voltage of a [FET.](#page-362-3)
- $V_{\text{th FG}}$  Threshold Voltage of the Front Gate.
- $x$  Input of a logic gate.
- $\nu$  Output of a logic gate.

# **Part I**

# <span id="page-23-0"></span>**Prologue**

# <span id="page-25-0"></span>**Chapter 1**

### **Introduction**

This thesis describes the overall design and some specific features of the [PARFAIT.](#page-363-0) It was derived by the author as part of the [PARFAIT](#page-363-0) research project, which focuses on the use of [RFET](#page-364-2) technology in [FPGAs.](#page-362-1) Whereas project partners focus on developmentof [RFET](#page-364-2) transistor devices [\[1,](#page-315-1) [2\]](#page-315-2) and circuits [\[Reu20,](#page-344-0) [Reu21\]](#page-344-1), this thesis discusses system level aspects related to use of [RFET](#page-364-2) technologyin [FPGA](#page-362-1) architecture design. Although this thesis provides a short introduction to the device and circuit aspects in [chapters 2](#page-31-0) and [3,](#page-91-0) the reader is referred to additional publications for a complete overview of the [PARFAIT](#page-363-0) research project.

**Power Reduction** In recent years, power efficiency of embedded systems has become more and more important. Reducing power consumption is a challenge especially for [FPGAs,](#page-362-1) which are commonly considered to be less energy efficient than [ASICs](#page-361-0) or [Central Processing Units \(CPUs\).](#page-361-3) The power losses in [ICs](#page-362-0) are generally classified into static and dynamic power, as will be explained in [section 2.3.](#page-42-0) The dynamic power part is dominated by switching power, which is proportional to the clocking frequency.

In [FPGAs,](#page-362-1) user applications usually do not make use of all available resources. Unused resources do not have switching transistors and therefore do not exhibit dynamic power loss. They however are affected by static power losses, especially by static leakage paths. These paths are caused by transistors in off state conducting leakage current, thus essentially forming paths between power supply rails. [CPUs](#page-361-3) and [ASIC](#page-361-0) use well-established techniques to reduce power in such contexts: Solutions such as power-gating can turn of unused parts of [ICs.](#page-362-0) These approaches have not seen widespread application in [FPGA](#page-362-1) though: Which resources are unused depends primarily on the [FPGA](#page-362-1) user application. This information is not known at the time the [FPGA](#page-362-1) is manufactured, rendering many of the [ASIC](#page-361-0) power management solutions unusable. The primary objective of this work is to investigate approaches to reduce static leakage power suitable for [FPGAs.](#page-362-1)

**[PVTA](#page-364-0)** This dissertation also addresses [PVTA](#page-364-0)in [FPGAs.](#page-362-1) As will be explained in detail in [section 2.4,](#page-49-0) [PVTA](#page-364-0) effects describe variations of transistor characteristics because of various causes. Such varying transistor characteristics lead to varying propagation delays of logic cells [\(section 2.3\)](#page-42-0). These effects therefore need to be taken into account when designing [ICs:](#page-362-0) Process variation causes differences in transistor characteristics introduced by the [IC](#page-362-0) manufacturing process. These variations can cause both differences between [ICs](#page-362-0) and differences between different logic cells in one [IC.](#page-362-0) In [FPGAs,](#page-362-1) again many of the established solutions for [ASICs](#page-361-0) are not applicable: When a user application is synthesized to a bitstream, it is expected to work on any [FPGA](#page-362-1) [IC.](#page-362-0) Whereas [ASICs](#page-361-0) can be measured and binned after production, this is therefore not readily possible for [FPGAs.](#page-362-1) Each [FPGA](#page-362-1) [IC](#page-362-0) must work with any bitstream, regardless of the actual resources used in the bitstream. Speed grading in [FPGAs](#page-362-1) therefore can not be based on specific resources on the device, but must always consider all resources.

Voltage and temperature variation can cause similar effects: Locally increased power consumption can lead to a localized drop in supply voltage or temperature hotspots. As the useof [FPGA](#page-362-1) resources is not known during manufacturing time, solutions such as locally increasing power grid density can not be used. Unlike process variation, the effects on the user application can however be estimated during synthesis of the user application. In addition to those effects, aging describes wear of devices over time.

When performing [Static Timing Analysis \(STA\)](#page-364-4) for [FPGA](#page-362-1) applications, [EDA](#page-362-2) tools have to assume the worst-possible delaya [IC](#page-362-0) could have, effectively using the worst case process variation. In reality, not all resources on each [IC](#page-362-0) will be affected by worst-case process variation. The application may therefore have a larger timing slack in practice then predicted during [STA.](#page-364-4) This unused performance may also be interpreted as a waste of power: Solutions which trade-off transistor performance with power could enable power reduction, if the available timing slack was known.

**Performance Power Trade-Off** The performanceof [ICs](#page-362-0) is mainly characterized by the propagation delay through the used logic cells. As mentioned, this delay relates to the drain current  $I_D$  of used transistors. Both this oncurrent  $I_D$  and the off-current leakage  $I_D$  are affected by the threshold voltage  $V_{th}$ . Trade-offs between performance and leakage power can therefore be achieved if  $V_{th}$  can be modulated.

Depending on the technology used, there are various effects that can be used to modulate  $V_{th}$ . In bulk silicon technology and more effectively in [SOI,](#page-364-1) body biasing can be used (see [section 2.1\)](#page-31-1). For the [RFET](#page-364-2) technology used in [PARFAIT,](#page-363-0) a similar effect can be achieved through voltage scaling on the [Program Gate \(PG\),](#page-363-3) an additional gate introducedin [RFET](#page-364-2) devices. A quick introduction to this effect will be given in [section 2.2.](#page-36-0) The newly introduced [PG](#page-363-3) also enables novel approaches for reconfigurable logic. This dissertation will therefore also discuss the useof [LUT](#page-363-1) replacements based on [RFET](#page-364-2) technologyin [FPGAs.](#page-362-1)

<span id="page-27-0"></span>

**Figure 1.1:** The [PARFAIT](#page-363-0) [FPGA](#page-362-1) architecture with the shade of red in each region representing locally adjusted performance using [PVTA](#page-364-0) compensation. Also shown are region controllers and the main controller, which orchestrate logic invasion and measurement.

**Power Regions** Scaling  $V_{th}$  instead of  $VDD$  enables connecting regions with different performance: In  $VDD$  scaling, when gates of transistors in one region are being driven by transistors in another region, certain restrictions for the voltage levels arise. With  $V_{th}$  scaling, all gate and supply voltages are identical, avoiding these problems.  $V_{th}$  scaling therefore enables fine-grain power regions as shown in [figure 1.1.](#page-27-0)

With power regions introduced in [chapter 7,](#page-217-0) each region enables a local tradeoff between performance and power. When additionally supporting dynamic adjustment of this trade-off, the system can also be used to counter [PVTA](#page-364-0) effects. For this, in addition to knowing the application slack in each region [\(section 8.1\)](#page-225-1), the [FPGA](#page-362-1) architecture also has to determine the current real performance under [PVTA](#page-364-0) effects. [Chapter 8](#page-225-0) describes the logic invasion method introduced as part of this thesis: It invades application resources to periodically characterize the performance of each resource in the [FPGA,](#page-362-1) without interrupting the user application.

**Logic Cells** For logic cells to use  $V_{\text{th}}$  scaling, special [RFET](#page-364-2) reconfigurable cells are used. Those can provide more area efficient configurable logic than [LUTs,](#page-363-1) but come with certain limitations: Most notably, those [ULM](#page-364-3) can often not realize all functions of  $N$  input variables, but only a certain subset. Chapter  $6$ explains how such cells can be used in an [FPGA](#page-362-1) architecture and what changes are necessary. In addition, [chapter 5](#page-183-0) describes how [RFET](#page-364-2) cells can be used to realize standard cell based applications.

**Evaluation** In order to evaluate the [PARFAIT](#page-363-0) [FPGA](#page-362-1) architecture, [section 9.1](#page-253-1) introduces optimizations for the [Virtual FPGA \(VFPGA\),](#page-364-5) enabling evaluation on commercial [FPGAs.](#page-362-1) Sucha [VFPGA-](#page-364-5)based evaluation can however not simulate the [PVTA](#page-364-0) effects. Because of this, the main evaluation of this thesis' results is based on simulation. [Section 4.6](#page-155-0) introduces the technology models that derive propagation delay  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  from [PVTA](#page-364-0) parameters. Models are then extended to also consider  $V_{th}$  scaling using a control parameter. Two models are introduced: One for a reference [SOI](#page-364-1) technology, based on Scarpato's work [\[3\]](#page-315-3). This model is extended to support  $V_{th}$  scaling and then parametrized for the used technology. In addition, the Scarpato model is adjusted for the [RFET](#page-364-2) technology characterized in [\[2\]](#page-315-2). Here, some changes in modeling are necessary. The model is then fitted to the measured [RFET](#page-364-2) characterization data of [\[2\]](#page-315-2) to obtain propagation delays.

To obtain  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$ , the delay model requires [PVTA](#page-364-0) parameter inputs. [Section 4.7](#page-174-0) describes the scenario models used to derive realistic [PVTA](#page-364-0) parameters for the evaluation. The control parameter is calculated by region controllers in the [FPGA](#page-362-1) architecture, as shown in [figure 1.1.](#page-27-0) To simulate these [Very High](#page-364-6) [Speed Integrated Circuit Hardware Description Language \(VHDL\)](#page-364-6) based controllers and the delay model, a co-simulation framework will be introduced in [section 9.4.](#page-267-0) The final evaluation in [chapter 10](#page-275-0) then uses a set of benchmark user [FPGA](#page-362-1) applications introduced in [section 9.5.](#page-271-0) Those are first placed on the [PARFAIT](#page-363-0) [FPGA](#page-362-1) with different regions sizes, then the slack factors for all

regions are calculated. Using these factors and the scenario models, the cosimulation evaluates the region controller responses. The relative power and the control voltage in each region, as well as the achieved delay factors, will be presented in the results chapter.

**Novel contributions** As part of this thesis, various new topics have been investigated or advanced significantly beyond state of the art. The following list provides an overview, referencing the relevant section in the dissertation.

- Parametrization of the Scarpato delay model for [SOI](#page-364-1) technology and extensions to model body biasing: [Section 4.6.](#page-155-1)
- Modification of the Scarpato delay model for [RFET](#page-364-2) technology and parametrization according to measurements: [Section 4.6.](#page-166-0)
- Evaluation of large digital circuitsin [RFET](#page-364-2) technology using a custom standard cell library: [Section 5.3.](#page-192-0)
- Derivation of relative power and delay metrics, that can be used in absence of absolute power and delay information: [Sections 4.6](#page-155-0) and [9.2.](#page-262-0)
- Derivation of a toolflow to map [FPGA](#page-362-1) applicationsto [ULMs:](#page-364-3) [Section 6.2.](#page-202-0)
- Designof [Controllable Logic Blocks \(CLBs\)](#page-361-4) based on [ULMs](#page-364-3) and metrics and methodology to evaluate the designs: [Section 6.4.](#page-208-0)
- Introduction of Logic Invasion, efficiently reusing [CLBs](#page-361-4) for  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  characterization of all [CLBs](#page-361-4) with little resource overhead: [Section 8.2.](#page-228-0)
- Introduction of slack factors to describe the available timing slacks in power regions, and algorithms to derive those: [Section 8.1.](#page-225-1)
- Introduction of a co-simulation framework to evaluate power aware architectures and derivation of related [PVTA](#page-364-0) scenarios: [Section 9.4.](#page-267-0)
- Evaluation of power saving in the [PARFAIT:](#page-363-0) [Section 10.4.](#page-288-0)
- Introduction of manual placement techniques for [VFPGA:](#page-364-5) [Section 9.1.](#page-253-1)

## <span id="page-31-0"></span>**Chapter 2**

### **Fundamentals**

This chapter reviews necessary prerequisites for understanding of the [PAR-](#page-363-0)[FAIT](#page-363-0) architecture. Explanations are kept succinct, as the reader is assumed to possess knowledge of electrical and information systems engineering. If further explanations are desired, the end of each chapter includes references to relevant standard works discussing the topics in more breadth and depth.

#### <span id="page-31-1"></span>**2.1 Classic Silicon Semiconductors**

Silicon semiconductors used for CMOS circuits are realized in various different physical device implementations, mostly depending on target feature size and other application characteristics. In this section, only basic device implementations will be discussed, as this is sufficient to describe the functional difference between devices for classic [Complementary Metal Oxide](#page-361-5) [Semiconductor \(CMOS\)](#page-361-5) and [RFET.](#page-364-2)



<span id="page-31-2"></span>**Figure 2.1:** Cross-section of a basic bulk-silicon type [Metal Oxide Semiconductor](#page-363-4) [FET \(MOSFET\).](#page-363-4) The bulk (or body) silicon is electrically connected to the source (S) terminal. Apart from this, the source (S) and drain (D) terminals are symmetrical. Materials for [N-Channel MOSFET \(NMOS\)](#page-363-5)/ [P-Channel](#page-363-6) [MOSFET \(PMOS\)](#page-363-6) are n/p doped bulk (black), n+/p+ doped source and drain regions (orange), some insulator for the gate oxide (yellow), polysilicon for the gate region (green) and metal for the wiring contacts (blue).

[Figure 2.1](#page-31-2) on the preceding page shows the most basic [MOSFET,](#page-363-4) a bulk type device. It is stacked onto the bulk or body silicon, which is provided by the wafer itself and usually lightly doped. Source and drain regions of the wafer are n+ doped for [NMOS](#page-363-5) devices. An oxide layer – in the simplest form silicon oxide – is then used to isolate the channel regions from the polysilicon gate region. Gate, source and drain are connected to the metal wiring layers to form circuits. The body contact connecting to the bulk is not shown in [figure 2.1.](#page-31-2) It is often globally connected to the source terminal of the respective device type and not available as an individual terminal. In a simple explanation, the electric field between gate and bulk forms a depletion region and causes charge carriers to form a conducting channel below the gate oxide. When this channel has been established, a current flow between gate and source can be induced when a respective potential difference is applied across the source and drain terminal. In this current flow, generally only one type of charge carrier is involved. The distinction between source and drain contacts is needed because of the potential applied to the bulk terminal.

<span id="page-32-0"></span>

**Figure 2.2:** Symbols and Terminals for [MOSFET](#page-363-4) Devices.

In the bulk [MOSFET,](#page-363-4) the potential of the device body is usually applied globally to the bulk substrate, so individual control of the potential per transistor is not possible. The device can therefore be described as a three-terminal device, as indicated by the commonly used symbols shown in [figures 2.2a](#page-32-0) and [2.2b.](#page-32-0) Other device types allow more fine-grain access to the body potential and can thus support four-terminal device control as shown in [figure 2.2c.](#page-32-0) An example for such a technology are [SOI](#page-364-1) [MOSFETs](#page-363-4) as shown in [figure 2.3a](#page-33-0) on the next page. Unlike bulk [MOSFETs,](#page-363-4) the semiconductor layers are not etched and doped into the bulk silicon substrate. In [SOI](#page-364-1) devices, the substrate is instead electrically isolated usinga [Buried Oxide \(BOX\),](#page-361-6) onto which semiconductor layers are deposited.

A [SOTB](#page-364-7) device, a special realizationof [SOI,](#page-364-1) is shown in [figure 2.3b](#page-33-0) on the facing page. Compared to normal [SOI](#page-364-1) devices, [SOTB](#page-364-7) devices feature a thinner [BOX.](#page-361-6) The main benefit of a thin oxide is better modulation of the device channel

<span id="page-33-0"></span>

**Figure 2.3:** Cross-sectionsof [SOI](#page-364-1) [MOSFETs.](#page-363-4) The transistor layers are similar to the bulk [MOSFET,](#page-363-4) but all layers are stacked onto a buried oxide layer. **[\(a\)](#page-33-0)** The conventional [SOI](#page-364-1) [MOSFET](#page-363-4) with thick [BOX.](#page-361-6)**[\(b\)](#page-33-0)** [Silicon onThin BOX \(SOTB\)](#page-364-7) device with thinner [BOX](#page-361-6) layer. These devices differ in manufacturing and in electrical characteristics.

from the substrate below the [BOX,](#page-361-6) including more effective body biasing [\[4,](#page-315-4) p. 14]. In a generalized abstract view, a contacted substrate can act similarly to the transistor gate. To be able to better describe these contacts in these devices, gates are usually distinguished as [Front Gate \(FG\)](#page-362-5) and [Back Gate](#page-361-7) [\(BG\).](#page-361-7)

In conventional designs, the substrate is connected to the same voltage potential as the source terminal. [Body Biasing \(BB\)](#page-361-1) designs use a potential difference between substrate and source to enable more control over the threshold voltage  $V_{\text{th}}$  and the on  $(I_{\text{on}})$  and off  $(I_{\text{off}})$  currents. For bulk devices, the body effect is commonly described as a shift in  $V_{th}$ :

$$
V_{\text{th}} = V_{\text{th0}} + \gamma \left( \sqrt{|-2\Phi - V_{\text{BS}}|} - \sqrt{|-2\Phi|} \right) \tag{2.1}
$$

Here,  $V_{BS}$  is the body biasing voltage,  $V_{th0}$  the threshold voltage without body biasing,  $\Phi$  the substrate Fermi potential and  $\gamma$  is the body effect coefficient, a technology specific constant. The maximum potential difference in bulk silicon designs is limited, leaving most benefits of body biasing for [SOI](#page-364-1) devices. As the substrate is isolated from the channel by the [BOX,](#page-361-6) the terminal used for body biasing in these technologies is commonly referred to as [Back Gate](#page-361-7) [\(BG\)](#page-361-7). Whereas [BB](#page-361-1) conceptually makes the [MOSFET](#page-363-4) a four-terminal devices as shown in [figure 2.2c](#page-32-0) on the preceding page, the fourth terminal's purpose is usually limited to power saving. Unlike four-terminal [RFETs](#page-364-2) introduced in the next chapter, the classic [MOSFET](#page-363-4) structure with *n* or *p* channels is kept. Only one type of charger carrier, electrons in the [NMOS](#page-363-5) or holesin [PMOS,](#page-363-6) is involved in current flow.

[SOI](#page-364-1) devices can be divided into [Partially Depleted SOI \(PDSOI\)](#page-363-7) and [Fully](#page-362-6)

[Depleted SOI \(FDSOI\)](#page-362-6) types. For [FDSOI,](#page-362-6) the channel material is intrinsic silicon. As the channel material is not doped, the channel is fully depleted. For [PDSOI,](#page-363-7) the channel material is doped like in bulk devices. [PDSOI](#page-363-7) is commonly used for thick [BOX](#page-361-6) devices, whereas [FDSOI](#page-362-6) commonly goes with [SOTB.](#page-364-7) [FDSOI](#page-362-6) technology has recently been adopted in industry and designs have demonstrated speed improvements and power reductions for both logic and memory applications [\[5,](#page-315-5) [6,](#page-315-6) [7\]](#page-315-7).

The drain current  $I_D$  of the transistor plays an important rule in the development of digital circuits. It depends on the technology type and its formulaic description changes depending on the region of the transfer characteristic, the transistor is used in. For example, in the Shockley model, there is a cutoff, a linear and a saturation region. Static operation of digital circuits uses transistors in the saturation region. Drain current in the Shockley model is described for the saturation regions as below [\[8\]](#page-315-8):

$$
I_D = 0.5K(V_{GS} - V_{th})^2
$$
  
=  $C_s(V_{GS} - V_{th})^2$  (2.2)

The Shockley model has been found to be insufficient to describe modern [Field Effect Transistors \(FETs\),](#page-362-3) as it does not model short-channel effects. It is therefore commonly replaced by a more general formula, the alpha-power law. This thesis will later on make use of the Scarpato model to describe circuit delay depending on [PVTA.](#page-364-0) As the Scarpato model is based on the alphapower-law, this necessitates its introduction [\[3\]](#page-315-3). Digital circuits operate in the pentode region of that model, where the drain current is defined as follows [\[8\]](#page-315-8):

$$
I_D = \frac{W}{L_{\text{eff}} P_C} (V_{\text{GS}} - V_{\text{th}})^{\alpha}
$$
  
=  $C_{\alpha} \mu (V_{\text{GS}} - V_{\text{th}})^{\alpha}$  (2.3)

Both models make use of different constants, but only the temperature dependency of those is relevant for this thesis. This dependency was analyzed by Dasdan and Hom, especially focusing on the [Inverted Temperature De](#page-362-7)[pendence \(ITD\)](#page-362-7) effect [\[9\]](#page-315-9). They note that the carrier mobility  $\mu$  and threshold voltage  $V_{th}$  depend on the temperature like this:

$$
\mu(T) = \mu(300) \left(\frac{300}{T}\right)^m \tag{2.4}
$$

$$
V_{\text{th}}(T) = V_{\text{th}}(300) - \kappa (T - 300) \tag{2.5}
$$

It can therefore be seen that rising temperature  $T$  leads to decreasing mobility  $\mu$  and therefore decreasing  $I_{\text{D}}$ . On the other hand, rising temperature causes decreasing threshold voltage  $V_{th}$  and therefore increasing  $I_D$ . Which effect is dominant, and whether  $I_D$  ultimately increases or decreases with  $T$ , depends on the supply voltage  $VDD$  [\[9\]](#page-315-9).

#### **Further Reading**

A broad range of textbooks are available discussing various aspects of classic silicon technology. For a broad overview of topics concerning digital circuit design, see [\[Rab03\]](#page-35-0). For an introduction into [Metal Oxide Semiconductor](#page-363-8) [\(MOS\)](#page-363-8) technology and details on analog circuit design, see [\[Raz16\]](#page-35-1). An introduction into Ultra-Thin-Body [MOSFETs](#page-363-4) and [SOI](#page-364-1) in general can be found in [\[Fos13\]](#page-35-2). For details about manufacturing and devices in [SOI,](#page-364-1) [\[Kon14\]](#page-35-3) may be used as a reference. Articles describing various body biasing techniques for [FDSOI](#page-362-6) devices at various granularity have been collected in [\[Cle20\]](#page-35-4).

<span id="page-35-4"></span><span id="page-35-3"></span><span id="page-35-2"></span><span id="page-35-1"></span><span id="page-35-0"></span>
## **2.2 Ambipolar Silicon Semiconductors**

In all devices introduced in [section 2.1,](#page-31-0) current through the channel involves only one type of charge carrier. The type of carrier is fixed at manufacturing time by selection of materials, usually by the doping of the source and drain regions. Devices where current flow through the channel involves both electrons and holes, are called ambipolar devices. With early ambipolar devices, the ambipolarity was not actively used to enable additional features and devices were designed to suppress the ambipolar behavior [\[10,](#page-316-0) [11\]](#page-316-1). A recent review publication summarizes various approaches which have been taken in this respect [\[12\]](#page-316-2). Active use of ambipolarity was initially described in [\[13\]](#page-316-3) and is the main ideaof [Reconfigurable \(Ambipolar\) FETs \(RFETs\)](#page-364-0), which use the ambipolar behavior to switch device polarization. It should therefore be noted that not all ambipolar devices qualify as [RFET.](#page-364-0)

<span id="page-36-0"></span>

**Figure 2.4:** Cross-sections of basic ambipolar [MOSFETs.](#page-363-0) Unlike bulk and [SOI](#page-364-2) [MOS-](#page-363-0)[FETs,](#page-363-0) the shown devices employ source and drain region materials with metallic behavior. Junctions between these regions and the channel region show Schottky junction behavior. **[\(a\)](#page-36-0)** The basic [Schottky Barrier FET](#page-364-1) [\(SBFET\)](#page-364-1) device is derived from the bulk [MOSFET.](#page-363-0)**[\(b\)](#page-36-0)** A basic planar [Recon](#page-364-0)[figurable \(Ambipolar\) FET \(RFET\)](#page-364-0) device derived from the [SOI](#page-364-2) [MOSFET.](#page-363-0)

[Figure 2.4a](#page-36-0) shows the cross-section of the [Schottky Barrier FET \(SBFET\)](#page-364-1) device. The [SBFET'](#page-364-1)s structure is similar to the bulk [MOSFET](#page-363-0) of [figure 2.1](#page-31-1) on page [9,](#page-31-1) with the main difference being the use of different material for source and drain regions. Instead of using p- or n- doped silicon, [SBFETs](#page-364-1) use materials with metallic behavior, usually a silicide. The use of these materials creates Schottky junctions between the channel and these regions. The effect of the gate can then now longer be described as forming a conducting channel through accumulation of charge carriers. Instead, a more complex analysis has to consider band diagrams in the device and of the junctions, and how the electric field formed by the gate contact interacts with those junctions. For the use of the devices in this thesis, a detailed understanding of such effects is not necessary and the reader is referred to literature [\[1\]](#page-315-0).

The use of Schottky contacts leads to ambipolar behavior, i.e. both charge carrier types are involved in current flow. The [SBFET](#page-364-1) is a conventional three terminal device, so the ambipolarity is usually not used to enhance the [FET](#page-362-0) functionality.

[Figure 2.4a](#page-36-0) shows the idea of a simple, planar [RFET](#page-364-0) device. From a technology and manufacturing perspective, this device is related to the basic [SOI](#page-364-2) and [SOTB](#page-364-3) devices. Like those, it builds on top ofa [BOX,](#page-361-0) where usually a thin[-BOX](#page-361-0) concept is used to enhance electrostatic control using the back gate. Like the [SBFET,](#page-364-1) it uses source and drain region materials with metallic behavior, but unlike the [SBFET](#page-364-1) it uses four or more terminals. In [RFET](#page-364-0) concepts, the polarity of the channel can be changed. For the device shown here, applying a certain voltage potential bias can either make the device act likea [PMOS](#page-363-1) ora [NMOS](#page-363-2) device. The analogy is limited to functional level, in particular the polarity of the front gate threshold voltage  $V_{th,FG}$  $V_{th,FG}$  $V_{th,FG}$ . More detailed analysis of channel currents will show differences in device physics and behavior when compared to classic [NMOS](#page-363-2)of [PMOS](#page-363-1) devices. For the architecture-level analysis in this thesis, these details are not relevant. The reader is referred to [\[1\]](#page-315-0) for more details.

Some types of [RFET](#page-364-0) devices, such as e.g. the [PARFAIT](#page-363-3) device, do not use any doped materials. Instead, they rely only on intrinsic silicon, poly silicon, oxides and silicides [\[14\]](#page-316-4). Apart from removing reducing manufacturing steps, this also allows devices to be used across a wider temperature range: Freeze-out at low temperature and intrinsic behavior at high temperature are not an issue with these [RFET](#page-364-0) concepts [\[15\]](#page-316-5). As the functionality of the device is determined by the [BG](#page-361-1) potential instead of by doping, the process of configuring the device using a static voltage potential is called electrostaticdoping [\[16\]](#page-316-6). From fabrication point of view, electrostatic doping poses similar requirements as body biasingin [SOI](#page-364-2) devices [\[17\]](#page-316-7). If reconfiguration is not required, [RFETs](#page-364-0) can be configured statically as [PMOS](#page-363-1) or [NMOS](#page-363-2) devices by statically connecting the back gates of all respective devices to the respective supply [\[18\]](#page-316-8). Alternatively, the back gates can be connected to variable voltage rails to enable similar effects as body biasingin [SOI.](#page-364-2)

[Figure 2.5](#page-38-0) shows the final [RFET](#page-364-0) device developed in the [PARFAIT](#page-363-3) 1 project. To enhance device characteristics, the [FG](#page-362-1) has been split into three gates. Two [TG](#page-364-4) are directly placed above the Schottky junctions and are used to configure the device polarity. The [FG](#page-362-1) is located in the center of the channel and is used as the main control gate for this device. Both [TG](#page-364-4) are usually connected internally to reduce external wiring overhead and therefore influence both junctions identically. In its most general configuration, the [PARFAIT](#page-363-3) [RFET](#page-364-0)



<span id="page-38-0"></span>**Figure 2.5:** Cross-section of the PARFAIT [RFET,](#page-364-0) adapted from [\[19\]](#page-316-9). Compared to the basic planar [RFET,](#page-364-0) the [PARFAIT](#page-363-3) [RFET](#page-364-0) splits the top gate into three independent gates. The [Top Gates \(TGs\)](#page-364-4) (T in the figure) are internally connected and exposed as one contact.

is a five-terminal device as shown in [figure 2.6a.](#page-38-1) It features the well-known source and drain terminals connected to the channel, the back gate below the [BOX](#page-361-0) and one contact for [TG](#page-364-4) and [FG](#page-362-1) each.

To reduce complexity both in connecting the devices using metal layers and in circuit design, certain device variations with combined terminals have been devised: [Figure 2.6b](#page-38-1) shows a variant where the [BG](#page-361-1) has been connected to the source terminal. This design is based on the bulk [MOSFET](#page-363-0) design, where the substrate or well is connected to the source terminal. As thin[-BOX](#page-361-0) technology is not always available, the [BG](#page-361-1) is often separated from the channel through a larger oxide than [TG.](#page-364-4) Because of this, [TG](#page-364-4) often provides a more efficient way to manipulate the Schottky junction, justifying this device design. For details, please refer to [\[19\]](#page-316-9) and [\[1\]](#page-315-0). [Figure 2.6c](#page-38-1) shows a variant where the [TG](#page-364-4) and [FG](#page-362-1) has been connected. This configuration resembles the more traditional design of [figure 2.4](#page-36-0) on page [14](#page-36-0) with one front and one back gate. It can be used for circuits originally devised for these less complex devices.

<span id="page-38-1"></span>

**Figure 2.6:** Symbols and Terminals for the [PARFAIT](#page-363-3) [RFET](#page-364-0) Devices.

Whereas this introduction derived ambipolar transistors from planar [SBFETs,](#page-364-1) ambipolar behavior was chronologically first explored as part of other device technologies. A summary of early publications can be found in [\[20\]](#page-317-0) and more current ones are summarized in [\[21\]](#page-317-1). Early devices were mostly one-dimensional devices and include silicon [\[22,](#page-317-2) [23,](#page-317-3) [24,](#page-317-4) [25,](#page-317-5) [26\]](#page-317-6) or germanium nanowires [\[27,](#page-317-7) [28\]](#page-317-8) as well as carbon nanotubes [\[13\]](#page-316-3). Later on, twodimensional devices using graphene [\[29\]](#page-317-9), dichalcogenides [\[30,](#page-317-10) [31,](#page-318-0) [32,](#page-318-1) [33\]](#page-318-2) and black phosphorus [\[34\]](#page-318-3) materials were introduced. Planar, silicon based [RFETs](#page-364-0) on the other hand can be considered as an extension of existing transistor technology. They can be realized with comparatively small changes to existing manufacturing processes [\[18\]](#page-316-8). As shown in the deduction of the [PARFAIT](#page-363-3) [RFET](#page-364-0) from [SOI](#page-364-2) and [SBFET](#page-364-1) devices, the main difference is in the material of source and drain regions. As silicides are already available in existing processes, the technology can be adapted for most [FDSOI](#page-362-2) processes, but a thin[-BOX](#page-361-0) process is advantageous for good electrostatic behavior. This has lead planar [RFETs](#page-364-0) to now entering the world of commercial manufacturing processes. For example, devices have recently been integrated into Global-Foundries FDX22 [\[17\]](#page-316-7) 22 nm [FDSOI](#page-362-2) technology with minimal changes to the manufacturing process [\[35,](#page-318-4) [36\]](#page-318-5).

From a circuit perspective, [RFETs](#page-364-0) can be divided into two groups [\[18\]](#page-316-8): Those with independent control of the source and drain junctions and those with only combined control over both interfaces. The PARFAIT device in [figure 2.5](#page-38-0) on the preceding page is an example of the second category: Although it does provide two [TGs,](#page-364-4) those are internally connected. The [BGs](#page-361-1) of [figures 2.4b](#page-36-0) and [2.5](#page-38-0) on page [14](#page-36-0) and on the preceding page also control both junctions at the same time, and therefore also belong to category two. In the remaining thesis, when [RFET](#page-364-0) technology is referred to, a device of the second category will be assumed.

Compared to conventional silicon technology, [RFETs](#page-364-0) offer three primary benefits which will be leveraged in this thesis: As already mentioned, [RFETs](#page-364-0) offer reconfiguration, providing a way to change the polarity of the device not only during manufacturing, but also in the field. This technological advantage enables the design of various circuits with fewer transistors, potentially reducing area and power of circuits [\[18\]](#page-316-8). Enabling hardware reconfiguration, this immediately benefits [FPGAs](#page-362-3) and will therefore be discussed in detail as part of this thesis. Whereas these logic cells for [FPGA](#page-362-3) based on [RFET](#page-364-0) technology will be introduced in [section 2.6](#page-75-0) on page [53,](#page-75-0) an overview of reconfiguration and further applications can be found in literature [\[18,](#page-316-8) [26,](#page-317-6) [37,](#page-318-6) [38,](#page-318-7) [39\]](#page-318-8). The second major benefitof [RFETs](#page-364-0) is the increased temperature range [\[40\]](#page-319-0). As the devices do not require doping, wide temperature ranges [\[15,](#page-316-5) [41\]](#page-319-1) allow targeting cryogenic and high-temperature applications. Whereas this enables the devices to function over a wide range, temperature-dependent change of device performance is still likely. This thesis therefore will introduce a mechanism to counter temperature variation induced effects and discuss

them in more detail in [section 2.4](#page-49-0) on page [27.](#page-49-0) The third and most important advantage is the ability to adjust the threshold voltage of the device [\[37,](#page-318-6) [38,](#page-318-7) [42,](#page-319-2) [43\]](#page-319-3). Whereas for basic [RFETs](#page-364-0) changing  $V_{th}$  happens with the same terminal that programs the polarity of the device, more advanced designs such as the [PARFAIT](#page-363-3)1 [RFET](#page-364-0) allow independent control using additional terminals. An adjustment of the threshold voltage changes the static leakage current and affects the dynamic power as well. On the other hand side, it modifies the [on](#page-21-0)-current  $I_{\text{on}}$ , affecting the delay times and therefore performance of the circuits. A detailed derivation of these effects will be given in [section 2.3.](#page-42-0) The trade-off between power and performance forms the main aspect of this thesis.

The planar [PARFAIT](#page-363-3) device is also compatible with CMOS processes, which enables mixed circuits made of both [RFET](#page-364-0) and normal [CMOS](#page-361-2) devices. This allows for more rapid integration and will be used in this work to reduce the complexity of the [RFET](#page-364-0) standard cell library presented in chapter 5 on page [161.](#page-183-0) [RFETs](#page-364-0) further possess some advantageous properties for analog and RF design, which are not further explored here [\[Reu22\]](#page-344-0). The reader is referred to the [PARFAIT](#page-363-3) project publications and related work for more information on those topics.

## **Further Reading**

As academic research initially focused on suppression of ambipolarity instead of on active use, most information about ambipolar transistor development is currently found in research articles. Most of those articles have narrow scope and are therefore not useful as overview works. They have be referenced in the bibliography section instead. The review articles [\[Ren19\]](#page-40-0) for ambipolar devices in general and [\[Hu21,](#page-41-0) [Fei22\]](#page-41-1) for two-dimensional devices serve as a good overview of recent developments. For general information about design of [SBFETs](#page-364-1) devices, refer to the summary in [\[Rud23\]](#page-41-2). Additionally, an introduction into characteristics and manufacturing of the specific [RFET](#page-364-0) device used in PARFAIT 1 can be found in [\[Kra19\]](#page-41-3). As there has been further research since the finalization of [\[Kra19\]](#page-41-3), readers are advised to also review the PARFAIT research literature referenced in this section.

<span id="page-40-0"></span>**[Ren19]** REN,Yi;YANG, Xiaoyang; ZHOU, Li; MAO, Jing–Yu; HAN, Su–Ting and ZHOU, Ye: "Recent Advances in Ambipolar Transistors for Functional Applications". In: *Advanced Functional Materials* 29.40 (2019), p. 1902105. DOI: [10.1002/adfm.201902105.](https://doi.org/10.1002/adfm.201902105)

- <span id="page-41-0"></span>**[Hu21]** HU, Wennan; SHENG, Zhe; HOU, Xiang; CHEN, Huawei; ZHANG, Zengxing; ZHANG, David Wei and ZHOU, Peng: "Ambipolar 2D Semiconductors and Emerging Device Applications". In: *Small methods* 5.1 (2021), e2000837. DOI: [10.1002/smtd.202000837.](https://doi.org/10.1002/smtd.202000837)
- <span id="page-41-1"></span>**[Fei22]** FEI, Wenwen; TROMMER, Jens; LEMME, Max Christian; MIKOLAJICK, Thomas and HEINZIG, André: "Emerging reconfigurable electronic devices based on two–dimensional materials: A review". In: *InfoMat* 4.10 (2022). DOI: [10.1002/inf2.12355.](https://doi.org/10.1002/inf2.12355)
- <span id="page-41-2"></span>**[Rud23]** RUDAN, Massimo: Springer Handbook of Semiconductor Devices. Springer Handbooks. Cham: Springer International Publishing AG, 2023.
- <span id="page-41-3"></span>**[Kra19]** KRAUSS, Tillmann A.: "Planare elektrostatisch dotierte rekonfigurierbare Schottky-Barriere FDSOI Feldeffekttransistor Strukturen". Dissertation. Darmstadt: Technische Universität Darmstadt, 2019.

# <span id="page-42-0"></span>**2.3 CMOS Circuit Technology**

In order to realize logic using transistor devices, those have to be connected in specific ways. There are various logic technologies, but the most commonly used one for digital logicis [CMOS](#page-361-2) technology.In [CMOS,](#page-361-2) logic gates consist of a pull-up network used to connect the output to  $VDD$  and a pull-down network, which connects to  $VSS$ . To avoid short circuits, it must be ensured that only one of the networks is conducting at a time, i.e. the networks have to be complementary.

This simplified view only applies to the static, steady state case, where all inputs of a cell are constant. Observation of the dynamic behavior, as explained below, will show a current flowing through the pull-up and pull-down networks during the output transition. It should also be noted that some optimized cells – such as multiplexers and [LUTs](#page-363-4) – are commonly realized using other logic, e.g. using pass transistors or transmission gates.

### <span id="page-42-1"></span>**[CMOS](#page-361-2) Cells**



(a) Traditional Inverter

(b) [RFET](#page-364-0) Inverter

**Figure 2.7:** Inverter circuitin [MOSFET](#page-363-0) [CMOS](#page-361-2) technology and in statically configured [RFET](#page-364-0) [CMOS](#page-361-2) technology. The [MOSFET](#page-363-0) inverter realizes the classical inverter circuit using [PMOS](#page-363-1) and [NMOS](#page-363-2) devices [\[44\]](#page-319-4). The [RFET](#page-364-0) inverter uses a static configuration, where the [TG](#page-364-4) of the inverters are directly connected to the  $VSS$  and  $VDD$  voltages [\[Reu19\]](#page-344-1). In this configuration, [RFETs](#page-364-0) always behave as either [PMOS](#page-363-1) or [NMOS](#page-363-2) transistors and can not be reconfigured.

[Figure 2.7](#page-42-1) shows a simple inverter as an example for a gate in [CMOS](#page-361-2) logic. Implementations are shown for both [SOI](#page-364-2) [MOSFET](#page-363-0) and [RFET](#page-364-0) technology. As

one of the most basic gates, the inverter features only one input  $x$  and one output  $y$ , which simplifies the introduction of the required concepts. In the [RFET](#page-364-0) case in [figure 2.7b,](#page-42-1) the [RFETs](#page-364-0) are statically configured by connecting their [TGs](#page-364-4) to  $VDD$  or  $VSS$  [\[Reu19\]](#page-344-1). This way, the transistor shown in the upper part of the figure will behave asa [PMOS](#page-363-1) transistor, the lower transistor as [NMOS.](#page-363-2) As this circuit is not reconfigurable, from a system perspective it behaves similar to the [SOI](#page-364-2) inverter in [2.7a.](#page-42-1) From a circuit perspective, electrical characteristics such as the voltage transfer characteristic of course vary.

<span id="page-43-0"></span>

**Figure 2.8:** Inverter voltage transfer characteristic, showing the output voltage depending on input voltage for the [RFET](#page-364-0) inverter of [figure 2.7b.](#page-42-1) The figure depicts noise margins, slope tangents and the low and high levels derived from those. Taken from [\[Reu19\]](#page-344-1).

**Power:** [Figure 2.8](#page-43-0) shows the voltage transfer characteristic for a statically configured [RFET](#page-364-0) inverter, closely resembling those of an [SOI](#page-364-2) inverter [\[Reu19\]](#page-344-1). As can be seen from the figure, an input voltage between  $V_{\text{H}}$  and  $V_{\text{H}}$  will cause an output voltage somewhere in between the low and high levels. In this range, both the pull-up and the pull-down network are conducting, leading to a current flow from *VDD* to *VSS*. For static operation, a circuit designer has to ensure that the input voltage is out of this range. For dynamic operation, when the input is switching from low to high or vice versa, the input voltage will be in this range for some time.

In order to quantify the energy loss during such a transition, a closer look at power lossin [CMOS](#page-361-2) gates is necessary. In general, dynamic power loss in nanometer CMOS is caused not only by short-circuit currents, but mostly by charging and discharging of load capacitance. This load capacitance is constituted of parasitic input capacitance of the gates connected to an output, as well as of the capacitance of the metal connection. In addition to those dynamic effects, which occur only during transitions of input signals, there are also static power dissipation effects, occurring even when signals are constant. These effects include subthreshold leakage, gate leakage, junction leakage and contention current. The following formulas summarize the power dissipationin [CMOS](#page-361-2) circuits [\[45\]](#page-319-5):

$$
P_{\text{total}} = P_{\text{dynamic}} + P_{\text{static}} \tag{2.6}
$$

$$
P_{\text{dynamic}} = P_{\text{switching}} + P_{\text{short circuit}} \tag{2.7}
$$

$$
P_{\text{static}} = (I_{\text{sub}} + I_{\text{gate}} + I_{\text{junct}} + I_{\text{cont}}) V_{\text{DD}} \tag{2.8}
$$

Switching power is given as:

$$
P_{\text{switching}} = \alpha C_{\text{L}} V_{\text{DD}}^2 f \tag{2.9}
$$

Where  $C_{\rm L}$  is the load capacitance,  $VDD$  is the supply voltage, and the clock frequency  $f_{\text{clk}}$  $f_{\text{clk}}$  $f_{\text{clk}}$  multiplied by the activity factor  $\alpha$  gives (half) the amount of transitions per second. The switching power therefore is directly proportional to the frequency f and the activity rate, which specifies in what proportion of the cycles the signal is actually switching. Switching power is therefore ultimately dependent on the number of transitions. Short circuit current has become mostly negligible in nanometer processes [\[45\]](#page-319-5).

<span id="page-44-0"></span>

**Figure 2.9:** Definition of the propagation delay  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  for both high-to-low and low-tohigh transition at output of an inverter. Also shown are the definitions of the rise and fall times  $t_r$  and  $t_f$ .

**Delay:** [Figure 2.9](#page-44-0) on the preceding page shows the definition of the propagation delay, a value used to characterize dynamic timing behavior of cells [\[45\]](#page-319-5). If there is a change of the value of a cell input that leads to a change of the value of a cell output, the propagation delay is the time between the input being at 50% of its high potential and the output reaching 50% of the high potential. Propagation delay can vary for different transitions. For an inverter, the only two possible transitions, a falling output and a rising output, are shown in the figure. More complex gates with multiple inputs and outputs accordingly need more propagation delay values to be described completely. The figure also shows definitions of rise and fall times, the time needed for an output to transition between 20% and 80% of its steady state high value. Those again depend on the combination of input and output used.

Furthermore, the exact values of those variables also depend on the load connected to the output. Assuming a load capacitance  $C_{\rm L}$ , the propagation delay can be specified based on the time needed to load this capacitance. Using the alpha-power model, this leads to [\[8\]](#page-315-1):

$$
t_{\rm PD} = \left(\frac{1}{2} - \frac{1 - \nu_{\rm T}}{1 + \alpha}\right) t_{\rm T} + \frac{C_{\rm L} V_{\rm DD}}{2I_{\rm D}}\tag{2.10}
$$

Neglecting the constant addend which depends on the input signal, Scarpato simplifies this to the following proportionality [\[3\]](#page-315-2):

$$
t_{\rm PD} \propto \frac{C_{\rm L} V_{\rm DD}}{I_{\rm D}} \tag{2.11}
$$

Using the alpha-power law model of [equation \(2.3\)](#page-34-0) for the drain current, this can be written as:

$$
t_{\rm PD} \propto \frac{C_{\rm L} V_{\rm DD}}{\mu (V_{\rm GS} - V_{\rm th})^{\alpha}},\tag{2.12}
$$

where the temperature dependency of  $\mu$  and  $V_{th}$  are given in [equations \(2.4\)](#page-34-1) and [\(2.5\).](#page-34-2) Scarpato then extends this model for chains of gates, assuming that  $I_D$  and  $VDD$  are constant and using a replacement capacitance  $C_{tot}$  as the sum of all load capacitances in a path [\[3\]](#page-315-2):

$$
t_{\rm PD,Path} \propto \frac{C_{\rm tot} V_{\rm DD}}{\mu (V_{\rm GS} - V_{\rm th})^{\alpha}}
$$
\n(2.13)

#### **Static Timing Analysis**

[Figure 2.10a](#page-46-0) shows a sequential (clocked) circuit consisting of two [FFs](#page-362-4) and a combinational part with one inverter and one *AND* gate. In order for the

[FFs](#page-362-4) to properly latch the input signal (avoiding zero and double clocking), certain constraints regarding setup and hold times must hold. Conceptually, the setup time is the duration the signal ata [FF](#page-362-4) input must be stable before the clock edge arrives at the [FF.](#page-362-4) Hold time on the other hand is the duration the input signal must be stable after the clock edge arrived. Satisfying these constraints for all [FFs](#page-362-4) in a circuit achieves timing closure and the process of verification is called [STA](#page-364-5) [\[46\]](#page-319-6).

<span id="page-46-0"></span>

(c) Hold Time Constraints

**Figure 2.10:** Example circuit and explanation of times usedin [STA.](#page-364-5)

Various definitions for the setup time calculations of the example circuit are shown in [figure 2.10b.](#page-46-0) The markers at  $t_{clk}$  denote the clock edges of a global, virtual reference clock with period  $T_{\text{clk}}$ . Because of delay caused by clock routing, the clock may not arrive at exactly the same time at all [FFs](#page-362-4). In that case,  $t_{\rm skew}$  $t_{\rm skew}$  $t_{\rm skew}$  describes the delay of the clock arriving at one [FF,](#page-362-4) compared to the virtual global clock. It essentially "shifts" the clock edges in the figure. In the example, the skew for the first [FF](#page-362-4) is assumed to be zero for simplicity, i.e. a clock perfectly aligned to the virtual reference. The setup time itself,  $t_{\text{setup}}$  $t_{\text{setup}}$  $t_{\text{setup}}$ , describes how long before the clock edge the input signal needs to be stable. This point in time is shown as  $t_{\text{req}}$  $t_{\text{req}}$  $t_{\text{req}}$ , the required arrival time. The actual [arr](#page-21-13)ival time  $t_{\text{arr}}$  can be calculated by summing all signal delays in the path. This includes the propagation delays  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  of all gates in the path, as well as wire delay  $t_{WD}$  $t_{WD}$  $t_{WD}$ . Wire delay usually describes the increase of propagation delay of the driving gate due to parasitic capacitance of the interconnect. In some cases, it also includes propagation delays of buffers inserted by [EDA](#page-362-5) tools to drive long wires. The difference between  $t_{\text{req}}$  $t_{\text{req}}$  $t_{\text{req}}$  and  $t_{\text{arr}}$  $t_{\text{arr}}$  $t_{\text{arr}}$  is called [slack](#page-22-10)  $t_{\text{slack}}$ . If it is positive, the signal arrives early enough and setup time constraints are adhered to. Putting all this together leads to a formula for the maximum achievable clock frequency and minimum clock period:

$$
T_{\rm{clk}} \ge t_{\rm{comb}} + t_{\rm{setup}} + t_{\rm{skew}} \tag{2.14}
$$

Definitions for the hold time are shown in [figure 2.10c,](#page-46-0) where the local clock may again be delayed by  $t_{\text{skew}}$  $t_{\text{skew}}$  $t_{\text{skew}}$ . The [hold](#page-21-14) time  $t_{\text{hold}}$  denotes the hold time required by the [FF.](#page-362-4) To analyze hold time constraints, the signal propagation time needs to be assessed relative to the same clock edge at the first [FF.](#page-362-4)The formula for a minimum combinational delay is then:

$$
t_{\rm comb} \ge t_{\rm skew} + t_{\rm hold} \tag{2.15}
$$

As shown in this example, the combinational path is usually formed out of multiple independent elements, [ARCs](#page-365-0). [STA](#page-364-5) usually represents the complete circuit as a graph and validates the mentioned constraints for all possible paths. As previously explained,  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  varies for different transitions. [STA](#page-364-5) therefore has to consider the worst case (or the worst possible combination) of all propagation delays in a path. Furthermore, cell speed differs depending on temperature, supply voltage and manufacturing effects. [Process Design Kit](#page-363-5) [\(PDK\)](#page-363-5) vendors therefore usually provide different parametrized sets of values. Fixing the temperature, voltage and process parameters allows to obtain one set of values, a so-called corner. [STA](#page-364-5) in common realizations uses the worst case value, leading to pessimistic results [\[46,](#page-319-6) p. 224].

Real circuits will possess many timing paths, but for circuit designers, primarily the critical paths are important. These are paths which either have a negative slack, or the ones with the smallest slack value, where the one with the smallest slack is called critical path. Manual optimizations of combinational delay are usually only needed for setup time constraints. For the hold time constraint, [EDA](#page-362-5) tools can automatically increase the delay of the

combinational path, if necessary. Furthermore, clock skew can be adjusted deliberately to achieve timing closure [\[46\]](#page-319-6).

## **Further Reading**

[CMOS](#page-361-2) circuit fundamentals have been discussed on different abstraction levels in various standard works. Streetman focuses largely on technology aspects and transistor devices, but also gives a quick introduction to inverter characteristics [\[Str15\]](#page-48-0). Razavi also summarizes semiconductor physics, but focuses more on circuit level aspects [\[Raz21\]](#page-48-1). He also covers an introduction to [CMOS](#page-361-2) circuits, focusing on noise margins and transition times. Baker also explains circuits such as oscillators, rather than only basic gates [\[Bak19\]](#page-48-2). He also covers physical aspects of cell design such as transistor sizing and layout. Weste covers [CMOS](#page-361-2) design from a system perspective [\[Wes11\]](#page-48-3). After quickly introducing [MOSFET](#page-363-0) devices and basic concepts such as delay and power in [CMOS](#page-361-2) circuits, he covers delay models, interconnect aspects, combinational and sequential circuit design, [EDA](#page-362-5) and verification. Sakurai focuses on [SOI](#page-364-2) and circuit design for this technology [\[Sak06\]](#page-48-4). He also includes a chapter on [CMOS](#page-361-2) circuit design with focus on low-power realizationin [SOI.](#page-364-2) An introduction to timing analysis can be found in [\[Wes11\]](#page-48-3), but a more thorough introduction to this topic is given in [\[Kah22\]](#page-48-5). The actual algorithms used in [EDA](#page-362-5) tools are described in more detail in [\[Ger99\]](#page-48-6).

- <span id="page-48-0"></span>**[Str15]** STREETMAN, Ben: Solid State Electronic Devices: Global Edition. 7th Edition. Harlow: Pearson, 2015.
- <span id="page-48-1"></span>**[Raz21]** RAZAVI, Behzad: Fundamentals of microelectronics: With robotics and bioengineering applications. Third edition. Hoboken: Wiley, 2021.
- <span id="page-48-2"></span>**[Bak19]** BAKER, Russel Jacob: CMOS circuit design, layout, and simulation. Fourth edition. Vol. 22. IEEE Press series on microelectronic systems. Piscataway, NJ and Hoboken, New Jersey: IEEE Press and Wiley, 2019.
- <span id="page-48-3"></span>**[Wes11]** WESTE, Neil H. E. and HARRIS, David Money: CMOSVLSI design: A circuits and systems perspective. 4. ed. Boston, Mass.: Addison-Wesley, 2011.
- <span id="page-48-4"></span>**[Sak06]** SAKURAI, Takayasu: Fully-Depleted SOI CMOS Circuits and Technology: For Ultralow-Power Applications. Boston, MA: Springer, 2006. DOI: [10.](https://doi.org/10.1007/978-0-387-29218-2) [1007/978-0-387-29218-2.](https://doi.org/10.1007/978-0-387-29218-2)
- <span id="page-48-5"></span>**[Kah22]** KAHNG, Andrew B.; LIENIG, Jens; MARKOV, Igor L. and HU, Jin: VLSI Physical Design: From Graph Partitioning to Timing Closure. Cham: Springer International Publishing, 2022. DOI: [10.1007/978-3-030-96415-3.](https://doi.org/10.1007/978-3-030-96415-3)
- <span id="page-48-6"></span>**[Ger99]** GEREZ, Sabih H.: Algorithms for VLSI design automation. Chichester and Weinheim: Wiley, 1999.

# <span id="page-49-0"></span>**2.4 PVT Variation and Aging**

As one of its primary features, the [PARFAIT](#page-363-3) FPGA architecture enables local power adjustments. These can be used to tune the performance of the implemented circuit toward user application requirements: For example, the power and performance of paths which are not critical for timing can be reduced. However, in order to guarantee that paths still meet timing requirements, it is necessary to not only know the expected path delay, but also the actual delay in the manufactured circuit. Here it is important to notice that currently used models for [STA](#page-364-5) do not represent the real path delay accurately, as they ignore various operating condition effects. Additionally, corner-based [STA](#page-364-5) analysis for [ASIC](#page-361-3) can conceive various combinations of operating conditions, balancing device yield and productivity loss by overly strict requirements. As will be shown here, [FPGAs](#page-362-3) however generally have to use the worst-case corner, leading to pessimistic [STA](#page-364-5) results.

In the following chapter, the four main operating conditions which affect circuit delays will be introduced:

- **Process Variation:** Due to manufacturing effects,
- **Voltage Variation:** Due to locally varying power supply,
- **Temperature Variation:** Due to locally varying temperature and
- **Aging:** Due to degradation of circuits over time.

It will be explained what causes those effects and how representative models for simulation can be derived from literature. In addition, it will be shown how existing models can be adapted for [RFET](#page-364-0) devices. Following that, the influence of transistor-level effects on circuit and system level metrics will be explained and how modeling for those effects will be introduced. In later chapters, these models will be used to simulate the [PARFAIT](#page-363-3) architecture with changing operating conditions.

#### **Process Variation**

Process variation describes the fact that transistor parameter values are not identical for all transistors on a manufactured [IC,](#page-362-6) even if they have been designed with nominally same values. This variation of transistor parameters is caused by manufacturing and illustrated in [figure 2.11.](#page-50-0) It mainly affects the threshold voltage  $V_{th}$  and effective channel length  $L_{eff}$ , but other parameters

<span id="page-50-0"></span>

**Figure 2.11:** Process Variation affecting a transistor parameter x. The figure shows the probability density of a parameter relative to its nominal value  $x_0.$   $\,x_{\mu}$ shows the effective mean value including variation,  $x_n$  is one instance of a measured device. Adapted from [\[47\]](#page-319-7).

are affected as well. This variation is usually assumed to be Gaussian and can then be split into an offset of the mean and into zero-mean Gaussian variation. To understand process variation in transistor manufacturing, it is illustrative to review the main processing steps introducing variations. For an overviewof [Very Large Scale Integration \(VLSI\)](#page-364-6) device manufacturing, please refer to literature [\[48,](#page-319-8) [49,](#page-319-9) [50\]](#page-319-10).

**Process Variation Sources:** In device manufacturing, photolithography is used to project patterns from masks onto the wafer. The process consists of multiple steps, including deposition of a resist film, mask alignment, exposure of the resist, development and baking. As structure sizes have decreased, various complex forms of photolithography have been introduced. Photolithography introduces process variation mainly through two effects: [Line-Edge Roughness \(LER\)](#page-363-6) is a variation of the channel length along the width, or the width along the length. It is caused by optical effects in during exposure and mainly affects the  $V_{th}$  of devices. The second effect is the [Optical](#page-363-7) [Proximity Effect \(OPE\):](#page-363-7) Because of diffraction effects, exposing a structure is actually dependent on neighboring elements. [OPE](#page-363-7) correction has been developed, but can not completely counter the effect. [OPE](#page-363-7) mostly leads to width variation and affects a device's  $V_{th}$ . In addition, mask alignment issues during lithography also affect device width and  $V_{th}$ .

Etching, which removes unneeded structures according to the mask, is another step introducing process variation. Depending on the method used (wet / dry / plasma / sputtering), there can be over- or underetching. These effects depend on the mask layout and therefore introduce variation. Doping is also a main contributor to process variation. It is used to insert dopants into a grid to form semiconductors. Commonly used methods for doping include ion implantation and diffusion. The process is prone to [Random](#page-364-7) [Dopant Fluctuation \(RDF\),](#page-364-7) causing locally varying doping concentrations, which mainly leads to varying  $V_{th}$ .

<span id="page-51-0"></span>

**Figure 2.12:** Process Variation taxonomy according to [\[51\]](#page-319-11). Adapted from [\[52\]](#page-320-0).

Deposition of materials is one more step which can cause process variation. As the threshold voltage depends on the gate oxide thickness, layer deposition of the gate oxide can cause variation. Methods commonly used for deposition are [Chemical Vapor Deposition \(CVD\),](#page-362-7) [Physical Vapor Deposition \(PVD\)](#page-363-8) and atomic layer deposition. One more processing step introducing variation is planarization of surfaces. As the wafer has to be as planar as possible for various production steps, it is planarized repeatedly. Whereas there are various etchback techniques, [Chemical Mechanical Polishing \(CMP\)](#page-361-4) is the most commonly use planarization method. It is prone to two effects causing variation: Dishing happens when the wafer is over-polished and too much metal is removed between dielectric parts, as it is softer than those. Erosion is similar, but includes both dielectric and metal loss. Again it is more marked in locations where there is lots of metal compared to dielectric, i.e. in locations with dense metal wiring. Both effects reduce the metal layer width, increasing resistance in the interconnect.

**Process Variation Taxonomy:** According to [\[51,](#page-319-11) [52\]](#page-320-0), process variation effects can be classified into the categories shown in [figure 2.12.](#page-51-0) *Systematic* or deterministic process variation is caused by deterministic effects, which affect all produced devices similarly. These manufacturing effects are identical across wafers, but could for example depend on the circuit layout instead. The most common cause of such effects occurs in photolithography, with optical proximity effects being the main contributor to variation. As these effects are systematic, they can be compensated at least partially.

Apart from systematic effects, all other effects can be categorized as *Non-Systematic* or random effects [\[52\]](#page-320-0). As these variations are arbitrary, modeling generally uses statistical approaches. Non-systematic effects can then be further separated into *Inter-Die* or global variation, and *Intra-Die* or local variation. Inter-Die Variation affects transistors on each single die in the same way, but does produce variations across different dies. Such effects are generally introduced by manufacturing variations which concern at least the whole die in the same way. Here die-to-die variations can occur due to

mask alignment issues, wafer-to-wafer variations due to wafer positioning, lot-to-lot variations because of changes in equipment or source material and fab-to-fab variations due to differences in different fabs. Inter-Die variation is usually countered by binning devices [\[47\]](#page-319-7): Device performance is measured and devices are classified into different speed grades according to their performance. As Inter-Die variation affects all transistors and therefore logic paths on a wafer similarly, there is little use in trying to locally compensate effects on the produced chips. Inter-Die variation can be modeled using a statistical distribution of affected device parameters, in the simplest case assuming a Gaussian distribution. As these effects do not have spatial correlation, all transistors can be modelled in the same way.

*Intra-Die* variation on the other hand can be differentiated into pure random variation and spatially correlated variation. Most notably, these changes can affect the feature size of different transistors on one die, leading to varying electrical characteristics for those. In currently used, modern technologies, most variation within produced devices is intra-die variation. Pure random intra-die variation represents effects which do not have any spatial correlation, i.e. affect nearby transistors in the same way as transistors which are at a larger distance.

According to [\[52\]](#page-320-0), these effects are mainly caused by random-dopant fluctuation and line edge roughness: Random-dopant fluctuation is less of an issue for [RFET](#page-364-0) technology, as it does not use any doping steps. Line edge roughness however affects [RFET](#page-364-0) technology as well, as photolithography is used for [RFET](#page-364-0) manufacturing. Spatially correlated variation is location dependent and describes an effect of the correlation between transistor characteristics depending on their relative distance, with larger correlation for transistors more closely located on the die. Intuitively, this means that transistors close to each other on a die are more likely to have "similar" characteristics. Various manufacturing steps can introduce such effects, most notably photolithography, etching and chemical-mechanical publishing. All of those can change gradually along the die. Intra-die variation affects mostly the channel length, width and oxide thickness of transistors.

**Process Variation Modeling:** This thesis will mainly address intra-die variation, but inter-die effects can be added using an additional Gaussian parameter in the variation model. In that case, mean and standard deviation of the inter-die variation distribution has to be known. The most commonly used way to model process variation of some parameter  $X$  in literature, is to consider it as a random variable. Then a linear combination of individual

components can be used to describe the final random variable for the total variation [\[52,](#page-320-0) [53,](#page-320-1) [54,](#page-320-2) [55\]](#page-320-3):

$$
X = X_0 + \Delta X_{\text{D2D}} + \Delta X_{\text{WID}}
$$
  
=  $X_0 + \Delta X_{\text{D2D}} + \Delta X_{\text{WID,c}} + \Delta X_{\text{WID,r}}$  (2.16)

<span id="page-53-0"></span>In [equation \(2.16\),](#page-53-0)  $X_0$  represents the nominal value of the parameter,  $\Delta\!X_{\rm D2D}$ inter-die variation and  $\Delta X_{\text{WD}}$  intra-die variation with correlated part  $\Delta X_{\text{WDc}}$ and pure random part  $\Delta X_{\text{WID}}$ . Assuming X to be normally distributed and considering  $\Delta X_{\text{D2D}}$ ,  $\Delta X_{\text{WID}}$  and  $\Delta X_{\text{WID}}$  as normally distributed and independent, mean  $\mu_X$  and variance  $\sigma_X^2$  of X are given as:

<span id="page-53-2"></span><span id="page-53-1"></span>
$$
\mu_X = X_0 \tag{2.17}
$$

$$
\sigma_X^2 = \sigma_{X_{\text{D2D}}}^2 + \sigma_{X_{\text{WID},c}}^2 + \sigma_{X_{\text{WID},r}}^2
$$
\n(2.18)

Which of the parts contributes to the parameter variation  $X$  differs for each parameter, depending on how it is affected by the variation sources [\[52\]](#page-320-0). [Equations \(2.17\)](#page-53-1) and [\(2.18\)](#page-53-2) don't necessarily apply for some recent models which use non-normally distributed variables.

**VARIUS Model:** Models for process variation usually need technology dependent parameters. As those are not always available from fabs, this can make adaption of those models difficult. The thesis therefore first introduces the VARIUS process variation model from [\[53\]](#page-320-1), as it is a simple model requiring few parameters and provides exemplary parameter values which match the empirical study in [\[56\]](#page-320-4). It can easily be adjusted for device data from other studies such as [\[57,](#page-320-5) [58\]](#page-320-6), or be tailored to a custom device characterization. The model is physically motivated, i.e. derived from transistor equations, and considers two main factors in process variation: Threshold voltage  $V_{th}$  and the effective gate length  $L_{\text{eff}}$ .

VARIUS describes  $\Delta X_{\text{WID.r}}$  and  $\Delta X_{\text{WID.c}}$  in [equation \(2.16\)](#page-53-0) as independent normal distributions. It models spatial correlation between two points on the chip with the correlation

$$
corr(X_{\vec{x}}, X_{\vec{y}}) = \rho(r) \qquad r = |\vec{x} - \vec{y}| \tag{2.19}
$$

and a spherical model for the correlation function  $\rho(r)$ :

$$
\rho = \begin{cases} 1 - \frac{3r}{2\Phi} + \frac{r^3}{2\Phi^3} & (r \le \Phi) \\ 0 & \text{otherwise} \end{cases} \tag{2.20}
$$

It notes that the spatially correlated component of  $L_{\text{eff}}$  is correlated to  $V_{\text{th}}$  and derives it accordingly:

$$
L_{\text{eff}} = L_{\text{eff}}^{0} \left( 1 + \frac{V_{\text{th}} - V_{\text{th}}^{0}}{2V_{\text{th}}^{0}} \right)
$$
 (2.21)

The publication also offers parameter values  $\Phi = 0.5$ ,  $\sigma/\mu = 0.063$  for random and systemic  $V_{\text{th}}$ , and  $\sigma / \mu = 0.032$  for random  $L_{\text{eff}}$ . Those can be used to reproduce the results for the transistor technology of [\[56\]](#page-320-4).

**Scarpato Model:** In [\[3\]](#page-315-2), Scarpato presents a complete approach to derive parameters from fab data for his propagation delay model. Compared to the VARIUS model, the model directly represents propagation delay instead of transistor parameters. It provides a complete model not only for process variation, but also for voltage and temperature variation as well as aging. Parameter values for this model are obtained through SPICE simulation of the intended target technology. As test circuit for the simulation, a representative path is selected from the target application [\[3\]](#page-315-2). To model process variation, the Scarpato model first characterizes the technologies voltage and temperature dependencies. First, various voltage and temperature values are simulated. Then the parameters in the following equation are fitted to the SPICE simulation results:

<span id="page-54-0"></span>
$$
t_{\rm pd}(V,T) = p_{\beta} + (C_1 + k_1 T^{n_1}) \frac{V}{(V - (C_2 - k_2 T^{n_2}))^{p_{\alpha}}}
$$
(2.22)

To model the process dependency, the simulation is repeated for various corners. To be able to get results for process variation between corners, some sort of interpolation is necessary. Scarpato therefore introduces parameter shifts depending on the process corner. Here he notes that there seems to be no meaningful dependency when all the parameters are varied between corners. Therefore, the model uses constant values for all parameters except for  $C_2$  and  $p_\alpha$ . Those parameters can then be considered to be a linear function between best and worst values:

$$
C_2(P) = C_{2,FF} + m_{c_2} \cdot P \tag{2.23}
$$

$$
p_{\alpha}(P) = p_{\alpha, FF} + m_{p_{\alpha}} \cdot P \tag{2.24}
$$

With  $P$  being a normalized process parameter from fastest  $(0)$  to slowest  $(1)$ ,  $C_{2,FF}/p_{\alpha,FF}$  the base values at the fastest corner and  $m_{c_2}/m_{p_\alpha}$  the slopes of the linear functions. Inserting into [equation \(2.22\)](#page-54-0) yields:

<span id="page-54-1"></span>
$$
t_{\rm pd}(P, V, T) = p_{\beta} + (C_1 + k_1 T^{n_1}) \frac{V}{(V - (C_2(P) - k_2 T^{n_2}))^{p_{\alpha}(P)}} \tag{2.25}
$$

<span id="page-55-0"></span>

**Figure 2.13:** Delay increase due to supply voltage drop. The figure shows the effect of a 5 mV drop depending on the supply voltage. Adapted from [\[3\]](#page-315-2).

The Scarpato model thereby provides delay for the estimated circuit at one operating point described by the  $(P,V,T)$  tuple. It does not model local variation directly, but can be used to obtain locally varying delays. When the local variation of the parameters  $f : (x, y) \mapsto (P, V, T)$  has been modelled, the Scarpato model can be used to obtain the local delay.

**Other Models:** Various other models have been proposed in literature. Simple, grid based models have been described as early as in [\[59\]](#page-320-7). As process variation grows increasingly complex, the amount of parameters can be expensive for simulation times. Because of this, [\[60\]](#page-320-8) proposes a model to reduce the amount of modelled parameters, which is especially important for Monte-Carlo simulations. Other models enable description of manufacturing parameters as non-gaussian and correlated parameters [\[61,](#page-320-9) [62\]](#page-321-0). An overview of commercially used models is given in [\[63,](#page-321-1) [64\]](#page-321-2). Other recent publications have focused on describing the variation as a skewed distribution instead of normally distributed [\[65\]](#page-321-3). The methods shown in this thesis can be used for all of those models. In order to simulate all four [PVTA](#page-364-8) parameters, a model encompassing all of those is necessary. As the mentioned models focus only on process variation, this thesis' evaluations will be based on a Scarpato model variant extended for local variation.

#### **Voltage Variation**

As demonstrated in [figure 2.13,](#page-55-0) changes in the supply voltage also cause changes in circuit delay. Supply voltage variation can be classified in two categories: Steady-state effects and dynamic effects [\[3\]](#page-315-2). Steady-state effects are caused by  $IR$ -drop, the voltage drop caused by the parasitic resistance R of the power network when a current  $I$  flows [\[66\]](#page-321-4). Varying voltage drops can therefore be caused by both variation in resistance  *and variation in* . Local variation of the resistance can be caused by process variation. In addition, it can also be systematically caused by unbalanced power network design. Changes in current can be caused by both density of transistors within a certain area and switching activity. Varying switching activity through clock gating or frequency scaling results in varying local power usage and  $IR$ -drop [\[67\]](#page-321-5). Switching activity is furthermore data-dependent, but for analysis of steady-state effects, considering the mean switching activity is sufficient. Whereas the transistor density can be taken into account when designing the power network, switching activity can not be compensated easily, as it may not be constant over time. This is especially the casein [FPGAs](#page-362-3) designs, where both the final switching activity and density of actively used transistors depends on the application bitstream. Both effects can therefore not be estimated until the chip programmed in the field, making power network based compensation largely impossible. Dynamic effects are mostly caused by  $di/dt$  noise [\[66\]](#page-321-4) and can cause supply voltage variations of up to 10 % [\[3\]](#page-315-2). They can be classified according to duration of the drops they cause [\[68\]](#page-321-6): The deepest drops are usually short and in the range of nanoseconds. They are caused by package inductance and capacity of the power distribution network. Effects in the range of hundreds of nanoseconds and slower effects in the range of microseconds usually have external cause. They can be mitigated by better package decoupling on the [Printed Circuit Board \(PCB\)](#page-363-9) integrating the [IC.](#page-362-6)

**Effects on [FPGAs:](#page-362-3)** For [FPGAs,](#page-362-3) voltage variation is an especially difficult issue to address. In [ASICs,](#page-361-3) the power supply network can be adapted to the application as it is manufactured together with the circuit. In [FPGAs,](#page-362-3) the power grid has to be pre-defined at the manufacturing time of the [FPGA](#page-362-3) [IC.](#page-362-6) The application is known only much later however, when programming the bitstream. It is therefore not readily possible to adapt the grid to the transistor density and switching activity of the target application. Because of that, the power networkin [FPGAs](#page-362-3) needs to be designed defensively. In addition, when mapping the target application and calculating timing slacks, certain safety margins have to be included. Voltage drops are the reason for the largest part of the timing margins and cause overly pessimistic margins in general [\[69\]](#page-321-7). Because it is difficult to address this issue in the [FPGA](#page-362-3) architecture, publications such as [\[69\]](#page-321-7) instead provide advice for [FPGA](#page-362-3) application designers. For example, careful floor planning of the user application can reduce the issue. The analysis in [\[69\]](#page-321-7) also shows that out of [Process-, Voltage-, Temperature-](#page-364-9) [Variation \(PVT\),](#page-364-9) dynamic voltage variation contributes most to increase of path delay, followed by process variation, temperature variation and static voltage variation at last.

**Voltage Variation Modeling:** Influence on the circuit timing can be estimated with a  $PVT$  model such as [equation \(2.25\)](#page-54-1) or the alpha-power law [\[66\]](#page-321-4):

$$
t_{\rm pd}(V + \Delta V) \propto \frac{V + \Delta V}{\left(V + \Delta V - V_{\rm th}\right)^{\alpha}}\tag{2.26}
$$

Here it can be seen that the closer V and  $V_{th}$ , the larger the influence of  $\Delta V$  [\[3\]](#page-315-2). This effect can also be seen in [figure 2.13](#page-55-0) on page [33,](#page-55-0) where a voltage drop of 5 mV has no effect at a supply voltage of 1.4 V, but causes 10%additional delay at 0.86 V. With such a model, a model describing the voltage variation on the [IC,](#page-362-6)  $f: (x, y) \mapsto (V)$ , is needed. Modeling of such power variation differs for steady-state and dynamic variation. For [ASICs,](#page-361-3) steady-state variation is usually analyzed using deterministic methods. If the application design and switching frequencies are known, [Computer Aided Design \(CAD\)](#page-361-5) tools support calculation and analysis of  $IR$  drop in power networks. Similarly, for [FPGAs](#page-362-3) power drop effects could be estimated if details about [FPGA](#page-362-3) power network, the application bitstream and switching activity are known. As this drop can be analyzed deterministically for applications, there are no ready to use models which represent a "typical" application.

Research on dynamic voltage drops, especially on [FPGAs,](#page-362-3) on the other hand largely focuses on security issues. Voltage drops can leak data through sidechannels or could be used for [Denial Of Service \(DOS\)](#page-362-8) attacks in cloud [FPGA](#page-362-3) hosts [\[70\]](#page-321-8). Models for these effects have so far only been provided on transistor level [\[71\]](#page-322-0). It has also been shown that power drop effects are usually spatially correlated over the [IC](#page-362-6) [\[72\]](#page-322-1), as voltage drops effectively occur in the power network. They therefore always affect multiple devices connected to that local part of the network. The author of this work is not aware of any system-level model combining these aspects for representable applications on [FPGA.](#page-362-3) The example evaluations in this work will therefore only consider static effects, but the designed simulation framework will be adaptable to dynamic models as well.

### **Temperature Variation**

Temperature variation can be a large issue in modern [VLSI](#page-364-6) [ICs.](#page-362-6) Early works have demonstrated that there can be local hotspotsin [CPUs](#page-361-6) of up to 120 °C

[\[67\]](#page-321-5). Furthermore, it has been shown that both transistor device and interconnect performance are affected by temperature effects [\[67\]](#page-321-5). Whereas this effect is an issue for classic circuits as well, energy saving [Near-Threshold](#page-363-10) [Voltage Operation \(NTVO\)](#page-363-10) is affected in a special way: In this situation, onchip temperature is often close to ambient temperature. Therefore, fluctuations in ambient temperature have immediate effect on circuit performance [\[73\]](#page-322-2).

<span id="page-58-0"></span>

**Figure 2.14:** Temperature variation on [FPGAs](#page-362-3) as reported by previous works in measurements and simulation models. **[\(a\)](#page-58-0)** Temperature profile for an application using only programmable soft logic. **[\(b\)](#page-58-0)** Temperature profile for an application using hard blocks on a platform [FPGA.](#page-362-3) **[\(c\)](#page-58-0)** Temperature profile as obtained using a simulation model.

Whereasin [ASICs](#page-361-3) a temperature profile can be devised by CAD tools before manufacturing the chip, in [FPGA](#page-362-3) such an analysis can only be performed when implementing the user application. Mitigation techniques which improve cooling of the chip locally, can therefore not be applied. Furthermore, recent work shows that heating can also be excessive on [FPGAs.](#page-362-3) Specially designed [FPGA](#page-362-3) applications were able to heat commercial [FPGAs](#page-362-3) to temperatures of 50 °C to 120 °C [\[77\]](#page-322-6). Standard [FPGA](#page-362-3) designs also cause temperature variation, especially in platform [FPGAs](#page-362-3) [\(figures 2.14a](#page-58-0) and [2.14b\)](#page-58-0). In those, especially hard blocks can cause local heating [\[75\]](#page-322-4). An exemplary analysis of [CPU](#page-361-6) circuits implemented on [FPGAs](#page-362-3) has further found temperature hotspots caused by cache and memory interfaces [\[78\]](#page-322-7). Other works have inserted temperature sensors into certain [FPGA](#page-362-3) applications to verify predictions of hotspot locations or monitor those at runtime [\[74\]](#page-322-3).

**Placement Strategies:** One way to counter these hotspot effectsin [FPGA](#page-362-3) is using a thermal aware placement strategy [\[75\]](#page-322-4). In those works, the thermal effects are usually estimated using a thermal simulator and the placement is then modified accordingly. Apart from larger modifications of placement, even small changes such as swapping the inputsof [LUTs](#page-363-4) can reduce the thermal hotspots [\[79\]](#page-322-8). Lu et al. provided a slightly different approach where the

initial temperature profile is not obtained through simulation, but through measurements [\[80\]](#page-322-9). They first perform a normal placement, but embed temperature sensors. The user application design then needs to be executed on the [FPGA,](#page-362-3) so the embedded sensors can be used to obtain the temperature profile. The proposed algorithm then perform a new placement for the application, taking the temperature profile into account. Overall, this resulted in a 13.9% increase of uniformity in temperature distribution over the [FPGA.](#page-362-3)

**Modeling Approaches:** To evaluate the effectiveness of all those mitigations for temperature variation in simulation, a model for this variation is needed. Temperature variation is usually modelled deterministically, based on the concrete circuit. Whereas [CAD](#page-361-5) tools for [ASICs](#page-361-3) integrate such temperature modeling, for [FPGAs](#page-362-3) usually only an overall temperature for the whole chip is estimated by vendor tools. Academia has therefore focused on deriving custom workflows to generate temperature profiles. [Equations \(2.27\)](#page-59-0) and [\(2.28\)](#page-59-1) demonstrate that the chip temperature is related to leakage power and dynamic power [\[76\]](#page-322-5):

$$
P_{\text{leak}} = P_0 e^{\frac{-k}{T}} \tag{2.27}
$$

<span id="page-59-1"></span><span id="page-59-0"></span>
$$
T = T_A \Theta (P_{\text{leak}} + P_d) \tag{2.28}
$$

It is therefore necessary to derive dynamic and leakage power to estimate temperature profiles. Whereas dynamic power can be obtained from switching activity, leakage power is often not reported by [FPGA](#page-362-3) vendor tools. In [\[76\]](#page-322-5), the authors therefore propose an iterative approach to derive the leakage power. Once those components are known, the temperature distribution can be obtained using thermal simulators such as HotSpot or ISAC. An example for such a simulated heatmap is shown in [figure 2.14c.](#page-58-0) Validating their measurements using thermal cameras, [\[76\]](#page-322-5) shows that this strategy can estimate temperature with a precision of up to 1 °C. The author of this dissertation is not aware of any application-independent, generic, temperature model. All models focus on the behavior of a given application, but there are none which predict the behavior of an exemplary application. As such, temperature modelling for the purpose of this thesis needs to be performed using the modelling approach presented above: The dynamic and leakage power need to be derived, and the temperature distribution can then be obtained using thermal modeling for each specific application.

**Effect on Circuit Delay:** The effect of varying temperature depends largely on the actual transistor technology used [\[81\]](#page-323-0). Previous studies found transistor drive current to decrease by 4% and interconnect delay to increase by 5% for a temperature increase of 10 K [\[67,](#page-321-5) [81\]](#page-323-0). Various parameters in the transistor equation are temperature dependent, including the channel mobility and charge, as well as the gate overdrive voltage through the dependency of the threshold voltage. For [FinFET](#page-365-1) and other [SOI](#page-364-2) devices, reduced short channel effects and self heating effects might have additional influence [\[81\]](#page-323-0). For bulk [MOSFET,](#page-363-0) temperature effects can be physically understood using the BSIM3 and BSIM4 transistor models. These models contain various parameters with temperature dependencies. The delay variation characteristics change, depending on which parameter is dominant in each specific case [\[82\]](#page-323-1). As an example, reduced supply voltage can cause an inversion of the temperature dependency: In older technology nodes, carrier mobility variation dominated, increasing circuit delay with rising temperature. In 45 nm technology, gate overdrive becomes the dominant parameter and circuit delay now decreases with rising temperature [\[83\]](#page-323-2).

Because of this technology dependent behavior, there are no physically based models for system level delay estimation. Simulation is commonly based on the device level SPICE models, which is accurate yet computationally infeasible for large circuits. Alternatively, mathematical models that abstract from the physical device behavior and use parameter fitting can be used. This is the approach used in the Scarpato model, fitting parameters of a mathematically motivated model to SPICE simulations. Here the dependency of the temperature variation on the dominant physical parameter is hidden in the fitted model parameters. This simplifies the equations and allowing for efficient computation using a generic model, adapting to various technologies. The resulting model is only valid as long as the temperature and other parameters stay within the limits used in the fitting process [\[3\]](#page-315-2).

#### **Aging**

Apart from process variation, which affects circuit performance independently of the environment conditions during operation, and environment conditions such as voltage and temperature, another aspect to consider in circuit performanceis [IC](#page-362-6) aging. [IC](#page-362-6) aging can be caused by [Front End of Line](#page-362-9) [\(FEOL\)](#page-362-9) effects, i.e. by transistors, by [Back End of Line \(BEOL\)](#page-361-7) effects, i.e. metal layers, and on package level and back end [\[84\]](#page-323-3). Aging effects can cause gradual degradation in performance or complete breakdown of functionality. Effects can also be temporary, allowing recovery under certain conditions,

or permanent. For the analysis in this thesis, both permanent and temporary gradual degradation effects will be addressed. Effects causing complete breakdown of functionality will not be considered.

**[FEOL](#page-362-9) effects:** There are three main effects belonging to the [FEOL](#page-362-9) category: [Hot Carrier Injection \(HCI\)](#page-362-10) is an effect where charges get trapped in the dielectric of the transistor. It is caused by current flow in the transistor channel, letting "hot" carriers enter the dielectric. Whereas it also affects carrier mobility and current flow, its main result is a permanent change in  $V_{th}$  [\[3\]](#page-315-2). Signal activity is the main factor affecting [HCI](#page-362-10) severity. It is also dependent on the supply voltage of the transistors, but mostly independent of the temperature [\[85,](#page-323-4) [86\]](#page-323-5). It is generally most relevant for analog circuits and less of an issue in digital designs [\[87\]](#page-323-6).

<span id="page-61-0"></span>

**Figure 2.15:** [Negative BiasTemperature Instability \(NBTI\)](#page-363-11) stress and recoveryin [PMOS](#page-363-1) transistors, taken from [\[87\]](#page-323-6). Left: Charge carriers entering dielectric and breaking Si-H bonds under stress (negative  $V_{th}$ ). Right: Charge carriers leaving dielectric and Si-H bonds restoring under recovery conditions  $(zero V_{th})$ .

The second effectin [FEOL](#page-362-9)is [Bias Temperature Instability \(BTI\)](#page-361-8) and its variants [NBTI](#page-363-11) and [Positive Bias Temperature Instability \(PBTI\).](#page-363-12) [NBTI](#page-363-11) as shown in [figure 2.15](#page-61-0) occurs in [PMOS](#page-363-1) devices and is the more common of the two effects. Like [HCI,](#page-362-10) it is an effect where charge carriers enter the dielectric and causes an increase in  $V_{th}$ . Unlike [HCI](#page-362-10) however, the cause for [BTI](#page-361-8) is the electric field perpendicular to the channel. The effect is therefore mainly caused by the voltage applied to the gate and the severityof [BTI](#page-361-8) depends on the signal switching probability [\[3\]](#page-315-2). Someof [BTI](#page-361-8) is reversible: Charge carriers entering the dielectric can leave it again if reduced gate voltage is applied. [BTI](#page-361-8) can however also lead to permanent effects, e.g. through broken Si-H bonds [\[3\]](#page-315-2). The effect is highly temperature dependent [\[88,](#page-323-7) [89\]](#page-323-8) and is the dominant aging effect in digital circuits [\[87\]](#page-323-6).

The third common aging effectin [FEOL](#page-362-9)is [Time Dependent Dielectric Break-](#page-364-10)

[down \(TDDB\). TDDB](#page-364-10) is a mostly permanent and often destructive effect. It causes a conductive path in the dielectric, which leads to increased leakage currents. When the hard breakdown occurs, a conducting path is created from gate to substrate, making the transistor unusable [\[3\]](#page-315-2). As destructive effects won't be addressed in this thesis, [TDDB](#page-364-10) effects will not be examined further.

**[BEOL](#page-361-7) effects:** [BEOL](#page-361-7) aging effects are those effects, which are caused in metal layers. The most prominent failure mechanism for metal layer based interconnectsis [Electromigration \(EM\). EM](#page-362-11) is caused by the electric field enabling the current flow in a conductor. This field causes atoms in the conductor to slowly migrate, as shown in [figure 2.16.](#page-62-0) As a result, wires can develop voids, causing increasing resistance on the wire. This effect is not necessarily destructive. In addition, migrated atoms accumulate on certain spots, forming hillocks. Hillocks can cause short-circuits between neighboring wires, leading to permanent defects [\[90\]](#page-323-9).

<span id="page-62-0"></span>

**Figure 2.16:** [EM](#page-362-11) effects observed using a scanning electron microscope. Picture taken from [\[91\]](#page-324-0). Left: A void in the metal wire connection, increasing connection resistance. Right: A hillock in the wire connection, potentially causing short circuits.

Metal layers are further affected by mechanical stress, e.g. caused by thermal cycling [\[3\]](#page-315-2). More aging effects concern the packaging of the [IC.](#page-362-6) As those are usually destructive effects, they are not further elaborated here.

**Aging in [FPGAs:](#page-362-3)** For [FPGAs](#page-362-3) as primarily digital circuits, [BTI](#page-361-8) is the most critical aging cause. When considering both [BTI](#page-361-8) and [HCI,](#page-362-10) it should be noted that treating them as completely independent effects can result in pessimistic estimations [\[3\]](#page-315-2). When modelling [BTI,](#page-361-8) it should also be considered that it varies locally. This local variation is however largely independent of the process variation [\[3\]](#page-315-2). As [BTI](#page-361-8) causes increased  $V_{th}$  and therefore increased delay in circuits [\[87\]](#page-323-6), it will also affect [FPGA](#page-362-3) logic. Due to the [FPGA'](#page-362-3)s late configuration, it is again not possible to consider application specific countermeasures during circuit design. Solutions therefore could use adding safety margins during timing analysisof [FPGA](#page-362-3) applications. Local variation and dependency on switching activity will however lead to pessimistic results. Alternatively, aging monitoring can detect delay degradation. Such systems can also handle local degradation, if they can counter effects using some mechanism to locally increase performance.

**Modeling Approaches:** Models for the aging processes have been proposed at various layers of abstraction. At the lowest abstraction, physical models can be used to enhance SPICE simulations with aging information. Until recently, fabs have provided [EDA](#page-362-5) specific simulation models and libraries for their respective technologies [\[92\]](#page-324-1). More recently, the Silicon Integration Initiative's Compact Model Coalition has standardized an [Application Programming](#page-361-9) [Interface \(API\)](#page-361-9) to develop heating and aging models [\[84,](#page-323-3) [93\]](#page-324-2). These models are then used in combination with the standard SPICE transistor models, such as BSIM. Factory provided models for aging can be both empirical or physical models, trading of performance and accuracy. For physical models, reactiondiffusion [\[94\]](#page-324-3) and trapping-detrapping [\[95\]](#page-324-4) models have been proposed. A summary of these models and the corresponding equations are given by Khoshavi et al. and can be found in [\[87\]](#page-323-6).

To provide faster models for simulation, circuit level predictive aging models have been proposed [\[87\]](#page-323-6). For example, long-term [HCI](#page-362-10) and [BTI](#page-361-8) effects in circuits can be modeled using these equations [\[96,](#page-324-5) [97\]](#page-324-6):

$$
\Delta V_{\text{th\_BTI}} = a(TSP \cdot t)^n \tag{2.29}
$$

$$
\Delta V_{\text{th\_HCI}} = b \left( T \text{S} w P \cdot t \right)^m \tag{2.30}
$$

In these equations,  $a$  and  $b$  are technology and temperature specific constants,  $TSP$  is the transistor stress probability,  $TSwP$  is the transistor switching stress probability,  $t$  is time, and  $n$  and  $m$  are time exponential constants. As the switching stress probabilities are application dependent, a work load profile for the application has to be provided [\[98\]](#page-324-7). For the different effects, both the signal probability (time a signal assumes a certain value) and transition density has to be known. In addition, these models require temperature and supply voltage profiles as inputs [\[99\]](#page-324-8). Furthermore, stochastic models can take non-deterministic effects into account [\[100\]](#page-324-9). For some effects, it is however

important that process variation and aging are not modelled independently, as this could yield to pessimistic results [\[87\]](#page-323-6).

Another kind of aging models on a yet higher level of abstraction are architecture models [\[87\]](#page-323-6). For example, [\[101\]](#page-324-10) proposed to select an "representative" transistor for a certain area and use it to estimate the values of all transistors in this area. [\[102\]](#page-325-0) simulates a certain workload pattern and derives voltage and temperature profiles. It then uses this information to derive gate delays. [\[103\]](#page-325-1) extends all these considerations to instantaneous [BTI](#page-361-8) effects. Application specific models are of limited usein [FPGAs,](#page-362-3) as the application is not known during [FPGA](#page-362-3) design time. As the [FPGA](#page-362-3) transistor sizes are predetermined, these models can be used for analysis, but are of limited use for mitigation.

**Scarpato Model:** One variant of a circuit level aging model is the one proposed in [\[3\]](#page-315-2). As an extension of the [PVT](#page-364-9) model presented in the same work, it is primarily an empirical, mathematical model: Parameters in the model equation are fitted to results of SPICE simulations. This approach therefore requires existing aging models that can be used in a SPICE simulator. The main benefit of the Scarpato model is its reduced computation complexity when compared to the SPICE model. The Scarpato [PVT](#page-364-9) model presented earlier in this section was therefore analyzed to determine which parameters change with [BTI](#page-361-8) and [HCI](#page-362-10) effects. [EM](#page-362-11) effects where not considered in this model, but extending it accordingly is possible. To reduce the model complexity, aging variation has been reduced to an aging-dependent shift of a single parameter in the model.

$$
t_{\rm pd}(V,T) = p_{\beta} + p_{\mu^{-1}}(T) \left( \frac{V}{(V - p_{V_{\rm th}}(T))^{p_{\alpha}}} \right) \tag{2.31}
$$

$$
t_{\rm pd}(V,T,t) = p_{\beta} + p_{\mu^{-1}}(T) \left( \frac{V}{\left(V - \left(p_{V_{\rm th}}(T) + \Delta p_{V_{\rm th}}(V,T,t)\right)\right)^{p_{\alpha}}}\right) \tag{2.32}
$$

<span id="page-64-3"></span><span id="page-64-2"></span><span id="page-64-1"></span><span id="page-64-0"></span>
$$
\Delta p_{V_{\text{th}}}(T) \propto e^{\frac{-E_{\alpha}}{kT}} \qquad (2.33)
$$

$$
\Delta p_{V_{\text{th}}}(V,T) = A \cdot V^{\gamma} \cdot e^{\frac{-E_{\alpha}}{kT}} \qquad (2.34)
$$

Revisiting [equation \(2.31\),](#page-64-0) a short-hand version of [equation \(2.22\),](#page-54-0)  $p_{V_{th}}$ was identified as the parameter affected most by aging shift. Introducing the new  $\Delta p_{V_{th}}$  parameter into this equation yields the aging-aware model in [equation \(2.32\).](#page-64-1) Scarpato then continues to describe the temperature and supply voltage dependencies. Various alternatives are examined for

the voltage dependency, but ultimately [equation \(2.33\)](#page-64-2) is used in the model.

To introduce the time dependency of the aging parameter, Scarpato introduces trapping-detrapping and reaction-diffusion models into [equa](#page-64-3)[tion \(2.34\).](#page-64-3) He then evaluates which variant of the equation yields a better fit of the SPICE simulations. Ultimately, the final model chosen for  $\Delta p_{V_{\perp}}$  is given as:

$$
\Delta p_{V_{\text{th}}}(V,T,t) = \left(C_1 t^{n_1 + a_1 \log(V)} + C_2 t^{n_2 + a_2 \log(V)}\right) \cdot V^{\gamma} \cdot e^{\frac{-E_{\alpha}}{kT}} \tag{2.35}
$$

[Equation \(2.31\)](#page-64-0) does not include process variation effects, but Scarpato showed that process variation is uncorrelated to aging. The parameter can therefore be determined once and be reused for different corners in [equa](#page-54-0)[tion \(2.22\)](#page-54-0) to include process variation.

As mentioned in the introduction of the physical aging effects, workload described by switching frequency and signal probability also affects the aging process. As it was found that workload can not be simply reduced to a fixed number of parameters, the  $\Delta p_{V_{th}}$  parameter is instead being estimated for different workloads using different SPICE simulations. The model can also include dynamic variation, i.e. changing supply voltage, temperature and workload over time. Refer to [\[3,](#page-315-2) p. 74] for details.

#### **Further Reading**

Various textbooks cover the [VLSI](#page-364-6) [IC](#page-362-6) fabrication process in detail. For example, a recent take can be found in [\[Gen17\]](#page-66-0). For process variation, [\[Cha18\]](#page-66-1) provides a detailed introduction from a circuit perspective. After introducing classical [STA](#page-364-5) and explaining the need for statistical analysis, it introduces mathematical foundations needed for the modeling. It further includes an overview of process variation sources and a classification of those. It also extends the analysis to gate and path delays for [CMOS](#page-361-2) circuits. Additional information about process variation and modelling can be found in [\[Chi07,](#page-66-2) [Die12,](#page-66-3) [Huf20\]](#page-66-4). A detailed introduction to voltage variation can be found in [\[Wir13\]](#page-66-5) and an introduction to aging, including sources, models and optimization techniques, is provided by [\[Tan19\]](#page-66-6). Additionally, [\[Ye21\]](#page-66-7) presents methods to deal with aging effects. The [PVTA](#page-364-8) simulation model by Scarpati, which is primarily used in this thesis to simulate the developed [FPGA](#page-362-3) architecture, is described in detail in his dissertation [\[Alt17\]](#page-66-8).

- <span id="page-66-0"></span>**[Gen17]** GENG, Hwaiyu: Semiconductor Manufacturing Handbook, Second Edition. 2nd edition. New York, N.Y.: McGraw-Hill Education and McGraw Hill, 2017.
- <span id="page-66-1"></span>**[Cha18]** CHAMPAC, Victor and GARCIA GERVACIO, Jose: Timing Performance of Nanometer Digital Circuits Under Process Variations. Vol. 39. Cham: Springer International Publishing, 2018. DOI: [10.1007/978-3-319-75465-9.](https://doi.org/10.1007/978-3-319-75465-9)
- <span id="page-66-2"></span>**[Chi07]** CHIANG, Charles C. and KAWA, Jamil: Design for manufacturability and yield for nano-scale CMOS. Series on integrated circuits and systems. Dordrecht: Springer, 2007.
- <span id="page-66-3"></span>**[Die12]** DIETRICH, Manfred and HAASE, Joachim: Process Variations and Probabilistic Integrated Circuit Design. New York, NY: Springer New York, 2012. DOI: [10.1007/978-1-4419-6621-6.](https://doi.org/10.1007/978-1-4419-6621-6)
- <span id="page-66-4"></span>**[Huf20]** HUFF, Michael: Process Variations in Microsystems Manufacturing. 1st ed. 2020. Microsystems and Nanosystems. Cham: Springer International Publishing and Imprint Springer, 2020. DOI: [10.1007/978-3-030-40560-1.](https://doi.org/10.1007/978-3-030-40560-1)
- <span id="page-66-5"></span>**[Wir13]** WIRNSHOFER, Martin: Variation-aware adaptive voltage scaling for digital CMOS circuits. Vol. 41. Springer series in advanced microelectronics. Dordrecht: Springer, 2013. DOI: [10.1007/978-94-007-6196-4.](https://doi.org/10.1007/978-94-007-6196-4)
- <span id="page-66-6"></span>**[Tan19]** TAN, Sheldon; TAHOORI, Mehdi Baradaran; KIM, Taeyoung; WANG, Shengcheng; SUN, Zeyu and KIAMEHR, Saman: Long-Term Reliability of Nanometer VLSI Systems: Modeling, Analysis and Optimization. Springer eBook Collection. Cham: Springer, 2019. DOI: [10.1007/978-3-030-26172-6.](https://doi.org/10.1007/978-3-030-26172-6)
- <span id="page-66-7"></span>**[Ye21]** YE, Wei; ALAWIEH, Mohamed Baker; HSU, Che-Lun; LIN, Yibo and PAN, David Z.: "Dealing with Aging and Yield in Scaled Technologies". In: *Dependable Embedded Systems*. Ed. by HENKEL, Jörg and DUTT, Nikil. Embedded Systems. Cham: Springer International Publishing, 2021, pp. 409– 429. DOI: [10.1007/978-3-030-52017-5\\_17.](https://doi.org/10.1007/978-3-030-52017-5_17)
- <span id="page-66-8"></span>**[Alt17]** ALTIERI SCARPATO, Mauricio: "Digital circuit performance estimation under PVT and aging effects". Thesis. Université Grenoble Alpes, 2017. URL: [https://theses.hal.science/tel-01773745.](https://theses.hal.science/tel-01773745)

# **2.5 FPGA Logic Generators**

The [CMOS](#page-361-2) circuits introduced previously have their functionality determined and fixed at manufacturing time of the circuit. As manufacturing is a long and expensive process, this approach is not suitable for prototyping or if only a few devices are needed. A solution to this issue lies in field programmable devices, which have their ultimate functionality decided after manufacturing, "in the field". Over the last decades, various approaches have emerged to realize the main component of such systems, the programmable logic cell.

**Logic Matrices** The first commercially available devices that provided programmable cells have been [Programmable Read Only Memorys \(PROMs\)](#page-363-13) [\[104\]](#page-325-2). Whereas they were mainly meant to be used as memory, connecting input signals to address lines and output signals to the data output of the memory realizes programmable logic. This approach works like an early implementation ofa [LUT,](#page-363-4) but is rather inefficient due to the need for a complete address decoder.

<span id="page-67-0"></span>

**Figure 2.17:** [Programmable Array Logic \(PAL\)](#page-363-14) structure according to [\[104\]](#page-325-2). Like [Pro](#page-363-15)[grammable Logic Arrays \(PLAs\),](#page-363-15) [PALs](#page-363-14) realize two level logic, but unlike those, they do not support configuration of the *OR* stage. As depicted in the left part of the picture, all configuration happens in the *AND* plane.

The next step in the evolution of programmable logic cells were [PLAs. PLAs](#page-363-15) realize two-level logic, where the first level is implemented as a wired *AND* and the second level as a wired *OR*. Due to this structure, this circuits can directly realize the [Sum-of-Products \(SOP\)](#page-364-11) form of an equation. [PLAs](#page-363-15) were mainly used to implement combinational logic and to replace multiple discrete gates ona [PCB.](#page-363-9) [PLAs](#page-363-15) had comparatively wide inputs in both configurations stages, causing the main drawback of early [PLAs:](#page-363-15) Comparatively large delay caused by two levels of configurable cells and complex manufacturing[\[104\]](#page-325-2). To solve these issues, [PALs](#page-363-14) as shown in [figure 2.17](#page-67-0) were introduced. Comparedto [PLAs,](#page-363-15) these systems kept the configurable *AND* plane, but fixed the functionality of the *OR* stage. In addition, [PALs](#page-363-14) also introduced [FFs](#page-362-4) on the outputs of the combinational logic cell. As those are also available as inputs for the *AND* matrix, these programmable logic circuits enable implementation of sequential circuits.

Later on, combinations of multiple [PLAs](#page-363-15) or [PALs](#page-363-14) were put in a single chip, connected using some kind of interconnect [\[104\]](#page-325-2). These chips, called [Com](#page-361-10)[plex Programmable Logic Device \(CPLD\),](#page-361-10) enable realization of more complex sequential logic such as state machines. The basic element in all these mentioned logic circuits are programmable matrices, usually a programmable *AND* matrix. They have to support a wide number of inputs and are therefore usually implemented in specific ways, which are more efficient than naive [CMOS](#page-361-2) implementation. As programmable matrix structures will not be employed in this thesis, the transistor level implementation details will not be discussed further.

**AND-Inverter Logic** Another possible realization of configurable logic has been introduced in 2012 by Parandeh-Afshar and Benbihi[\[105\]](#page-325-3): So-called [AND-Inverter Cones \(AICs\)](#page-361-11) are configurable logic cells consisting of *AND* and *INVERTER* gates. Unlike [PALs](#page-363-14) and [PLAs,](#page-363-15) [AICs](#page-361-11) do not realize the configurable logic through specific assignment of inputs to gates. Instead, connections are hardwired and reconfiguration is achieved by changing the logic function in the logic element. [Figure 2.18](#page-69-0) shows a simplified 3-level [AIC](#page-361-11) implementation. Each node in the shown graph is an *AND* with an optional *INVERTER* and can therefore either realize an *AND* or a *NAND* function. Each cell provides multiple outputs, allowing to derive multiple outputs from one block. This feature may also be used to implement multiple functions in a single block, enabling a fracturable logic system.

[AICs](#page-361-11) are inspired by modern synthesis tools, which represent circuits as graphs of *AND* and *INVERTER* nodes. Their area scales linearly with the number of inputs, which is a benefit comparedto [LUTs](#page-363-4) which scale exponentially. Furthermore, the delay scales logarithmically, whereas it scales linearly for [LUTs.](#page-363-4) Because of those advantages, Parandeh-Afshar et al. introduced the logic cell in a hybrid [FPGA,](#page-362-3) combining both [LUTs](#page-363-4) and [AICs.](#page-361-11) They achieved

<span id="page-69-0"></span>

**Figure 2.18:** [AIC-](#page-361-11)3 structure adapted from [\[105\]](#page-325-3). Each node represents a logic cell which can either realize an *AND* or a *NAND* function. Shown here is a simplified 3-stage design, where the [AICs](#page-361-11) used in practice usually consist of 6 levels. Intermediate signals are also available as outputs to make the cell fracturable and enable reuse of intermediate values.

a 16% decrease in area and up to 32% decrease in delay compared to their [LUT-](#page-363-4)only reference. [\[105\]](#page-325-3)

[AICs](#page-361-11) are less expressive than [LUT,](#page-363-4) as they can not realize every possible function of their inputs. Because of that, numerous [AICs](#page-361-11) needs to be used, which causes additional interconnect pressure. To avoid routing congestion in the interconnect, multiple [AICs](#page-361-11) are combined within one cluster using internal loopback connections. In 2014, Zgheib et al. implemented the proposed [AICs](#page-361-11) [FPGA](#page-362-3) in 40 nm technology [\[106\]](#page-325-4). They found that because of the large number of inputs and outputs, a comparatively large crossbar is required. This crossbar requires almost twice as much area as the [LUT](#page-363-4) system used for comparison. Furthermore, authors show that reduced delays compared to [LUTs](#page-363-4) only were achieved for short critical paths.

In 2020, Thummler et al. proposed new, optimized mapping algorithms for [AIC](#page-361-11) based programmable logic systems [\[107\]](#page-325-5). Their algorithm reduces area by up to 16.4%, but the general issue with large crossbars persists to date. Because of this, authors have proposed alternative solutions with small, reconfigurable cells as summarized in [\[108\]](#page-325-6). The most common replacement cell design changes functions between *NOR* and *NAND*. Comparedto [LUTs](#page-363-4) or [ULMs](#page-364-12) explained in the next section, the number of functions that can be realized by the cell are still small.

**Universal Logic Modules** Whereas [AICs](#page-361-11) realize two logic functions in logic cells, [ULMs](#page-364-12) take this idea one step further. Initial [ULM](#page-364-12) research was carried out decades before the introduction of first [FPGAs](#page-362-3) [\[109\]](#page-325-7). An early definition of the [ULM](#page-364-12) was given in [\[110\]](#page-325-8) for example: An  $ULM$ .*m* is a function  $U(z_0..z_n)$ 

that realizes all possible functions  $f(x_0..x_m)$  by substitution of  $z_0..z_n$  with any of  $\{x_0..x_m,\overline{x_0}..\overline{x_m},0,1\}$ . Strictly speaking, a  $ULM.m$  must therefore be able to implement all functions of  *input variables. Later works however also refer*  $st$ 

<span id="page-70-0"></span>

**Figure 2.19:** Exemplary [ULM](#page-364-12) implementations proposed in literature. **[\(a\)](#page-70-0)** [ULM.](#page-364-12)3 Let the trecoding may be too expensive to implement, wellives.<br>Lister trade-offs borders (Banacoding of Diagrams (Binary Decision Diagrams (BDDs). [\(b\)](#page-70-0)<br>connulative by esign nonophina, precollary as well. 8-input logic cell manually designed by Iida et al.

[Figure 2.19](#page-70-0) shows two examples which were especially designed for [FPGA](#page-362-3) application<sup>pi</sup>liteadaption of [ULMs](#page-364-12) for [FPGA](#page-362-3) happened later, starting with Thakur et al. 1995 [\[112\]](#page-326-0) The summer in the previous [ULM](#page-364-12) work, considering that equivalent functions can be combined and inputs can be swapped in Fegals. It then provided an algorithm to derive sets of  $ULM.m$ and computed some possible implementations as the Actel multiplexer based cell. The main goal of that publication was to generate set of suitable func-

tates haw many nodes there should be in the SBDD.4. The-<br>tileDingSilv, the largest BDD representing a 4-variable function<br>should have at must 9 nodes, excluding terminal nodes. Start-

To optimize the interconnection, we allow that some levelnodes can have both outgoing edges pointing to the same ande. (Note that this node would not exist in standard BDDs.)

*z* from the root (level 1) ages, the edgar can branch as the control of the diat this next world one estat in standard BDB. 1<br>The learned the control et al., the edgar of the standard on the standard of the diate of the d uses BDD as to design the ULMs, It also provides concrete 3 and 4 imput example in public in public  $\alpha$  is the distinction between  $LM$ . Behave become reproduced in figure  $2.19a$ . This publication also first separated configuration and general purpose inputs. Configuration inputs are then driven by [Static Random-Access Memory](#page-364-13) [\(SRAM\)](#page-364-13) or similar memory cells, reducing pressure on the global interconnect. Zilic et also note that and the number of consideration between the space of configuration bits for the indicate and the space of configuration bits for the indicate and the number of configuration bits for the indicate  $4$ cell was reduced to 13 bit, compared to 16 bit needed for 24-input [LUT.](#page-363-4) Some years later, new algorithms for [ULM](#page-364-12) design were proposed to design cells with more inputs [\[113\]](#page-326-1).

A set of cell implementations using even less area has been presented in [\[111\]](#page-325-9), the COGRE cells. Strictly speaking, those cells are not [ULMs,](#page-364-12) as they do not cover all possible functions. Taking advantage of that design decision, the authors state that their 8-input cell uses 75.19% less area and 68.27% less configuration bits than an 8-input [LUT.](#page-363-4) An implementation of the 8-input COGRE cell is shown in [figure 2.19b.](#page-70-0) Even though there has been continued research on [ULMs](#page-364-12) themselves, there are few publications and no commercial systems actually using themin [FPGAs.](#page-362-3)



**Figure 2.20:** Microsemi Axcelerator C-Cell Logic Block [\[114\]](#page-326-2), cited via [\[115\]](#page-326-3). This block realizesa [Multiplexer \(MUX\)](#page-363-16) based [Logic Element \(LE\),](#page-362-12) but combines the multiplexer with hard logic gates to achieve higher expressiveness.

**Multiplexer Logic** An alternativeto [ULMs](#page-364-12) are [MUX-](#page-363-16)based logic cells. Compared to [LUTs,](#page-363-4) [MUX-](#page-363-16)based cells route input signals to both select and data inputs of multiplexers. [LUTs](#page-363-4) are similarly based on multiplexers. As they route input signals to the select inputs of multiplexers only, data inputs in [LUTs](#page-363-4) are however always constant. They are directly connected to memory, usually [SRAM.](#page-364-13)

A two data input multiplexer can realize six functions, when it uses variables and constants as inputs. In general, [MUX-](#page-363-16)based logic cells can therefore achieve large expressiveness with few transistors. This benefit however is diminished by the introduction of a large number of inputs, compared to [LUT](#page-363-4) based cells. Additional inputs increase resource demand and complexity in the interconnect, making the use of small routing switches especially important. In practical realizations, [MUX-](#page-363-16)based logic cells have therefore
often been combined with flash or anti-fuse technology [\[116\]](#page-326-0). To realize cells with more inputs, [MUXs](#page-363-0) are often combined with logic gates to yield more complex functions. [Figure 2.20](#page-71-0) shows an example of such a cell which was used in a commercial architecture released in the early 1990s, the Actel ACT3 [\[114\]](#page-326-1). In addition to the [MUXs,](#page-363-0) it includes hard logic *AND* and *OR* gates to realize more functions. Another early example of [MUXs](#page-363-0) cells are early [FP-](#page-362-0)[GAs](#page-362-0) by Quicklogic [\[104\]](#page-325-0). Both early commercial variants are Antifuse based, making use of small switches to reduce the wiring overhead. [MUX-](#page-363-0)based logic cells have gone out of fashion in recent [FPGAs](#page-362-0) architectures, which use [LUTs](#page-363-1) almost exclusively. This is often attributed to the comparatively complex tools needed to map user applications to such logic. For [LUT-](#page-363-1)based [FPGAs](#page-362-0) in comparison, the mapping step is reduced to a less-complex graph-covering problem [\[117\]](#page-326-2).

In 2016, Chin et al. presented a more modern variant ofa [MUX-](#page-363-0)based logic cell. Their variant realizes a 6-input [LE](#page-362-1) using a 4-input [MUX.](#page-363-0) Using both the data and the select inputs of these multiplexers they obtaina [LE](#page-362-1) with the same amount of inputs as in modern [LUT-](#page-363-1)based [LE.](#page-362-1) The authors' [LE](#page-362-1) requires only 15% of the area of a comparable 6-input [LUT.](#page-363-1) When used in a hybrid architecture mixed with [LUTs,](#page-363-1) the hybrid [FPGA](#page-362-0) architecture still used 8% less area thana [LUT-](#page-363-1)only architecture. The logic cell can map all functions of 2 and 3 inputs and some functions of up to 6 inputs. For the logic mapping, authors explain in detail how the Shannon expansion can be used to fit logic functions to the [LE.](#page-362-1)

**Lookup Tables** [LUTs](#page-363-1) are the most common logic generator used in recent, commercial [FPGAs](#page-362-0) [\[118,](#page-326-3) p.4ff]. An  $N$ [-LUT](#page-363-1) combines a  $2^N$ -to-1 [MUX](#page-363-0) and a  $2^N$ bit memory, usually [SRAM.](#page-364-0) [Figure 2.21b](#page-73-0) depicts the transistor level implementation of such a basic [LUT.](#page-363-1) Memory outputs are connected to the data inputs of the multiplexer, whereas its select inputs are external inputs for the logic function's parameters. The figure also includes signal buffers in various parts of the [LUT,](#page-363-1) similar to the buffered [MUX](#page-363-0) shown in [figure 2.21a.](#page-73-0) Depending on the requirements of output signal levels and the characteristics of transistors and technology, some buffers may not be required.

Torealize a function  $f(x_0..x_n)$  in a [LUT,](#page-363-1) its truth table needs to be obtained and stored in the memory. Inputs  $x_0$ ... $x_n$  are connected to the select inputs and choose the matching output value from the truth table for the given input combination. Like in [MUX](#page-363-0) logic, the function table can again be obtained using the Shannon Composition  $f(x_0, x_1...x_n) = x_0 \cdot f_0(x_1...x_n) \vee x_1 \cdot f_1(x_1...x_n)$ . An important difference compared to [MUX](#page-363-0) logic is that the composition needs to be applied repeatedly until all data inputs of the multiplexers are constant.

<span id="page-73-0"></span>

**Figure 2.21:** Exemplary [MUX](#page-363-0) and [LUT](#page-363-1) implementations proposed in literature. Implementations shown are models used in COFFEE, which automatically determines transistor widths for [FPGA](#page-362-0) elements [\[119\]](#page-326-4). **[\(a\)](#page-73-0)** A two-level multiplexer in pass-transistor logic configured by [SRAM](#page-364-0) cells. The output contains a level restore buffer to compensate for the voltage drop across pass transistors. **[\(b\)](#page-73-0)** An implementation ofa [LUT](#page-363-1) using pass-transistors. Internal level restore buffers are needed after three pass transistors to ensure signal integrity. Buffered multiplexer select inputs are driven externally, whereas buffered data inputs are driven by [SRAM](#page-364-0) cells.

[LUTs](#page-363-1) can describe all functions of  $N$  inputs and have therefore been called "computational heart" of the [FPGA](#page-362-0) [\[118,](#page-326-3) p.4 ff.]. They allow for a higher logic density and large digital designs, causing a historical paradigm shift [\[104\]](#page-325-0): Large digital systems can now be realized in small quantities, avoiding the initial setup cost involved in [ASIC](#page-361-0) manufacturing. The best size ofa [LUT,](#page-363-1) i.e. the number of select inputs, depends on the mapped user applications and has been extensively researched. Literature suggests 4-input [LUTs,](#page-363-1) but commercial architectures commonly use 6-input [LUT](#page-363-1) as well. Larger [LUTs](#page-363-1) can realize more complex functions, but cause higher propagation delay [\[118,](#page-326-3) p. 4]. Whereas delayof [LUTs](#page-363-1) grows linearly with the number of inputs, its area grows exponentially [\[116\]](#page-326-0). Because of this, useful sizesof [LUTs](#page-363-1) are limited. An alternative way to realize large functions is to divide the function into multiple parts, which are then realized in multiple [LUTs](#page-363-1)[\[118\]](#page-326-3).

Recent commercial [LUTs](#page-363-1) often include enhanced functionality, such as being useable as memory or shift registers [\[120,](#page-326-5) p. 37]. These extensions are usually realized on transistor level, allowing for an area and delay efficient implementation. Other extras which are often implemented on cell level to complement the [LUT](#page-363-1) include dedicated signals for carry chains and external adders or *XOR* gates [\[121,](#page-326-6) p. 25].

### **Further Reading**

Information about [CPLDs,](#page-361-1) [PLAs](#page-363-2) and [PALs](#page-363-3) can be found in an early work of Brown et al. [\[Bro96\]](#page-74-0). This work also gives a quick summary of commercial architectures available in the late 1990s. In a more recent overview, Kuon et al. give an overview of commercial architectures available in 2007 [\[Kuo07\]](#page-74-1). [\[Yan14\]](#page-74-2) can be used as a starting point for information on [AICs.](#page-361-2) Information about [LUTs](#page-363-1) can be found in most [FPGA](#page-362-0) textbooks, as these are the most commonly used logic generators. Some examples include [\[Vas07\]](#page-74-3) which summarizes various architecture studies, [\[DeH07\]](#page-74-4), which includes a discussion of granularity, and [\[Ama18\]](#page-74-5), which discusses performance trade-offs and [LUT](#page-363-1) size. The 2020 textbook of Rodríguez-Andina provides an overview of commercial [FPGA](#page-362-0) architectures and used logic generators in 2020 [\[Rod20\]](#page-74-6). When it comes to most recent logic generator recent research, [\[Rai21\]](#page-74-7) provides a short overview. Apart from discussing [AICs](#page-361-2) again, this work also introduces logic generators which make efficient use of various novel transistor technologies. Those generators will be discussed in detail in the next section.

<span id="page-74-7"></span><span id="page-74-6"></span><span id="page-74-5"></span><span id="page-74-4"></span><span id="page-74-3"></span><span id="page-74-2"></span><span id="page-74-1"></span><span id="page-74-0"></span>

## **2.6 Ambipolar Reconfigurable Cells**

The logic generators discussed so far have been realized mostly in traditional [CMOS](#page-361-3) silicon technologies. The development of those logic generators has therefore been limited to designs which can be efficiently realized in that technology. Furthermore, those works focus on the design of efficient cells for [FPGAs](#page-362-0) leveraging flexibility given by the technology. In recent years, a new type of logic generators has been proposed in literature: Logic cells which make use of – and are optimized for – ambipolar transistor technologies. As these technologies enable efficient implementation of reconfigurable [FETs,](#page-362-2) the adaption in larger, reconfigurable cells seems natural. The realization of such cells however has to obey different technology constraints, leading to new challenges. Most notably, the cell implementations with low area usually also provide low flexibility or expressiveness [\[122\]](#page-326-7). Ambipolar reconfigurable cells have been realized in both dynamic and in static logic styles. In dynamic logic, the cell needs to be clocked to transition between various internal states over time. Usually, such additional states are used to pre-charge certain capacitances in the circuit. Apart from the need of a clocking system for the cells, another drawback of dynamic logic is reduced energy efficiency [\[123\]](#page-326-8). Static cells on the other hand do not need any clocking input and can therefore replace [LUTs](#page-363-1) in common [FPGA](#page-362-0) architectures more easily.

A main differentiating characteristic of ambipolar reconfigurable cells is the number of functions they can realize "in the field" [\[18\]](#page-316-0) and the number of transistors used [\[124\]](#page-326-9). Here, Mikolajick et al. first classify cells according to their configuration inputs as "hard-wired" or "soft-wired". "Hard-wired" logic cells connect their configuration inputs to voltages statically, i.e. during manufacturing. Such devices do not offer any customization in the field [\[18\]](#page-316-0). They can however allow for more cost-efficient manufacturingof [ASIC,](#page-361-0) as the functionality of the circuit can be changed through modification of only metal layers. The semiconductor layers can therefore be manufactured using the same masks for different applications. Furthermore, it is also possible to develop standard cells with completely fixed functionality for these ambipolar technologies [\[125\]](#page-327-0).

More interesting for this thesis is the second group of "soft-wired" logic cells. These cells enable changing of functionality "in the field" and can therefore be used to realize [FPGAs](#page-362-0) architectures. The most basic of those cells can realize only two different functions. They are usually implemented using complimentary pull-up and pull-down networks and switch between complimentary

functions [\[18\]](#page-316-0). In this category, NAND/NOR, AND/OR, XOR/XNOR and other cells have been proposed [\[26\]](#page-317-0). Other works in literature have also combined multiple of these cells using multiplexers, forming more expressive cells, e.g. a 6-function static-logic cell [\[126\]](#page-327-1).

<span id="page-76-0"></span>

**Figure 2.22:** Reconfigurable cells based on ambipolar devices as proposed in literature. Circuits were originally realized using [Dual Gate CNTFET \(DGCNTFET\)](#page-362-3) devices [\[123\]](#page-326-8), but the structure has been adapted to other ambipolar technologies. **[\(a\)](#page-76-0)** A 7-transistor cell realizing 14 functions of two inputs. This realization uses a dynamic logic approach. **[\(b\)](#page-76-0)** A 10-transistor cell realizing all 16 functions of two inputs. The circuit has been realized as static logic.

More advanced cells can be grouped into almost [ULMs, ULMs](#page-364-1) and novel [LUTs](#page-363-1) classes. An example of an almost [ULM](#page-364-1) is shown in [figure 2.22a.](#page-76-0) The original version of this circuit was presented in [\[127\]](#page-327-2) and described an 8-function cell. It was derived from [\[128\]](#page-327-3), which is the first published "soft-wired" ambipolar reconfigurable cell. The circuit is implemented using [DGCNTFET](#page-362-3) in dynamic logic. The three programming inputs *A*, *B* and*C* are programmed with positive and negative supply voltages. In [\[123\]](#page-326-8), the concept was extended to use three values for the programming voltages. Using an additional zero level voltage allows turning off [RFET](#page-364-2) devices, and enables realization of 14 functions in total. Jabeur et al. also proposed a similar static logic cell, shown in [figure 2.22b.](#page-76-0) This cell is an example of a full [ULM,](#page-364-1) as it can realize all possible 16 functions of 2 input signals. Whereas a static cell is easier to use, the main drawback of this specific cell is the large number of configuration inputs: A total of 9 configuration inputs with three possible values each require more configuration storage than the 4 bit needed for a two-input [LUT.](#page-363-1) Because of that, later works such as [\[129\]](#page-327-4) explicitly try to reduce the needed configuration storage: The cell proposed by Kato et al. reduces configuration input values back to only two voltages, but it is a dynamic logic design. Some designs of similar

cells, such as a 6-function 2-input cell or a 13-function 3-input cell, can be found in [\[122\]](#page-326-7). The 6-input static cell presented there uses only 150 transistors compared to 648 needed for a similar [LUT.](#page-363-1)

The fourth class, [LUT-](#page-363-1)like cells, can also realize all possible functions for a certain input configuration. Unlike [ULM,](#page-364-1) these devices however implement a memory based logic generator, just like traditional [LUTs.](#page-363-1) Examples of such cells include the ones proposed by Kumar et al. [\[130\]](#page-327-5) and improved by Guo et al. [\[131\]](#page-327-6). Both works use memristors to implement [LUTs](#page-363-1) in less area than is needed for a silicon [LUT](#page-363-1) realization. They mainly achieve this through combination of the data storage and the [MUX](#page-363-0) or decoder network. Furthermore, they replace the [SRAM](#page-364-0) for configuration storage with memristor elements. Such logic cells have been proposed as two, four and six input variants. Both Kumar's and Guo's implementations are similar, with the main difference being a reduced number of transistors in Guo's variant.

Another way to classify ambipolar logic generators is according to the way the logic elements have been designed. Here, Cheng et al. distinguish four classes [\[122\]](#page-326-7): [CMOS](#page-361-3) like structures are based on complementary networks and include designs like the simple 2-function cells of [\[26\]](#page-317-0). Stack based cell architectures stack multiple rows, where each row consists of two transistors. Inverted output architectures are those which feature an inverter at the output, such as the designs of [figure 2.22.](#page-76-0) [BDD](#page-361-5) based cells are designed using [BDDs,](#page-361-5) enabling the designer to specify the functions to be realized [\[132\]](#page-327-7). Yet another way to classify those cells is chosen by Rai et al. and groups devices according to technology [\[108\]](#page-325-1): The realizations by Jabeur et al. [\[123\]](#page-326-8) and Cheng et al. [\[124\]](#page-326-9) are [DGCNTFET](#page-362-3) based, whereas Gaillardon' works are based on silicon nanowires [\[125\]](#page-327-0). Rai also presents Spintronic based cells as well as memristor based [LUT](#page-363-1) realizations. As some of those ambipolar reconfigurable cells provide only limited expressiveness, some works have investigated efficient interconnect concepts to combine multiple of those cells. Yakymets et al. propose an efficient interconnect matrix [\[133\]](#page-327-8), whereas Cheng et al. focus on the implementation of efficient cell clusters [\[124\]](#page-326-9).

### **Further Reading**

As there are no textbooks available on reconfigurable ambipolar logic cells, information has to be obtained largely from original publications. Some review papers are available and summarized in the following: [\[Che13\]](#page-78-0) provides an overview of cell designs published until 2013. It analyzes the number of transistors used as well as the number of functions realized. In a publication

published three years later, Cheng et al. categorize various 2-input cells and analyze the realized functions in detail. They also discuss efficient combination of multiple cells in matrix structures [\[Che17\]](#page-78-1). Mikolajick et al. provide an overviewof [RFET](#page-364-2) device technology in 2017, including a short section on applications [\[Mik17\]](#page-78-2). They describe various published reconfigurable cells and cover the topicof [Multiple Independent Gate FETs \(MIGFETs\),](#page-363-4) which allow for further optimizations in cell design. The most recent review on reconfigurable ambipolar cells was published by Rai et al. in 2021 [\[Rai21\]](#page-78-3). It summarizes both traditional [FPGA](#page-362-0) logic generators and [RFET](#page-364-2) based ones, as well as memristor and Spintronic cells. Rai et al. also evaluate these cells in terms of area, delay and power for certain [FPGA](#page-362-0) benchmarks.

- <span id="page-78-0"></span>**[Che13]** CHENG, Kevin; LE BEUX, Sebastien and O'CONNOR, Ian: "Am/IDG-FET based reconfigurable cells versus LUTs: Characteristics description and analysis". In: *2013 25th International Conference on Microelectronics (ICM)*. IEEE, 2013, pp. 1–4. DOI: [10.1109/ICM.2013.6734987.](https://doi.org/10.1109/ICM.2013.6734987)
- <span id="page-78-1"></span>**[Che17]** CHENG, Kevin; LE BEUX, Sebastien and O'CONNOR, Ian: "Hybrid Topologies for Reconfigurable Matrices Based on Nano-Grain Cells". In: *2017 IEEE International Conference on Rebooting Computing (ICRC)*. IEEE, 2017, pp. 1–8. DOI: [10.1109/ICRC.2017.8123639.](https://doi.org/10.1109/ICRC.2017.8123639)
- <span id="page-78-2"></span>[Mik17] MIKOLAJICK, T.; HEINZIG, A.; TROMMER, J.; BALDAUF, T. and WEBER, W. M.: "The RFET—a reconfigurable nanowire transistor and its application to novel electronic circuits and systems". In: *Semiconductor Science and Technology* 32.4 (2017), p. 043001. DOI: [10.1088/1361-6641/aa5581.](https://doi.org/10.1088/1361-6641/aa5581)
- <span id="page-78-3"></span>**[Rai21]** RAI, Shubham; NATH, Pallab; RUPANI, Ansh; VISHVAKARMA, Santosh Kumar and KUMAR, Akash: "A Survey of FPGA Logic Cell Designs in the Light of Emerging Technologies". In: *IEEE Access* 9 (2021), pp. 91564–91574. DOI: [10.1109/ACCESS.2021.3092167.](https://doi.org/10.1109/ACCESS.2021.3092167)

### **2.7 FPGA System Architecture**

Previous sections have introduced reconfigurable cells that can be used to realize small combinational functions. In order to design larger digital circuits, these basic cells need to be integrated in a larger, reconfigurable system. This thesis will focus on one specific type of reconfigurable systems, [Field](#page-362-0) [Programmable Gate Arrays \(FPGAs\)](#page-362-0). [FPGAs](#page-362-0) are "field programmable", which means that they can be reconfigured "in the field", i.e. without manufacturing of customized hardware. This is in contrastto [ASICs](#page-361-0) and mask-programmable devices, where customizing the function means developing at least some custom masks and manufacturing new [ASIC](#page-361-0) devices using those. [FPGAs](#page-362-0) are fine-grain reconfigurable devices: They are optimized to provide efficient manipulation of single bits, whereas coarse-grain systems operate on whole words – often even floating point values – at once [\[134\]](#page-327-9). As [FPGAs](#page-362-0) enable the implementation of customized digital [ICs](#page-362-4) without expensive manufacturing, they are widely used for prototyping and small-scale series production, where manufacturing [ASICs](#page-361-0) is not cost-efficient. Commercial devices are available from various vendors with different performance, power and cost characteristics. The following section will summarize quickly how [FPGAs](#page-362-0) are realized using the reconfigurable cells introduced previously.

<span id="page-79-0"></span>

**Figure2.23:** Combination of an D[-FF](#page-362-5) and the reconfigurable cell in a [LE](#page-362-1) [\[118\]](#page-326-3).

**Logic Elements** [LEs](#page-362-1) are based on the previously shown reconfigurable cells, but extend them in various ways. In commercial devices, [LEs](#page-362-1) are usually realized using [LUTs,](#page-363-1) but any of the previously introduced reconfigurable cells could be used. As this cell is the base elementof [FPGAs,](#page-362-0) it has to feature some way to reconfigure the logic function it realizes in-the-field. In addition, to support arbitrary user application logic functions, the [FPGA](#page-362-0) must be able to provide a functionally complete set of boolean operations. Although this usually means one type of [LE](#page-362-1) has to support such a complete set, it is also possible to use multiple typesof [LEs](#page-362-1) in an [FPGA.](#page-362-0) Alternatively, some logic operations could be integrated within device interconnect, e.g. an inversion as part of an inverting buffer. As long as the combined set of reconfigurable elements in the [FPGA](#page-362-0) realizes a functionally complete set of operations, basic [FPGA](#page-362-0) functionality can be achieved. Larger logic functions can then simply be decomposed into operations supported by the [LEs](#page-362-1) using [EDA](#page-362-6) tools.

In addition to the reconfigurable cells, which enable implementation of combinational logic, [LEs](#page-362-1) usually feature a storage element [\[118\]](#page-326-3). In most implementations, this element is a simple D[-FF](#page-362-5) connected to the output of the reconfigurable cell, as shown in [figure 2.23](#page-79-0) on the previous page. It is needed to implement sequential circuits, commonly usedin [Finite State Machine](#page-362-7) [\(FSM\)](#page-362-7) implementations. The [FF](#page-362-5) is usually not directly connected to the output of the [LE,](#page-362-1) but via a user-configurable [MUX:](#page-363-0) This design allows to optionally bypass the [FF](#page-362-5) and output the combinational signal of the reconfigurable cell. It is mostly useful to realize large combinational functions using multiple [LE.](#page-362-1) Similarly, a bypass for the reconfigurable cell, which is often realized using the identity functionin [LUTs,](#page-363-1) allows to use the [LE](#page-362-1) as a basic memory storage element.

<span id="page-80-0"></span>

**Figure 2.24:** The logic cluster as used in the commercial Intel Aria 10 architecture [\[135\]](#page-328-0). The cluster consistsof [MLAB](#page-363-5) and [LAB,](#page-362-8) each consisting of 10 [ALMs.](#page-361-6)

**Logic Clusters** Multiple [LEs](#page-362-1) are commonly combined in clusters, called [Logic](#page-362-8) [Array Block \(LAB\)](#page-362-8) in Intel and [CLB](#page-361-7) in Xilinx [FPGAs](#page-362-0) [\[120,](#page-326-5) [136\]](#page-328-1). An example of such a cluster, the [LAB](#page-362-8) used in Aria 10 [FPGAs,](#page-362-0) is shown in [figure 2.24.](#page-80-0) A [LAB](#page-362-8) consists of 10 [Adaptive Logic Module \(ALM\),](#page-361-6) where "ALM" is Intel's term for [LE](#page-362-1) [\[135\]](#page-328-0). As can be seen in the figure, logic clusters commonly share a local interconnect, which is their defining characteristic: The interconnect is commonly the largest single contributor to [FPGA](#page-362-0) area and application delay, with up to 50% of both being caused by it [\[137\]](#page-328-2). Because of that, [FPGA](#page-362-0) designers introduce one level of hierarchy using a local interconnect. This way, not all signals have to be routed using the global interconnect. Xilinx uses similar clusters called [CLBs.](#page-361-7) As an example, in their Ultrascale+ architecture, this cluster consists of 8 6-input [LUTs,](#page-363-1) 16 [FF](#page-362-5) and additional carry chain logic [\[138\]](#page-328-3).

<span id="page-81-0"></span>

**Figure 2.25:** Defining characteristics of an island-style [FPGA](#page-362-0) architecture [\[134\]](#page-327-9). The interconnect is routed in a 2D-grid, logic blocks are islands within the interconnect.

**Global Architecture** Logic Clusters are then combined in the [FPGA](#page-362-0) system as depicted in [figure 2.25.](#page-81-0) The main feature at this abstraction level is the [FPGA'](#page-362-0)s programmable interconnect. It allows to connect the logic clusters in a way that is configured by the user when programming the [FPGA.](#page-362-0) The interconnect itself used to be an active area of research. Commonly, the [Switch](#page-364-3) [Boxs \(SBs\)](#page-364-3) have been altered to only support some connections instead of all, in an effort to reduce their area [\[120\]](#page-326-5). Similarly, [Connection Boxs \(CBs\)](#page-361-8) often do not connect logic clusters to all signals in a channel, but only to a few. For the spatial placement of logic clusters, multiple variants have been used over time: Whereas row-based systems used to be common for early [FPGAs,](#page-362-0) modern systems are largely island-style architectures. Those have shown to be efficiently implementable, as their defining characteristic, a regular 2D grid of wires, is easy to manufacture [\[137\]](#page-328-2). In addition, commercial systems tend to divide the [FPGA](#page-362-0) into regions, where some resources such as clock signals, are constrained to individual regions. Commercial systems usually do not only consist of configurable logic clusters, but they include

<span id="page-82-0"></span>hard logic such as memory blocks, [Digital Signal Processing \(DSP\)](#page-362-9) blocks, optimized [Input / Output \(IO\)](#page-362-10) and other [Intellectual Property \(IP\)](#page-362-11) blocks [\[120\]](#page-326-5). Hard logic in this case describes logic which is directly realized on the chip, as opposed to soft logic, which is using the programmable logic resources.



**Figure 2.26:** Simple [FPGA](#page-362-0) architecture as used in the remaining thesis. [IO](#page-362-10) blocks are at the periphery. Central blocks are logic clusters (white), memory (diagonal lines) and compute elements (grid). [SBs](#page-364-3) are denoted as small gray blocks, [CBs](#page-361-8) as dots.

Academic [FPGA](#page-362-0) architectures on the other mostly focus on researchof [LE](#page-362-1) and interconnect. They are usually simple island-style architectures as shown in [figure 2.26,](#page-82-0) with [IO](#page-362-10) at the periphery. In real [ICs](#page-362-4) this is difficult to realize, as large pin-count demand [IO](#page-362-10) pads all over the chip area. Furthermore, hard logic for high-performance [IO](#page-362-10) needs to be physically close to the [IO](#page-362-10) pads, which may cause issues if pads are only at the periphery. Because of the research focus, such architectures commonly do not include any hard logic blocks. [Figure 2.26](#page-82-0) shows an illustration of an island-style [FPGA](#page-362-0) architecture which will be used in the rest of this thesis. It has been simplified to show interconnect as simple dots and squares, avoiding visual clutter.

**Programming & Storage** Whether based on [LUTs](#page-363-1) or other reconfigurable cells, [FPGA](#page-362-0) also need reconfigurable storage. To date, three types of memory have been commonly used for that [\[134\]](#page-327-9): The most commonly used storage is [SRAM](#page-364-0) storage. It is commonly used in high-performance commercial [FPGAs](#page-362-0) and is relatively easy to integrate in manufacturing, as it has little demands on device technology. Its main drawbacks are relatively large size of up to 6 or 7 transistors per bit and the relatively high power usage. Furthermore, the volatility of the storage necessitates additional non-volatile storage, which is used to program the [SRAM](#page-364-0) storage in the final system. One way to avoid this external storage is to directly use Flash storage in the [FPGA.](#page-362-0) This approach is not commonly used though, as it poses certain requirements on the technology used. Efficient realization of Flash cells requires thick oxide layers to prevent discharge of stored charges. Additionally, programming of the storage cells requires high voltages, needing additional circuits and making dynamic reconfiguration more difficult. As a slight variation, [Dynamic Random-Access](#page-362-12) [Memory \(DRAM\)](#page-362-12) [FPGAs](#page-362-0) have been proposed in academia: They store data similarly to Flash based [FPGAs,](#page-362-0) but need periodic refreshes of the stored charges. Whereas this relaxes some requirements on device technology, it also makes storage volatile again. The third category of memory is Anti-Fuse based memory. This approach has been used in older commercial devices. Its main benefit is the non-volatility and power efficiency of the storage. Programming again needs to happen using high voltages. This is then used to break down fuses, often realized in between metal layers to avoid area overhead on the semiconductor layers [\[139\]](#page-328-4). Whereas this storage is nonvolatile, programming is also permanent and does not enable any reprogramming. Furthermore, scalability to smaller devices nodes is limited.

#### **Further Reading**

Various textbooks on [FPGA](#page-362-0) design cover [FPGA](#page-362-0) architectures and some good survey papers are available as well. [\[DeH07\]](#page-83-0) provides an extensive overview of all [FPGA](#page-362-0) related topics, especially compute models and programming. As it was published 2007, it also features description of older commercial architectures. For an overview of [PLA](#page-363-2) and [PAL](#page-363-3) architectures, [\[Kuo07\]](#page-83-1) can be used. In addition to those works, [\[Vas07\]](#page-84-0) provides an overview of research architectures available in 2007. For a more recent introduction to [FPGA](#page-362-0) architectures, [\[Ama18\]](#page-84-1) can be recommended. It is supplemented by [\[Rod20\]](#page-84-2) which focuses on use of commercial [FPGA](#page-362-0) in industry contexts and [\[Bou21\]](#page-84-3), which provides an overview of research [FPGA](#page-362-0) architectures in 2021.

- <span id="page-83-0"></span>**[DeH07]** DEHON, André and HAUCK, Scott: Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation. 1. Aufl. Systems on Silicon. s.l.: Elsevier professional, 2007.
- <span id="page-83-1"></span>**[Kuo07]** KUON, Ian; TESSIER, Russell and ROSE, Jonathan: "FPGA Architecture: Survey and Challenges". In: *Foundations and Trends*Ⓡ *in Electronic Design Automation* 2.2 (2007), pp. 135–253. DOI: [10.1561/1000000005.](https://doi.org/10.1561/1000000005)
- <span id="page-84-0"></span>**[Vas07]** VASSILIADIS, Stamatis, ed.: Fine- and coarse-grain reconfigurable computing. Dordrecht: Springer, 2007.
- <span id="page-84-1"></span>[Ama18] AMANO, Hideharu, ed.: Principles and structures of FPGAs. Singapore: Springer, 2018.
- <span id="page-84-2"></span>**[Rod20]** RODRÍGUEZ-ANDINA, Juan José; LA TORRE-ARNANZ, Eduardo de and VALDÉS PEÑA, María Dolores: FPGAs: Fundamentals, advanced features, and applications in industrial electronics. First issued in paperback. Boca Raton: CRC Press, 2020.
- <span id="page-84-3"></span>**[Bou21]** BOUTROS, Andrew and BETZ, Vaughn: "FPGA Architecture: Principles and Progression". In: *IEEE Circuits and Systems Magazine* 21.2 (2021), pp. 4–29. DOI: [10.1109/MCAS.2021.3071607.](https://doi.org/10.1109/MCAS.2021.3071607)

## **2.8 Synthesis and Implementation**

The following section gives a quick introduction to [EDA](#page-362-6) tools. This thesis will introduce both standard cell design for [RFETs](#page-364-2) and new approaches for application mappingto [RFET-](#page-364-2)based [FPGAs,](#page-362-0) so both [FPGA](#page-362-0) and [ASIC](#page-361-0) design flows will be explained. Due to the similarities of the design flows, and [FPGA](#page-362-0) tools being more relevant for this thesis, the [FPGA](#page-362-0) flows are used as an example and relevant differences to [ASICs](#page-361-0) will be noted inline. [Figure 2.27](#page-85-0) shows involved [EDA](#page-362-6) tools for [FPGA](#page-362-0) application design. This figure includes host integration, as in the general case, [FPGA](#page-362-0) may be used in combination with a host processor [\[140\]](#page-328-5). This aspect is of no particular relevance to this work though and therefore won't be elaborated further.

<span id="page-85-0"></span>

**Figure 2.27:** Design flow for [FPGA](#page-362-0) applications, including design capture using [High](#page-362-13) [Level Synthesis \(HLS\)](#page-362-13) and [Domain Specific Languages \(DSLs\),](#page-362-14) as well as integration with host code [\[140\]](#page-328-5).

**Design Capture** The first step in designing an [FPGA](#page-362-0) application is design capture [\[141\]](#page-328-6). In this step, the user describes the application in a machinereadable way. For higher productivity, [DSLs](#page-362-14) and [HLS](#page-362-13) have been introduced as ways to enable higher-level descriptions than the traditionally used [Hardware](#page-362-15) [Description Languages \(HDLs\)](#page-362-15) VHDL and Verilog. A detailed overview of such high-level approaches is given in [\[140\]](#page-328-5), but the applications used in this thesis will start on [HDL](#page-362-15) level.

**Synthesis** Aftera [HLS](#page-362-13) design has been compiled to [HDL](#page-362-15) or a user has developeda [HDL](#page-362-15) design, the first step is to transform it to a technologyindependent netlist in [Register Transfer Level \(RTL\)](#page-364-4) form. The [RTL](#page-364-4) form

consists of only registers and combinational logic, allowing direct mapping to the reconfigurable and non-reconfigurable logic cells introduced previously. The transformation step, technology-independent synthesis, replaces high level language constructs such as processes, and emits a structured netlist. As a technology-independent description of combinational logic, netlists often include function tables with an arbitrary number of inputs. The tools commonly used in the open source community for this synthesis step are ODIN II [\[142\]](#page-328-7) and *Yosys* [\[143\]](#page-328-8). In addition, various commercial tools are available. In general, technology independent synthesis does not differ for [ASICs](#page-361-0) and [FPGA](#page-362-0) targets and tools like *Yosys* have been used for both. Nevertheless, [FPGA](#page-362-0) vendors often include tools specific to their [FPGAs](#page-362-0) in their tools, such as in Xilinx *Vivado*. Similarly, tools like Cadence Genus are mostly used for [ASIC](#page-361-0) design. Often, technology independent synthesis and technology mapping, are combined in one tool.

**Technology Mapping** Technology Mapping is the step of converting the technology independent [RTL](#page-364-4) netlist to a technology specific netlist. For [ASICs,](#page-361-0) the tools have to ensure to only use logic cells available in a standard library. For [LUT](#page-363-1) based [FPGA,](#page-362-0) the problem is easier,as [LUT](#page-363-1) can realize any function. The only limitation in that case is the number of available inputs of the [LUT,](#page-363-1) reducing the problem to a covering problem which can be solved using e.g. dynamic programming [\[118\]](#page-326-3). Both standard cell mapping and [LUT](#page-363-1) mapping is handled by *ABC* in open source design flows. *ABC* converts netlists into an internal [And-Inverter Graph \(AIG\)](#page-361-9) representation, performs technology independent logic optimization, and also maps the internal presentation to cells or [LUTs](#page-363-1) [\[144\]](#page-328-9). Commercial tools again include [FPGA](#page-362-0) specific vendor tools such as *Vivado* and tools such as Cadence Genus for [ASICs.](#page-361-0)

[ASIC](#page-361-0) tools can in general be used with different technology nodes and fabs. For a tool to work with their technology, technology vendors provide [PDKs](#page-363-6) to be used with those tools. For the technology mapping, [PDKs](#page-363-6) provide Liberty or *.lib* files, describing which standard cells are available and their timing information. The format developed by Synopsys is text based and can therefore easily be generated manually [\[145\]](#page-328-10). Apart from information about the cells themselves, such as names and pins, the file includes all possible timing ARCs for the cells. Furthermore, it includes a wire-load model to be used to derive an approximation of the capacitive load caused by wires connecting cells. This load information is then used to look up the timing arcs, which are usually parametrized on this load. Traditional standard cell libraries may provide multiple cells with the same function for different wireloads. The synthesis then performs [STA](#page-364-5) as explained earlier and chooses

a cell which minimizes the path delay or area, depending on optimization goal and criticality of the path. Whereas [STA](#page-364-5) for [ASICs](#page-361-0) works as previously explained, [STA](#page-364-5) for FPGA applications is simplified: For [LUT](#page-363-1) based [FPGAs,](#page-362-0) the delay introduced by logic elements is always the same. However,in [FPGA](#page-362-0) the interconnect, delay is more important: Unlike in [ASICs](#page-361-0) it is not only caused by parasitic wire loads. It also includes the delays caused by interconnect switches. For both targets, synthesis will perform [STA](#page-364-5) even before placement to obtain estimates of expected delays and guide synthesis. The longest, critical path will limit the maximum achievable frequency, which has to match or surpass the target frequency set by the application designer.In [ASIC,](#page-361-0) [STA](#page-364-5) is explicitly performed on different corners because of the delay variations due to [PVTA.](#page-364-6) Usually, results of the worst corner have to be used, pessimizing the timing results. Even though [FPGA](#page-362-0) tools often do not expose various corners for [STA,](#page-364-5) they are also affected by [PVTA](#page-364-6) and internally assume worst case delays of cells.

**Packing** Packing or Clustering is a step performed before the placement step, reducing that step's complexity [\[118\]](#page-326-3). It is most commonly used to cluster primitives which need to be placed in related locations on the final [FPGA.](#page-362-0) An example for thisis [Versatile Place and Route \(VPR\)'](#page-364-7)s VPACK algorithm: It is primarily used to cluster [LUT](#page-363-1) and [FF](#page-362-5) which usually are realized in one [LE](#page-362-1) in [FPGAs.](#page-362-0) Furthermore, it also allows packing multiple [LE](#page-362-1) within a logic cluster [\[146\]](#page-328-11). Apart from handling some parts of the placement process, packing also handles some placement legality constraints, simplifying the implementation of placement algorithms [\[118\]](#page-326-3).

**Placement** Placement can be realized using structured or unstructured approaches. Structured approaches use information about the design hierarchy, whereas unstructured approaches view the circuit as a large unstructured network of logic blocks and [FFs.](#page-362-5) Structured approaches include datapath oriented placement as well as user guided placement variants. The most commonly used placement algorithms however are unstructured [\[118\]](#page-326-3). For [ASICs,](#page-361-0) placement usually means placing rectangular elements on a 2D grid or placing standard cells within predefined rows. For this, information about the physical dimensions of a cell is needed, which is provided as a . Lef file. Placement for [ASICs](#page-361-0) can be separated in global and detailed placement. Global placement may then generate illegal results, such as small overlaps between cells, which need to be sorted out in detailed placement [\[46\]](#page-319-0). Apart from stochastic placement algorithms, [ASICs](#page-361-0) sometimes use analytic approaches: Here, a global function describing the total wire length is used to analytically find a minimum [\[118\]](#page-326-3). [FPGAs](#page-362-0) usually use stochastic placement algorithms, where the most commonly used ones are based on simulated annealing [\[118\]](#page-326-3). For example, [VPR'](#page-364-7)s placer operates in three phases [\[146\]](#page-328-11): In the first, it produces a random placement. In the second it performs random pairwise swaps and in the third reevaluates costs. Using the standard deviation of these initial placements, an initial temperature for the annealing is derived. As a last step, the annealing itself is performed with a specific temperature schedule. For an open source tool placement tool targeting commercial [FPGAs,](#page-362-0) refer to the nextpnr tools [\[143\]](#page-328-8).

**Routing** Routing finds legal connections between placed logic elements, according to the netlist.In [ASICs,](#page-361-0) routing is often split into global and detailed routing phases [\[46,](#page-319-0) [147\]](#page-329-0). Global routing in this case determines routing between certain predefined regions, whereas detailed routing then finalizes the connections for each single network. [FPGA](#page-362-0) tools usually combine global and detailed routing, as resources in routing channel are more limited than in [ASICs](#page-361-0) [\[118\]](#page-326-3).

<span id="page-88-0"></span>

**Figure 2.28:** Tools and file formats used in common [FPGA](#page-362-0) design flows [\[148\]](#page-329-1). The figure focuses on open source tools, but includes some commercial tools for reference. Highlighted is the *fasm* format, which is used to as a textbased bitstreams description for various [FPGA](#page-362-0) architectures.

**Bitstream Generation** After a legal placement and routing has been obtained, the reconfigurable cells and interconnect of the [FPGA](#page-362-0) have to be configured to realize this result. For that, usually a file containing the configuration bits for all reconfigurable units is created [\[118\]](#page-326-3). For [LUTs,](#page-363-1) this means directly storing the function table, whereas for other elements, the configuration has to be encoded somehow. Regarding the interconnect, selected multiplexer bits are usually stored directly. [Figure 2.28](#page-88-0) shows tools and file formats used in open source design flows for [FPGAs.](#page-362-0) Bitstream generation is generally an [FPGA](#page-362-0) dependent task, but the [FPGA ASM \(FASM\)](#page-362-16) format allows to encode this information in a common text format [\[148\]](#page-329-1). FASM data can be generated by [VPR](#page-364-7) as a result of the place and route phases

[\[149\]](#page-329-2) and can then be used in custom tools to transfer the result to a binary format.

### **Further Reading**

[EDA](#page-362-6) tools and algorithms have been covered for both [ASICs](#page-361-0) and [FPGAs](#page-362-0) in literature. For [DSLs,](#page-362-14) [HLS](#page-362-13) and programming paradigms for [FPGAs](#page-362-0) in general, refer to [\[Del23\]](#page-89-0). An in-depth treatment of mapping, placement, routing and bitstream generation for [FPGA](#page-362-0) is given in [\[DeH07\]](#page-89-1). For a shorter, more recent summary of those topics, refer to [\[Ama18\]](#page-89-2). Examples and case studiesof [FPGA](#page-362-0) architectures and tools can be found in [\[Vas07\]](#page-89-3), whereas an industry oriented introduction covering mixed signal simulation and debugging in general is given by [\[Rod20\]](#page-89-4). [ASIC](#page-361-0) [EDA](#page-362-6) topics are covered in detail in [\[Ger99\]](#page-89-5), with a shorter, more recent summary given in [\[Kah22\]](#page-89-6). For an overview of design flows and mixed signal aspects, refer to [\[Wes11\]](#page-89-7).

<span id="page-89-2"></span><span id="page-89-1"></span><span id="page-89-0"></span>

- <span id="page-89-5"></span><span id="page-89-4"></span><span id="page-89-3"></span>**[Ger99]** GEREZ, Sabih H.: Algorithms for VLSI design automation. Chichester and Weinheim: Wiley, 1999.
- <span id="page-89-6"></span>**[Kah22]** KAHNG, Andrew B.; LIENIG, Jens; MARKOV, Igor L. and HU, Jin: VLSI Physical Design: From Graph Partitioning to Timing Closure. Cham: Springer International Publishing, 2022. DOI: [10.1007/978-3-030-96415-3.](https://doi.org/10.1007/978-3-030-96415-3)
- <span id="page-89-7"></span>**[Wes11]** WESTE, Neil H. E. and HARRIS, David Money: CMOSVLSI design: A circuits and systems perspective. 4. ed. Boston, Mass.: Addison-Wesley, 2011.

*This page intentionally left blank*

# **Chapter 3**

# **Related Work**

This chapter covers related work for the various aspects covered in this thesis. It starts with ambipolar standard cell libraries, continues with ambipolar [FPGAs,](#page-362-0) introduces works related to dynamic reconfigurationin [FPGA,](#page-362-0) summarizes some [PVTA](#page-364-6) compensation works and power management techniques and concludes with an overview of synthesis approaches for reconfigurable cells.

## **3.1 Ambipolar Standard Cell Libraries**

To develop digital circuits based on ambipolar devices in an automated, standard cell based [EDA](#page-362-6) flow, a matching [PDK](#page-363-6) including a set of standard cells is required. Traditionally, [PDKs](#page-363-6) have been largely shipped as closed source [IP](#page-362-11) and have been provided by the foundries which developed the technology nodes and ultimately manufacture the [ICs.](#page-362-4) In recent years, there have been releases of open-source [PDKs](#page-363-6) for commercially available [CMOS](#page-361-3) technologies. Two famous examples include SkyWater's 130 nm *SKY130* [\[150\]](#page-329-3) and Global-Foundries 180 nm *GF180MCU* [\[151\]](#page-329-4) technology.

**[Predictive PDKs](#page-363-7)** It is possible to develop [PDKs](#page-363-6) without having a complete manufacturing process for a technology. Those [PDKs](#page-363-6) are called [Predictive](#page-363-7) [PDK \(PPDK\),](#page-363-7) as they describe a predicted technology. Traditionally, academia did not have access to the small-feature size technologies, such as recent FinFET technology. Because of that, researchers build [PPDKs](#page-363-7) for these [CMOS](#page-361-3) technologies, enabling them to prototype and test their circuit designs on modern technology nodes without access to the foundries commercial [PDK.](#page-363-6) One of the earliest example for a standard [CMOS](#page-361-3) [PPDK](#page-363-7) is the 45 nm *FreePDK* [\[152\]](#page-329-5). Some years later, *FreePDK15* [\[153\]](#page-329-6) was developed to target a functional 15 nm FinFET technology. A basic [PDK](#page-363-6) such as *FreePDK15* usually includes

various rules and definitions: A set of design rules specifies required distances between traces and similar geometric rules. The layer definitions determine what layers are available and what properties they possess. This includes active layers for the design of transistor devices as well as metal layers for wiring. Furthermore,a [PDK](#page-363-6) usually includes SPICE simulation models for the transistor devices. More advanced [PDKs](#page-363-6) may also provide support for [Par](#page-363-8)[asitics Extraction \(PEX\)](#page-363-8) to determine parasitics from the layout and [Layout vs.](#page-363-9) [Schematic \(LVS\)](#page-363-9) to verify that the layout matches the schematic. With those parts, [PDKs](#page-363-6) can be used to design and simulate analog circuits, including the standard cells themselves.

<span id="page-92-0"></span>

**Table 3.1:** Views and file formats to describe a standard cell library for use in common commercial [EDA](#page-362-6) tools. The table shows the views supported by the *FreePDK15* standard cell library and was taken from [\[154\]](#page-329-7). Not all functions in the [EDA](#page-362-6) flow require all views.

Standard cell libraries are often not included with the [PDKs](#page-363-6) themselves, but offered as an additional package. For example, for *FreePDK15*, a standard cell library has been provided by Martins et al. [\[154\]](#page-329-7). Depending on the actua[l EDA](#page-362-6) tool and the task to be performed, different information grouped in "Views" is needed. [Table 3.1](#page-92-0) shows the views provided by the *FreePDK15* library, which is a mostly complete set. Here, the .lib file is used for synthesis, .lef for physical synthesis, .v is used for simulation, the .gds file for streamout and the .spi file for timing characterization of the standard cells. The .oa file is a database which combines some of these individual files. Standard cell libraries vary in the number and type of cells they provide, as well as in the

functions those cells can realize. For example, [\[154\]](#page-329-7) provides 76 cells with 21 logic functions. Some cells are available in different drive strengths, so that cells with higher drive strength can be used in high-fanout situations. In addition, buffer and inverter cells are usually available in more drive strengths than usual cells. The library also provides sequential cells such as [FFs,](#page-362-5) scanflops and latches. In addition to such logic cells, libraries provide helper cells: Antenna cells, tie high and tie low cells and filler cells. The implementation of cells commonly follows a template, as described in [\[154\]](#page-329-7): Apart from the physical height of cells, usually positions of the power supply pins as well as some well dimensions are the same for all cells. Such regularity ensures that the user application design can later be routed more efficiently by the [EDA](#page-362-6) tools. Timing and power information are ultimately stored as simple tables in the respective views, where values may be interpolated by the [EDA](#page-362-6) tools if necessary. Nevertheless, some methodologies have been established for the characterization of cells to obtain these tables: As presented in [\[154\]](#page-329-7), the non-linear delay model is commonly used, the Composite Current Source model is used by Synopsys and the Effective Current Source model is used by Cadence. In general, tools like Cadence Liberate can provide an automated characterization of [CMOS](#page-361-3) standard cells: Given a netlist for a cell, these tools automatically create SPICE netlists to simulate the cell. The simulations are then performed using the SPICE model of the [PDK](#page-363-6) and the results are used to derive the values for the .lib file. This approach can also be used to characterize [RFET](#page-364-2) based cells, as it is independent of the cell layout. It however requires fully functional SPICE models.

**[RFET](#page-364-2) Specific Optimization** When it comes to logic cells for [RFET,](#page-364-2) few publications provide a complete set of standard cells. Most of the publications focus on some specific issues and optimization instead: For example, Rai et al. propose a set of six functionally enhanced logic gates [\[126\]](#page-327-1): Compared to commonly used standard-cells, these integrate XOR operations, which can be efficiently realized using [RFET.](#page-364-2) In another publication, Rai et al. introduce the concept of self-dual functions [\[155\]](#page-329-8): They first observe, that a specific class of reconfigurable cell can be obtained by switching of pull-up and pulldown networks, including the power rails. Such an operation makes a cell switch between its dual functions. For example, a traditional [CMOS](#page-361-3) *NOR* becomes a *NAND* it its pull-up network and pull-down network are switched. Of course, for traditional technology, switching means keeping the network topology but using the correct transistor type in each network. For ambipolar transistors used in [RFET](#page-364-2) circuits, the transistors in pull-up and pull-down network are identical and switching their polarities and power rails can indeed realize the opposite circuit. Based on this idea, Rai proposes a workflow

which finds such pairs of functions, called self-dual in the publication. The workflow then automatically derives a set of standard cells which implements these special, reconfigurable cells. In benchmarks, Rai et al. could show a decrease in used area of 13% and a decrease in delay of 11.5%. Another [RFET](#page-364-2) specific optimization for standard cells has been proposed by Krinke et al. in [\[156\]](#page-329-9): After observing that the additional configuration gates used by [RFET](#page-364-2) circuits add additional wiring overhead, Krinke et al. propose various solutions to this issue. Apart from optimizing the layout of reconfigurable cells and relative positions of these additional gate contacts, they also introduce special buffer drivers to drive the gates in a reconfigurable cell. Furthermore, they investigate placement of parts of the cells in specific power regions to reduce power.

**[RFET](#page-364-2) [PPDKs](#page-363-7) and Standard Cells** To date, few standard cell libraries have been presented for [RFET](#page-364-2) [PPDK.](#page-363-7) The first one was presented by Ben-Jamaa et al. in [\[157\]](#page-329-10) for [Carbon Nanotube FETs \(CNTFETs\).](#page-361-4) They developed a library based on *NOR*, *NAND*, *AOI* and *OAI* gates, but extended it with "nonconventional" gates embedding *XOR* functionality. They also realized 46 functions with one base cell design to achieve a reduction of 26% in delay and 32% in area compared to using only unipolar [CNTFET](#page-361-4) gates. The library characterization was performed mostly manually. For area, the authors provided relative estimates based on a weighted device count: They normalized the area of each used transistor in respect to a unit transistor and then summed the normalized area of all transistors in a cell. They did however not consider the complete layout of each single cell. For the delay, Ben-Jamaa et. all used SPICE simulation to obtain the FO4 delay, where each gate is driving 4 instances of the same gate at its output. For evaluation, the authors used the open source ABC tool to perform technology mapping. They developed a custom genlib file to specify the available gates to ABC and mapped various benchmark circuits, achieving a 6.9% average speedup.

In 2018, a new set of standard cell libraries was presented by Rai et al., this time for [Silicon Nanowire \(SiNW\)](#page-364-8) technology [\[158\]](#page-329-11). Unlike Ben-Jamaa, Rai et al. designed full cell layouts for seven cells and characterized their library in . Lib and .lef files. Their library does not contain any sequential elements. To obtain the timing characterization, the authors simulated the SPICE netlist with aVerilog-A table model for the silicon nanowire transistors. Area was obtained from the gate layouts. Using those files, they evaluated the MCNC benchmarks using the qflow flow, which internally uses Yosys and ABC.Their circuits required in average 13% more area than a comparable [CMOS](#page-361-3) implementation, which used an scaled version of *FreePDK45*.

Yet another [PDK](#page-363-6) was presented by Gore et al. in 2019 [\[159\]](#page-330-0). The [PDK](#page-363-6) features Verilog-A SPICE models, design rule manuals and [Design-Rule Check \(DRC\)](#page-362-17) and [LVS](#page-363-9) support for 10 nm silicon-nanowire [Three-Independent-Gate FET](#page-364-9) [\(TIGFET\).](#page-364-9) As common for these [PPDK,](#page-363-7) the SPICE model was derived as a table model from [Technology CAD \(TCAD\)](#page-364-10) simulation. Whereas this initial publication did not include a standard cell library, it was presented in 2022 by Gauchi [\[160\]](#page-330-1) and Keyser [\[161\]](#page-330-2). Using the analog features of the [PDK,](#page-363-6) Gauchi designed layouts for 11 cells in Cadence Virtuoso. The cells where then characterized using Cadence Liberate and the results where analyzed using benchmarks. Whereas the analog flow and design of the libraries used Cadence tools, the benchmarks where synthesized using open source tools such as Yosys. When analyzing the synthesis results of the PicoRV32 RISC-V processor, Gauchi reported 2.3 times less area and 5.7 times less energy compared to the Global-Foundries 12 nm technology. [Figure 3.1](#page-95-0) shows the cell characterization flow for this technology as described bey Keyser [\[161\]](#page-330-2). Keyser's description focuses more on physical aspects of the cells. After layout, [DRC](#page-362-17) and [LVS,](#page-363-9) they authors also perform [PEX](#page-363-8) to assess parasitics of the devices. They then use Cadence Liberate to obtain the .1<sup>th</sup> files.

<span id="page-95-0"></span>

**Figure3.1:** Design flow for a [TIGFET](#page-364-9) standard cell library, taken from [\[161\]](#page-330-2). Gates are designed using the analog parts of the [TIGFET](#page-364-9) [PDK](#page-363-6) of [\[159\]](#page-330-0). Functionality is then verified in [DRC](#page-362-17) and [LVS](#page-363-9) checks. The gates are then combined with wire models taken from *FreePDK15* to obtain parasitics for the final characterization.

The latest [PDK](#page-363-6) was presented by Quijada et al. in 2022 [\[162\]](#page-330-3). It targets the germanium nanowire technology and again derived its SPICE model as a Verilog-A table model from [TCAD](#page-364-10) simulation. The authors analyze various cells in SPICE, but do not provide any .lib or .lef files for a complete standard cell library. The authors claim that parasitic effects can easily be derived, because the nanowire design is based on a commercial FinFET process and values can



**Table 3.2:** Overview of previously published [PDKs](#page-363-6) for ambipolar transistor devices. The table compares the technology used, whether sequential cells such as [FFs](#page-362-5) are provided, how . lib files are obtained, whether the [PDK](#page-363-6) supports standard cell synthesis for digital circuits, whether it integrates a wireload model and whether it can be mixed with existing silicon standard cells.

be adapted. But before standard cells can be provided, the [PDK](#page-363-6) needs to be extended with [DRC](#page-362-17) and [LVS](#page-363-9) support first.

## **3.2 Ambipolar FPGA Architectures**

Although various publications have covered ambipolar reconfigurable cells and [FPGAs](#page-362-0) are commonly mentioned as the main use case, few publications on ambipolar device based [FPGAs](#page-362-0) are available. Part of the reason for this is that a circuit level implementation and evaluation requires a full [PPDK](#page-363-7) and [EDA](#page-362-6) integration. So far, only one such [PPDK](#page-363-7) for [RFET](#page-364-2) has been published and only in 2022: The [SiNW](#page-364-8) [TIGFET](#page-364-9) [PDK](#page-363-6) by Gore, Gauchi and Keyser [\[159,](#page-330-0) [160,](#page-330-1) [161\]](#page-330-2). Without sucha [PDK,](#page-363-6) full circuit level implementation, simulation and evaluation is not possible. Nevertheless, some previous works have investigated partial aspects of ambipolar [FPGA](#page-362-0) architectures.

**Novel non[-RFET](#page-364-2) [FPGAs](#page-362-0)** Some publications have investigated [FPGA](#page-362-0) architectures using novel [LEs,](#page-362-1) but still using conventional technology. An example for this is the work by Parandeh-Afshar et al. in 2012 [\[105\]](#page-325-2): In this publication, [LUTs](#page-363-1) have been replaced with [AICs.](#page-361-2) As a motivation, the authors argue that programmable devices have previously been designed to fit [EDA](#page-362-6) tools: As those tools mostly used [SOP](#page-364-11) representation of logic, [PALs](#page-363-3) and similar devices have focused on *AND* and *OR* gates. With the adaptionof [AIGs](#page-361-9)in [EDA](#page-362-6) tools, the introductionof [AICs](#page-361-2) enables a simple technology mapping implementation. As [AICs](#page-361-2) are less expressive than [LUTs,](#page-363-1) a larger numberof [AICs](#page-361-2) has to be used, increasing demands on the interconnect. To avoid congestion, the authors propose the introductionof [AIC](#page-361-2) clusters, where they also evaluate various different sizes. For the local interconnect in these clusters, the authors evaluate various depopulation levels for crossbars to reduce the area usage. For their architecture, they found 75 % populated crossbars to provide the best trade off. In addition to a purely [AIC](#page-361-2) based system, the authors also introduce a hybrid system which mixes [LUTs](#page-363-1) and [AIC.](#page-361-2) Little details are provided for the hybrid system, but ultimately, the authors claim this architecture reduced delay by up to 32% and area by 16%.

Another novel, non[-RFET](#page-364-2) [FPGA](#page-362-0) architecture has been proposed by Gonçalves et al. in 2013 [\[163\]](#page-330-4). In this architecture, the authors replaced normal [SRAM](#page-364-0)based [LUTs](#page-363-1) with [Magnetoresistive Random-Access Memory \(MRAM\)](#page-363-10) backed ones. This architecture is intended to be used in space systems, where radiation is causing issues with [SRAM](#page-364-0) based storage. The 2 input [LUT,](#page-363-1) a combinationof [MRAM](#page-363-10) for long-term storage and [DRAM](#page-362-12) for permanent access to the stored data, has been manufactured and analyzed. The authors have not discussed the system architecture of their [FPGA](#page-362-0) or any implications for [EDA](#page-362-6) tools. As the new architecture is however still using [LUTs,](#page-363-1) it can be assumed that few or no changes are necessary.

**Ambipolar [FPGA](#page-362-0) Architectures** The first and to the knowledge of the author only complete description of an [FPGA](#page-362-0) architecture based on ambipolar devices has been given in various publications by Ben-Jamaa and Gaillardon. In 2011, they first described their reconfigurable logic cell,a [DGCNTFET](#page-362-3) based replacement for [LUTs](#page-363-1) [\[164\]](#page-330-5). In addition to the reconfigurable cell, the authors also introduced a way to allow permutation of power lines, which may be needed if nets change between pull-up and pull-down functionality. Special to this initial publication is that configuration inputs of the cell are not directly connected to storage. They are realized as normal inputs instead and routed using the interconnect. The authors presenta [EDA](#page-362-6) flow based on ABC, VPACK and VPR and evaluate various benchmarks for their architecture. <span id="page-98-0"></span>They report a 13% speedup compared to a reference architecture with [LUTs,](#page-363-1) but at an area overhead of 10%. The authors have not made use of any other benefits of [RFET](#page-364-2) and have not introduced any novelties in the [FPGA](#page-362-0) system architecture.



**Figure 3.2:** Introduction of the MCluster and different internal routing architectures [\[165\]](#page-330-6). MClusters reduce the amount of signals which are routed on the global interconnect and therefore reduce routing congestion.

A slightly modified variant of this [FPGA](#page-362-0) architecture has been presented in the same year by Gaillardon and Ben-Jamaa [\[166\]](#page-330-7). The main novelty of this publication is the introduction of MClusters, shown in [figure 3.2.](#page-98-0) Using 3x3 MClusters, the authors managed to reduce the area by 62% compared to [LUTs.](#page-363-1) As MClusters use a special interconnect implementation, there are more changes required in [EDA](#page-362-6) tools, which are explained in detail in the publication. Again, the publication does not make use of any other features of [RFET,](#page-364-2) e.g. for power reduction. The final version of this architecture has been presented 4 years later, in [\[165\]](#page-330-6). It is based on the same dynamic logic reconfigurable cell as the previous publications and again uses MClusters. This publication however puts even more focus on [EDA](#page-362-6) aspects and evaluates a larger set of benchmarks. In addition, the authors perform an evaluation of various granularity levels for MClusters.

Whereas the [FPGA](#page-362-0) architecture by Gaillardon et al. did not make use of special [RFET](#page-364-2) features, a publication by Park et al. in 2017 [\[167\]](#page-330-8) explicitly focuses on one such feature: The technology used enables the storage of configuration data on the [BGs](#page-361-10) of the device. The [BGs](#page-361-10) therefore effectively work as a flash memory, as shown in [figure 3.3.](#page-99-0) Programming of the [BGs](#page-361-10) is carried out using high voltage pulses, where the pulse duration determines drain current and pulse potential determines threshold voltage. Unfortunately, the publication provides little additional information: The transistor technology

<span id="page-99-0"></span>

**Figure 3.3:** Programming of a transistor using charge storage on the [BG](#page-361-10) [\[167\]](#page-330-8). **Left:** Programming pulse polarity determines device polarity. Pulse duration determines drain current. **Right:** Variation of programming voltage affects threshold voltage.

has not been presented in detail. It is based on poly-silicon and the measured devices have gate lengths of 1 μm. The authors did not explain whether they expect their technology to be shrinkable to smaller features sizes. Similarly, little information is given about the reconfigurable cell or the system [FPGA](#page-362-0) architecture. Although the publication envisions high-density reconfigurable devices, it does not present such a system. The publication rather focuses on the description of the device on transistor level.

### **3.3 Dynamic Reconfiguration**

Modern [FPGA](#page-362-0) architectures do not only allow to configure [FPGAs](#page-362-0) once at startup, but provide more advanced configuration features. As most commercial [FPGA](#page-362-0) are [SRAM](#page-364-0) based, various advanced applications using the ability to quickly modify memory have been developed. In general, reconfiguration, the process of programming an [FPGA](#page-362-0) with a bitstream, has been distinguished in different categories [\[168\]](#page-331-0): Reconfiguration in general describes a replacement of the configuration of the whole [FPGA.](#page-362-0) In contrast, partial reconfiguration replaces only a part of the [FPGA](#page-362-0) bitstream. Both techniques can be further distinguished as static or dynamic reconfiguration: In static reconfiguration, non-reconfigured logic is kept in a reset, or stopped state. In dynamic reconfiguration on-the-other hand, those areas which are not reconfigured contain active logic and [FPGA](#page-362-0) applications mapped to those areas are not interrupted by reconfiguration. Most commercial [FPGA](#page-362-0) now either provide only static

reconfiguration of the whole [FPGA,](#page-362-0) or they provide [Partial Dynamic Recon](#page-363-11)[figuration \(PDR\),](#page-363-11) allowing to replace parts of the logic while keeping other parts of the application working.

A summary of recent academic and commercial reconfiguration architectures has been collected in 2019 by Vipin et al. [\[168\]](#page-331-0), whereas older architectures have been described in [\[169\]](#page-331-1). [PDR](#page-363-11) enables various new application patterns: Logic can be time-multiplexed, which allows execution of large circuits on smaller [FPGAs.](#page-362-0) As reconfiguration is often implemented using serial register chains and serial programming, the data transfer rate in reconfiguration is often limited. Reconfiguration can therefore be performed faster in general, if only a part of the bitstream is changed. Another benefit of [PDR](#page-363-11) is that a part of the user application circuit can be kept active during reconfiguration. This can be useful to keep a peripheral link active and is commonly used with [Peripheral Component Interconnect Express \(PCIe\)](#page-363-12) connections to host computers.

Reconfiguration architectures can be distinguished into architectures supporting fine-grain reconfiguration and coarse grain reconfiguration. Whereas fine-grain reconfiguration allows reconfiguring individual programmable elements, most architectures combine multiple elements to be reprogrammed at the same time into [Partially Reconfigurable Regions \(PRRs\).](#page-363-13) Grouping elements in this way allows reducing area overhead of the configuration network at the cost of reduced flexibility. Most architectures support reading back the configuration [SRAM,](#page-364-0) but this feature is often not exposed. Cardona et al. used this feature to read back frames, modify single [LUTs](#page-363-1) and write back the bitstream [\[170\]](#page-331-2). This enabled them to support fine-grain reconfiguration on an architecture which originally only supports frame-based reconfiguration. In addition to reconfiguration, some architectures support relocation: In the relocation case, a user application can be placed onto a location on the [FPGA](#page-362-0) for which it has not originally been synthesized. Such concepts are most common when multiple applications are to be executed on one [FPGA.](#page-362-0) In such a case, it might not be known during synthesis time which applications will be running on the [FPGA](#page-362-0) later on, and the location can not be determined ahead of time. Most of the commercial toolchains still require defining possible target areas ahead of time. An application can then be placed onto any such block, but not onto freely chosen locations. Yet another conceptual difference can be found in reconfiguration time: Whereas specialized early architectures enabled reconfiguration in one clock cycle, for recent commercial architectures, reconfiguration is a slower process.

In an abstract view, the configuration memory of an [FPGA](#page-362-0) can be thought

<span id="page-101-0"></span>

**Figure 3.4:** Conceptual view of reconfigurationin [FPGA](#page-362-0) [\[168\]](#page-331-0). **Left:** Configuration memory as a virtual layer, independent of the hardware layer. **Right:** Extending the concept to multiple memory layers leads to multi-context [FPGAs.](#page-362-0)

to be independent of the hardware logic, as shown in [figure 3.4.](#page-101-0) The figure also shows a conceptional view of multi-context [FPGAs,](#page-362-0) which was an active topic of research in the late 1990s. As [FPGA](#page-362-0) were limited in size, such a concept allowed to split larger circuits in a time-multiplexed manner. To realize this, multi-context [FPGAs](#page-362-0) store multiple independent bitstreams and switch between those, usually on a fixed clock-cycle schedule. Switching between different bitstreams leads to lots of changing signals, which ultimately causes a large activity factor for most nets in the [FPGA.](#page-362-0) This leads to high power consumption, which ultimately caused this concept to be no longer used when large [FPGAs](#page-362-0) became available [\[168\]](#page-331-0).

When it comes to adoption in commercial architectures, Xilinx and Intel support [PDR](#page-363-11) on a coarse grain level. Xilinx introduces frames as the smallest reconfigurable unit. In early Xilinx architectures, frames used to cover complete columns in the [FPGA.](#page-362-0) They have become smaller in more recent architectures though. National Semiconductor, Lattice and Actel initially supported [PDR,](#page-363-11) but removed it from later architectures. In general, [PDR](#page-363-11) has not been adopted by a larger audience [\[168\]](#page-331-0). Nevertheless, some applications where it has been used will be explained in more detail:

**Task Based Reconfiguration** Apart from domain-specific examples, [PDR](#page-363-11) has been mostly investigated as part of task based reconfiguration systems. Such systems take the idea of tasks as used in software [Operating System](#page-363-14) [\(OS\)](#page-363-14) and transfer it to [FPGA.](#page-362-0) As shown in [figure 3.5,](#page-102-0) hardware tasks are usually realized as partial bitstreams describing a local region of the overall [FPGA.](#page-362-0)

Publications for [FPGA](#page-362-0) tasks can be roughly sorted into four categories: Task-Mapping, Task-Scheduling, [OS](#page-363-14) and Hardware-Software integration. Most

<span id="page-102-0"></span>

**Figure 3.5:** Hardware tasks in a 1D area model [\[171\]](#page-331-3). In the 1D model, tasks are always full-height and vary only in width, simplifying task placement. Tasks need predefined communication interfaces to be relocatable. Steiger's architecture also reserves some area for [OS](#page-363-14) support.

publications do not describe any changes in [FPGA](#page-362-0) architecture, but they make certain assumptions on the reconfiguration system. When it comes to task mapping, one of the earliest publications was published by Diessel in 1997 [\[172\]](#page-331-4). The publication assumes a homogeneous [FPGA](#page-362-0) which allows arbitrary relocation of logic in two dimensions. For such an architecture, the authors presented an algorithm which can find free space to efficiently map tasks. If not enough space is available, tasks can be moved to reduce gaps, called compaction or defragmentation. In [\[173\]](#page-331-5), Walder et al. use a similar approach, but propose a different algorithm to efficiently keep track of free space. A more recent take of task mapping is given by Sidiropoulos et al. in [\[174\]](#page-331-6). The authors use an architecture with multiple independent [FPGA](#page-362-0) cores and combine those with a host computer. The host computer keeps track of the unused logic resources within each [FPGA](#page-362-0) core. When an application needs to be mapped, the system uses the netlist to actually place and route the design on demand. It takes into account information about already used resources and therefore enables tasks to overlap in area. Compared to other systems which are based on pre-placed and pre-routed tasks, this allows for better resource usage. For large [FPGA](#page-362-0) designs, the reconfiguration time can however become excessive when a place and route step has to be performed on demand.

Task scheduling has been discussed starting with [\[175\]](#page-331-7). In this publication, authors describe a system which allows to choose the location of tasks at runtime. They provide an algorithm for efficient arrangement and packing, assuming a homogeneous [FPGA](#page-362-0) architecture. For recent, commercial [FPGAs,](#page-362-0) Sterpone et al. introduce a custom routing system [\[176\]](#page-331-8). As this system is aware of reconfiguration frames, it routes the design in a way to minimize the number of frames and bitstream size. This allows for more efficient reconfiguration of tasks. Another scheduling algorithm has been presented by da Silva et al. and focuses on streaming applications [\[177\]](#page-331-9). It uses runtime task scheduling and introduces a performance model to enable prediction of the speedup.

The third topic, [OS](#page-363-14) for [FPGA](#page-362-0) systems, has been lead by Steiger et al. In 2003, they first described the idea of an [FPGA](#page-362-0) [OS](#page-363-14) as shown in [figure 3.5,](#page-102-0) including a scheduler, placer and loader for tasks [\[178\]](#page-331-10). The publication's main focus however is on algorithms for planning and placement. In the followup publication [\[171\]](#page-331-3), Steiger et al. then describe the [OS](#page-363-14) in detail. It introduces online scheduling with hard realtime guarantees. It further asserts that as [FPGA](#page-362-0) systems do not always allow arbitrary relocation of logic, tasks need to use a pre-defined communication scheme. In such a scheme, the location of communication logic within a task is fixed.

More recently, with the introductionof [FPGA](#page-362-0) in data centers and cloud computing, hardware-software co-design aspects have been investigated. In such systems, a host computer is used to configure the [FPGA](#page-362-0) with various tasks. These tasks are then used to accelerate certain specific operations, but the main application logic is still executed in software on the host computer. Publications include [\[179\]](#page-332-0), which was one of the first to discuss the topic and to combine software and hardware tasks. Janßen et al. then expanded on the concept by providing a predefined library of hardware accelerators, so-called hardware overlays [\[180,](#page-332-1) [181\]](#page-332-2). Their system was on of the first to be integrated with the PYNQ software stack. Another hardware-software integration framework which provides automated design-space exploration to find efficient trade-offs is TaPasCo. It focuses on parallel computation of tasks and ease of use in software [\[182\]](#page-332-3).

**Fine Grain Reconfiguration** Previously mentioned publications have not made any changes to [FPGA](#page-362-0) architecture, although some of them have assumed fine-grain reconfigurability with relocation support. In the following, custom architectures with novelties in the reconfiguration system are reviewed quickly.

<span id="page-104-0"></span>

**Figure 3.6:** Overview of a time multiplexed [FPGA](#page-362-0) as described by [\[183\]](#page-332-4). **[\(a\)](#page-104-0)** Working principle with multiple independent memory layers time multiplexed for one logic layer. **[\(b\)](#page-104-0)** Splitting combinational logic into multiple contexts.

[Figure 3.6](#page-104-0) shows one of the first multi-context [FPGA](#page-362-0) [\[183\]](#page-332-4). It is reconfigurable in one cycle and allows to choose one out of eight stored configurations. To support quick saving and restoring of the state, the architecture introduced micro registers, which can store [CLB](#page-361-7) outputs. The authors also explore various usage patterns of such architecture in detail. One of those, Logic Engine mode, is shown in [figure 3.6b.](#page-104-0) It depicts how the system can be used to timemultiplex a single design: Combination logic is split into multiple parts and executed in multiple cycles.

A similar architecture was proposed by Li et al. [\[184\]](#page-332-5). Based on multi-context [FPGAs,](#page-362-0) they introduce configuration caching: To reduce the time spent in reconfiguration, the authors propose various optimizations to reduce the number of reconfigurations. They also propose caching algorithms with work efficiently for relocation and defragmentation.

<span id="page-104-1"></span>

**Figure 3.7:** Row based defragmentation as proposed by Compton [\[185\]](#page-332-6). Configuration data and state of each single row is first read into the row buffer, then stored into the new location.

[Figure 3.7](#page-104-1) shows row-based defragmentation, a concept proposed by Compton et al. [\[185\]](#page-332-6). Their architecture is a homogeneous [FPGA](#page-362-0) with homogeneous and virtual IO to enable moving of logic. As 2D defragmentation is an algorithmically complex problem, the authors focus on 1D defragmentation instead. Through introduction of a row buffer, reconfiguration of the device always happens one row at a time. In addition, the current configuration and state can be read back into the row buffer. When multiple applications have been started and stopped, some empty rows may reside between applications. To free up this space into a larger region, a defragmentation system is introduced. This system moves applications on the [FPGA](#page-362-0) by copying them into the row buffer and then to their new location row-by-row. A slightly modified and extended variant of this architecture has been presented by Brebner et al. [\[186\]](#page-332-7). The author's architecture enables relocation without a host computer, implementating all the relocation logic in hardware. For this, they extend the Compton architecture's row-based defragmentation to quickly find free rows in hardware. Based on this free-space search, they implement on-chip compaction of running [FPGA](#page-362-0) tasks.

In 2004, Koch et al. introduce hardware extensions for preemptive task scheduling and defragmentation [\[187\]](#page-332-8). [FFs](#page-362-5) can already be configured as part of the bitstream, as those can have defined initialization values. Freezing the current application state can therefore be realized without hardware extensions, but it requires reading back the complete configuration memory, including [LUT](#page-363-1) configuration which does not change. The authors therefore introduce an additional scan chain which only connects the [FFs](#page-362-5) in the logic cells. This way, a smaller amount of data needs to be saved and restored. The authors also present an extension to enable two-dimensional defragmentation through shifting.

<span id="page-105-0"></span>

**Figure 3.8:** [FPGA](#page-362-0) core fusion as proposed by Figuli [\[188\]](#page-333-0). Multiple independent [FPGA](#page-362-0) cores can be combined to place larger tasks.

[Figure 3.8](#page-105-0) shows an approach to task mapping presented by Figuli et al. in 2011 [\[188\]](#page-333-0). This [FPGA](#page-362-0) architecture is divided into multiple identical cores, which behave independently. Application size is limited to be multiples of the core size and applications are mapped to the cores using a controller. When an application is too large for one core, the system supports core fusion: In core fusion, the [IO](#page-362-10) peripherals on one side are disabled and interconnect switches instead connect to the next core. To enable routingof [IO](#page-362-10) pins in such an architecture, the authors also provide virtual [IO](#page-362-10) adapters. Defragmentation can be performed on a coarse level using this system: The architecture supports freezing and restoring state of a complete core. This allows applications to be moved between cores, enabling defragmentation on core level.

A completely different approach to fine-granular reconfiguration presented by Bozzoli et al. in 2019 is shown in [figure 3.9](#page-106-0) [\[189\]](#page-333-1). This architecture allows reconfiguration on single [LUT](#page-363-1) level through introduction of the "Reconfigurable Multipotent Cell", ReM.This cell combines logic, memory and reconfiguration: It enables distributed reconfiguration, as each cell can trigger reconfiguration of its neighbors. Unfortunately, the authors have not demonstrated how existing, regular applications can make use of such a novel architecture.

<span id="page-106-0"></span>

**Figure 3.9:** Basic cell of the distributed reconfigurable architecture by Bozzoli et al. [\[189\]](#page-333-1). Each cell can trigger reconfiguration of its neighboring cells.

An extension for relocation support on commercial [FPGAs](#page-362-0) has been presented by Adetomi et al. [\[190\]](#page-333-2). Its main contribution is the use of clock lines to communicate between applications. As clock lines are not part of static wiring in Xilinx architectures, this can reduce routing issues when relocating applications. This approach is mostly useful when an existing architecture must be used and can not be modified. When novel [FPGA](#page-362-0) architectures are

designed, it is possible to include dedicated wiring such as [Network-on-Chip](#page-363-15) [\(NoC\)](#page-363-15) instead.

### **3.4 PVTA Compensation**

**Traditional [PVTA](#page-364-6) SuppresionMethods** As explained in the previous chapter, various [PVTA](#page-364-6) sources affect the propagation delay of logic gates, including the reconfigurable logic elementsin [FPGA.](#page-362-0) Similar effects also affect the propagation delay of wires and the [FPGA](#page-362-0) interconnect in general. When working with average or typical process values and the corresponding propagation delays, circuits can fail: When one of the critical paths contains gates with worse delay than assumedin [EDA](#page-362-6) [STA,](#page-364-5) the circuit may actually cause setup time violation, even though this was not visiblein [STA.](#page-364-5) Commonly used solutions include speed binning and worst corner design: In speed binning, it is accepted that some produced [ICs](#page-362-4) may fail at their nominal clock frequency. Therefore, each single [IC](#page-362-4) is measured either statically after fabrication or dynamically at runtime. Depending on the results, it may be used with clock frequencies which are lower than the nominal frequency. Similarly, [ICs](#page-362-4) with lower propagation delays may be used at a higher frequency. Such an approach however requires detailed measurements of the [IC,](#page-362-4) which often needs test structures requiring additional area on the chip. An alternative to this is worst-corner design: In this case, [STA](#page-364-5) is performed with the worst case values for all possible variation sources. This includes process, voltage and temperature variation as well as aging. As process variation is largely random, the worst case combination of all sources is unlikely to occur, especially all the time and affecting the total [IC](#page-362-4) area. Worst-case corner design therefore introduces overly pessimistic guard bands [\[191\]](#page-333-3). This leads to additional design effort in otherwise unnecessary circuit optimization. It also means that circuits often are clocked at a lower frequency than theoretically possible. For [FPGAs,](#page-362-0) this issue occurs in exactly the same way. A commonly proposed solution to improve this situation, is [Statistical STA \(SSTA\)](#page-364-12) [\[51\]](#page-319-1). [SSTA](#page-364-12) does not work with single values for propagation delays, but operates on statistical distributions instead. As a result, [SSTA](#page-364-12) yields a distribution of circuit delay paths. This allows to estimate the amount of circuits which can operate within a certain performance region. It therefore allows to shape the performance distribution during design, which increases the yield after binning.

**Applicability for [FPGAs](#page-362-0)** Whereas [FPGA](#page-362-0) suffer from performance variation due to [PVTA](#page-364-6) just like [ASICs,](#page-361-0) the [SSTA](#page-364-12) solution is unfortunately not easily
applicable. An approach was described in [\[192\]](#page-333-0), but it suffers from various usability issues: In [FPGAs,](#page-362-0) the critical paths are not known during design and manufacturing of the [FPGA](#page-362-0) itself. [SSTA](#page-364-0) therefore has to be performed on the user application. However, at this time after [FPGA](#page-362-0) [ICs](#page-362-1) have already been manufactured and sold, a yield optimization is difficult. In essence this would mean that a produced bitstream works only on some [FPGAs.](#page-362-0) Whereas this is already undesirable for [FPGA](#page-362-0) users, it would also be more difficult to realize binning for [FPGA](#page-362-0) bitstream programming. To make matters worse, whether a bitstream operates properly on some [FPGA](#page-362-0) [IC](#page-362-1) is also highly dependent on the placement. An [FPGA](#page-362-0) [IC](#page-362-1) which works for one version of the design might not work anymore when minor changes in the design demand a rerun of the [EDA](#page-362-2) flow and yield a changed placement. Design synthesis could then be repeated until the design is found to work on an [FPGA,](#page-362-0) but again, this only optimizes for a single [IC.](#page-362-1) All in all, [SSTA](#page-364-0) approaches therefore are less suited for [FPGAs:](#page-362-0) As binning becomes infeasible due to previously manufactured devices and because of device reuse using reconfiguration, the utility of statistical approaches is limited.

**[FPGA](#page-362-0) Specific [PVTA](#page-364-1) Handling** Solutions to mitigate [PVTA](#page-364-1) effects on [FPGA](#page-362-0) therefore look different from ones used for [ASICs.](#page-361-0) They are usually making use of the reconfigurability of the [FPGAs](#page-362-0) in some way. Most of the time, solutions are dynamic, adjusting various aspects of the implemented circuit during circuit operation. A few completely static solutions, which do not perform any operation at runtime, but only modify the [EDA](#page-362-2) flow, have been proposed. Usually those rely on a previous characterization of the target [IC](#page-362-1) and make use of that information during placement. Dynamic solutions also rely on such a measurement of device characteristics, but perform these online. In addition, they also include some aspects to compensate the [PVTA](#page-364-1) effects during runtime. In some cases, solutions also characterize the user application: This allows to not only compensate [PVTA](#page-364-1) to reach nominal operating conditions, but to also accept reduced performance in areas where the user application has sufficient slack and can tolerate larger delays. Compared to nominal conditions, this concept allows for further energy saving optimization. In the following, publications covering those individual aspects will be presented. Publications combining these topics to yield a full compensation system similar to the one in this thesis will be introduced at the end of this section.

#### **Critical Path Identification**

Identification of critical paths in an application design can be obtained in two ways: One approach is special handling or detection of paths during synthesis and implementation in the [EDA](#page-362-2) tools. The other is to detect critical paths or violation of their timing constraints directly in a circuit.

<span id="page-109-0"></span>In 1990, Kaenel et al. [\[193\]](#page-333-1) presented a system for global voltage reduction. To estimate the available timing slack of the critical path, they built a circuit which emulates this path. This "Equivalent Critical Path" is measured and depending on available slack, the global power supply of the circuit is adjusted. To build this equivalent path, the authors identify and extract the critical path using [EDA](#page-362-2) tools.



**Figure 3.10:** Variation aware chipwise placement as proposed by Cheng et al. [\[194\]](#page-333-2). Placement uses a chip specific variation map to optimize critical path location.

A different approach was taken by Cheng et al. [\[194\]](#page-333-2). Instead of regulating performance dynamically, they characterize each [FPGA](#page-362-0) [IC](#page-362-1) individually: Using test circuits configured onto the [FPGA,](#page-362-0) they obtain performance variation maps for some defined regions in their [ICs.](#page-362-1) They then integrate this information in the [EDA](#page-362-2) flow as shown in [figure 3.10](#page-109-0) to customize the placement step for each [FPGA](#page-362-0) [IC.](#page-362-1) As the process variation is known, critical paths can be placed into regions with smaller propagation delay. Overall, this allowed to achieve a 12% improved performance. The main drawback of the system is that the application bitstream has to be regenerated for each [FPGA](#page-362-0) [IC,](#page-362-1) which is a time intensive operation.

A related approach was taken by Ghosh et al. [\[195\]](#page-333-3) for low-power design. In such systems, optimization such as gate sizing can increase the number of critical paths. In order to compensate process variation, the authors therefore describe a way to reduce critical paths in circuits: Using modified [EDA](#page-362-2) tools, they customize the Shannon Expansion step in a way to shape critical paths.

Using this, they then confine the critical paths to certain logic cofactors and isolate them. They then switch to two-cycle operation at runtime when those critical paths are activated. A different solution was given by Ebrahimi et al. [\[196\]](#page-333-4). Their compensation approach is based on the common critical path replica idea, but the selection of the critical path is special: Whereas there have already been publications describing how to select those critical paths, which are most likely to be affected by aging, for [ASICs,](#page-361-0) the authors adapt this idea for [FPGAs.](#page-362-0) As the transistor-level design of commercial [FPGAs](#page-362-0) is not publically known, [ASIC](#page-361-0) models can not be used and the authors derive an [FPGA](#page-362-0) model, taking e.g. static and dynamic stress into account.

Elgebaly et al. noted, that the critical path in a circuit might change over time due to [PVTA](#page-364-1) [\[197\]](#page-333-5). They therefore propose to track a changing, emulated path. This emulated path is designed to exhibit the same behavior as the actual critical path under all [PVT](#page-364-2) conditions. Their path emulation covers both interconnect and combinational logic.

<span id="page-110-0"></span>

**Figure 3.11:** Razor (left) and Bubble Razor (right) timing violation detector [\[198\]](#page-334-0). Special hardware monitors in all paths detect violations of setup time constraints.

A different style of compensation systems has been published based on the Razor system. [Figure 3.11](#page-110-0) shows a comparison of Razor and similar systems and the Bubble Razor system. Instead of detecting critical paths ahead of time, razor like systems extend combinational logic to include detectors for setup time violation. The original razor system detects such a violation by checking whether a transition on the data signal occurs shortly after a clock transition. Razor therefore however adds a constraint on the minimal hold time within a circuit. Bubble Razor on the Other hand introduces two latches in between combinational paths and uses them to detect changes within

the combinational logic at invalid times. When integrated into [PVTA](#page-364-1) compensation systems, those systems operate the application at almost-failure frequency. When a failure occurs, the system needs to invoke restore logic to replay the operation.

A novel approach for critical path replicain [ASIC](#page-361-0) was recently published by Miro-Panades et al. [\[199\]](#page-334-1). It operates the application circuit in two phases: In the first phase, it detects the available timing margins. In the second phase, it actually operates the application normally. The remaining slack is estimated using a timing fault sensor, emulating critical paths. Unlike previous publications, this sensor is not fixed-function but configurable. It can therefore adapt to different application designs.

#### **Device Characterization**

Apart from detecting the critical path in applications, [PVTA](#page-364-1) compensation systems usually also characterize the device in some way. Often, a circuit is characterized using a replica of the critical path in the application. In other cases however, process, voltage, temperature variation and aging are measured directly. Most of those sensors proposed for [FPGAs](#page-362-0) are based on delay measurements. As all of those physical values are correlated with delay, designing sensors for one of them requires compensation of the others. The following section will present a short overview of publications on those topics.

<span id="page-111-0"></span>

**Figure 3.12:** Fully digitally temperature sensor as presented by Chen et al. [\[200\]](#page-334-2). All signal processing is performed in the digital domain.

Yu et al. describe a way to measure process variation, which they use for a variation aware design approach [\[201\]](#page-334-3). They measure the delay of a ringoscillator based circuit to estimate both [LE](#page-362-3) and wire delay. The design is implemented for [FPGA](#page-362-0) and only uses resources available in standard [FPGA](#page-362-0) architectures. Gnad et al. analyzed voltage fluctuations in commercial [FPGAs](#page-362-0) and used a similar time-to-digital sensor to measure the voltage [\[69\]](#page-321-0). They placed multiple sensors over the whole [FPGA](#page-362-0) area and analyzed temporal and spatial effects depending on the application design.

Various publications have used similar sensors for temperature sensing. One of the first such temperature sensors was described by Chen et al. and is shown in [figure 3.12.](#page-111-0) It consists of a cyclic delay line and uses the system clock as a time reference. The whole design fits into 140 [LEs](#page-362-3) and achieves an error between −1.5 K and 0.8 K With 260 μs conversion time, the sensor can be used for dynamic measurements. Franco et al. describe a similar, ring oscillator based temperature sensor for Virtex5 [\[202\]](#page-334-4). They explicitly discuss the voltage sensibility of the sensor and how to address it. Happe et al. use 144 similar sensors on a Virtex 6 architecture [\[203\]](#page-334-5). Using this dense set of sensors, they derive a thermal model which they use for thread mapping onto [CPU](#page-361-1) cores. For characterization and dynamic modelling, a set of heating elements on the [IC](#page-362-1) produces temperature gradients which are measured using the sensors. Calibration is performed using a temperature measurement diode which is integrated in the [FPGA.](#page-362-0)

Aging and process variation is monitored similarly. Agarwal et al. describe a system for circuit failure prediction [\[204\]](#page-334-6). They mostly focus on [PMOS](#page-363-0) aging and [NBTI](#page-363-1) effects and implement and aging detector integrated intoa [FF.](#page-362-4) Huard et al. use aging monitors to perform adaptive wearout management [\[205\]](#page-334-7). They argue that replica elements work well for global effects, but less so for local effects. In-situ monitors integrated within the application design and embedded in the physical area of those application parts enables more direct monitoring of delays of real paths. Another ring oscillator based system to detect aging was presented by Sengupta et al. [\[206\]](#page-334-8). Their publication focuses on [BTI](#page-361-2) and [HCI](#page-362-5) effects and explains the required sensor circuit calibration.

A sensor directly developed to measure delay was presented by Zick et al. [\[207\]](#page-335-0). They implemented an [FPGA](#page-362-0) based sensor node in8 [LUTs.](#page-363-2) The sensor element was therefore able to fit within a single Virtex5 [CLB.](#page-361-3) For usage scenarios, they explain how the sensor can measure delay, temperature and IR drop or voltage variation.

#### **[PVTA](#page-364-1) Compensation Architectures**

Based on critical path identification and device characterization, various solutions for management and compensationof [PVTA](#page-364-1) have been proposed. Most systems target [ASICs](#page-361-0) and are often not directly applicable for [FPGAs,](#page-362-0) as the critical path is not knownat [IC](#page-362-1) manufacturing time. Some ideas presented in these systems are however general and have been adapted to [FP-](#page-362-0)[GAs.](#page-362-0)

An analysis focusing on circuit and transistor level aging solutions was presented by Alam et al. [\[208\]](#page-335-1). The authors suggest considering aging effects already in the design of circuits. Transistor sizes and other physical parameters would then also be chosen according to aging requirements. Unlike this static compensation approach, Gupta et al. presented TRIBECA, a dynamic compensation system mostly targeting processor systems [\[209\]](#page-335-2). The authors provide an analysis of variation sources, then propose a system for error detection and compensation: In order to correct errors, they introduce an error detection unit and operation replay support. As their solution is local, it can also handle spatial variation effects. The proposed system is tightly integrated with the [CPU](#page-361-1) architecture presented and can therefore not be used for general purpose applications. When it comes to more general compensation systems for [ASICs,](#page-361-0) various systems have been proposed over time. An overview of those systems can be found in recent survey publications [\[87,](#page-323-0) [210,](#page-335-3) [211\]](#page-335-4). Khoshavi et al. provide an overview of measurement and monitoring approaches. They compare static guard banding and dynamic approaches. Static approaches include design aware balancing, which balances critical paths according to aging criteria and other [EDA](#page-362-2) based solution such as the one by Alam. Dynamic approaches presented focus mostly on voltage and frequency scaling. Another summary, focusing mostly on [CPU](#page-361-1) systems, was published by Mittal et al. [\[210\]](#page-335-3). They mostly include solutions which address process variation, including block selection and error management techniques. Compared to other publications, they also summarize works which investigate scheduling of tasks on [CPU](#page-361-1) under process variation. A similar, but more extensive survey is given by Rahimi et al. [\[211\]](#page-335-4). It focuses on processor systems as well, but covers variability mitigation from circuit to software level. In the following, some individual works which are closely related to the work in this thesis will be presented in more detail.

**[Adaptive Body Biasing](#page-361-4)** Similar to [RFET](#page-364-3) technology, [BB](#page-361-5) in [SOI](#page-364-4) devices enables fine-grain adjustment of transistors threshold voltages  $V_{\text{th}}$ . As [SOI](#page-364-4)

technology is readily available in commercial manufacturing processes, various systems using [BB](#page-361-5) at different granularity have been proposed. When the amountof [BB](#page-361-5) is adjusted at runtime, these systems are called [Adaptive Body](#page-361-4) [Biasing \(ABB\)](#page-361-4) systems. [Figure 3.13](#page-114-0) shows such an [ABB](#page-361-4) based compensation system, as presented by Tschanz et al. [\[212\]](#page-335-5). This early work focused on dieto-die chip variation and as such uses [BB](#page-361-5) on a chip-wide scale. The system is dynamic, in that it finds the optimum [BB](#page-361-5) voltage for each chip during runtime of the application. To determine whether the circuit speed needs to be improved or whether the circuit can get slowed down, the system constantly monitors a replica of a critical path. It then adjusts the bias of a circuit block accordingly. In the test chip shown in the figure, this circuit block is not a complete processor, but only consists of some extracted processor paths for simplicity. The authors also propose an extension to use multiple instances of this system on one [IC](#page-362-1) to compensate intra die variations. The granularity of this [ABB](#page-361-4) approach is therefore at chip level, the proposed extension operates at region level.

<span id="page-114-0"></span>

**Figure 3.13:** [Adaptive Body Biasing](#page-361-4) test chip presented by Tschanz et al. [\[212\]](#page-335-5). The chip contains a critical path replica used for characterization as well as a circuit block, which simulatesa [CPU.](#page-361-1)

A similar system proposed by Teodorescu et al. is explicitly tailored to mitigate process variation for processors [\[213\]](#page-335-6). This work considers fine-grain [BB,](#page-361-5) although in this case granularity is also only at region, not at transistor scale. It also combines the [BB](#page-361-5) system with [Dynamic Voltage and Frequency Scaling](#page-362-6) [\(DVFS\)](#page-362-6) of the processor, to reduce power consumption even more. In addition, the authors propose an in-chip  $V_{th}$  variation model for the manufacturing process they use.

Yet another [ABB](#page-361-4) system was proposed by Mauricio et al. [\[214\]](#page-335-7). As shown in [figure 3.14,](#page-115-0) this work again devides a chip into regions, where each region contains one biasing system. These systems track runtime variations in the circuits in a region using a delay sensor, which is not further specified. The main focus of the paper is on an area efficient bias generator for the individual regions. Using a charge pump in this bias generator allows generating bias voltages above and below the power supply voltages. Here, the authors claim their approach to require 70% less area than a comparable system using [Digital to Analog Converters \(DACs\).](#page-362-7)

<span id="page-115-0"></span>

**Figure 3.14:** [Body Biasing](#page-361-5) Islands in a compensation system proposed by Mauricio et al. [\[214\]](#page-335-7). Each island contains the compensation circuit, which consists of a delay detector and charge pump.

Whereas previous systems have been proposed for general purpose [ASIC](#page-361-0) application, a system closer to solutions for [FPGAs](#page-362-0) has been proposed by Matsushita et al. [\[215\]](#page-335-8). This work focuses on [Coarse Grain Reconfigurable](#page-361-6) [Arrays \(CGRAs\)](#page-361-6) and primarily aims to reduce leakage power using [BB](#page-361-5) on [SOTB](#page-364-5) technology. To enable [BB,](#page-361-5) the authors first divide the [CGRA](#page-361-6) into multiple regions of a certain size. In [CGRAs,](#page-361-6) just like in [FPGAs,](#page-362-0) some details about critical paths may only be known after configuration with a user application. The authors therefore determine the biases for each region as part of the application [EDA](#page-362-2) flow, after the [CGRA](#page-361-6) user application has been placed. As there is no runtime management system, the authors' approach is completely

static. To find the optimal size, the grain size is then varied in a design space exploration. The authors claim to achieve a 40% reduction of static leakage using this approach, at an area overhead of 6%.

**[FPGA](#page-362-0) Solutions** In addition to the generic compensation systems presented so far, some specialized [FPGA](#page-362-0) systems have been described in literature as well. Chow et al. proposed one of the first systems to realize [Dynamic Voltage](#page-362-8) [Scaling \(DVS\)](#page-362-8) on commercial [FPGAs.](#page-362-0) Their proposed system uses the delay sensor in [figure 3.15](#page-116-0) to determine the current speed of the [FPGA](#page-362-0) logic. It then scales the supply voltage globally for the whole [FPGA](#page-362-0) to reduce power as far as possible. The system was mainly meant to be used to reduce power usage, not to counter [PVTA.](#page-364-1) As such, there is only one sensor for the whole [IC](#page-362-1) and there is no local adjustment of voltages. The authors therefore claim a 54% reduction in power.

<span id="page-116-0"></span>

**Figure 3.15:** Logic delay measurement circuit proposed by Chow et al. [\[216\]](#page-335-9). Registers are clocked using the same signal as the data input of the inverter chain. The falling clock edge will trigger signal changes in the inverter chain, which will be captured by the [FF.](#page-362-4)

A different goal was pursued by Nabaa et al. [\[217\]](#page-336-0): Their system primarily targets compensation of process variation, requiring a more fine-grain, localized measurement of delays. The main novelty in this publication is the fact that the characterizer circuit shown in [figure 3.16](#page-117-0) is placed only once on the [IC.](#page-362-1) To measure delays in different areas of the [FPGA,](#page-362-0) the authors iteratively route the signal to be measured through different [LEs](#page-362-3) on the [FPGA.](#page-362-0) They then use [BB](#page-361-5) to slow regions which positive slack, achieving a 3-times reduction in leakage power. As the system can only characterize [LE](#page-362-3) as long as they are not in use by the application, the authors propose to run the characterization phase once, before starting the user application. This semi-dynamic compensation can therefore not address variation during application runtime, including temperature variation and voltage variation. Long-term aging can be compensated, if the user application is stopped periodically.

Hioki et al. focus on reduction of leakage power in their [SOTB](#page-364-5) test chip [\[218\]](#page-336-1). Using a fine granular approach with 57  $V_{th}$  domains per [FPGA](#page-362-0) region, they achieve up to 50 times reduction in leakage. Due to this fine-grain approach, they however report area overhead of 26 % and addition 10 % area which needs to be unused for separation. There is no runtime measurement of path delay. Instead, the authors modify their [FPGA](#page-362-0) tools to adjust the bias for all power domains after placing the user application. As such, not only the critical path but all paths individually are considered. The resulting circuit characterization is used to tune the biases in the power domains of the device. As this approach is static, it can not address temperature and voltage variation, as well as aging. It could compensate process variation, if the chips are characterized before configuration. This option was however not explored by the authors.

<span id="page-117-0"></span>

**Figure 3.16:** [FPGA](#page-362-0) block characterizer used in the system proposed by Nabaa et al. [\[217\]](#page-336-0). A clock signal is routed through the block to be characterized to a phase detector. The phase detector also gets a direct connection to the clock, so that the measured phase difference characterizes the delay of the [FPGA](#page-362-0) block and wire delay.

A slightly different approach has been taken in the thesis by Burmester Campos [\[219\]](#page-336-2). His work focuses on [EDA](#page-362-2) algorithms and proposes variation aware optimization algorithms. These approaches are validated on the PAnDA architecture, an [FPGA](#page-362-0) with reconfigurable transistor widths.

Focusing on [FPGA](#page-362-0) runtime again, Maragos et al. proposeda [PVT](#page-364-2) system for commercial [FPGAs](#page-362-0)[\[220\]](#page-336-3). It places dozens of delay sensors onto an [FPGA](#page-362-0) embedded into the user application. The sensor then uses a logic chain to measure the delay and compares it to a configurable, acceptable delay. This acceptable delay is predetermined in the [EDA](#page-362-2) flow for the user application. This information is then used to dynamically adapt the global power supply of the [FPGA.](#page-362-0) The authors achieve up to 27% reduction in power, at a resource overhead of 1.6%. The main limitation of this approach is that [DVS](#page-362-8) can only happen globally in commercial [FPGA.](#page-362-0)

Apart from these compensation systems presented in academia, commercial [FPGA](#page-362-0) vendors have started to make useof [BB](#page-361-5)in [SOTB](#page-364-5) [FPGAs.](#page-362-0) In 2019, Lattice Semiconductor presented the Crosslink-NX FPGA, the first based on the Nexus platform [\[221\]](#page-336-4). These devices are based ona [FDSOI](#page-362-9) process and enable [BB](#page-361-5) on a chip-wide level. The devices also only allowfor two performance levels to be selected. So far, no dynamic management system has been proposed, further limiting the use for [PVTA](#page-364-1) compensation.

### <span id="page-118-1"></span>**3.5 Power Management Techniques**

The previous section has included some systems, which do not only focus on [PVTA](#page-364-1) compensation, but also feature power reduction. As the mechanisms for performance measurement and adjustment are the same in both cases, solutions are often similar as well. Previously presented publications were however more focused on technology-level details, such as [BB](#page-361-5)in [SOI](#page-364-4) devices. The power management schemes discussed in this section focus more on system-level aspects.

<span id="page-118-0"></span>

**Figure 3.17:** Classification scheme for power aware [FPGA](#page-362-0) architectures and techniques as proposed by Akgün et al. [\[222\]](#page-336-5).

[Figure 3.17](#page-118-0) shows a classification of such power aware [FPGA](#page-362-0) techniques, as recently proposed by Akgün et al. [\[222\]](#page-336-5). The authors distinguish between three main topics: Power-saving techniques, runtime management and fault diagnosis. In the following, we will mainly consider the first topic, which includes system architecture solutions. The second aspect mostly focuses on application specific solutions which are of less relevance to this thesis. Error detection and recovery is also not relevant to this work.

**[EDA](#page-362-2) Techniques** In addition to hardware approaches, power saving approaches have also been introduced in [EDA](#page-362-2) tools. Early works by Sutter et al. have focused on more power efficient [FSM](#page-362-10) state encoding [\[223\]](#page-336-6). Apart from binary and one-hot encoding, they also analyzed two-hot encoding and a custom scheme which is intended to minimize switching activity. Using the best technique reduces the power consumption by 57%. Whereas this approach considers one small detail, more systemic [EDA](#page-362-2) power saving solutions have been proposed by Singh et al. [\[224\]](#page-336-7). Their work considers clustering and placement under power reduction constraints. Using a specialized logic clustering algorithm enabled power reduction by up to 13%. To achieve that, the authors proposed a new algorithm for routeability estimation between logic cluster block. Gayasen et al. introduced power regions in their [FPGA](#page-362-0) architecture [\[225\]](#page-336-8). In order to reduce leakage power, they added sleep transistors to all those regions and turn of unused ones. To improve power reduction results, they also customized the placement phase in their [EDA](#page-362-2) tools. Using a vertical or horizontal direction based placement allowed to maximize the number of unused regions. Further details on the placement algorithm were not given by the authors. They also quickly note that their region based power management, essentially a power gating scheme, can also be used at runtime. This however requires explicit activation or deactivation of regions by the user application.

**Power Gating** Many other publications have focused on clock gating on an architectural level. In 2007, Tuan et al. described a low-power version of the Xilinx Spartan 3 architecture [\[226\]](#page-336-9). Apart from various general optimizations, such as lowering the core voltage and reducing leakage in configuration [SRAM,](#page-364-6) they also include a power gating scheme. As [SRAM](#page-364-6) leakage has already been optimized, the authors do not power gate the configuration storage. Whereas power gating was originally mostly used for unused logic, keeping configuration storage also enables a standby mode in the architecture: In this mode, the logic in the design can be power gated dynamically and powered on later on again. The authors propose this could be used with a user-provided central power controller, which is itself never power gated.

<span id="page-120-0"></span>

**Figure 3.18:** Fine grain power gating for [FPGA](#page-362-0) as proposed by Bsoul et al. [\[227\]](#page-337-0). All logic blocks (LC) and neighboring routing channels can be power gated individually.

As the area overhead of power gating solutions can be significant, Bsoul et al. carried out a design space exploration of various sizes of gating regions [\[228\]](#page-337-1). The authors use a dynamically power-gated design and study the area vs. power saving trade-off. For their architecture, they find tiles of 3x3 [CLB](#page-361-3) to be most efficient, reducing leakage by 40% at an area overhead of only 1%. The proposed architecture allows runtime adaption as well, but the power controller needs to be implemented by the [FPGA](#page-362-0) application designer. In a follow-up publication, Bsoul et al. extend the architecture [\[227\]](#page-337-0) to allow for fine-grain power gating, as shown in [figure 3.18.](#page-120-0) Apart from the finer granularity for [SBs,](#page-364-7) the authors also introduce a modified [EDA](#page-362-2) tool flow. To introduce a power domain aware routing algorithm, they modify the cost functionof [VPR'](#page-364-8)s router to penalize routing through regions which do not belong to the fan-ins or fan-outs of the net. This is intended to reduce routing through otherwise unused tiles, making more tiles available for power gating. With these changes, the authors managed to reduce leakage by  $83\%$  in total.

Whereas Bsoul et al. modified [CAD](#page-361-7) algorithms, a different approach is taken by Seifoori et al. [\[229\]](#page-337-2). In an attempt to more efficiently select the multiplexers which are part of one power region, the authors first place and route a set of benchmarks for a baseline [FPGA](#page-362-0) architecture. They then perform k-means clustering and a custom utilization similarity clustering to find multiplexers

and [SBs](#page-364-7) which are commonly used together. Instead of modifying the [CAD](#page-361-7) tools, this approach shapes the power-gating regions according to the results of the [CAD](#page-361-7) tools for certain benchmark circuits.

**[DVS](#page-362-8) and [DVFS](#page-362-6)** Another commonly used power management technique on [FPGA](#page-362-0)is [DVS.](#page-362-8) As previously mentioned, one of the first such systems was the setup by Chow et al. [\[216\]](#page-335-9). It used an off-chip voltage controller and a commercial [FPGA](#page-362-0) to realize global [DVS](#page-362-8) for the complete [IC.](#page-362-1) A more recent publication by Nabina et al. replicated the idea on a Virtex 5 architecture [\[230\]](#page-337-3). Using a LEON3 processor, the authors realize an [Application Specific](#page-361-8) [Integrated Processor \(ASIP\),](#page-361-8) where some [FPGA](#page-362-0) resources are used to implement a reconfigurable part. For the processor, the authors also implement dynamic frequency scaling. For the reconfigurable module, the authors use an adaptive voltage scaling scheme, measuring the logic delay using a delay measurement circuit.

[Figure 3.19](#page-122-0) on the next page showsa [DVS](#page-362-8) system proposed by Nunez-Yanez [\[231\]](#page-337-4). This system also uses an external power regulator, but uses an in-situ detector to determine whether the voltage should be scaled up or down. The detector consists of two [FFs,](#page-362-4) a main [FF](#page-362-4) and a slow [FF.](#page-362-4) The main [FF](#page-362-4) directly replacesa [FF](#page-362-4) in user application logic, whereas the slow [FF](#page-362-4) connects to the same input, but with an additional delay. When propagation delay increases due to voltage reduction and there is a setup time violation, this is first detected in the slow [FF.](#page-362-4) As the main [FF](#page-362-4) still operates correctly, the user application will not be affected. The [DVS](#page-362-8) controller uses the information about failing slow [FFs](#page-362-4) to stop further reduction of the supply voltage. To integrate these detectors within the application logic, the author has developed an [EDA](#page-362-2) tool to automatically replace [FFs](#page-362-4) in critical paths.

Another [DVS](#page-362-8) system was proposed by Ahmed et al. [\[232\]](#page-337-5). This system changes the supply voltage according to a calibration lookup table. This table contains the minimum operation voltage for the circuit at various operation points. It is obtained through a custom [EDA](#page-362-2) tool which automatically generates a calibration bitstream based in the user application. The calibration bitstream then has to be programmed once to each [FPGA](#page-362-0) [IC](#page-362-1) to obtain the device specific calibration. In [\[233\]](#page-337-6), this idea was extended by the authors to use multiple calibration bitstreams.

Recent systems realizing [DVFS](#page-362-6) have also been proposed by Levine, Zhao and Taka. Levine's system performs online slack measurement using shadow registers. These registers have to be added to critical paths using a custom [EDA](#page-362-2) tool [\[234\]](#page-337-7). Apart from enabling global power scaling, the proposed system

<span id="page-122-0"></span>

**Figure 3.19:** [DVS](#page-362-8) system for [FPGA](#page-362-0) proposed by Nunez-Yanez [\[231\]](#page-337-4). Voltage scaling is performed off-chip, as the used commercial [FPGA](#page-362-0) does not provide any on-chip voltage configuration.

also scales one global clock. Zhao's system performs offline self-calibration instead, finding the frequency and voltage limits for an application before it is actually active [\[235\]](#page-337-8). To realize this, the authors extract the critical paths of the user application and build a calibration design based on these critical paths. The custom [FPGA](#page-362-0) bitstream then performs the self-calibration step. The authors also scale power and one global clock and evaluated their design using an [Finite Impulse Response \(FIR\)](#page-362-11) filter. Taka's design focuses on one specific application only, a RISC-V processor[\[236\]](#page-337-9). To detect whether the processor is operating correctly at a certain frequency and voltage, the authors slowly increase the frequency of the processor for each tested voltage. The authors then run a test application at all frequency steps and verify the applications output. If the application output is incorrect, the system assumes the frequency has been increased too much. As the system requires running software, the concept is limited to processors.

Other publications attempted to reduce power on [FPGAs](#page-362-0) using a dual design [\[237\]](#page-338-0): In such architectures, voltage scaling is not global and each [CLB](#page-361-3) can be programmed to use a high or a low supply voltage. For this, Gayasen et al. changed [EDA](#page-362-2) tools to assign [CLBs](#page-361-3) to one of the supply rails. After evaluation various techniques, one was found to provide an average power reduction of 61% in the MCNC benchmarks. Historically, a few such dual-VDDdesigns have been explored. With the availability of [SOI](#page-364-4) technology, fine-grain techniques have focused more on [BB,](#page-361-5) as explained in the previous section.

## <span id="page-123-1"></span>**3.6 Synthesis for Reconfigurable Cells**

Whereas various ambipolar reconfigurable cells have been described in literature, there are few publications focusing on [EDA](#page-362-2) flow integration to use these cellsin [FPGA.](#page-362-0) As most of these cells are experimental and have not been manufactured in large-scale [FPGA](#page-362-0) devices, there was so far no demand for [EDA](#page-362-2) tools. Nevertheless, a few publications have proposed solutions for ambipolar reconfigurable cells or for [ULMs](#page-364-9) and logic macro cells, which need similar [EDA](#page-362-2) flows. For [DGCNTFET,](#page-362-12) Zukoski et al. have described not only their logic cell, but also gave a quick introduction to the [EDA](#page-362-2) flow they use [\[238\]](#page-338-1). They use ABC for optimization and tech mapping. For the tech mapping itself, they provide a custom library of gates which represent the functions that can be realized by the reconfigurable cell. The place and route steps for use in a final [FPGA](#page-362-0) architecture are not further explained.

<span id="page-123-0"></span>

**Figure 3.20:** Asymmetric 4+5 [LUT](#page-363-2) (left) and extended5 [LUT](#page-363-2) (right) by Anderson et al. [\[239\]](#page-338-2). The asymmetric [LUT](#page-363-2) was developed for trimming input optimization, the extended [LUT](#page-363-2) for gated input optimization.

Publications on [EDA](#page-362-2) for [ULM](#page-364-9) and macro-cell based reconfigurable architectures have been proposed in the late 1990s and early 2000s. Due to their age, these publications use older synthesis approaches, based on [SOP](#page-364-10) representation. For example, Lin et al. use the SIS mapper for their reconfigurable architecture [\[240\]](#page-338-3). They first map a set of benchmarks to a generic 3 input [LUT](#page-363-2) architecture. They then analyze which functions are commonly used and derive a set of custom [ULM](#page-364-9) modules for this information. Synthesis for these [ULM](#page-364-9) modules is not explained in detail. Another [EDA](#page-362-2) flow has been presented by Cong et al., focusing on synthesis for  $k/m$  cell reconfigurable logic  $[241]$ .  $k/m$  cells are [PLA-](#page-363-3)like reconfigurable cells and can be described by the number of inputs,  $k$ , and the number of product terms that can be realized, $m$ . Unlike a [PLA,](#page-363-3)  $k/m$  cells have always exactly one output signal. Optimization and mapping for this cell has been implemented as a custom

algorithm in SIS and verified using the MCNC benchmarks. For place and route, [VPR](#page-364-8) has been used.

[Figure 3.20](#page-123-0) on the preceding page shows a special reconfigurable cell derived by Anderson et al. [\[239\]](#page-338-2). The architecture has been developed to fit a specific, custom [EDA](#page-362-2) flow efficiently: The authors introduce the notion of trimming inputs and gated inputs. A trimming input an input, which when used in a Shannon decomposition yields a co-factor with more than one variable less than the original function. When applied to functions with 6 inputs, such a decomposition can be realized by the reconfigurable element on the left-hand side of [figure 3.20.](#page-123-0) A gating input is a special trimming input, which yields one constant cofactor in the Shannon decomposition. Such a decomposition has been realized with the cell in the right-hand side of the figure. The authors observe, that such special inputs can be found quicklyin [AIGs](#page-361-9) by finding non-inverting paths. The mapping algorithm can therefore be implemented in tools like ABC.

Apart from special [EDA](#page-362-2) solutions for specific architectures, literature has also proposed solutions for hybrid architectures. These architectures do contain [LUTs,](#page-363-2) which are primarily used when the hard logic macro can not represent a function. Such an approach was used by Hu et al. [\[242\]](#page-338-5): The authors first analyze which functions are commonly used in a set of benchmarks and then design a macro cell to realize those. They then propose an [FPGA](#page-362-0) architecture using both these cells and [LUT.](#page-363-2) For [EDA,](#page-362-2) they use a standard cut-based mapping technique for [LUTs](#page-363-2) first. They then determine which [LUTs](#page-363-2) in the netlist could be replaced by a macro cell. Furthermore, they use a modified packing and repacking scheme to recover area. This step consists of realizing macro functions usinga [LUT](#page-363-2) instead of a macro cell, if many free [LUT](#page-363-2) are available.

A similar approach to combine [LUTs](#page-363-2) and [ULMs,](#page-364-9) also called universal logic generator, has been presented by Luo et al. [\[243\]](#page-338-6). The publication focuses on the description of the complete [EDA](#page-362-2) flow, as shown in [figure 3.21.](#page-125-0) It uses ABC for optimization and technology mapping to standard 4 input [LUTs.](#page-363-2) In a post-processes pass, the netlist is modified to replace [LUT](#page-363-2) instances with [ULMs](#page-364-9) when possible. Packing, placement and routing is performedin [VPR](#page-364-8) as usual. The authors however note thata [LUT](#page-363-2) can also realize the [ULM](#page-364-9) function, but not the other way round. They therefore specify [LUTs](#page-363-2) as multimode cells in the [VPR](#page-364-8) architecture, enabling [VPR](#page-364-8) to also place logic mapped to [ULMs](#page-364-9) into [LUTs.](#page-363-2)

<span id="page-125-0"></span>

**Figure 3.21:** [EDA](#page-362-2) flow for a hybrid [FPGA](#page-362-0) consistingof [LUTs](#page-363-2) and universal logic generators (ULG) [\[243\]](#page-338-6). Applications are first mappedto [LUT](#page-363-2) as usual, then eligible [LUT](#page-363-2) are replaced by ULGs.

# **3.7 Summary**

[Table 3.3](#page-126-0) shows a quick summary of the related works which are most closely related to this dissertation. Table columns show: Whether a non[-LUT](#page-363-2) [LE](#page-362-3) is used, whether moving and freezing of data is supported, whether power reduction is supported, whether [PVTA](#page-364-1) compensation and measurement happens once or repeatedly, whether compensation is local or global and whether a delay measurement system is used and if it reuses existing logic. This dissertation combines three aspects, which has not been done in any related work: [RFET](#page-364-3) standard cell usage, system level [FPGA](#page-362-0) power management and [PVTA](#page-364-1) compensation and low level reconfiguration aspects. As there is little overlap, the first aspect was not included in the table, but is explained in [section 3.1.](#page-91-0) As the table clearly shows, no previous work has combined the low-level reconfiguration system with [PVTA](#page-364-1) and power reduction techniques. Combining these aspects is the main novelty of this thesis, as it allows for fully dynamic, local [PVTA](#page-364-1) with limited resource overhead. And whereas some works do use [FPGA](#page-362-0) resources for measurement, none is able to share the resources with an application. To the best of the author's knowledge, the transparent logic invasion concept has not been proposed anywhere previously.

<span id="page-126-0"></span>

# **Part II**

# **PARFAIT Architecture**

*This page intentionally left blank*

# **Chapter 4**

# **Overall System**

In the following chapters, state-of-the-art [PVTA](#page-364-1) compensation and power saving measures in [FPGAs](#page-362-0) will be investigated to be used with ambipolar transistor devices. To keep the overall system independent of technology, it will also be adapted and evaluated for a state-of-the-art [SOI](#page-364-4) technology. This chapter will first introduce a standard [FPGA](#page-362-0) architecture for [PARFAIT,](#page-363-4) which will be used as the baseline in the evaluation. It furthermore introduces the simulation models and parameters for the technologies used. The chapter concludes with a quick preview of topics which will be investigated in more detail in following chapters.

### **4.1 FPGA Base Architecture**

**Top Architecture** [Figure 4.1](#page-130-0) shows the top-level view of the base [FPGA](#page-362-0) architecture used in this thesis. It is based on *k6\_frac\_N10\_40nm*, which is one of the reference architectures shipped with [Verilog to Routing \(VTR\).](#page-364-11) These architectures model a 40 nm technology [FPGA,](#page-362-0) trying to match the Stratix IV architecture closely. Area, wire segment lengths, capacitances and resistances have been modelled to match this commercial [FPGA](#page-362-0) architecture. *k6\_frac\_N10\_40nm* is based on the main [VTR](#page-364-11) flagship architecture, but with the memory blocks, [DSP](#page-362-13) blocks and carry chains removed: As these elements are not relevant for this thesis, the simplest of the flagship architecture versions, consisting of only [CLBs](#page-361-3) and [Input-/Output-Blocks \(IOBs\),](#page-362-14) was selected. The flagship architecture uses uniform channel widths of 1.0 for all channels and Wilton switch blocks with  $f_s = 3$ . It uses only one type of wire segments of length 4 and fully populated connections to [CB](#page-361-10) and [SB,](#page-364-7) i.e. wires are connected to those elements whenever they intersect. Blocks shown in white are the logic generators [\(CLBs\)](#page-361-3), and blocks with a dotted pattern are [IOBs](#page-362-14) for

<span id="page-130-0"></span>external connectivity. For more details about this architecture, refer to the [VTR](#page-364-11) reference manual.



**Figure 4.1:** Top-level viewof [VTR'](#page-367-0)s flagship *k6\_frac\_N10\_40nm* architecture specified in the timing/k6\_frac\_N10\_40nm.xml file shipped with [VTR](#page-367-0) version 8. It consistsof [IOBs](#page-362-14) at the periphery and [CLBs](#page-361-3) in all remaining locations. The corners do not contain any blocks.

A reduced excerpt of the XML architecture description is given in [listing 4.1.](#page-130-1) Physical values depending on technology such as delays, capacitances, resistances and area have been removed in the excerpt for brevity. For all evaluations performed with [VPR,](#page-364-8) the values have been kept as in the original file.

```
1 <architecture>
2 <layout>
3 <auto_layout aspect_ratio="1.0">
4 <perimeter type="io" priority="100"/>
5 <corners type="EMPTY" priority="101"/>
6 <fill type="clb" priority="10"/>
7 </auto_layout>
8 </layout>
9
```
#### **4.1 FPGA Base Architecture 109**

```
10 <device>
11 <chan_width_distr>
12 <x distr="uniform" peak="1.000000"/>
13 <y distr="uniform" peak="1.000000"/>
14 </chan width distr>
15 <switch_block type="wilton" fs="3"/>
16 <connection_block input_switch_name="ipin_cblock"/>
17 </device>
18 <switchlist>...</switchlist>
19 <segmentlist>
20 <segment freq="1.000000" length="4" type="unidir">
21 <mux name="0"/>
22 <sb type="pattern">1 1 1 1 1</sb>
23 <cb type="pattern">1 1 1 1</cb>
24 </segment>
25 </segmentlist>
26
27 <complexblocklist>..</complexblocklist>
28 <power>..</power>
29 <clocks>..</clocks>
30 </architecture>
```


The excerpt shows how architecture dimensions are defined and how specific blocks are assigned to locations, according to their priority. The auto  $l$ ayout statement in line 3 instructs [VPR](#page-364-8) to automatically size the device to fit the [FPGA](#page-362-0) user application which is currently being placed. Lines  $11 - 14$  specify the channel width, which is specified as a relative fraction of the architecture channel width. This final architecture channel width is usually determined dynamically by [VPR](#page-364-8) when placing the application: VPR starts with a small value and increases the channel-width whenever it can not route a design due to congestion.

The following lines specify [SB](#page-364-7) and [CB](#page-361-10) details. For the [SB,](#page-364-7) only the switching pattern is specified here. For the [CB,](#page-361-10) a switch is selected from the switch\_list. This architecture uses only one [SB](#page-364-7) and one [CB](#page-361-10) type. Lines 19 – 25 specify the routing segments available in the global interconnect. This specifies the wire length as 4 and that the wires connect to all [SBs](#page-364-7) and [CBs.](#page-361-10) Line 28 species technological parameters for power analysis and line 29 specifies the clock, for which the reference architecture defines

only a single one. Line 27 describes the blocks in the architecture in more detail.

**[IOB](#page-362-14) Description** [IOBs](#page-362-14) will not be further examined in this thesis, but their architecture affects the placementof [FPGA](#page-362-0) user applications in the evaluation. Therefore, a quick description of the blocks is given here, whereas the full [IOB](#page-362-14) architecture description can be found in the appendix, see [listing B.1](#page-371-0) on page [349.](#page-371-0) [IOBs](#page-362-14) in the architecture operate in one of two modes, but never in both at the same time: An [IOB](#page-362-14) can either be in input, or in output mode. [IOBs](#page-362-14) are clocked devices and therefore connect to the global device clock. On each [IOB](#page-362-14) location, there can be up to 8 instancesof [IOBs.](#page-362-14) This means that each single location provides up to 8 inputs or outputs. For inputs, [IOBs](#page-362-14) connect to 15% of the wires in a channel, for outputs to 10%. The [IOB](#page-362-14) architecture model specifies pins on all sides allowing for a single model to be used for all [FPGA](#page-362-0) sides. As only one of the [IOB](#page-362-14) sides is ever connected to the global interconnect, this does not increase the number of available connections for the [IOB.](#page-362-14)

<span id="page-132-0"></span>

**Figure 4.2:** [CLB](#page-361-3)in [VTR's](#page-367-0) flagship *k6\_frac\_N10\_40nm* architecture for [VTR](#page-367-0) version 8, containing 10 [Fracturable Logic Elements \(FLEs\).](#page-362-15)

**[CLB](#page-361-3) Description** [Figure 4.2](#page-132-0) shows the structure of the [CLB](#page-361-3) logic generator in the reference architecture. It is a cluster of  $N = 10$  [FLEs,](#page-362-15) where each [FLE](#page-362-15) can be configured as either one  $K = 6$  input [LUT,](#page-363-2) or as two  $K = 5$  input [LUTs](#page-363-2) with identical inputs. The [CLB](#page-361-3) connects to the global clock, which is forwarded to the [FLE.](#page-362-15) To connect the inputs to the global interconnect, it contains a fully populated crossbar. The crossbar connects 40 inputs to the tracks in the adjacent channel througha [CB.](#page-361-10) In addition, all 20 outputs of the [FLE](#page-362-15) are fed back into the crossbar to realize a local interconnect. This allows to connect multiple [FLEs](#page-362-15) in series without using global routing resources. The crossbar then provides 60 inputs to the [FLEs,](#page-362-15) 6 for each single one. As the crossbar is fully populated, all outputs can select any of the inputs. The [CLB](#page-361-3) connects to 15% of input wires in a channel and 10% of outputs. An architecture description, reduced in the same way as the top architecture description, is shown below in [listing 4.2.](#page-133-0)

```
1 <pb type name="clb" area="53894">
2 \leq \le3 <output name="O" num_pins="20" equivalent="none"/>
4 <clock name="clk" num_pins="1"/>
5
6 <br />
sob type name="fle" num pb="10">
7 \langle input name="in" num pins="6"/>
8 < output name="out" num_pins="2"/>
9 <clock name="clk" num pins="1"/>
10
11 \langle mode name="n2 lut5">...\langle/mode>
12 <mode name="n1_lut6">...</mode>
13 \times /pb_{\text{type}}14 <interconnect>
15 <complete name="crossbar" input="clb.I fle[9:0].out"
              ↪ output="fle[9:0].in"></complete>
16 <complete name="clks" input="clb.clk" output="fle[9:0].clk
              ↪ "></complete>
17 <direct name="clbouts1" input="fle[9:0].out[0:0]" output="
              \leftrightarrow clb.0[9:0]"/>
18 <direct name="clbouts2" input="fle[9:0].out[1:1]" output="
              \Leftrightarrow clb. 0[19:10]"/>
19 </interconnect>
20
21 <fc in_type="frac" in_val="0.15" out_type="frac" out_val="
           \leftrightarrow 0.10"/>
22 <pinlocations pattern="spread"/>
23 \times /pb_{type}
```
**Listing 4.2:** [CLB](#page-361-3) part of the XML architecture description used by the [VTR](#page-364-11) framework and [VPR.](#page-364-8)

**[BLE](#page-361-11) Description** A structural depiction of the [FLE](#page-362-15) modes is shown in [fig](#page-134-0)[ure 4.3,](#page-134-0) where [Figure 4.3a](#page-134-0) shows the 6 input [LUT](#page-363-2) [FLE](#page-362-15) mode. It contains a [Basic Logic Element \(BLE\)](#page-361-11) with an additional [FF](#page-362-4) anda [MUX](#page-363-5) to select between the [LUT](#page-363-2) output and the [FF](#page-362-4) output. This way, the [FLE](#page-362-15) can realize both combinational and sequential logic. The architecture description XML for this

mode is given in the appendix, see [listing B.2](#page-372-0) on page [350.](#page-372-0) [figure 4.3b](#page-134-0) shows the second mode the [FLE](#page-362-15) can operate in, as two 5-input [LUTs.](#page-363-2) In this mode, only the first 5 of the 6 available inputs are used. The inputs are connected identically to both [LUTs. LUTs](#page-363-2) are again combined witha [FF](#page-362-4) and [MUX](#page-363-5) in a [BLE.](#page-361-11) This mode provides two outputs, which are both connected individually to the outputs of the [CLB.](#page-361-3) This mode's architecture description XML can also be found in the appendix, see [listing B.3](#page-373-0) on page [351.](#page-373-0)

<span id="page-134-0"></span>

**Figure 4.3:** [FLE](#page-362-15) used in the [CLBs](#page-361-3)in [VTR'](#page-367-0)s flagship *k6\_frac\_N10\_40nm* architecture. The [FLE](#page-362-15) can be configured to operate in one of two modes: **[\(a\)](#page-134-0)** [LUT6](#page-363-2) mode which implements a single 6-input [LUT](#page-363-2) with one output. **[\(b\)](#page-134-0)** Fractured mode, which implements two [LUT5](#page-363-2) with the same 5 inputs and provides two outputs in total.

**[RFET](#page-364-3) Hard [IP](#page-362-16)** In order to make use of the [BG](#page-361-12) of [RFETs](#page-364-3) to trade-off performance and static leakage, as introduced in [section 2.2](#page-36-0) on page [14,](#page-36-0) the transistors in the [FPGA](#page-362-0) architecture have to be replaced with [RFETs.](#page-364-3) One approach to achieve this is by replacing common [CMOS](#page-361-13) cells with [RFET](#page-364-3) standard cells. [Chapter 5](#page-183-0) on page [161](#page-183-0) will therefore present an [RFET](#page-364-3) standard

cell library. It will also describe an [EDA](#page-362-2) flow as introduced in [section 3.6](#page-123-1) on page [101](#page-123-1) to make use of them. Results will be compared with the state-of-the art works introduced in [section 3.1](#page-91-0) on page [69.](#page-91-0)

**[RFET](#page-364-3) Logic Generators** Whereas [RFETs](#page-364-3) can be used as drop-in replacements for standard silicon devices, such a usage neglects someof [RFETs](#page-364-3) benefits, especially for reconfigurable [LEs.](#page-362-3) This thesis will therefore analyze substitution of [CLBs](#page-361-3) in the architecture with [RFET](#page-364-3) based [LEs,](#page-362-3) as introduced in [section 2.6](#page-75-0) on page [53.](#page-75-0) Details of this analysis will be presented in [chapter 6](#page-199-0) on page [177](#page-199-0) and results will be compared to the state-of-the-art architectures of [section 3.2](#page-96-0) on page [74.](#page-96-0)

Due to different propagation delays of the cells, replacing these [CLBs](#page-361-3) in [fig](#page-130-0)[ure 4.1](#page-130-0) on page [108](#page-130-0) and adjusting their [BG](#page-361-12) voltages will not only affect the leakage current, but also the performance of the [FPGA:](#page-362-0) In that figure, an example of a critical path is shown in orange. It depicts the path with the least available slack as introduced in [section 2.3](#page-42-0) on page [20,](#page-42-0) for a specific user application. Similarly to the example in that chapter, the total propagation delay  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  is given as the sum of propagation delays in the wiring and in the combinational logic. And similarly to the introduction there, setup and hold conditions dictate the maximum clock frequency [\(equation \(2.15\)](#page-47-0) on page [25\)](#page-47-0) and the minimal delay [\(equation \(2.14\)](#page-47-1) on page [25\)](#page-47-1). A major difference is that in [FPGAs,](#page-362-0) there are no logic gates to be considered. Delays are instead caused by the [LEs](#page-362-3) [\(section 2.5](#page-67-0) on page [45\)](#page-67-0) and [MUXs](#page-363-5) in the interconnect.

This dissertation has to answer the question how a changein [BG](#page-361-12) voltage will affect these delays and how such a change can be made at application runtime without violating timing constraints. As shown in [equation \(2.3\)](#page-34-0) on page [12,](#page-34-0) for normal [CMOS](#page-361-13) devices,  $I_D$  is dependent on  $V_{th}$ , which is modulated in [RFETs](#page-364-3) using the back gate:

$$
I_D = \frac{W}{L_{\text{eff}} P_C} (V_{\text{GS}} - V_{\text{th}})^{\alpha}
$$
  
=  $C_{\alpha} \mu (V_{\text{GS}} - V_{\text{th}})^{\alpha}$  (4.1)

 $t_{\rm PD}$  $t_{\rm PD}$  $t_{\rm PD}$  on the other hand depends on  $I_{\rm D}$ , as introduced previously in [equa](#page-45-0)[tion \(2.10\)](#page-45-0) on page [23:](#page-45-0)

$$
t_{\rm PD} = \left(\frac{1}{2} - \frac{1 - \nu_{\rm T}}{1 + \alpha}\right) t_{\rm T} + \frac{C_{\rm L} V_{\rm DD}}{2I_{\rm D}}\tag{4.2}
$$

This leads to a dependency of  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  on the [BG](#page-361-12) voltage. The propagation delay of a combined path, containing multiple transistors or gates, can be described as in [equation \(2.13\)](#page-45-1) on page [23:](#page-45-1)

$$
t_{\rm PD,Path} \propto \frac{C_{\rm tot} V_{\rm DD}}{\mu (V_{\rm GS} - V_{\rm th})^{\alpha}}
$$
(4.3)

The exact relation of  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  and the [BG](#page-361-12) voltage depends on the used technology. It can be simulated in SPICE simulations, if a suitable model for dynamic behavior is included in the [PDK.](#page-363-6) As shown in [section 3.1](#page-91-0) on page [69,](#page-91-0) this is commonly not the case for [RFET](#page-364-3) technologies. As an alternative, the Scarpato Model introduced in [section 3.4](#page-107-0) on page [85](#page-107-0) will be adopted and parametrized for the considered target technologies. This parametrization will be detailed in the following sections in this chapter.

Possible change of  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  in the critical path also needs to be considered during user application synthesis, which will be covered as part of [chapter 7](#page-217-0) on page [195.](#page-217-0) Furthermore, using different [LEs](#page-362-3) requires changes in [EDA](#page-362-2) tools, which will be discussed in [chapter 6](#page-199-0) on page [177](#page-199-0) and compared to the stateof-the art solutions of [section 3.6](#page-123-1) on page [101.](#page-123-1)

Power can be described as static and dynamic power, as shown in [equa](#page-44-0)[tions \(2.8\)](#page-44-0) and [\(2.9\)](#page-44-1) on page [22:](#page-44-0)

<span id="page-136-1"></span><span id="page-136-0"></span>
$$
P_{\text{total}} = P_{\text{dynamic}} + P_{\text{static}} \tag{4.4}
$$

$$
P_{\text{dynamic}} = P_{\text{switching}} + P_{\text{short circuit}} \tag{4.5}
$$

$$
P_{\text{static}} = (I_{\text{sub}} + I_{\text{gate}} + I_{\text{junct}} + I_{\text{cont}})V_{\text{DD}} \tag{4.6}
$$

$$
P_{\text{switching}} = \alpha C_{\text{L}} V_{\text{DD}}^2 f \tag{4.7}
$$

As static leakage current is also dependent on  $I<sub>D</sub>$ , there is a trade-off between performance and power usage that will be exploited in power management techniques similar to [section 3.5](#page-118-1) on page [96](#page-118-1) and will be evaluated in [chapter 9](#page-253-0) on page [231.](#page-253-0)

#### **4.2 FPGA Power Regions**

[Figure 4.4](#page-137-0) shows the reference architecture of [figure 4.1](#page-130-0) with a slightly more complex user application, depicted in orange. As depicted in this example, utilizationof [FPGAs](#page-362-0) is commonly below 100% in real world applications. Full

utilization usually reduces performance because of routing congestion or makes designs not routable in extreme cases. Other effects reducing the total utilization are limited amounts and fixed positions of other resources, such as [DSP](#page-362-13) blocks or specific [IOB.](#page-362-14)

As introduced in the previous section, it is now possible to scale the overall performance vs. leakage power of the [FPGA,](#page-362-0) as long as the timing constraints for the critical path hold. However, as can also be seen, the utilizationof [CLBs](#page-361-3) can vary locally: This example design was placed into the bottom left corner of the device, adjacent to the used [IOBs.](#page-362-14) On the other hand side, almost no [CLBs](#page-361-3) in the top right corner are utilized. With a global power scaling approach as presented in [section 3.5](#page-118-1) on page [96,](#page-118-1) those unused cells still draw static power  $P_{\text{static}}$  due to leakage (see [equation \(4.6\)\)](#page-136-0). As a first estimate, it can be assumed that there is however no dynamic power use  $P_{\text{dynamic}}$ , as there is no switching, i.e.  $f = 0$  (see [equation \(4.5\)\)](#page-136-1).

<span id="page-137-0"></span>

**Figure 4.4:** The *k6\_frac\_N10\_40nm* architecture with a placed user application (orange) and the newly introduced region grid (blue). Region size was arbitrarily chosen as 3x3 blocks, including [IOBs.](#page-362-14)

To avoid static power in unused [CLBs,](#page-361-3) a region based power management strategy is introduced for the [PARFAIT](#page-363-4) system. Regions are depicted in blue in [figure 4.4](#page-137-0) and allow voltage scaling approaches to be used locally. The concepts introduced in the dissertation are largely independent of the power management technology used. The evaluation of the [RFET](#page-364-3) system will use [BG](#page-361-12) based threshold voltage adjustment, evaluation of [SOI](#page-364-4) reference technology will use body biasing. The approaches and tools presented here can also be used for supply voltage scaling, as previously presented in literature. In that case, the limited output voltage swing of cells needs to be considered when driving cells in other regions. As the approaches used here do not scale the supply voltage, this problem is not relevant for the [PARFAIT](#page-363-4) system.

With the exemplary region size and application of [figure 4.4,](#page-137-0) four of nine regions could be turned off completely. But even for regions where this is not the case, more fine-grained power management is now possible: Timing requirements in form of the critical path can now be considered independently for the regions. Instead of focusing on the one global critical path, the critical path in each section needs to be considered. This becomes especially obvious when thinking of examples where the critical path does not cross one region at all. Nevertheless, timing constraints for paths in this region need to hold, and a local, critical path needs to be tracked. For this, the [EDA](#page-362-2) process for user applications introduced in [section 2.8](#page-85-0) on page [63](#page-85-0) needs to be modified: The critical path needs to be determined and stored for each single region. Special care must be taken for paths which cross region boundaries, as scaling the performance in one region affects the performance of the whole path. It might therefore preclude further scaling in other regions which contain this path as well.

In addition to critical path extraction, a region based architecture calls for further optimization in the [EDA](#page-362-2) tools: If placement is made aware of regions by factoring in the cost of placing logic into currently unused regions, keeping regions empty can be incentivized. Another approach is to predefine regions with fixed performance/power trade-offsat [FPGA](#page-362-0) manufacturing time. Such a static high performance / low power region approach reduces complexity and circuit overhead, as dynamic scaling after in-the-field programming is not required. [EDA](#page-362-2) tool modifications presented here will also enable evaluation of such systems.

Details on power management region aspects of the [PARFAIT](#page-363-4) system will be presented in [chapter 7](#page-217-0) on page [195](#page-217-0) and a simulation framework to evaluate the whole system including regions support will be presented in [chapter 9](#page-253-0) on page [231.](#page-253-0) Final [Design Space Evaluation \(DSE\)](#page-362-17) of parameters, such as the region size or cost function weights, will then be presented in [chapter 10](#page-275-0) on page [253.](#page-275-0)

# **4.3 PVTA Compensation**

Previous sections explained how voltage adjustment can be used to reduce performance where the slack of the critical paths allows it, and how to decrease the static leakage. In a first implementation, this can be realized as a static approach: The user application can be analyzed in an [EDA](#page-362-2) tool and the slack can be derived in each region and then stored as part of the [FPGA](#page-362-0) bitstream. When programming, the regions can be changed to the proper performance level and then don't change at runtime.

<span id="page-139-0"></span>

**Figure 4.5:** The *k6\_frac\_N10\_40nm* architecture with the shade of red in each region representing local performance variations dueto [PVTA.](#page-364-1) Effects vary within each region as well, but have been limited to region scope here for readability.

Such a static approach however wastes some potential. As has been discussed in [section 2.4](#page-49-0) on page [27,](#page-49-0) manufactured [FPGA](#page-362-0) [ICs](#page-362-1) have local performance variations, which then cause pessimistic resultsin [STA.](#page-364-12) This effect is depicted in [figure 4.5,](#page-139-0) where different shades of red show different performance in the regions. Even when the application is analyzed during [EDA,](#page-362-2) as the performance of each region is not known ahead of time, the performance adjustment needs to be performed conservatively.

To address this in the [PARFAIT](#page-363-4) system, a dynamic, closed loop control system for performance in each region is introduced. The overall, high level concept is presented in [figure 4.6.](#page-140-0)

<span id="page-140-0"></span>

**Figure 4.6:** Region detail and region controller concept. The figures show a zoomed in detail of the central region in [figure 4.4.](#page-137-0) **[\(a\)](#page-140-0)** Region controller in idle mode. **[\(b\)](#page-140-0)** Row was reconfigured dynamically with measurement circuits. **[\(c\)](#page-140-0)** Delay characterization is active.

[Figure 4.6a](#page-140-0) shows a zoomed in detail of one region. In addition to the [FPGA](#page-362-0) elements of the base architecture, it also contains the building blocks for the closed loop control system introduced in [section 3.4](#page-107-0) on page [85:](#page-107-0) Critical path identification is performedin [EDA](#page-362-2) tools for the [FPGA](#page-362-0) application. The determined, available slack is used to derive a slack factor, which describes the relative percentage of how much the delay in a region is allowed to increase. This is stored as the *Target* for each region.

The architecture does not make use of path replicas. As paths on [FPGA](#page-362-0) are always made up of the base [LEs](#page-362-3) anyway, it is sufficient to characterize these basic [LEs.](#page-362-3) This work is performed in the *Measure* block. The PARFAIT architecture measures the real delays of real [LEs,](#page-362-3) not artificial, additional logic. This also enables the architecture to effectively measure each single [CLB](#page-361-3) and [BLE](#page-361-11) on the [FPGA.](#page-362-0) In order to keep resource overhead low, the architecture uses fine-grain dynamic reconfiguration [\(section 3.3](#page-99-0) on page [77\)](#page-99-0) to temporarily reprogram areas of the [FPGA.](#page-362-0) As depicted in [figure 4.6b,](#page-140-0) the architecture therefore transparently moved the user application down one row in the [FPGA](#page-362-0) on a row-by-row basis. In order for this to work, one row of the [FPGA](#page-362-0) needs

to stay unprogrammed. Furthermore, some special handling is needed to ensure that the user application can continue to be executed. Details about this are given in [section 8.2](#page-228-0) on page [206.](#page-228-0)

Once the original application logic has been moved, the row is reprogrammed with specific delay measurement circuits. This process will be called "Logic Invasion" in the future, as it replaces the user application logic. While most of the measurement logic is kept in reconfigurable cells, a central *Measure* controller orchestrates the measurements, as shown in [figure 4.6c.](#page-140-0) The measured delay is then compared to the nominal delay and the target delay in a controller system. This controller appropriately adjusts the performance in the regions using a voltage controller. Although not directly visible in [figure 4.6,](#page-140-0) this effectively forms a closed loop system: The change in voltage will change the cell delay, as introduced previously, closing the loop.

Whereas this enables more efficient power reduction for the user application, it also allows addressing [PVTA](#page-364-1) effects: As these manifest in changed cell propagation delay, they will be picked up by the *Measure* characterization step and be compensated transparently. It should be noted that the high-level description here is simplified and the mentioned blocks do not necessarily all have dedicated hardware in each region. As will be shown later, all blocks, except for the voltage adjustment logic, will actually be centralized to reduce the hardware overhead.

**Simulation and DSE** Whereas previous sections have described the complete [PARFAIT](#page-363-4) architecture, some more thoughts are necessary to efficiently simulate and evaluate the system. As previously described, simulation models are necessary to describe the actual influence of voltage adjustment on the propagation delays of cells. Furthermore, a similar model is needed to estimate the changes in  $P_{static}$  with voltage adjustment. For [PVTA,](#page-364-1) two simulation models are required: First, it must be determined how those effects actually affect the propagation delay,  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$ :

$$
t_{\rm PD} = f(P, V, T, A) \tag{4.8}
$$

And considering the possibility of another independent variable for performance adjustment,  $C$ :

$$
t_{\rm PD} = f(P, V, T, A, C) \tag{4.9}
$$

These equations are heavily technology-dependent. For well-established technologies, an algebraic expression can be derived from equations given in [chapter 2.](#page-31-0) For various novel technologies, including the [RFET](#page-364-3) technology, such detailed equations and understanding are however not yet available. Even for [SOI](#page-364-4) and similar technology, using text-book equations may lead to larger modeling errors than using vendor-specific SPICE models shipped as partof [PDKs.](#page-363-6) This dissertation will therefore use the generic Scarpato model presented in [section 2.4](#page-49-0) on page [27,](#page-49-0) fitting it to the technologies evaluated in this thesis.

With this model, the system can be simulated given the  $P, V, T, A$  and  $C$  inputs. can be obtained through simulation of the voltage control loop, but for the other inputs, realistic test scenarios need to be given. The following sections will derive both the technology model and the scenarios used to derive [PVTA](#page-364-1) parameters in detail.

# **4.4 FPGA Implementation**

For comparison of the final results and as a basis for the [PARFAIT](#page-363-4) [FPGA,](#page-362-0) the *k6\_frac\_N10\_40nm* architecture was implementedin [VHDL.](#page-364-13)The [VPR](#page-364-8) toolflow was extended with missing tools to enable a full Verilog-to-Bitstream flow for this architecture. Ultimately, benchmark applications can therefore be synthesized to a bitstream and the [FPGA](#page-362-0) can be simulated in Questasim, including the application bitstream. This workflow enables prototyping of system-level [FPGA](#page-362-0) features with quick functional verification.

**Architecture Modifications** Some minor details of the *k6\_frac\_N10\_40nm* reference had to be changed for implementation: First, a fixed layout has been introduced instead of the originally used automated layout. With an automated layout, [VPR](#page-364-8) dynamically sizes the [FPGA,](#page-362-0) finding the smallest [FPGA](#page-362-0) size which fits the user application to be placed. Although the bitstream generator and the [FPGA](#page-362-0) implementation support parametrization to any size, there are benefits in fixing the size. For example, it allows generation of bitstreams for multiple applications, that can all be evaluated on the same [FPGA](#page-362-0) instance. As a second step, [FASM](#page-362-18) metadata has been added to the architecture to enable generation of bitstreams. This metadata does not affect the [FPGA](#page-362-0) architecture in any way and will be discussed in detail in the next section.

In addition, there are actual changes of the functional architecture itself: The pin equivalence for [CLBs](#page-361-3) has been changed to none. Without this setting, [VPR](#page-364-8) introduces another stage in the final generated netlist which permutes inputsof [CLBs.](#page-361-3) Although the [VHDL](#page-364-13) code for the implemented architecture

supports this permutation, [VPR](#page-364-8) does not emit [FASM](#page-362-18) for those. Without this information, it is however not possible to generate correct bitstreams, so the permutation was disabled completely. In an additional change, the capacity of [IO](#page-362-19) cells has been reduced to five instead of eight in the original architecture. As a consequence, there will be only five input and output pins per [IOB](#page-362-14) instead of eight. This is not a real limitation, as the evaluated benchmarks require large [FPGA](#page-362-0) sizes and the [IO](#page-362-19) ports are therefore not a limiting factor. [IOBs](#page-362-14) connect to channels on only one side, so five outputs in total match the [CLB'](#page-361-3)s five outputs on each side. Given that [VPR](#page-364-8) selects the tracks to connect to depending on the outputs of blocks adjacent to a channel, using the same amount here greatly simplifies the connection pattern. With this change, this pattern becomes identical for all block outputs all over the [FPGA,](#page-362-0) for both [IOBs](#page-362-14) and [CLBs.](#page-361-3) This greatly reduces complexity both in the [VHDL](#page-364-13) [FPGA](#page-362-0) implementation and in the bitstream generator. Block input mapping still varies between blocks, as well be explained later in this section.

The last change is related to the routing channels and [Programmable Switch](#page-363-7) [Matrixs \(PSMs\):](#page-363-7) In [VPR,](#page-364-8) [PSMs](#page-363-7) for [FPGAs](#page-362-0) with unidirectional channels use modified Wilton switch patterns. These patterns are can only be tiled, i.e. are only identical for each [PSM](#page-363-7) instance, if all tracks start or stop ata [PSM.](#page-363-7) If a track instead passes througha [PSM,](#page-363-7) [VPR](#page-364-8) will use a custom pattern for each [PSM.](#page-363-7) As such an irregularity largely increases complexity of the [FPGA](#page-362-0) and the bitstream generator, pass-through tracks had to be avoided. To realize this, the segment length of tracks was set to one instead of the original four. As these architecture changes affect benchmark results, both the [LUT](#page-363-2) based architecture used for comparison and the [ULM](#page-364-9) [RFET](#page-364-3) architecture implement all these changes consistently.

**FPGA Interconnect** An important aspectof [FPGA](#page-362-0) implementations that has not been discussed so far is the interconnect system, responsible for routing of signals between logic blocks. This work focuses largely on changes of the [CLB,](#page-361-3) but an [FPGA](#page-362-0) implementation also has to deal with various interconnect aspects, which are explained in the following.

The interconnect system can be divided roughly into two categories: First, a local interconnect, which is used for internal connections within each logic block. In the [PARFAIT](#page-363-4) [FPGA,](#page-362-0) this is mainly the crossbar in [CLBs,](#page-361-3) explained as part of [figure 4.2](#page-132-0) on page [110.](#page-132-0) The second category is the global interconnect, which consistsof [PSMs](#page-363-7) and the routing channels, which contain tracks connecting the [PSMs.](#page-363-7) These tracks also connect to [CBs,](#page-361-10) where [CBs](#page-361-10) can be either partof [CLB](#page-361-3) or of the top-level system. The [PARFAIT](#page-363-4) FPGA realizes a generic [CB](#page-361-10) with customizable patterns, which specify how to connect inputs
$(f)$ 

<span id="page-144-0"></span>



(e)

and outputs to certain tracks. In this implementation, the [CBs](#page-361-3) are therefore part of the global interconnect.

[Figure 4.7](#page-144-0) shows various grids and indexing that is used in the top-level design of the [FPGA.](#page-362-0) As shown in **[\(a\)](#page-144-0)**, [IOBs](#page-362-3) are numbered in a one-dimensional scheme: They are grouped into north, east, south and west and then numbered from left to right or bottom to top. This indexing is primarily used in the [VHDL](#page-364-0) code, as there is no [IOB](#page-362-3) specific bitstream in the bitstream generator. Subfigure **[\(b\)](#page-144-0)** shows the 2D grid used to number all blocks. This is the indexing scheme which is used in [FASM](#page-362-2) and the generated bitstream, as it handles all block indices in a generic way. Subfigure **[\(c\)](#page-144-0)** shows the 2D grid indexing scheme used to name [CLBs](#page-361-0) in the [VHDL](#page-364-0) code. The [VHDL](#page-364-0) code gives names to blocks according to type.

(d)

Subfigure **[\(d\)](#page-144-0)** shows the locations and indexingof [PSMs.](#page-363-0) It should be noted that [PSMs](#page-363-0) at edge and corner locations are technically not connected to all channels. Nevertheless, for simplicity, an identical implementation is used. For technical reasons, the [VHDL](#page-364-0) code however has to actually connect unused channel outputs. Because of that, the unused signals are declared in the code, but they are not connected anywhere else. Subfigure **[\(e\)](#page-144-0)** and **[\(f\)](#page-144-0)** show [CBs](#page-361-3) connecting channels to blocks south and north or east and west. The [VHDL](#page-364-0) implementation of those [CBs](#page-361-3) is the same for both variants. It should be noted that the size of all grid varies a bit, but is always derived from the total [FPGA](#page-362-0) width and height.

The top-level file and interconnect are automatically generated using the custom pgen tool. This tool parses a mustache template file and then generatesa [VHDL](#page-364-0) file implementing an [FPGA](#page-362-0) of the selected size. An excerpt of this template is shown in [listing 4.3](#page-145-0) as an example. As can be seen, this template system allows substitution of certain values and repeated instantiation of blocks. A single instance of generated code is shown in [listing 4.4.](#page-145-1)

```
1 {{#IOB_N}}
2 iob_n{{NUM}}: entity vfpga.IOB
3 port map (
4 din => cb_ns_no({{CB_X}}, {{CB_Y}})(7 downto 0),
5 dout => cb_ns_ni({{CB_X}}, {{CB_Y}})(7 downto 0),
6
7 pad i => iob ni({{NUM}}),
8 pad_o => iob_no({{NUM}})
9 );
10 \{{(IOB_N}}
```
**Listing 4.3:** Excerpt of the top-level FPGA mustache template showing the IOB instantiation code.

```
1 iob_n00: entity vfpga.IOB
2 port map (
3 din => cb_ns_no(0, 3)(7 downto 0),
4 dout => cb_ns_ni(0, 3)(7 downto 0),
5
6 pad_i => iob_ni(00),
7 pad o => iob no(00)
8 );
```
**Listing 4.4:** Excerpt of the top-level FPGA architecture generated from [listing 4.3.](#page-145-0)

**CB Implementation** The used [Connection Box](#page-361-3) consists of multiple configurable multiplexers, as shown in [figure 4.8.](#page-146-0)

<span id="page-146-0"></span>

**Figure 4.8:** [CB](#page-361-3) used in the [PARFAIT](#page-363-1) [FPGA.](#page-362-0) CHAN I and CHAN D are channel signals connectingto [PSMs.](#page-363-0) As tracks are unidirectional,a [CB](#page-361-3) connects to tracks in increasing and decreasing direction. IN 1 and OUT 1 are output and input signals from one [CLB](#page-361-0) adjacent to the [CB.](#page-361-3) IN 2 and OUT 2 are respective signals for the other [CLB.](#page-361-0)

The exact number of tracks in all signal busses can be configured in the [VHDL](#page-364-0) code. How a specific [CB](#page-361-3) instance maps the output signals OUT1 and OUT2 to channel tracks can be specified using a generic parameter. The same is true for the input signals IN 1 and IN 2. Which of the individual connected tracks is actually selected by the multiplexers is configurable through the [FPGA](#page-362-0) bitstream. The [CB](#page-361-3) instances in the [PARFAIT](#page-363-1) architecture use pin maps extracted from [VPR,](#page-364-1) to realize the connections following the [VPR](#page-364-1) architecture description.

A requirementof [VPR](#page-364-1) is that both connected [CLBs](#page-361-0) must be able to drive inputs of the other one through the [CB.](#page-361-3) Therefore, both [CLB'](#page-361-0)s inputs IN 1 and IN 2 are first multiplexed onto the channel bus. After that, both [CLB](#page-361-0) outputs OUT 1 and OUT 2 are read from the now multiplexed channel. Signals are unidirectional, so each [CLB](#page-361-0) connects to the channel busses for increasing and decreasing direction. The [CLB](#page-361-0) signals must be able to drive and read signals in both directions, which is ensured by the realization in [figure 4.8.](#page-146-0) [VPR](#page-364-1) normally determines the required channel width on the fly, depending on the mapped application. To get more reproducible results, the channel width is always fixed to 80 tracks in this dissertation, giving 40 tracks in each direction.

**Programming** Programmable elements, which include the [CLBs,](#page-361-0) [PSM](#page-363-0) and the [CBs](#page-361-3) contain internal configuration storage. This storage is realized as shift register with serial input and output, and a parallel output. This way, multiple elements can be chained. Data is transferred using a single serial data line, a clock signal and an enable signal. Such an approach reduces the number of required wires at the expense of longer configuration times, as long sequences of bits need to be shifted into the registers. Configuration time is not critical for the [PARFAIT](#page-363-1) [FPGA,](#page-362-0) so this approach was chosen.

<span id="page-147-0"></span>

**Figure 4.9:** Programming chain connecting the configurable blocks in the [PARFAIT](#page-363-1) [FPGA.](#page-362-0) **[\(a\)](#page-147-0)** Chain for the [CLBs.](#page-361-0) Each row can be programmed individually. **[\(b\)](#page-147-0)** Chain for interconnect elements. Programs all [PSMs,](#page-363-0) [CBSN](#page-361-1) and [CBEW](#page-361-2) at once.

[Figure 4.9](#page-147-0) shows how the programmable elements are chained. The leftmost arrows depict external inputs, which are connected to the [FPGA](#page-362-0) programming port driven by the ProgController code. Subfigure **[\(a\)](#page-147-0)** depicts the configuration chains for the [CLBs. CLB](#page-361-0) configuration includes the local interconnect, i.e. the crossbar. It does however not include any of the [CBs,](#page-361-3) which are programmed as part of the interconnect bitstream instead. Each row gets an own [CLB](#page-361-0) configuration chain. This enables configuration of single [CLB](#page-361-0) rows while keeping other rows unaffected. Subfigure **[\(b\)](#page-147-0)** shows the configuration chain for the global interconnect. As can be seen, this chain includes the [PSMs,](#page-363-0) [CBSNs](#page-361-1) and [CBEWs.](#page-361-2)

[Figure 4.10](#page-148-0) shows the [FSM](#page-362-4) describing the PROG controller. This controller can program a configuration chain of arbitrary length. It uses a simple interface to receive the data to be programmed in parallel, and to enable programming of the serial chain. As shown in the [FSM,](#page-362-4) the controller initially resets the configured chain's storage registers. It then shifts all bits onto the configuration chain. When it finished programming all bits, it asserts the done signal.

<span id="page-148-0"></span>

**Figure 4.10:** [FSM](#page-362-4) describing the PROG component.

[Figure 4.11](#page-148-1) depicts how the PROG block is used in a ProgController to program all chains of [figure 4.9.](#page-147-0) There are two instances of the PROG block: One directly drives the interconnect programming chain. The other one is used to program the [CLB](#page-361-0) rows. As only one controller is used for the rows, a multiplexer connects the proper row configuration chain to the controller.

<span id="page-148-1"></span>

**Figure 4.11:** Block diagram describing the ProgController component. It consists of the control [FSM](#page-362-4) two PROG controllers for the interconnect and for the [CLB](#page-361-0) rows and one multiplexer to select the active [CLB](#page-361-0) row.

The [FSM](#page-362-4) shown in [figure 4.12](#page-148-2) is used to coordinate the programming process. It first programs the interconnect, then all the [CLB](#page-361-0) rows individually. Once the [FPGA](#page-362-0) is fully programmed, it asserts a done signal to inform the user that the [FPGA](#page-362-0) application is now ready.

<span id="page-148-2"></span>

**Figure 4.12:** [FSM](#page-362-4) describing the ProgController component.

The exact format of the bitstream will be introduced as part of the next section, as it is relevant for the logic invasion concept introduced in this thesis.

# **4.5 FPGA Toolflow**

The [VPR](#page-364-1) toolchain can place applicationsto [FPGA](#page-362-0) architectures using only an architecture file. It can then gather statistics such as the number of used blocks, routing track utilization and more. When testing functional aspects such as logic invasion presented in [section 8.2](#page-228-0) on page [206,](#page-228-0) it is more promising to simulate the real implementation of the architecture presented in the previous chapter. However, for such an evaluation, the user application needs to be programmed to this custom [FPGA](#page-362-0) architecture. This then requires a more complete toolflow than the one offered by [VPR,](#page-364-1) as the placed and routed design also needs to be converted to a bitstream.

<span id="page-149-0"></span>

**Figure 4.13:** Verilog-to-Bitstream toolfow for the [LUT](#page-363-2) based [PARFAIT](#page-363-1) reference architecture. Parts marked in orange are custom parts that have been developed as part of this thesis.

[Figure 4.13](#page-149-0) shows the toolflow developed as part of this thesis. It is largely based on the original [VPR](#page-364-1) toolflow, but extended with some newly written tools, which are shown in orange in the figure. The basic approach chosen for bitstream generation is based on [FASM,](#page-362-2) a text based format for bitstream representation availablein [VPR](#page-364-1) [\[244\]](#page-338-0).

```
1 BLK_X[001]Y[002]Z[000].CLB.FLE[8].5BLE[1].LUT5.INIT[31:0]=32'b
     \hookrightarrow 111100000000000001111000000000000
```
**Listing 4.5:** An example [FASM](#page-362-2) feature for a 32 bit [LUT.](#page-363-2)

[FASM](#page-362-2) aims to be the analogon of software assembly formats for [FPGA](#page-362-0) based hardware description. The format specification only specifies that "features" are hierarchically grouped keys, which optionally can define a value. The exact content and meaning of these keys is specified by the architecture designer in the [VPR](#page-364-1) architecture description, as metadata. An example [FASM](#page-362-2) feature illustrating these elements is given in [listing 4.5.](#page-149-1) Here, everything up to the equals sign is the feature key. The value is a Verilog-style binary string literal. The feature is hierarchical: It describes the INIT value of the second LUT5 in a 5BLE in the eight FLE of the CLB at block position 1, 2.

Although [FASM](#page-362-2) can be written manually for maximum control of placement and routing, it commonly is generated automatically from Verilog code instead. As shown in [figure 4.13,](#page-149-0) the Verilog application is first fed into *ODIN II* for synthesis. After synthesis has transformed the Verilog application into a technology-independent netlist, this netlist is getting optimized in *ABC*. Then, also in *ABC*, the netlist is mapped to the target technology. For the reference architecture, *ABC* is instructed to map to 6-input [LUTs.](#page-363-2) If less inputs are required, [VPR](#page-364-1) will later automatically use the fractured 5-input [LUT](#page-363-2) instead.

After synthesis, *VPR* is used for the implementation passes. First, it packs related [FFs](#page-362-5) and [LUTs](#page-363-2) together for usein [BLEs.](#page-361-4) It then places all logic into [BLEs](#page-361-4) and [CLBs](#page-361-0) based on the architecture description. In the last implementation step, [VPR](#page-364-1) routes the design. At this point, the placed and routed design is saved into multiple files: A .place file describing block locations, a .route file specifying the global routing,  $a \cdot$  net file describing the intra-block routing and a .blif file, describing the netlist. All files need to be parsed manually to obtain the full information required to generate bitstreams. As this is a tedious and error-prone process, the *GENFASM* tool has been introduced. It can read the aforementioned files and emit [FASM](#page-362-2) such as the one in [listing 4.5.](#page-149-1) Given that [FASM](#page-362-2) does not provide any semantic meaning in the specification, it however needs additional specifications using metadata in the architecture description. Furthermore, metadata needs to be added to the [VPR](#page-364-1) [Routing](#page-364-2) [Resource Graph \(RRG\),](#page-364-2) which is realized by the custom written *PBIT\_RR* tool. The generated. fasm code finally can be transformed to a binary bitstream using the custom *PBIT* tool. In the following pages, the newly introduced tools and steps will be explained in more detail.

**Architecture Metadata** [VPR](#page-364-1) emits [FASM](#page-362-2) whenever it creates instances of pbs specified in an architecture description. [Listing 4.6](#page-151-0) shows an example of such metadata added to the [FPGA](#page-362-0) architecture description. In this case, it describes the LUT5.INIT[31:0] feature. [VPR](#page-364-1) will automatically add the [LUT](#page-363-2) table as the feature value. The prefix of the feature is also constructed automatically, by looking at the metadata of parent pbs.

```
1 <pb type name="lut5" blif model=".names" num pb="1" class="lut
      G ''2 <metadata>
3 <meta name="fasm_type">LUT</meta>
4 <meta name="fasm_lut">LUT5.INIT[31:0]</meta>
5 </metadata>
```

```
Listing 4.6: Architecture metadata to generate FASM features for LUTs.
```
[VPR'](#page-364-1)s current [FASM](#page-362-2) implementation does not support emitting block location indices. Instead, users have to specify metadata for each single [CLB](#page-361-0) individually. As this is rather tedious, the [FASM](#page-362-2) support was extended as part of this thesis to automatically add position information when a \$BLK  $X$  Y  $Z$  feature is processed. Indices in the block grid are defined according to [figure 4.7b](#page-144-0) on page [122.](#page-144-0) Furthermore, [VPR](#page-364-1) currently only supports emitting [FASM](#page-362-2) for multiplexers, but not for crossbars. The reference k6\_frac\_N10\_mem32K\_40nm architecture however does use a crossbar in the [CLB](#page-361-0) implementation. As the architecture should not be modified too much to keep results comparable, crossbar [FASM](#page-362-2) support was addedto [VPR](#page-364-1) as well.

The features shown in [listing 4.7](#page-151-1) show all the features that are derived from the architecture description. Those include [LUT](#page-363-2) data for the two 5-input [LUTs](#page-363-2) ina [BLE](#page-361-4) or for a single 6-input [LUT,](#page-363-2) if the [BLE](#page-361-4) is not in split mode. It further specifies whether the [FF](#page-362-5) in the [BLE](#page-361-4) is used or bypassed. The last feature shows how the [CLB](#page-361-0) crossbar is modeled in the [PARFAIT](#page-363-1) [FASM](#page-362-2) description: It specifies that the third inputof [BLE](#page-361-4) 8 should be connected to the second input of the [CLB,](#page-361-0) driven by the global interconnect. Alternatively, a value of fle[0].O[0] would specify internal feedback, i.e. connecting the first output of [BLE](#page-361-4) 0 to this input.

```
1 BLK_X[001]Y[002]Z[000].CLB.FLE[8].5BLE[1].LUT5.INIT[31:0]=...
```

```
2 BLK_X[001]Y[002]Z[000].CLB.FLE[8].5BLE[1].FF.BYPASS
```

```
3 BLK_X[001]Y[002]Z[000].CLB.FLE[8].5BLE[0].LUT5.INIT[31:0]=...
```

```
4 BLK_X[001]Y[002]Z[000].CLB.FLE[8].5BLE[0].FF.ENABLE
```

```
5 BLK_X[000]Y[000]Z[000].CLB.FLE[7].6BLE.LUT6.INIT[63:0]=...
```

```
6 BLK_X[000]Y[000]Z[000].CLB.FLE[7].6BLE.FF.ENABLE
```

```
7 BLK_X[001]Y[002]Z[000].CLB.FLE[8].CB.I[2]="clb[0].I[1]"
```
**Listing 4.7:** Summary of features that are derived from the architecture description.

**PBIT\_RR** is used to parse the [VPR](#page-364-1) [RRG](#page-364-2) and add [FASM](#page-362-2) metadata. [Listing 4.8](#page-152-0) shows an example of an [RRG](#page-364-2) after it has been processed by PBIT\_RR, where

the metadata tag has been added by the tool. The modified [RRG](#page-364-2) is fed into the *GENFASM* tool, which will simply emit the feature string whenever it finds that a segment has been used in the routing.

```
1 <node id="20" type="OPIN" capacity="1">
2 <loc xlow="0" ylow="1" xhigh="0" yhigh="1" side="RIGHT"
            ightharpoonup ptc="1"/>
3 \times/node>
4 <node id="9552" type="CHANY" direction="INC_DIR" capacity="1">
5 <loc xlow="0" ylow="1" xhigh="0" yhigh="1" ptc="0"/>
6 \times/node>
7 <!−− ... −−>
8 <edge src node="20" sink node="9552" switch id="2">
9 <metadata>
10 <meta name="fasm_features">BLK_X[000]Y[001]Z[000].CBW[
                \hookrightarrow E].ODST[00]=1</meta>
11 </metadata>
12 \div\text{edge}
```
**Listing 4.8:** Example [VPR](#page-364-1) [RRG](#page-364-2) extended with [FASM](#page-362-2) metadata by PBIT\_RR.

In the [RRG,](#page-364-2) nodes represent channels (connecting [PSMs\)](#page-363-0), or IPINs and OPINs, representing inputs and outputsof [CLBs](#page-361-0) and other blocks. Edges on the other hand represent a possible connection between such nodes. The [RRG](#page-364-2) contains all theoretically possible connections. It is therefore only dependent on the [FPGA](#page-362-0) architecture description, but is independent of the application. The PBIT RR tool analyzes all segments and derives [FASM](#page-362-2) features for them. It then distinguishes two kinds of edges: If an edge connects two channel nodes, it is representing a connection ina [PSM.](#page-363-0) If it connects a channel and an IPIN or a channel and an OPIN, it isa [CB](#page-361-3) connection.

```
1 PSM_X[000]Y[001]Z[000].MUX[S].TRACK[79]="N"
```

```
2 BLK_X[001]Y[002]Z[000].CBW[W].ODST[79]=59
```

```
3 BLK_X[000]Y[001]Z[000].CBR[E].I[12]=79
```
**Listing 4.9:** Summary of features that are derived from the [RRG.](#page-364-2)

[Listing 4.9](#page-152-1) gives examples of those generated features, with the first line describinga [PSM](#page-363-0) connection. Apart from block coordinates, it contains the information that track 79 in the south output of the [PSM](#page-363-0) needs to be driven by the input from north. Each [PSM](#page-363-0) output only connects to one track in each direction, so this information is complete.

The second line shows information for the output part ofa [CB,](#page-361-3) driving routing channels froma [CLB](#page-361-0)or [IOB.](#page-362-3) This specific feature signifies that track 79 in the channel in the west needs to be driven by output 59 of the block. The tracks that can connect to a certain output, follow a fixed pattern, so not all numbers may be emitted here. The third line shows information for the input part of a [CB,](#page-361-3) selecting tracks to route intoa [CLB](#page-361-0) or [IOB.](#page-362-3) In this case, it specifies that input 12 should be driven by track 79 in the channel to the east of the logic block.

The generated [FASM](#page-362-2) uses pin and track numbers as used by [VPR,](#page-364-1) to enable easier debugging. These numbers do not directly correspond to multiplexer indices though: As only some connections are possible, there are less multiplexer addresses than pin and track numbers. The mapping is done by both *PBIT*, which maps track indices to multiplexer addresses, and by the [VHDL](#page-364-0) code specifying the architecture. The latter has to undo the mapping to connect to the proper tracks for a certain multiplexer address. The connection patterns of pins to tracks are determinedin [VPR](#page-364-1) using an elaborate algorithm. Reimplementing this algorithm in custom tools would be error-prone, so a different solution has been chosen: The *PBIT\_RR* tool will generate pin maps based on the [RRG](#page-364-2) [\[245\]](#page-338-1) to be used in the other tools. An example of such a pin map for inputs in the north of an [IOB](#page-362-3) block is shown in [listing 4.10.](#page-153-0) Here, the first line for example states that [IOB](#page-362-3) pin 0 can connect to track pins 0, 1, 12, etc.

```
1 ======================== IOB_N ========================
2 00 => [0, 1, 12, 13, 26, 27, 40, 41, 52, 53, 66, 67]
3 03 => [2, 3, 16, 17, 28, 29, 42, 43, 56, 57, 68, 69]
4 06 => [4, 5, 18, 19, 32, 33, 44, 45, 58, 59, 72, 73]
5 09 => [8, 9, 20, 21, 34, 35, 48, 49, 60, 61, 74, 75]
6 12 => [10, 11, 24, 25, 36, 37, 50, 51, 64, 65, 76, 77]
```
**Listing 4.10:** Pin map generated by PBIT\_RR for the north inputs of an [IOB.](#page-362-3)

**PBIT** reads [FASM](#page-362-2) files, which might be either generated as previously explained, or manually written, and transforms them to binary configuration data. In this process, it first maps track indices and block pins to multiplexer addresses, using the pin maps generated by *PBIT\_RR*. It then converts these addresses to binary data vectors. Vectors representing different blocks are then ultimately combined to fit the order of the configuration chains of [fig](#page-147-0)[ure 4.9](#page-147-0) on page [125.](#page-147-0) Data can be directly emitted asa [VHDL](#page-364-0) package, based on a given mustache template.

Bitstream formats are shown in tables  $4.1$  to  $4.4$ . Table  $4.1$  shows the bitstream format used for [PSMs.](#page-363-0) For all 40 tracks in all 4 directions, it encodes the direction a track is driven from. For example, the N entry in TRACK\_0 specifies where the track 0 output in the north of the [PSM](#page-363-0) is driven from. Values are encoded as 2 bitin increasing order: from left, from center, from right (looking from output track into the [PSM\)](#page-363-0).

<span id="page-154-0"></span>

| 319 |          |   |   |   |         |   |         |   |  |  |
|-----|----------|---|---|---|---------|---|---------|---|--|--|
|     | TRACK_39 |   |   |   | TRACK_1 |   | TRACK_0 |   |  |  |
| W   | ◡        | N | W | ب |         | N | W       | ب |  |  |

**Table 4.1:** Bitstream Format for the [PSM](#page-363-0)

[Table 4.1](#page-154-0) describes the bitstream format used for [BLEs.](#page-361-4) It consists of the [LUT](#page-363-2) data, which might either be two times 32 bit in fractured [LUT](#page-363-2) mode, or one time 64 bit for a 6-input [LUT.](#page-363-2)Which mode is selected is specified by the SPLIT bit. MUX0 and MUX1 bits then drive the multiplexers selecting between [LUT](#page-363-2) or [FF](#page-362-5) outputs.

<span id="page-154-2"></span>

| 66   | 65         | 04<br>$\sim$ $\sim$ $\sim$ | n r<br>ັ |               |
|------|------------|----------------------------|----------|---------------|
| -- - | MUX<br>._- | MUX<br>0<br>. –            | –ັ<br>-- | U<br>--<br>-- |

**Table 4.2:** Bitstream Format for the [BLE](#page-361-4)

[Table 4.3](#page-154-1) describes the bitstream format for complete [CLBs.](#page-361-0) It consists of 10 [BLEs](#page-361-4) as specified in [table 4.2.](#page-154-2) Furthermore, it contains the configuration for the crossbar in the [CLB.](#page-361-0) Each entry is a 6 bit value, selecting one of 40 global inputs  $(0 - 39)$  or one of the feedback signals  $(40 - 59)$ . Inputs are sorted, so thatthe six inputs  $I[0]$ - $I[5]$  in category MUX[0] drive inputs 0 – 5 of [BLE](#page-361-4) 0.

<span id="page-154-1"></span>

| 1029  |  |  |  |  |  | 705                                    |  |  |  |         | 669   |  |  |
|-------|--|--|--|--|--|----------------------------------------|--|--|--|---------|-------|--|--|
| MUX 9 |  |  |  |  |  | MUX 0                                  |  |  |  | $BLE_9$ | BLE_0 |  |  |
|       |  |  |  |  |  | 15 14 13 12 11 10    15 14 13 12 11 10 |  |  |  |         |       |  |  |

**Table 4.3:** Bitstream Format for the [CLB](#page-361-0)

[Table 4.4](#page-155-0) describes the bitstream format for [CBs.](#page-361-3) A [CB](#page-361-3) can drive 32 tracks of a channel in both directions, from two inputs A and B. Each of the fields DBD, etc., is a single bit which is high if the respective track is driven from the respective input. Otherwise, the track is passed through unmodified. As an example, a 1 bit in DBD of DRIVE o signifies that track 0 in decreasing channel direction should be driven by port B. Note that track index 0 is not the track index as in [VPR](#page-364-1) and the pin index in B is not configurable. These values are fixed in the [VHDL](#page-364-0) implementation according to the previously introduced pin maps. Values in the READ categories describe what channel track should drive an input to logic block A or B. Each value is a 4 bit vector, encoding values from 0 to 12. Again, the exact track numbers selected are determined according to the pin map. In addition, the numbers also encode the direction, as the multiplexer connects to both channel directions (refer to [figure 4.8](#page-146-0) on page [124\)](#page-146-0). For example,  $RA = 4b'0010$  in READ 9 signifies that input signal 9 of the block A should be driven by track 2.

<span id="page-155-0"></span>

| 207       |        | 135   | <b>128</b> |          |  |  |         |                                     |  |  |            |
|-----------|--------|-------|------------|----------|--|--|---------|-------------------------------------|--|--|------------|
| $READ_9$  | READ_0 |       |            | DRIVE_31 |  |  | DRIVE 0 |                                     |  |  |            |
| <b>RA</b> |        | RB RA |            |          |  |  |         | RB   DAI DAD DBI DBD    DAI DAD DBI |  |  | <b>DBD</b> |

**Table 4.4:** Bitstream Format for the [CB](#page-361-3)

## **4.6 Technology Modeling**

To be able to simulate the [PARFAIT](#page-363-1) architecture, a simulation model to predict the circuit delay under certain process conditions is required. Although any model can be used in the tools designed as part of this thesis, the evaluation here will use models based on Scarpato's work [\[3\]](#page-315-0). Refer to [sections 2.3](#page-42-0) and [2.4](#page-49-0) on page [20](#page-42-0) and on page [27](#page-49-0) for an introduction to this model.

<span id="page-155-1"></span>

**Table 4.5:** Layout parameters for the [MOSFETs](#page-363-5) used in the test circuit for  $t_{\text{DP}}$  analysis in *XT018*. Most parameters are identical to [\[Reu21\]](#page-344-0), but the device width was recalculated to obtain symmetrical inverters at 1.8 V.

In general, the Scarpato model describes delay for a certain gate or a whole circuit. In this dissertation, the model was derived for specific test gates.

Based on this, it is assumed that [FPGA](#page-362-0) circuits based on reconfigurable [LEs](#page-362-6) show behavior proportional to the test gates. The propagation delay models will therefore be normalized to yield a 1.0 value at nominal condition. As a result, the final models will yield a factor that can be multiplied with the [LE](#page-362-6) delay at nominal condition to describe performance increase and decrease.

#### **SOI Reference Technology**

The commercial *XT018* technology is used as a reference to compare the [RFET](#page-364-3) results to. It is a 180 nm [SOI](#page-364-4) technology, featuring a similar feature size as the [PARFAIT](#page-363-1) [RFET.](#page-364-3) As the technology features a complete, commercial [PDK,](#page-363-6) required simulation models can be fitted to resultsof [SPICE](#page-367-0) simulations. For these evaluations, a simple [CMOS](#page-361-5) inverter as shown in [figure 4.14](#page-157-0) was designed in Cadence *Virtuoso*. General layout parameters were taken from Reuter et al. to match the investigation in [\[Reu21\]](#page-344-0). As the operating voltage used in this dissertation is 1.8 V, the width of the transistors had to be refitted though. Layout parameters are shown in [table 4.5.](#page-155-1)

**Test Circuit** [Figure 4.14](#page-157-0) shows the test circuit used to extract the propagation delays. Delays are measured for the second inverter and in between the I and O points.

In analog simulations, the first inverter's input is driven by a simulated, variable voltage source. The source voltage starts at 0 V at  $t = 0$ , and starts rising linearly at  $t = 40$  ps with a slope of 1.8V/40 ps. For the nominal voltage of 1.8 V, this ensures that the input rises in 40 ps to the maximum input voltage. When using non-nominal supply voltage, this specification keeps the slope constant, changing the transition time. At  $t = 500$  ps, the input falls linearly to 0 V, matching the same absolute slope as for the rising edge. This artificial signal generates an output at the first inverter, which is close to inverter output transients in real circuits. It drives the second inverter, which in turn drives four other inverters at its output. Outputs of the last inverter stage are kept unconnected. An implementation of this circuit is evaluated in Cadence *Virtuoso*, where it can be parametrized for the supply voltage, body biasing voltage and temperature.

**Voltage and Temperature** Following the process in [\[3\]](#page-315-0), the model will first be fitted to the V and T data. Delays are extracted from the transient simulations shown in [figures 4.15a](#page-158-0) and [4.15b](#page-158-0) on page [136,](#page-158-0) where the propagation delay is

<span id="page-157-0"></span>

**Figure 4.14:** Test circuit to extract the [FO4](#page-365-0) delay of an inverter. The characterized inverter (second from left to right) is driven by another inverter and drives a load consisting of four inverters.

computed in Cadence *Virtuoso* using the delayMeasure function. Both the rise-fall  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  and fall-rise  $t_{\text{PD}}$  haven been extracted, and it has been verified that they show identical behavior with respect to the tested parameters. The absolute values of these delays are not identical, due to small mismatches in the [PMOS](#page-363-3) and [NMOS](#page-363-4) networks. However, for the system level simulation, the model only needs to report a performance scaling factor. Because of this, the absolute values are not important, and the model developed here therefore will be parametrized using only fall-rise transitions. The model will then be normalized in the end.

For supply voltage dependency modeling, the supply voltage was swept from 1.3 V to 2.3 V using steps of 50 mV, whereas other parameters have been fixed at nominal conditions. The simulated data, 21 points covering a range of ±500 mV centered on the nominal 1.8 V, was then exported. Data samples are shown in [figure 4.16](#page-163-0) on page [141](#page-163-0) as dots in the left column. A similar approach was taken to extract the temperature dependency [figure 4.15b.](#page-158-0) Temperature *T* was swept from  $-40$  °C to 175 °C with step size 5 °C. The resulting 44 delay values were exported and then converted to Kelvin units. Resulting values are shown in [figure 4.16](#page-163-0) on page [141](#page-163-0) as dots in the right column.

<span id="page-158-0"></span>

**Figure 4.15:** Transients obtained for the test circuit of [figure 4.14](#page-157-0) for XT018 technology, varying various parameters. All simulations represent the typical corner. Only four curves are shown for illustrative reasons. **[\(a\)](#page-158-0)** Varying  $V$  from 1.3 V to 2.3 V. *T* = 25 °C,  $V_{BB}$  = 0 V. [\(b\)](#page-158-0) Varying *T* from −40 °C to 175 °C.  $V = 1.8$ V,  $V_{BB} = 0$ V. [\(c\)](#page-158-0) Varying  $V_{BB}$  from  $-0.5$  V to 0.5 V.  $T = 25$  °C,  $V = 1.8V$ .

The data was fitted to the [equation \(2.22\),](#page-54-0) which is repeated here in [equa](#page-159-0)[tion \(4.10\).](#page-159-0) Fitting used a least-squares optimization and the Python scipy.curve fit function. This resulted in the parameters shown in [table 4.6,](#page-159-1) resulting in the initial model to describe variation in delay based on temperature  $T$  and supply voltage  $VDD$ .

<span id="page-159-0"></span>
$$
t_{\rm PD} = f(V, T) = p_\beta + (C_1 + k_1 T^{n_1}) \frac{V}{(V - (C_2 - k_2 T^{n_2}))^{p_\alpha}}
$$
(4.10)

<span id="page-159-1"></span>

**Table 4.6:** Parameters obtained when least-squares fitting the delays obtained in simulations of the test circuit in the *XT018* technology to [equation \(4.10\).](#page-159-0)

**Process Variation** As in [\[3\]](#page-315-0), the next step in derivation of the delay model is characterization of the process variation dependency. Unfortunately, little information about process variation was available for the evaluated *XT018* technology. Variation of geometrical parameters (see [section 2.4\)](#page-49-0) could be simulated in *Virtuoso* through parametrization. However, the more interesting information is the absolute amount of variation to be expected, which could not be obtained. Therefore, instead of modifying the [SPICE](#page-367-0) circuit simulation to estimate process variation, factory provided data for digital cells was used: For those cells, the [PDK](#page-363-6) provides various corners (see [section 2.8\)](#page-85-0) for slow, typical and fast devices.

The test circuit of [figure 4.14](#page-157-0) has therefore been replicated in [VHDL](#page-364-0) as shown in [listing C.1](#page-375-0) on page [353.](#page-375-0) The test uses inverter cells of type INHDX0 from the D\_CELLS\_HD high density standard cell library shipped with the [PDK.](#page-363-6) The [VHDL](#page-364-0) test circuit has then been synthesized in Cadence *Genus* using all available corners. A timing analysis has been performed and the

top 100 critical paths have been extracted. Out of those paths, the worst rise-fall and fall-rise propagation times for the second inverter, inv1, have been extracted. Again, only the fall-rise transitions have been used for modelling.

The absolute delay of the custom designed cellin [SPICE](#page-367-0) simulation is not identical to the one of the commercial cells in the [PDK.](#page-363-6) Because of that, a scaling factor has been derived using the delays of both cells at nominal operating conditions. For the [SPICE](#page-367-0) simulated results, the delay was obtained using the model in [equation \(4.10\).](#page-159-0) For the cell from the standard cell library, the delay has been obtained through simulation in Cadence *Genus*. The obtained scaling factor was used to scale all delays obtained in the *Genus* simulations before any further processing.

Then, as in [\[3\]](#page-315-0), all parameters except for  $C_2$  and  $p_\alpha$  were fixed to the values in [table 4.6.](#page-159-1) As there were not many data points available,  $p_{\alpha}$  could also be fixed in addition. Then, all 4 delays obtained for the fast corner, at various temperatures and  $VDD$ , were used to fit [equation \(4.10\)](#page-159-0) with these fixed parameters. This process was repeated with the delays obtained for the slow corner. The resulting  $C_2$  values are shown in [table 4.7.](#page-160-0)

<span id="page-160-0"></span>

**Table 4.7:** Fitted  $C_2$  for [equation \(4.10\)](#page-159-0) and the Cadence *Genus* results for the *XT018* technology. Other parameters are kept fixed as in [table 4.6.](#page-159-1)

As in [\[3\]](#page-315-0), to obtain delays for continuous process variation instead of only for the corners, the  $C_2$  variable has been interpolated linearly, fitting it to the following equation:

$$
C_2(P) = C_{2C} + C_{2m} * P \tag{4.11}
$$

Here, *P* describes the process as a number ranging from  $-1$  (slow) to 1 (fast). Values were fitted as  $C_{2C} = -14.361068$  and  $C_{2m} = 0.26553966$  and the extended model with process variation is given as:

$$
t_{\rm PD} = f(V, T, P) = p_\beta + (C_1 + k_1 T^{n_1}) \frac{V}{(V - (C_2(P) - k_2 T^{n_2}))^{p_\alpha}}
$$
(4.12)

[Figure 4.17](#page-164-0) on page [142](#page-164-0) shows examples of varying the process parameter P in the derived model, keeping other parameters at nominal values. Five

values are depicted,  $-1$  (slow),  $-0.5$ , 0 (typical), 0.5 and 1.0 (fast). Dots in these figures depict the simulated values obtained from the Cadence *Genus*. Unfortunately, the available corners do not only differ in process variation, but also in supply voltage and temperature. Because of this, only a typical process corner is available at nominal temperature and supply voltage, so only one dot for the typical process is shown.

**Body Biasing** As a next step, the dependency of body biasing, used to control the performance of the device, is added to the model. Body biasing is not covered in [\[3\]](#page-315-0), but the extension of the model is straightforward: Comparing [equa](#page-159-0)tion  $(4.10)$  and equation  $(2.12)$  to the physical model of the drain current [equa](#page-34-0)[tion \(2.3\),](#page-34-0) shows the physical motivation of the Scarpato delay model, and that the threshold voltage  $V_{th}$  is modeled like this:

<span id="page-161-1"></span>
$$
V_{\text{th}} \propto (C_2(P) - k_2 T^{n_2}) \tag{4.13}
$$

Body biasing affects the threshold voltage  $V_{th}$  based on body effect factor  $\gamma$  and the Fermi potential  $\Phi$  according to [equation \(2.1\)](#page-33-0) on page [11.](#page-33-0) The original model has therefore been extended with additional terms affecting the threshold voltage following equation  $(2.1)$  in equation  $(4.15)$ . The new complete model is given in [equation \(4.14\):](#page-161-1)

$$
t_{\rm PD} = f(V, T, P, C)
$$
  
=  $p_{\beta} + (C_1 + k_1 T^{n_1}) \ast \frac{V}{(V - (C_2(P) + C_3(C) - k_2 T^{n_2}))^{p_{\alpha}}}$   

$$
C_3(C) = C_{\gamma} \ast (\sqrt{|-C_{\Phi} - C|} - \sqrt{|-C_{\Phi}|})
$$
 (4.15)

The control input C corresponds directly to the body bias voltage  $V_{BS}$  and  $C_{\gamma}$ and  $C_{\Phi}$  are newly introduced fitting parameters.

The test circuit of [figure 4.14](#page-157-0) was evaluated for a body bias voltage sweep from −500 mV to 500 mV, where bulk voltages are defined as in the following:

<span id="page-161-0"></span>
$$
V_{\rm BS,p} = V_{\rm DD} - V_{\rm BS} \tag{4.16}
$$

$$
V_{\rm BS,n} = 0 \, \text{V} + V_{\rm BS} \tag{4.17}
$$

[PMOS](#page-363-3) and [NMOS](#page-363-4) devices are therefore biased symmetrically. Positive  $V_{BS}$ is defined as forward body biasing, where the delay of the circuit is reduced. The voltage has been increased in 50 mV increments and the resulting 21 data points are shown as dots in the left-hand side figures of [figure 4.17.](#page-164-0) The newly introduced parameters were then fitted as  $C_{\gamma} = 0.17329925253921682$  and

 $C_{\Phi}$  = 0.7588269701830175. Interpolated results, which were obtained using this model, are shown in [figure 4.17.](#page-164-0)

**Aging** As a final step, the model was extended to model aging. Unfortunately, no detailed aging information was available in the used *XT018* [PDK.](#page-363-6) This is an issue, as aging itself depends on temperature and voltage [\[3\]](#page-315-0), which makes it difficult to model without sufficient data. However, as the model derived here will only be used as a realistic test case, it does not have to model the *XT018* technology exactly. As long as the model behaves in a way a real technology would, it is suitable for the architecture evaluation. Because of this, an existing aging model from [\[3\]](#page-315-0) has been integrated into the [SOI](#page-364-4) reference model used here. Following equation (3.10) in [\[3\]](#page-315-0), aging is introduced as a shift in  $V_{th}$ with  $\Delta V_{th} = A$ , yielding the final model:

$$
t_{\rm PD} = f(V, T, P, C, A)
$$
  
=  $p_{\beta} + (C_1 + k_1 T^{n_1}) \ast \frac{V}{(V - (C_2(P) + C_3(C) + A - k_2 T^{n_2}))^{p_{\alpha}}}$  (4.18)

The simplest of the aging models investigated in [\[3\]](#page-315-0) has been chosen for the final propagation delay model:

<span id="page-162-2"></span><span id="page-162-1"></span>
$$
A = f(t, V_A, T_A) = C * t^n * V_A^{\gamma} * e^{-900/T_A}
$$
\n(4.19)

The parameters have been taken from table 3.5 in [\[3\]](#page-315-0) and are reprinted in this thesis in [table 4.8.](#page-162-0)

<span id="page-162-0"></span>

**Table 4.8:** Parameters for [equation \(4.19\)](#page-162-1) taken from Table 3.5 in [\[3\]](#page-315-0). This reproduces the aging model for the technology evaluated in [\[3\]](#page-315-0).

This approach is realistic, as this aging model represents an absolute shift of the threshold voltage. Even though the technologies are different, absolute values of threshold voltages are similar, which allows to obtain a realistic combined model. It should be noted that parameters  $V$  and  $T$  in [equation \(4.18\)](#page-162-2) are instantaneous values at a certain time  $t$ . The values in [equation \(4.19\)](#page-162-1) on the other hand are values which represent the aging process, e.g. devices age faster at higher temperature.  $V$  and  $V_{\mathrm{A}}$ , and  $T$  and  $T_{\mathrm{A}}$ , therefore have to be considered to be independent parameters.

<span id="page-163-0"></span>

Figure 4.16:  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  model (lines) and fitting data (dots) for the *XT018* technology. The left-hand side figures show propagation delay vs. supply voltage, parametrized on the four other parameters. The right-hand side shows the delay dependency on temperature. Parameters were chosen to yield five exemplary curves covering the full parameter range, where one curve matches the parameter value for the original fitting data.

<span id="page-164-0"></span>

Figure 4.17:  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  model (lines) and fitting data (dots) for the *XT018* technology. The left-hand side figures show propagation delay vs. body bias voltage, parametrized on the four other parameters. The right-hand side shows the delay dependency on process variation. Parameters were chosen to yield five exemplary curves covering the full parameter range, where one curve matches the parameter value for the original fitting data.



**Figure 4.18:**  $t_{\text{pn}}$  model (lines) and fitting data (dots) for the *XT018* technology. Figures show propagation delay vs. aging, parametrized on the four other parameters. Aging was simulated at a temperature of 120 °C and 1.8 V.

**Normalization** Ultimately, the model in [equation \(4.9\)](#page-141-0) has been normalized to yield value 1.0 at nominal conditions,  $P = 0$ ,  $V = 1.8V$ ,  $T = 25^{\circ}C$ ,  $A = 0V$ and  $C = 0V$ . As a result, the final model in [equation \(4.18\)](#page-162-2) needs to be scaled by  $C_{\text{norm}} = 1.2068296052 \times 10^{10}$ .

**Leakage Current** In addition to the model for  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$ , to model power through static leakage [\(equation \(2.8\)\)](#page-44-0), a model for the leakage current is required. Similar to the  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  model, an inverter will be used as a representative [CMOS](#page-361-5) circuit and a factor describing increase or decrease has been derived. The leakagecurrent in a [CMOS](#page-361-5) inverter is mostly determined by the  $I_D$  current of the transistor in off state. As such, two aspects are primarily important: The absolute value of the gate voltages, i.e. whether the driving gate achieves full output swing. For this dissertation, this aspect will be assumed to be given. The second aspect is the  $I_D$  when  $V_{CS}$  is close to 0 V.

Theoretically, this drain current can again depend on  $P$ ,  $V$ ,  $T$  and aging. For this thesis, the most important relation is the dependency on the control input or body bias voltage  $V_{BS}$ . As the derivation of a more complete model is out of scope of this thesis, only the  $V_{BS}$  dependency has been modeled. For this, a single inverter has been modeled in *Virtuoso*, again using the geometric

parameters of [table 4.5.](#page-155-1) The inverter input has been biased to 0 V and the body bias voltage was again swept from 500 mV to 500 mV in 21 steps. The results of this simulation are shown as dots in [figure 4.19.](#page-166-0)

<span id="page-166-0"></span>

**Figure 4.19:** Leakage current in an inverter for the *XT018* technology. Dots depict values simulated in Cadence *Virtuoso* [SPICE](#page-367-0) simulation, the line shows values obtained through [equation \(4.20\).](#page-166-1)

This data was then fitted to a shifted power function:

<span id="page-166-1"></span>
$$
I_{\text{leak}} = f(C) = C_{\text{ID}} * (1.8 - C)^{\alpha_{\text{ID}}} \tag{4.20}
$$

The parameters for this model and the test data were calculated as  $C_{\text{ID}} =$  $1.14293038 \times 10^{-6}$  and  $\alpha_{\text{ID}} = -1.46337323 \times 10^{1}$ .

#### **RFET Technology**

Now that the model for [SOI](#page-364-4) technology has been derived, a similar  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  model will be discussed for [RFET](#page-364-3) technology: This work will primarily use Galderisi's *RGATE* [\[246\]](#page-338-2) to demonstrate the feasibility of an [FPGA](#page-362-0) based on ambipolar technology. To estimate the feasibility of the introduced [PVTA](#page-364-5) compensation concepts, and to demonstrate the potential of [RFETs,](#page-364-3) evaluations should ideally be performed using the *RGATE* technology.

Unfortunately, the *RGATE* and related technology were only introduced recently (2024 and 2022). Because of this, no propagation delay model, [PDK](#page-363-6) or even [SPICE](#page-367-0) model is available yet. However, extensive device characterization data for various parameters has been published in [\[42\]](#page-319-0) and [\[2\]](#page-315-1). As a temporary solution until more exact models become available, the Scarpato model will be adapted and fitted to the device data in a best-effort approach. Certain

deviations and inaccuracies are expected: Physical effects determining the device performance are partially different for Schottky Barrier based [RFETs,](#page-364-3) whereas the Scarpato model was derived for standard silicon technologies. Conclusions drawn using the adapted model should therefore be interpreted with caution, but this model at least enables a proof-of-concept evaluation. Concrete limitations of the model will be explained during the derivation and discussion of the model.

The general approach for  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  modelling using device data was derived through observation of [equation \(2.11\).](#page-45-1) Introducing the modeled parameters results in the following equation:

<span id="page-167-0"></span>
$$
t_{\rm PD}(V, T, P, C, A) \propto \frac{C_{\rm L}V}{I_{\rm D}(V, T, P, C, A)}
$$
(4.21)

With the parameters as in the [SOI](#page-364-4) model:  $V$  as supply voltage  $VDD$ ,  $T$  as temperature,  $P$  as process variable,  $C$  as control variable for the [PG](#page-363-7) and  $A$  as an aging parameter. In the equation, it has been assumed that  $C_{\rm L}$  is independent of the parameters. Whether this is the case needs to be confirmed through future device characteristics measurements and physical modeling, which are out of scope for this dissertation.

[Equation \(4.21\)](#page-167-0) only provides a proportionality, so deriving absolute propagation delays requires fitting a model to measured or simulated delays. It is however not possible to perform these evaluations for the chosen [RFET](#page-364-3) technology, as neither measured data nor simulation models are available. This dissertation however only needs information about relative effects in delay: If one of the parameters changes, how does the delay change in comparison to the original value? This normalization is used even for the [SOI](#page-364-4) model, where absolute data is theoretically available: A model using absolute values is only valid for a single, specific gate with specific load conditions. Deriving relative changes instead enables generalization of the model for various gates, as will be shown later.

The relative delay can then be described as:

<span id="page-167-1"></span>
$$
\frac{t_{\rm PD}(V,T,P,C,A)}{t_{\rm PD}(V_0,T_0,P_0,C_0,A_0)} = \frac{V \cdot I_{\rm D}(V_0,T_0,P_0,C_0,A_0)}{V_0 \cdot I_{\rm D}(V,T,P,C,A)}\tag{4.22}
$$

It can be seen that [equation \(4.22\)](#page-167-1) allows derivation of a propagation delay model when knowing only the drain current  $I_D$  and supply voltage  $VDD$ . These values are available from [\[2,](#page-315-1) [42\]](#page-319-0):  $I<sub>D</sub>$  is the main quantity evaluated in device characterization and directly available. For  $VDD$ , the situation is more complicated: In a typical [CMOS-](#page-361-5)like circuit,  $VDD$  determines both the

drain-source voltage of the device and the control gate voltage. How this voltage exactly determines the drain-source voltage in a circuit depends on the circuit and won't be discussed here in detail. For the model, measurements performed using a drain-source voltage of 1 V will be used, which is the highest voltage measured. The VDD parameter dependency will be extracted from the control gate voltage.

Another source of errors is inherent in this derivation process: Obtaining propagation delays from measured drain currents requires an inversion operation. Currents are however generally small, which causes larger relative measurement errors. When these values are inverted, such small errors lead to larger errors in propagation delay. It would therefore be preferable to directly measure delays, but respective data is not available. Measurement errors are partially reduced through the normalization step performed in [equation \(4.22\),](#page-167-1) but this only cancels gain errors.

**Voltage and Temperature** Galderisi et al. provide characterizations for [RFETs](#page-364-3) in low- and high-threshold configurations and for *n*, *p* and *ambipolar* modes [\[2\]](#page-315-1). For brevity, the model will be parametrized for only the *p* mode low-threshold configuration, although the *RGATE* uses both low- and high threshold configurations and *n* and *p* mode. In the model parameters and in the figures shown, the sign of all voltages will be flipped so that the final model parameters are positive numbers.

<span id="page-168-0"></span>

**Table 4.9:** Parameters for RFET propagation delay model in nominal conditions. Data was derived from drain current measurements in [\[2\]](#page-315-1).

For the voltage dependency, the data available from figure 1h in [\[2\]](#page-315-1) has been used. The data was extracted for a drain voltage of −1 V, a program gate voltage of −3 V and a temperature of 25 °C. As for all measurements used, it

will be assumed that values correspond to the typical process. Control gate values were selected from  $-2.45$  V to  $-3$  V to ensure the current values are obtained in the saturation region.

Temperature data was obtained from figure 5c in [\[2\]](#page-315-1), which again describes the *p* mode in low-threshold configuration. Program gate and drain voltage were again fixed at −3 V and −1 V, and the data was assumed to represent the typical process. Drain currents were extracted for a control gate voltage of −3 V and temperatures from 325 K to 425 K. In addition, a data point representing room temperature was taken from the voltage dependency dataset.

As in the [SOI](#page-364-4) model, data was fitted to [equation \(2.22\)](#page-54-0) using a least-squares fit and the Python scipy.curve fit function:

<span id="page-169-0"></span>
$$
t_{\rm PD} = f(V, T) = p_\beta + (C_1 + k_1 T^{n_1}) \frac{V}{(V - (C_2 - k_2 T^{n_2}))^{p_\alpha}}
$$
(4.23)

This resulted in the parameters shown in [table 4.9](#page-168-0) for the initial model describing delay variation for temperature  $T$  and supply voltage  $VDD$ . [Figure 4.20](#page-172-0) on page [150](#page-172-0) shows how the final model fits the extracted values. The model in [equation \(4.23\)](#page-169-0) corresponds to the final model with  $C = 3.0$  V,  $P = 0$  and  $t = 0$ , i.e. no aging. Data in the figures has been normalized so that the delay at nominal condition ( $T = 25^{\circ}\text{C}$ ,  $V = 3.0\text{V}$ ,  $C = 3.0\text{V}$ ,  $P = 0$ ,  $t = 0$ ) equals 1.0.

**Process Variation** Galderisi et al. provide some initial process variation data in [\[2,](#page-315-1) [42\]](#page-319-0). As this is currently a lab-scale process and scaling of the technology will likely have effects on process variation, this data was not used and the process variation data obtained for the industrial [SOI](#page-364-4) process reference has been adapted instead: First, relative changes in delay for worst and best process have been evaluated for the [SOI](#page-364-4) model, obtained at otherwise nominal conditions. This lead to a propagation delay increase of 30% for the slow process and a decrease of 34% for the fast process. These values were then used to scale the previously obtained temperature and voltage datasets. Like in the [SOI](#page-364-4) model, all parameters except for  $C_2$  were fixed to the values in [table 4.9.](#page-168-0) The data for the slow and fast process was then again fitted to this model, yielding the  $C_2$  values shown in [table 4.10.](#page-170-0)

Following  $[3]$  the  $C_2$  variable has been interpolated linearly again by fitting it to the following equation:

$$
C_2(P) = C_{2C} + C_{2m} * P \tag{4.24}
$$

<span id="page-170-0"></span>

**Table 4.10:** Fitted  $C_2$  for [equation \(4.23\)](#page-169-0) and relative process variation matching the *RFET* technology. Other parameters are kept fixed as in [table 4.9.](#page-168-0)

Here, P again describes the process as a number ranging from  $-1$ (slow) to 1 (fast). Values were fitted as  $C_{2C} = -4.23116198 \times 10^4$  and  $C_{2m} = -666.535$  and the extended model with process variation is again given as:

$$
t_{\rm PD} = f(V, T, P) = p_\beta + (C_1 + k_1 T^{n_1}) \frac{V}{(V - (C_2(P) - k_2 T^{n_2}))^{p_\alpha}}
$$
(4.25)

A comparison of the nominal operating point and the modelled values for various different process parameters is shown in [figure 4.21](#page-173-0) on page [151](#page-173-0) in the right column. As can be seen in [figure 4.20d,](#page-172-0) the model becomes unrealistic for high temperatures in the fast process, even leading to negative values. Evaluations in this parameter combination region should therefore be avoided.

**Program Gate Voltage** For the [PG](#page-363-7) voltage, a different modeling approach had to be chosen compared to the [SOI](#page-364-4) body biasing voltage: The physical principles are different and figure 1j in [\[2\]](#page-315-1) suggests that the influence of the [PG](#page-363-7) voltage is exponential. With the previously shown fitted parameters,  $p_{\beta}$ is comparatively large. This means that the whole second term can only have limited effects on the propagation delay value. Modeling the control parameter as part of this term, like in the [SOI](#page-364-4) model, could therefore not represent this exponential influence.

$$
t_{\rm PD} = f(V, T, P, C)
$$
  
=  $\left( p_\beta + (C_1 + k_1 T^{n_1}) \frac{V}{(V - (C_2(P) - k_2 T^{n_2}))^{p_\alpha}} \right) \cdot C_3(C)$  (4.26)

$$
C_3(C) = C_\gamma \cdot e^{C_\phi \cdot C} \tag{4.27}
$$

As a solution, the slightly modified version of the model above was introduced. Instead of modifying the second term, the whole function is scaled using an exponential function. This model is not physically motivated and

is therefore a pure mathematical model. It assumes that the [PG](#page-363-7) voltage and other parameters are uncorrelated and can be modelled independently, which needs to be verified in future research and with more measurements. For the evaluation of the [PARFAIT](#page-363-1) system, this assumption is however not limiting.

The data to be matched was extracted from figure 1 iin [\[2\]](#page-315-1) for  $V = 3V$ . Drain voltage and temperature were 1 V and 25 °C respectively, and values belong to the typical process. The [PG](#page-363-7) voltages for the extracted drain currents were between 0 V and 3 V. The exponential term was then first fitted independently to the normalized, inverted drain current values. The least-squares fit was modified with the additional constraint that the value for  $V_{\text{PG}} = 3V$  matches the nominal model exactly, i.e. it yields exactly 1.0. This ensures high accuracy for the nominal [PG](#page-363-7) voltage.

The newly introduced parameters were then fitted as  $C_{\gamma} = 71310.4$  and  $C_{\Phi} =$ −3.724932. Interpolated results which were obtained using this model are shown in [figure 4.21](#page-173-0) on page [151](#page-173-0) in the left column. Note that these results are plotted using a linear y axis, whereas other figures showing the [PG](#page-363-7) voltage as a parameter are plotted on logarithmic y axis.

**Aging** As there is no aging data available for [RFETs,](#page-364-3) the same model as used for the [SOI](#page-364-4) reference has been integrated into the [RFET](#page-364-3) model. For the same reasons as in the [PG](#page-363-7) voltage modeling, aging can not easily be modeled as a shift in threshold voltage. A solution similar to the [PG](#page-363-7) voltage model was used: The dependency was extracted from nominal values in the [SOI](#page-364-4) model [\(SOI](#page-364-4) aging at 1.8 V and 120 °C). The obtained data was then fitted to a shifted root function and introduced in the final model:

$$
t_{\rm PD} = f(V, T, P, C, A)
$$
  
=  $\left( p_\beta + (C_1 + k_1 T^{n_1}) \frac{V}{(V - (C_2(P) - k_2 T^{n_2}))^{p_\alpha}} \right) \cdot C_3(C) \cdot A$  (4.28)

Where the aging parameter is defined as:

$$
A = f(t) = 1 + C_A * t^{\alpha_A} \tag{4.29}
$$

This models aging independently of temperature and voltage, which is sufficient for this thesis. Values were fitted to  $C_A = 0.00192291$  and  $\alpha_A = 0.19914002$ . Examples of aging data obtained using the model are shown in [figure 4.22](#page-174-0) on page [152.](#page-174-0)

<span id="page-172-0"></span>

**Figure 4.20:**  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  model (lines) and fitting data (dots) for the *RFET* technology. The left-hand side figures show propagation delay vs. supply voltage, parametrized on the four other parameters. The right-hand side shows the delay dependency on temperature. Parameters were chosen to yield five exemplary curves covering the full parameter range, where one curve matches the parameter value for the original fitting data.

<span id="page-173-0"></span>

**Figure 4.21:**  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  model (lines) and fitting data (dots) for the *RFET* technology. The left-hand side figures show propagation delay vs. program gate voltage, parametrized on the four other parameters. The right-hand side shows the delay dependency on process variation. Parameters were chosen to yield five exemplary curves covering the full parameter range, where one curve matches the parameter value for the original fitting data.

<span id="page-174-0"></span>

**Figure 4.22:**  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  model (lines) and fitting data (dots) for the *RFET* technology. Figures show propagation delay vs. aging, parametrized on the four other parameters. Aging was matchedto [SOI](#page-364-4) aging at 120 °C and 1.8 V.

**Leakage Current** Leakage data was extracted from drain currents for a control gate voltage of 0 V from figure 1k in [\[2\]](#page-315-1). Program gate voltages were selected from 0 V to 3 V. Data had to be extracted from the  $n$  mode graphs, as data on  $p$  mode were not discernible at 0 V. [Figure 4.23](#page-175-0) shows the measured data as dots and the final model as a line.

For the final model, the measured data was fitted to the following exponential function:

<span id="page-174-1"></span>
$$
I_{\text{leak}} = f(C) = C_{\text{ID}} * e^{\alpha_{\text{ID}} \cdot C} \tag{4.30}
$$

The parameters for this model and the test data were calculated as  $C_{\text{ID}} =$ 2.81359916×10<sup>-14</sup> and  $\alpha_{\text{ID}} = 3.55417232 \times 10^{-1}$ .

### **4.7 PVTA Scenario Modeling**

Previously introduced delay models for [SOI](#page-364-4) and [RFET](#page-364-3) provide the delay for a circuit under certain [PVTA](#page-364-5) conditions and body bias. To simulate the complete [FPGA](#page-362-0) architecture, it is necessary to derive [PVTA](#page-364-5) conditions on the chip as inputs for the delay models. In the following, specific [PVTA](#page-364-5) scenarios used

<span id="page-175-0"></span>

**Figure 4.23:** Leakage current in transistor for the *RFET* technology. Dots depict values extracted from [\[2\]](#page-315-1), the line shows simulated values obtained through the model in [equation \(4.30\).](#page-174-1)

for evaluation in this thesis will be discussed, but the propagation delay models and simulation framework enable simulation of any other [PVTA](#page-364-5) scenario as well. The model's  $C$  input used for body biasing will not be discussed here: It is not an external scenario influence, but will be directly controlled by the [FPGA](#page-362-0) architecture instead.

**Process Variation** As introduced in [section 2.4](#page-49-0) on page [27,](#page-49-0) models for process variation are usually technology dependent. As the required technology parameters are not always available, this can make adaption of some models difficult. Some of them also require detailed modeling of the circuit for which process variation is analyzed, e.g. in SPICE. As a primary aim of the framework developed in this thesis is quick evaluation of different architectures, such extensive modelling approaches are not feasible and a simpler model is needed.

The previously introduced VARIUS process variation model [\[53\]](#page-320-0) fits these requirements: It has few parameters and provides exemplary values, which match the empirical study in [\[56\]](#page-320-1). It will therefore be used for all evaluation scenarios in this thesis. The simulation framework derived here however also supports more advanced models such as [\[60,](#page-320-2) [61,](#page-320-3) [247,](#page-338-3) [248\]](#page-339-0), when required parameter values are available.

The VARIUS model considers two main factors in process variation, threshold voltage  $V_{th}$  and the effective gate length  $L_{eff}$ . For the discussion here and for the derivation of the  $P$  parameter, only the  $V_{th}$  part will be used. Following

<span id="page-176-0"></span>

**Figure 4.24:** Simulated intra-die process variation showing the transistor threshold voltage  $V_{th}$  for a 24x24 grid. The spatially correlated part is shown in [\(a\)](#page-176-0), the uncorrelated part in **[\(b\)](#page-176-0)** and **[\(c\)](#page-176-0)** depicts the resulting total variation. Brighter color indicates smaller *th*.

equations  $(2.19)$  and  $(2.20)$ , 2D process variation maps will be derived as random samples of the stochastic distribution determined by the correlation between values:

$$
corr(P_{\vec{x}}, P_{\vec{y}}) = \rho(r) \qquad r = |\vec{x} - \vec{y}| \tag{4.31}
$$

$$
\rho = \begin{cases} 1 - \frac{3r}{2\Phi} + \frac{r^3}{2\Phi^3} & (r \le \Phi) \\ 0 & \text{otherwise} \end{cases} \tag{4.32}
$$

Values are then normalized to be in range  $[-1,1]$  to match the P process parameter. For the simulation,  $x$  and  $y$  are chosen to be coordinates on the finest [FPGA](#page-362-0) grid, the block grid. This way, each [CLB](#page-361-0) gets a distinct process variation factor. The  $P$  parameter is finally given as:

$$
P = f(x, y) = \text{map}_i(x, y),
$$
\n(4.33)

where  $\mathrm{map}_i$  is one randomly sampled 2D map of the VARIUS model.

In the evaluation, a single map will be considered. Given the stochastic nature of the variation map, results obtained with a single instance are representative. The parameter values  $\Phi = 0.5$ ,  $\sigma/\mu = 0.063$  for random and systemic  $V_{th}$ presented in the original VARIUS paper will be used for process variation modeling throughout this thesis. An exemplary 24x24 variation grid for  $V_{th}$ derived in this way is shown in [figure 4.24.](#page-176-0)

**Voltage Variation** As motivated in [section 2.4,](#page-49-0) supply voltage variation affects the propagation delay on [FPGAs.](#page-362-0) In [ASIC](#page-361-6) design, voltage variation is

analyzed deterministically for each circuit, simulating switching activity and power supply wire density to determine IR drop. The information obtained in simulation is then commonly used to increase metal width are wire density locally. This technique can not be usedin [FPGA,](#page-362-0) as the local resource utilization and switching activity is not known at manufacturing time of the [IC](#page-362-7) and only becomes known after application placement.

As deterministic analysis is effective, statistical models describing voltage variation for artificial circuits are not commonly available. Such models could be derived through analysis of multiple real circuits, but this is not necessary for this thesis: In a simpler approach, a deterministic, application specific voltage drop model is used for each user application. An IR drop analysis as sophisticated asin [ASIC](#page-361-6) design is however out of scope for this thesis: It requires detailed knowledge of the power supply network of the modeled [FPGA](#page-362-0) architecture and modeling of the metal layers of the used technology. In high-level, early design stage [DSE](#page-362-8)of [FPGA](#page-362-0) architectures, this information may not be readily available.

In this thesis, it is therefore assumed that the power grid in the [FPGA](#page-362-0) architecture is regular: For all locations on the [FPGA,](#page-362-0) an identical utilization percentage and switching activity will lead to the same IR drop. IR drop then does not only depend on a single unit such as one [CLB,](#page-361-0) but on all devices in the vicinity as well, as they share at least parts of the supply. Furthermore, it is assumed that all [LEs](#page-362-6) within a power region have identical supply voltage: This assumption requires that the local power connections within a region are dense enough that there is no voltage drop within the region. Each region then get assigned a voltage drop factor, describing the percentage of supply voltage reduction. To simplify the model, it is also assumed that power usage in regions does not interfere with other regions, so the voltage drop within each region depends only on the utilization within this region:

$$
LE(R) = \{e \in R | \text{type}(e) = LE\}
$$
\n
$$
(4.34)
$$

$$
LE_{used}(R) = \{ le \in LE(R) \mid used(le) \}
$$
\n
$$
(4.35)
$$

$$
K(R) = \varepsilon * \frac{|LE_{used}(R)|}{|LE(R)|}
$$
(4.36)

<span id="page-177-0"></span>
$$
V = f(x, y) = (1 - K(\text{reg}(x, y))) * V_0 \tag{4.37}
$$

In this notation,  $reg(x, y)$  returns the set of all elements in the region which covers position  $(x, y)$ .  $K(R)$  is the voltage reduction factor for a region R and  $VDD$  reduction is obtained according to [equation \(4.37\).](#page-177-0)

**Temperature Variation** For temperature variation, similar remarks apply as for voltage variation. There are no commonly used statistical models as their usefulness would be limited. Instead, [ASICs](#page-361-6) are again designed with temperature analysis specific to each single circuit and again this approach is not suitable for [FPGA.](#page-362-0) This thesis will therefore use a high level model for temperature modeling. Like in the voltage variation case, the [FPGA](#page-362-0) is assumed to be regular and whether heating occurs is mainly determined by the utilization in the vicinity. However, unlike voltage variation, temperature variation is modelled as an [IC-](#page-362-7)global effect: Even whena [LE](#page-362-6) is at a physically large distance from the analyzed [LE,](#page-362-6) it is assumed that it still adds to overall heating and produces a temperature increase in the analyzed [LE.](#page-362-6) To model this, the distance between two [LEs](#page-362-6) is first given as the geometrical distance between their coordinates:

$$
d(x_1, y_1, x_2, y_2) = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}
$$
 (4.38)

Theheating of a [LE](#page-362-6) at position  $x, y$  is then determined according to the following equation:

$$
\Lambda(x, y) = \delta_0 * \sum_{(i,j)\in U} e^{-\delta_1 * d(x, y, i, j)} \tag{4.39}
$$

Here it is assumed that [LEs](#page-362-6) have exponentially declining impact on the analyzed [LE'](#page-362-6)s temperature with increasing distance. The equation then sums up contributions of all [LE](#page-362-6) which are used by the user application, denoted as set  $U.$  The rate of the decline can be adjusted using parameter  $\delta_1$ , whereas parameter  $\delta_0$  is used to scale the overall contribution to the ambient temperature. It can be derived when finding the maximum  $\Lambda(x, y)$  for an application, then fixing  $\delta_0$  to yield the desired maximum temperature increase  $\Lambda T$ 

$$
\max_{(x,y)\in U} \Lambda(x,y) * T_0 = \Delta T \tag{4.40}
$$

This definition is dependent on the placed user application. If an absolute reference is desired,  $\delta_0$  can be calculated in the same way, but assuming that all [LE](#page-362-6) are used. In this case,  $\Delta T$  will be the maximum theoretically possible temperature increase if the [FPGA](#page-362-0) is fully utilized.

Using the local temperature increase  $\Lambda(x, y)$ , the final temperature is derived based on the ambient temperature  $T_0$  as:

<span id="page-178-0"></span>
$$
T = f(x, y) = (1 + \Lambda(x, y)) * T_0 \tag{4.41}
$$

Previous analysis has only considered static scenarios, where the temperature stays constant over time. As the architectures discussed in this thesis cover dynamic effects, dynamic heating scenarios will be considered as well. For this, temperature rise in local hotspots will be derived as an exponentially falling temperature gradient, relative to a center position  $(x_0, y_0)$ . In addition, temperature will be exponentially increasing over time. These requirements lead to the model shown below:

$$
\Lambda_t(x, y, t) = \delta_0 * e^{-\delta_1 * d(x, y, x_0, y_0)} * (1 - e^{-\delta_2 t})
$$
\n(4.42)

Here,  $\delta_0$  is a parameter for the total temperature rise at the center point,  $\delta_1$ models the distance covered by the hotspot and  $\delta_2$  determines how long it takes for the rising temperature to reach the final value. The final temperature for the dynamic model is given below:

$$
T = f(x, y) = (1 + \Lambda_t(x, y, t)) * T_0
$$
\n(4.43)

This thesis will consider both local hotspot heating over time and global chip heating, which can be caused e.g. due to insufficient cooling of the [IC.](#page-362-7) The same formula will be used for both cases, only the hotspot position and  $\delta$ parameters will be changed.

**Aging** For aging, two scenarios can be considered: The first is global aging, which assumes the same conditions on the whole [IC.](#page-362-7) In this mode, the aging parameter A will be derived from [equation \(4.19\)](#page-162-1) using constant  $V_A$  and  $T_A$ values. This allows artificial aging simulations for predefined environment temperatures and voltage changes.

The second scenario is slightly more realistic and calculates  $V_A$  and  $T_A$  based on the temperature and voltage profiles defined before. The aging parameter will be then dependent on  $x$  and  $y$  as well. The model for this scenario is given below:

$$
A = f(x, y, t) = A(t, V(x, y), T(x, y)),
$$
\n(4.44)

with  $A(t, V_A, T_A)$  from [equation \(4.19\),](#page-162-1)  $V(x, y)$  from [equation \(4.37\)](#page-177-0) and  $T(x, y)$ from [equation \(4.41\).](#page-178-0) Like the aging model in general, this scenario assumes that  $V$  and  $T$  distributions calculated at the beginning are constant over time during the aging process.

## **4.8 Upcoming Aspects**

This chapter has presented a high-level overview of the [PARFAIT](#page-363-1) architecture derived in this thesis. Whereas this top-down view provides the overall
idea and picture, individual subtopics will be addressed in more detail in the following chapters. In addition, some aspects regarding simulation and evaluation of the architecture will also be presented.

[Chapter 5](#page-183-0) will derive a standard cell library for [RFET](#page-364-0) devices and demonstrate a synthesis flow using these. This will enable the development of digital circuits in [RFET](#page-364-0) technology using state-of-the-art commercial [EDA](#page-362-0) tools. In the [PARFAIT](#page-363-0) architecture, this could be used to realize the non-reconfigurable hard blocks, such as [DSP](#page-362-1) blocks or others, in [RFET](#page-364-0) technology. Although [RFET](#page-364-0) technology can be combined with classical [MOSFETs](#page-363-1) on one die, using [RFETs](#page-364-0) for such circuits enables fine-grain performance scaling using the [BG.](#page-361-0) For evaluation of the library, the dissertation will present basic arithmetic circuits and one large circuit, an accelerator for a symmetric cryptography cipher.

[Chapter 6](#page-199-0) will focus on the main element ofa [FPGAs,](#page-362-2) the reconfigurable [LE.](#page-362-3) It will introduce the [RFET](#page-364-0) based logic element used in the [PARFAIT](#page-363-0) architecture and describe integration in the base [FPGA](#page-362-2) architecture as described here. To allow for fairer comparison between the reference architecture and the [PARFAIT](#page-363-0) one, the top-level architecture will be kept unmodified, as described in [section 4.1.](#page-129-0) In order to also keep the [FPGA](#page-362-2) device size similar for both approaches, the [RFET](#page-364-0) based [CLB](#page-361-1) needs to provide similar expressiveness as the [LUT](#page-363-2) based one. Apart from discussing this in detail and evaluating the expressiveness, the chapter also focuses on changes neededin [EDA](#page-362-0) tools to map user applications to such [RFET](#page-364-0) based logic cells.

[Chapter 7](#page-217-0) will introduce power regions in more detail. As the architecture of power regions is simple, the chapter instead focuses on the changes in [EDA](#page-362-0) tools necessary to support those regions. This includes support to model regions as part of the [FPGA](#page-362-2) architecture description in [VPR.](#page-364-1) Both pre-determined low-power / high performance modes for regions (Static Assignment) and runtime configurable region (Dynamic Assignment) are considered.

[Chapter 8](#page-225-0) will introduce the [PVTA](#page-364-2) compensation system in detail. It first describes how the required performance for each region is determined by extracting the local critical path in [VPR.](#page-364-1) It then describes the [Ring Oscillator](#page-364-3) [\(RO\)](#page-364-3) based approach to determine the momentary performance in each region. Based on this information, it continues with a description of the [Logic](#page-363-3) [Invasion \(LI\)](#page-363-3) system, which dynamically reconfigures parts of the [FPGA](#page-362-2)as [RO](#page-364-3) circuits. This allows to transparently characterize the whole [FPGA](#page-362-2) without impacting the user application. After introduction of these building blocks, the power management controller will be introduced. The chapter concludes with an overall summary of the [PVTA](#page-364-2) compensation and performance adjustment system.

[Chapter 9](#page-253-0) concludes the main chapters. It will present methodology for simulation and evaluation of the [PARFAIT](#page-363-0) architecture. For that, it will give an introductionto [VFPGAs,](#page-364-4) which enable prototypingof [FPGA](#page-362-2) architectures on commercial [FPGAs.](#page-362-2) It then continues by describing the static power analysis used in [VPR](#page-364-1) for power evaluation. In order to also model dynamic effects caused by the [PVTA](#page-364-2) compensation system, a QuestaSim based runtime simulation framework is derived. Whereas this can provide a functional simulation of the [FPGA](#page-362-2) architecture and the power management controller, it can not simulate the propagation delay due to [PVTA](#page-364-2) and voltage scaling. The following section in the chapter therefore introduces a co-simulation system. This system combines the [PVTA](#page-364-2) models described in [section 4.7](#page-174-0) as a software implementation with the VHDL hardware implementation simulated in QuestaSim. The chapter concludes with an introduction of the chosen user application benchmarks for the final [DSE](#page-362-4) and descriptions of the concrete [PVTA](#page-364-2) scenarios which will be evaluated.

The final chapters will conclude the thesis: [Chapter 10](#page-275-0) will provide [DSE](#page-362-4) results and results for the previously introduced test scenarios. Results will be presented individually for the [RFET](#page-364-0) standard cells, the [RFET](#page-364-0) [LE,](#page-362-3) power regions, and the overall [PARFAIT](#page-363-0) power management system, combining those aspects. [Chapter 11](#page-309-0) will then summarize results and insights obtained in this thesis and will give an outlook on possible future research.

*This page intentionally left blank*

# <span id="page-183-0"></span>**Chapter 5**

# **Ambipolar Standard Cells**

To analyze applicationof [RFET](#page-364-0) in non-reconfigurable partsof [FPGAs,](#page-362-2) a standard cell based approach has been taken. The following sections will first explain the derivation of an [RFET](#page-364-0) standard cell library, based on characterization data obtained in the PARFAIT project by Reuter [\[Reu21\]](#page-344-0). Based on this cell library, this thesis will investigate application in simple arithmetic units, like they are used as static logic extensionsin [LEs.](#page-362-3) To also verify applicability of [RFET](#page-364-0) cells in large circuits and to test mixingof [RFET](#page-364-0) and standard silicon [CMOS,](#page-361-2) the last section in this chapter presents an [RFET](#page-364-0) based cryptographic accelerator.

### **5.1 Standard Cell Library**

Characterization and derivation of an [RFET](#page-364-0) based standard cell library has originally been described in [\[Reu21\]](#page-344-0). This chapter will first introduce the [RFET](#page-364-0) device used. It will then summarize the work by Reuter, which consists of modeling and analyzation of the individual cells. The section will conclude with a derivation of a standard library as done by this thesis' author in a contribution to [\[Reu21\]](#page-344-0).

**[RFET](#page-364-0) Device** After early works of Reuter had analyzed static device behavior such as [Voltage-Transfer-Characteristics \(VTCs\)](#page-364-5) [\[Reu20,](#page-344-1) [Reu19\]](#page-344-2), [\[Reu21\]](#page-344-0) discusses dynamic characteristics. As no SPICE transient model was available for the [RFET](#page-364-0) technology, logic cells have been modelled and analyzed in [TCAD.](#page-364-6) [Figure 5.1](#page-184-0) shows the [RFET](#page-364-0) device which was used for the design of the standard cell library. It is based on the work by Krauss [\[1\]](#page-315-0), but device size has been optimized by Reuter [\[Reu21\]](#page-344-0). Providing both [BG](#page-361-0) and [TG,](#page-364-7) the device enables independent biasing of those gates. With a 90 nm [FG](#page-362-5) length but 210 nm [BG](#page-361-0) length, fair comparisons should compare the device to [SOI](#page-364-8)

<span id="page-184-0"></span>technology with 210 nm. Reuter therefore compares results to a commercial 180 nm [SOI](#page-364-8) technology, where the transistors have been manually scaled to a gate length of 210 nm. As modification of the commercial standard cell library was not possible, this comparison was only possible for analysis of single cell characteristics. Comparison of the standard cell library was therefore based on the commercial 180 nm standard cells.



**Figure 5.1:** The [RFET](#page-364-0) device used for the standard cell characterization [\[Reu21\]](#page-344-0). Possessing a 90 nm [FG](#page-362-5) length and a 210 nm [BG](#page-361-0) length, the device can be compared to 210 nm [SOI](#page-364-8) devices.

The cell supply voltage  $VDD$  was fixed at  $1.6$  V to match the commercial [SOI](#page-364-8) technology. In addition, the device has been tuned for symmetric behavior in p- and n- configuration through adjustment of gate work functions. As the device behaves symmetrically, one device can be used for pull-up and pull-down networks, so transistor sizing to compensate drive current is not necessary. After tuning, drive currents  $I_D$  were measured at 64.3  $\mu$ A in n-mode and 83.4 μA in p-mode.

**Cell Characterization** For development of the standard cells, the devices operate as described in [figure 2.6b:](#page-38-0) The [BG](#page-361-0) of all devices are connected to the devices' source terminals. [TG](#page-364-7) terminals of each device are combined and used as the configuration gate or secondary input. Use of the configuration gate to configure the device as [NMOS](#page-363-4) or [PMOS](#page-363-5) device is shown in [figure 5.2a](#page-185-0) on the left: For [NMOS](#page-363-4) operation, the  $TG$  is connected to  $VDD$ . for [PMOS](#page-363-5) it's connected to  $VSS$ . The [FG](#page-362-5) is used as the primary gate of a transistor.

The drive current of all cells has been normalized to match the drive current of the inverter. *INV*, *NAND* and *NOR* cells do not make use of reconfiguration and use statically configured [RFETs](#page-364-0) as shown in [figure 5.2b.](#page-185-0) The *XOR* cell on the other hand uses the reconfiguration gate as a logic input. The

<span id="page-185-0"></span>

**Figure 5.2:** Schematics for the [RFET](#page-364-0) based standard cells as designed by Reuter [\[Reu21\]](#page-344-0). **[\(a\)](#page-185-0)** Static biasing of [RFET](#page-364-0) devices to obtain *NAND*, *NOR* and *INV* gates equivalent to conventional [CMOS](#page-361-2) designs. **[\(b\)](#page-185-0)** *XOR* gate realized using [RFET](#page-364-0) reconfiguration. As described in [\[Reu20\]](#page-344-1), this device can be realized in 8 transistors, including required input inverters.

cell design shown in [figure 5.2b](#page-185-0) is based on the design in [\[Reu20\]](#page-344-1) and realizes the *XOR* gate using fewer transistors than is possible in standard [CMOS](#page-361-2) technology.

<span id="page-185-1"></span>

**Figure 5.3:** Characterization of dynamic cell behavior as described by Reuter for an *INV* cell [\[Reu21\]](#page-344-0). [\(a\)](#page-185-1) Characterization of rise  $t_r$  and fall times  $t_f$ , here called  $t_{\rm rise}$  and  $t_{\rm fall}$ . [\(b\)](#page-185-1) Characterization of propagation delay  $t_{\rm PD}$  $t_{\rm PD}$  $t_{\rm PD}$ , here called  $t_{\rm d}$ . Delays are individually characterized for possible [ARCs](#page-365-0) and edges.

[Figure 5.3](#page-185-1) shows an exemplary extraction of cell characteristics for the *INV* gate. The characteristics are obtained in the [Non-Linear Delay Model \(NLDM\).](#page-363-6) In this model, cell characteristics such as propagation delay  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  and rise  $t_{\text{r}}$ and fall times  $t_f$  are modeled as a lookup table. The table is parametrized by the input transition time and the output capacitance. Furthermore, [EDA](#page-362-0) tools interpolate between table entries. For the output capacitance, Reuter uses fixed values between 3 fF and 15 fF. Compared to other common models like [FO4,](#page-365-1) dynamic capacitance effects are therefore not included. This

trade-off is necessary, as the number of transistors that can be simulated in reasonable time in [TCAD](#page-364-6) simulation is limited. Reuter then performs transient simulations as shown in [figure 5.3a](#page-185-1) for rise and fall times and in [figure 5.3b](#page-185-1) for propagation delay. Similar simulations are executed for the modified 180 nm reference technology to obtain values for comparison. Characterization values are extracted from these simulations using custom scripts. Results show that the reference technology achieves higher drive currents, which also affects the slew rate. For the same output capacitance, depending on the gate type, propagation delay has been 160% to 638% slower for the [RFET](#page-364-0) cells. On the other hand, [RFET](#page-364-0) standard cells possess 86% to 64% less input capacitance  $C_{\text{in}}$ . Further details on cell characterization can be found in the original publication [\[Reu21\]](#page-344-0). Based on the work by Reuter described in the previous two paragraphs, this thesis' author contributed a cell library for synthesis in Cadence Genus to [\[Reu21\]](#page-344-0), which will partially be reproduced and summarized in the following sections.

**Wire Load Estimation** Before considering a library description of the cells itself, a characterization of connections between the cells, wires, is needed. Such wires add parasitic resistance and capacitance to the load of their driving cell. To ensure that the resultsof [RFET](#page-364-0) synthesis can be compared to synthesis resultsof [CMOS](#page-361-2) technologies, such wire load effects need to be modeled. Cadence Genus in version 17.11.00 was used for synthesis in both reference [SOI](#page-364-8) and [RFET](#page-364-0) technologies. Genus supports four methods to estimate wire load effects: Wireload models, [Physical Layout Estimation \(PLE\),](#page-363-7) spatial synthesis and physical synthesis. Wireload models are a statistical solution, providing tables of absolute capacitance, resistance, wire length and wire area. These tables are provided as part of the timing . Lib file and parametrized on the number of gates in the circuit and the fanout of the respective cell. Additionally, these tables can be parametrized on process corner, supply voltage and temperature, using common liberty file constructs. Wire load models therefore essentially estimate values by averaging over representative circuits of a certain size. For fanout values which are not present in the table, Genus uses linearly interpolated values. [PLE](#page-363-7) uses physical information from . Lef files and capacitance table files, aiming to obtain more accurate estimations. Physical information most notably includes resistance and capacitance per wire length values. Genus then uses a proprietary algorithm to derive the wire lengths for various connections and calculate the capacitance and resistance for all nets. To provide more exact results, unlike wireload models, [PLE](#page-363-7) calculates the length for each net individually. The third category of wire load estimation includes spatial and physical estimation. Both use the physical information usedin [PLE](#page-363-7) and additionally take a user-supplied floorplan into

account. In spatial mode, only a quick initial placement is used. In physical mode, an intermediate step invokes a full layout step to take into account cell placement, routing and congestion effects.

<span id="page-187-0"></span>

**Table 5.1:** Comparison of the wire load estimation techniques supported in Cadence Genus, using the [SOI](#page-364-8) reference technology. Values shown are averages over all nets in the circuit. Statistical models show larger deviations for small circuits, as those do not match the average circuit assumed in the model.

[Table 5.1](#page-187-0) shows estimated values in the reference [SOI](#page-364-8) technology to estimate quality of results for different estimation techniques. Results have been evaluated for the small *fa* and the large *chacha\_reg* circuit, which will be used for all analysis in this chapter. In addition, wire loads have been extracted after implementation in Cadence *Innovus* 17.11.000 to evaluate the accuracy of the estimations. For both physical synthesis and implementation, the same simple quadratic floorplan targeting a utilization ratio of 70% has been used. As can be seen in [table 5.1,](#page-187-0) the wireload model and [PLE](#page-363-7) underestimate capacitance and resistance for the test circuits. The physical flow overestimates values for the small circuit but is more accurate for the large circuit.

**Liberty File Derivation** Based on the device characterization by Reuter and the wire load discussion, a .lib file has been derived to enable synthesis of larger circuits. This file is reprinted in [appendix D](#page-377-0) for reference. Apart from the main definitions, it consists of the library name, the type of delay model used, metadata and the definition of units for all values. Power related units are defined for completeness, although the [RFET](#page-364-0) liberty file currently focuses on timing information, and does not contain power or area information. For the wire load model, [PLE](#page-363-7) or the physical flows cannot be used, as full layout and area information for cells is not available in the current [RFET](#page-364-0) [PDK.](#page-363-8) Wireload models on the other hand can be included in the [RFET](#page-364-0) standard cell library, enabling fairer comparison to the [SOI](#page-364-8) reference technology. As one of planar [RFET](#page-364-0) technology's main advantages is integration with existing [CMOS](#page-361-2) technologies, repurposing an existing wireload model is possible. As the device size is comparable, and the metal layers will be identical, the reference [SOI](#page-364-8) wireload model will be reused for the [RFET](#page-364-0) library as well. [Listing 5.1](#page-188-0) shows how this wireload model is defined in the library. Concrete values where taken from [SOI](#page-364-8) liberty files, but have been replaced with dummy values to conform to NDA rules. Line 1 describes one of the tables, defining absolute area, capacitance, wire length and resistance for various fanout values. Synthesis tools choose the respective entry within this wire load table according to the fanout, interpolating the value if necessary. The wire load selection wload sel in line 9 maps the tables is to the area of the synthesized circuit. The final statement then defines the default wire load selection to be used in the library file.

```
1 wire load table (wload 1) {
2 fanout_area (1, 1);
3 fanout_capacitance (1, 1);
4 fanout_length (1, 1);
5 fanout resistance (1, 1);
6 / / ...7 }
8 // ...
9 wire load selection (wload sel) {
10 wire_load_from_area (0, 100, wload_1);
11 wire_load_from_area (100, 500, wload_5);
12 / / ...13 }
14
15 default_wire_load_selection : wload_sel;
```


An example for a cell definition is shown in [listing 5.2](#page-189-0) on the next page, a reduced version of the [RFET](#page-364-0) inverter definition. The cell definition consists of pin definitions, including power pins (not shown here) and the input pin A and output pin Q. Each pin section includes the pin direction. For inputs, the input capacitance of the pin on a rising transition and on a falling transition are defined according to the values obtained by Reuter. Outputs contain a logical de-

```
1 cell (INV) {
2 / / ...3 pin (A) {
4 direction : input;
5 rise_capacitance : 0.004;
6 fall_capacitance : 0.003;
7 }
8 pin (Q) {
9 direction : output;
10 function : "!(A)";
11
12 timing () {
13 timing_sense : negative_unate;
14 related_pin : A;
15 rise_transition (delay_3x3) {
16 index_1 ("0.06, 0.30, 0.54");
17 index_2 ("0.003, 0.0065, 0.010");
18 values ( \
19 "0.10, 0.14, 0.19", \rangle20 "0.16, 0.20, 0.25", \
21 "0.22, 0.26, 0.31");
22 }
23 fall_transition (delay_3x3) {
24 index_1 ("0.06, 0.30, 0.54");
25 index_2 ("0.003, 0.0065, 0.010");
26 values ( \
27 "0.11, 0.15, 0.19", \
28 "0.17, 0.21, 0.25", \
29 "0.24, 0.28, 0.31");
30 }
31 // ...32 }
33 }
34 }
```
**Listing 5.2:** Shortened version of the RFET inverter cell section in the *.lib* file. Values have been truncated, *cell\_rise* and *cell\_fall* sections are not shown and power pin related constructs have been removed.

scription of the output as a function of the cell inputs and where modeled similarly to the [SOI](#page-364-8) reference technology. Furthermore, output pins contain delay model tables for rise and fall, as well as for cell\_rise and cell\_fall (not shown here). The rise and fall tables model the transition times  $t_r$  and  $t_f$  of the output pin. cell\_rise and cell\_fall tables on the other hand model the cell propagation delay  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  from input edge to output edge. Delay tables are indexed by the input slew rate and the total capacitance connected to the output pin. The slew rate is defined to be the duration in which the input edge is between 20% and 80% of  $VDD$ .

As [FF](#page-362-6) cells are required to implement sequential circuits and the [RFET](#page-364-0) [PDK](#page-363-8) does not provide such a cell yet, the reference technology [SOI](#page-364-8) cell was used in the [RFET](#page-364-0) library. This is justified, as the technology allows mixing [RFET](#page-364-0) and other cells. However, it should be noted that the limited drive currentof [RFET](#page-364-0) cells and the comparatively large input capacitance of the [FF](#page-362-6) [SOI](#page-364-8) cell will lead to increased delays. Furthermore, as the [RFET](#page-364-0) cell consists of only five cells, a comparison to a normal standard cell library with hundreds of cells would not be fair: Synthesis tools could choose cell variants optimized for a certain fanout, or they could use advanced cells, which combine multiple basic functions in one cell. In order to allow for fairer comparison, a reduced library has been derived for the [SOI](#page-364-8) technology. This library consists of only those cell types, which are also available in the [RFET](#page-364-0) library.

### **5.2 Application in Arithmetic Units**

After analysis of the dynamic characteristics of the standard cells and the extraction of properties such as propagation delays into a cell timing library, the library was evaluated in demonstration circuits. The selected circuits make use of *XOR* cells, which to demonstrate one main benefitof [RFET](#page-364-0) technology. As shown, *XOR* cells can be implemented using fewer transistors than in the normal [CMOS](#page-361-2) implementation. For a complete evaluation, two methods were examined to map the circuit to the cells in the respective timing library: In the first test with combinational circuits, cells are mapped manually to ensure the netlist of planar [RFET](#page-364-0) and [SOI](#page-364-8) reference technology coincide. In the next section, a normal synthesis approach will be used and will perform individual optimizations on the circuit, depending on the provided timing library. All syntheses are performed for both, the planar [RFET](#page-364-0) technology and the [SOI](#page-364-8) reference. Comparability between planar [RFET](#page-364-0) and [SOI](#page-364-8) reference technology is limited, due to the fact that the [SOI](#page-364-8) reference cell library is based on 180 nm

digital devices. Upscaling of the [SOI](#page-364-8) devices to match channel lengths, as performed by Reuter when comparing the cell performance, is not possible for a digital cell library. The device performance is expected to decrease with upscaling of the channel length, which benefits the presented timing characteristics of the reference technology. The interpretation of the results of the comparison between the two technologies therefore have to take into account a significant boost for the reference technology.

For the direct comparison of identical netlists, a set of simple test applications has been chosen. Those applications resemble circuits that are commonly usedin [FPGAs](#page-362-2) and implemented in non-reconfigurable logic. Here, full adders are commonly includedin [LEs](#page-362-3) to allow for faster carry-ripple adder implementations. [Figure 5.4](#page-191-0) introduces the combinational test circuits: A 32 bit carry ripple adder made entirely out of the full-adder cells of [figure 5.4a,](#page-191-0) a 4 bit parity checking adder, as initially proposed for nanowire [RFETs](#page-364-0) in [\[249\]](#page-339-0), complemented carry generation for the checked adder in [figure 5.4b](#page-191-0) and the two-rail checker for the checked adder in [figure 5.4c.](#page-191-0) Circuits have been transformed to NAND and NOR logic, as they will be directly mapped to gates according to these schematics.

<span id="page-191-0"></span>

**Figure 5.4:** Arithmetic, combinational test circuits for logic synthesis. **[\(a\)](#page-58-0)** Full Adder cell. **[\(b\)](#page-58-0)** Complemented carry generation for Checked Adder [\[249\]](#page-339-0). **[\(c\)](#page-58-0)** 2RC Checker for Checked Adder [\[249\]](#page-339-0).

For these small circuits, each single gate was analyzed in detail. It was therefore necessary, that the final netlist is structured exactly as shown in [figure 5.4.](#page-191-0) To ensure this, an essentially pre-mapped, albeit hierarchical netlist was used as input to the Genus synthesis tool: Circuits are modeled in structural VHDL, where gates are modelled as small entities wrapping the respective technology's cell. An example for the *xor2* cell is shown in [listing 5.3.](#page-192-0)

```
1 architecture parfait of xor2 is
2 begin
3 cellx: entity work.EO2
4 port map (
5 A => a,
6 B => b,
7 0 => y
8 );
9 end;
```
**Listing 5.3:** VHDL description to model the *xor2* cell for synthesis.

To prevent Genus from analyzing the cell functions and mapping to different gates, the dont\_touch attribute was used for all cells instantiated in the circuits. The syn\_map phase was then skipped.

### **5.3 Application in Cryptographic Accelerators**

To analyze the use of the [RFET](#page-364-0) lib file in a more realistic way using standard synthesis and mapping, a larger test circuit is needed. To make use of [RFET](#page-364-0) benefits, an application which makes heavy use of *XOR* cells was selected. An accelerator for the ChaCha cipher fits these requirements and will be described in the following. The accelerator was originally designed and evaluated for [FPGAs](#page-362-2) by this thesis' author in [\[Pfa19\]](#page-343-0).

#### **The ChaCha Cipher**

ChaCha is a symmetric stream cipher originally published by Bernstein in 2008 and is now commonly used in various communication protocols [\[250\]](#page-339-1). As a stream cipher, the various ChaChaN variants first generate a stream of key data, the keystream. To encrypt data, the bytes of the keystream are combined with the bytes of the datastream by an XOR operation, resulting in the cipherstream. As decryption is symmetrical, both the enand decrypting parties have to generate the same keystream. The algorithm for encryption and decryption is therefore identical and shown below:

<span id="page-193-2"></span>
$$
ciphers the term = keystream \oplus dataset
$$
\n
$$
(5.1)
$$

<span id="page-193-3"></span>
$$
datastream = keystream \oplus ciphertextream \qquad (5.2)
$$

In order to generate the keystream, ChaCha performs various operations on a matrix consisting of 32 bit unsigned integers. The initial matrix  $M$  is formed as follows:



The values in the first row are the 16 constant bytes expand 32-byte k in hexadecimal notation, and are followed by the symmetric key. The counter is used to provide the current position in the keystream. For the first block, i.e. the first 64 bytes in the keystream, it is zero. For the next block it is one, etc. ChaCha allows for random access to the keystream: It is possible to calculate blocks at any stream offset without calculating any previous block. The last entry, is a number unique to each keystream, the [Number-used-Once](#page-363-9) [\(nonce\).](#page-363-9)

To process the matrix M, ChaChaN performs  $N = 8$ , 12 or 20 rounds of operations on the matrix, as shown in [listing 5.4:](#page-193-0)

```
1 foreach i in (0 .. N-1)
2 in = odd(i) ? diags(M) : cols(M)
3 out[0..4] = qround(in[0..4])
4 odd(i) ? diags(M) = out : cols(M) = out
```
**Listing 5.4:** The rounds $_N(M)$  Operation

Quarter-rounds consist of addition  $+$ , xor  $\wedge$  = and rotate  $\wedge$   $\wedge$  = operations as shown in [listing 5.5:](#page-193-1)

<span id="page-193-1"></span>1 a += b; d ^= a; d <<<= 16; 2 c += d; b ^= c; b <<<= 12;  $3 \text{ a } += \text{ b}; \text{ d } += \text{ a}; \text{ d } << = 8;$ 4 c += d; b ^= c; b <<<= 7;

**Listing 5.5:** The  $ground(a, b, c, d)$  Operation

[Figure 5.5](#page-194-0) shows how even rounds operate on the four columns of the matrix (column rounds) and odd rounds on four diagonal vectors (diagonal rounds). Each round then performs the quarter-round sub-operations *ground* once per input vector.

<span id="page-194-0"></span>

**Figure 5.5:** Definition of columns  $c_i$  and diagonals  $d_i$  for the  $ground$  input data.

After matrix  $M$  has been processed through  $N$  rounds, the last processing step for the ciphertext block *keystream*(*counter*) is to add the processed matrix to the initial matrix:

<span id="page-194-1"></span>
$$
keystream(counter) := M + rounds_N(M)
$$
 (5.4)

To obtain the next 64 B of the keystream, the counter field in the initial matrix  $M$  is incremented by one and the process in [equation \(5.4\)](#page-194-1) is repeated for the new matrix.

#### **ChaCha in Digital Hardware**

<span id="page-194-2"></span>

**Figure 5.6:** Quarter-round operation depicted asa [Data Flow Graph \(DFG\).](#page-362-7)

[Figure 5.6](#page-194-2) shows the quarter-round operation of [listing 5.5](#page-193-1) asa [DFG,](#page-362-7) depicting both the operations used and the data flow of the algorithm. As can be seen, each output  $a, b, c$  and  $d$  is dependent on each input and on intermediate

results, showing that parallelization of this operation is not easily possible. [Figure 5.7](#page-195-0) shows a slightly modified form of the first two sections of the graph, suggesting that the whole operation can be built out of four basis cells. These [Add Rotate XOR \(ARX\)](#page-361-3) cells form the base operation used in the ChaCha cipher, but outputs need to be permuted in order to directly chain these cells: The outputs of the first cell,  $a'$ ,  $b'$ ,  $c'$  and  $d'$  are updated input variables as described in line one of [listing 5.5.](#page-193-1) Line two of [listing 5.5](#page-193-1) then applies the same operations on the updated input variables. As it maps the inputs differently to the operations, the basis cell has to perform that permutation. If four basis cells are connected serially, the final result will be in the same order as the input variables.

<span id="page-195-0"></span>

**Figure 5.7:** [ARX](#page-361-3) cell with permuted outputs to realize the *ground* operation.

As shown in the graph, the rotation distance is different for each stage. To handle this in the basis cell, the distance can be required to be a constant parameter. The rotation operation will then be implemented as a simple wire permutation by synthesis tools, introducing no additional logic delay. As such a cell can be used for only one stage in the quarter-round, four physical copies of these cells with different rotation distances will be required. An alternative is to make the rotation distance changeable as a runtime input: In that case, the rotation operation will be synthesized into a 4-input multiplexer structure, requiring additional hardware resources and introducing logic delay. As explained in more detail in [\[Pfa19\]](#page-343-0), it is possible to pipeline this [ARX](#page-361-3) basis cell.

**Quarter-Round and Rounds** Two different options to combine [ARX](#page-361-3) cells have been evaluated. One option is to employ a pipeline structure as depicted by the sections in [figure 5.6.](#page-194-2) In this implementation, the quarter-round is implemented as one pipeline consisting of four physical [ARX](#page-361-3) cells. This pipeline does the complete round processing for a quarter of the matrix and each operation can be mapped to one physical ARX cell. The rotation distance for each cell is constant, reducing logic delay. Another benefit of this structure is the possibility of deep pipelining: As the [ARX](#page-361-3) basis cell is pipelined, an in-series

chain of these cells should not pose further restrictions on the critical path and overall performance.

In order to utilize such a deep-pipeline completely, all stages have to be filled with independent data vectors, i.e. any vector in the pipeline may not in any way depend on the final *ground* result for any other vector in the pipeline. As can be seen from [figure 5.5,](#page-194-0) and in [listing 5.4,](#page-193-0) the four quarter-rounds in each round are independent. Their inputs depend on the results of the previous round, but not on any result of other quarter-rounds in this same round. Problems however arise after the vectors of one round have been processed: The operations of the next round have data dependencies on all data vectors modified in the previous round [\[Pfa19\]](#page-343-0). One solution is to start processing the next round only when previous columns have been completely processed, but this reduces the performance of the implementation. As individual keystream blocks in ChaCha are independent, another solution is to prepare the initial matrix  $M_{n+1}$  for the next keystream block and process this matrix' rounds interleaved with the  $M_n$  rounds. This idea can be generalized to any number of matrices, depending on the pipeline depth. In general, processing then alternates between the rounds of multiple matrices.

Alternatively, a more software like approach where at least one [ARX](#page-361-3) cell is used with a configurable rotation distance was evaluated. This way, the four stages of a quarter-round can be processed consecutively and the result of every step will be written back to the processing matrix. Four [ARX](#page-361-3) cells were still used in parallel to enhance throughput, similarto [Single Instruction Mul](#page-364-9)[tiple Data \(SIMD\)](#page-364-9) in software optimizations. In this implementation, four parallel [ARX](#page-361-3) cells will process the whole matrix at once instead of processing vectors consecutively. As one cycle always yields a complete matrix, there are no pipelining complications when starting to process the next round. It is therefore not necessary to alternate processing between multiple matrices. The main drawback is that this approach requires a runtime-adjustable rotation distance, as the distance will be different for each processing cycle.

After the matrix has been processed through the *rounds*<sub>N</sub> implementation, the final addition of [equation \(5.4\)](#page-194-1) and the encryption or decryption operations [equations \(5.1\)](#page-193-2) and [\(5.2\)](#page-193-3) need to be performed. As the final operation is an addition followed by an XOR operation, it is possible to reuse the [ARX](#page-361-3) cell for this operation.

**ChaCha Accelerator System Architecture** Based on the previously introduced building blocks, three different system architectures have been evaluated. The pipeline implementation consists of four parallel quarter-round cells, calculating one round completely in parallel. The individual Round blocks are then chained to form a deep pipeline, taking care to permute signals connecting two rounds. If the output datastream is not ready or the input datastream is not valid,a [FSM](#page-362-8) accordingly stops the Round blocks to pause keystream generation.

The Block Memory and the Register implementation share most modules. The core blocks which calculate the complete  $rounds_N$  however differ between the two implementations. In either case, processing a matrix in a Core block takes a certain amount of cycles. The top level module therefore allows to use a configurable number of cores in parallel, interleaving their outputs and enhancing throughput. As the outputs are interleaved, the maximum number of cores is reached when every cycle yields 16 words of data. Any further parallelization then requires duplicating the top-level architecture.

For the Block Memory Implementation, one pipelined quarter-round is used. The four output words are then saved to four parallel Block Rams at certain indices: Each row of the matrix is kept in one Block Ram, which then allows to read all four inputs for a column round or for a diagonal round in parallel using proper addressing. The benefit of this implementation is that all operations in the quarter-round are placed into one physical pipeline implementation with constant rotation values.

For the register based implementation, no complex address calculation needs to be done for memory access. Instead, after the initial matrix has been loaded using a multiplexer, the complete matrix is kept in register storage internal to a parallel quarter-round implementation. The outputs are then looped back to the inputs. The block operates on the whole matrix, processing one fourth of a round per cycle.

#### **[RFET](#page-364-0) Implementation**

After the cryptographic accelerator has been designed and evaluated on [FPGA,](#page-362-2) this thesis' author used it for [RFET](#page-364-0) standard cell evaluation in [\[Reu21\]](#page-344-0). Out of the ChaCha implementations, only the Register variant shown in [figure 5.8](#page-198-0) was used for [RFET](#page-364-0) synthesis. This variant is smaller than the Pipeline variant and does not need block memory, which is not available in the minimal standard library. Instead of analyzing every single gate as done for small circuits, the purpose of this larger circuit is mainly to verify proper synthesis

and to evaluate wire load estimation approaches on large circuits. Because of this, standard behavioral VHDL coding techniques have been used and [RFET](#page-364-0) .lib file with the *INV*, *NAND*, *NOR* and *XOR* cells has been supplied to Genus, instead of manually mapping cells. A similarly reduced library with the same types of cells was used for the [SOI](#page-364-8) technology, to allow for direct comparison. As technology mapping is done entirely by the synthesis tool in usual digital standard cell flows, this approach is closer to common practice.

<span id="page-198-0"></span>

**Figure 5.8:** Simplified system architecture of the Register ChaCha accelerator variant [\[Reu21\]](#page-344-0).

The timing library of the planar [RFET](#page-364-0) is limited to combinational cells and therefore only allows for synthesis of combinational circuits. To be able to evaluate sequential circuits, the .lib a D[-FF](#page-362-6) was added to the library. The deployed D[-FF](#page-362-6) comes from the [SOI](#page-364-8) reference technology. Lib file as explained previously. The input capacitance of the D[-FF](#page-362-6) is approximately half the capacitance of the [RFET](#page-364-0) inverter input pin and the timing tables support output loads of up to 80 times the maximum input capacitance of all pins in the [RFET](#page-364-0) library. The cells are therefore compatible for use in the timing analysis.

# <span id="page-199-0"></span>**Chapter 6**

# **Ambipolar Reconfigurable Cells**

As applicability of [RFET](#page-364-0) in non-reconfigurable digital logic was evaluated in the previous chapter, this chapter now focuses on the application of reconfigurable cells. A summary of available reconfigurable cells was given in [section 2.6](#page-75-0) on page [53,](#page-75-0) whereas reconfigurable cells in the [PARFAIT](#page-363-0) [RFET](#page-364-0) technology have only become available during writing of this thesis. This chapter therefore focuses on useof [ULM](#page-364-10) cellsin [FPGA](#page-362-2) in general, independent of the technology used and of the concrete cell design. The content of this chapter has been previously published in [\[Pfa20\]](#page-343-1), but it has been edited and extended for this thesis.

### **6.1 Basic Logic Cells**

Various existing ambipolar technology reconfigurable cells were considered for an [FPGA](#page-362-2) architecture which realizes the opportunities offered by ambipolar transistor technology. As a result, the *CNT-DR8F* cell presented by Liu et al. [\[128\]](#page-327-0) was selected as a base for this work: Compared to simpler 2-input cells with only two selectable functions, this eight-function cell offers more choices for the mapping and packing algorithms. Unlike some larger proposed cells, this cell does however not provide the complete function set of two variables. It is therefore not a 1:1 replacement for [LUTs:](#page-363-2) A cell which allows to represent all possible functions of  $N$  input variable can be treated likea [LUT](#page-363-2) in most of the [EDA](#page-362-0) tool flow. It therefore puts fewer restrictions on the [EDA](#page-362-0) tools than cells with a reduced function set. For specialized [RFET](#page-364-0) reconfigurable cells, it is often not easily possible to influence the realized function set. Because of that, it is important to evaluate the EDA tool flow and the architecture in regard to a real ambipolar base cell. This will take into account restrictions which may arise due to the limited set of functions in the used cells. *CNT-DR8F* offers the function set shown in [table 6.1.](#page-200-0) The table also

shows the names of the functions as they are used in the rest of this chapter, the [EDA](#page-362-0) tools and the [FPGA](#page-362-2) architecture. Simple permutation of cell inputs can easily be done in the [FPGA](#page-362-2) interconnect and does not need to be implemented explicitly in the reconfigurable cell. The function pairs  $(\overline{A} \cdot B, A \cdot \overline{B})$  and  $(A + \overline{B}, \overline{A} + B)$  supported by the original cell have therefore been combined into single ulm\_and2n and ulm\_or2n functions.

<span id="page-200-0"></span>

**Table 6.1:** The set of functions supported by the *CNT-DR8F* [ULM c](#page-364-10)ell. The cell operates with two configuration voltage levels and is using in dynamic logic. Adapted from [\[128\]](#page-327-0).

**Logic Generator** [Table 6.1](#page-200-0) also shows the configuration values to obtain a certain output function for reference. For the following discussions, the exact  $(V_{\text{bA}},V_{\text{bB}},V_{\text{bC}})$  combination used is however not relevant. If the physical view considering the voltages is mapped to a logical view considering configuration bits of SRAM cells, the combination determines essentially the bitstream encoding used. An important point is the number and type of used reconfiguration inputs. Some reconfigurable cells use three values for the configuration input: A positive, a negative and a zero voltage. In those cells, the positive and negative voltages are used to configure n-channel or p-channel behavior. The zero voltage is used to put the transistor in an off, high-impedance state. Such three-level configuration can however not directly be representedin [SRAM,](#page-364-11) so a decoder network is needed for these cells. The selected *CNT-DR8F* cell uses two input values and three inputs. The configuration storage for such an element can therefore be kept entirely in three SRAM cells. If the chosen ambipolar technology supports to store the configuration as charges of the respective configuration gates [\[167\]](#page-330-0), external [SRAM](#page-364-11) storage is not needed. In this case, logic and memory are combined, so the extra transistors otherwise used for configuration memory can be saved. The total transistor count for

<span id="page-201-0"></span>the *CNT-DR8F* cell, including configuration storage, then reduces to seven transistors [\[128\]](#page-327-0).



**Figure 6.1:** Top Level [FPGA](#page-362-2) Architecture. Dotted: [IO](#page-362-9) blocks; white: Configurable logic blocks; diagonal: Memory blocks; grid: Multiplier blocks. Multiplier and memory blocks repeat every 8th column. Wire length of the global interconnect is 4.

**[FPGA](#page-362-2) Architecture** The [RFET](#page-364-0) [FPGA](#page-362-2) architecture is based on the [VPR](#page-364-1) reference architecture k6\_frac\_N10\_mem32K\_40nm shown in [figure 6.1.](#page-201-0) In addition to being the base for the [RFET](#page-364-0) architecture, this architecture will also be used as a reference to compare the results to. Certain differences between the [RFET](#page-364-0) reconfigurable cell and [LUTs](#page-363-2) have to be considered when designing the [RFET](#page-364-0) [FPGA:](#page-362-2) As *CNT-DR8F* is a cell using only two inputs, it has less expressiveness than the 6-input [LUT](#page-363-2) commonly used as [LEs](#page-362-3)in [FPGAs.](#page-362-2) Using it directly as a configurable logic block in the FPGA grid of the reference architecture therefore leads lots of wires having to be routed using the global interconnect. Such routing leads to increased wire delay, negatively affecting the critical path and severely limiting the maximum frequency achievable for user designs in the FPGA architecture. It may also lead to routing congestion for some user applications. As the interconnect in modern [FPGAs](#page-362-2) already makes up most of the total area, using even wider interconnect channels is not an option. The [RFET](#page-364-0) based architecture evaluated here will therefore keep identical channel width as the reference architecture which is used for comparison. To ensure the global interconnect can be kept unchanged, the [RFET-](#page-364-0)based [CLB](#page-361-1) replacement will be designed to possess a similar expressiveness as the reference architecture [CLB.](#page-361-1) The discussion in this chapter will be

kept on system level: For example, cell area largely depends on the technology and cell used. In a more detailed analysis, cell area however does have an effect on the application performance and on interconnect design. Future, circuit-level investigation will therefore require a more detailed analysis of the logic cell and a complete circuit level reimplementation of the cell in the target technology, including a cell layout. Because of this, circuit-level analysis is deferred to future work.

**Fracturable Logic Cell** Aiming to stay close to the design of the [VPR](#page-364-1) k6\_frac\_N10\_mem32K\_40nm reference architecture, the fracturable 6-input cell structure shown in [figure 6.2a](#page-203-0) has been derived. It combines5 [ULM](#page-364-10) base cells in a tree-like structure, providing three outputs. In this base structure, the outputs are not independent and can only be used in parallel if an intermediate output is required. Analysis of the packed netlist will show whether such situations are common enough in the netlists of real applications to be beneficial. The maximum depth of the cell is three [ULM](#page-364-10) cells deep, but it is not a fully populated tree. Depths of one [ULM](#page-364-10) or two [ULMs](#page-364-10) (full tree) are available at the other outputs. Such a cell does however not yet allow to implement all required functions on an [FPGA:](#page-362-2) As an example, the case where an IO input is directly connected to a register cannot be represented in this architecture, as all signals have to be routed through a logic blocks before being fed into a register. [LUT](#page-363-2) based architectures are not affected by this problem, as they always allow to implement the identity function, a simple pass-through. To make the [ULM](#page-364-10) based architecture universal, a pass-through mode has to been added to the [ULM](#page-364-10) element, as will be explained in the following sections.

### **6.2 Electronic Design Automation**

To evaluate various parameters and compare the designed logic cellsto [LUT](#page-363-2) based [FPGA,](#page-362-2) an evaluation methodology is needed. Like most [FPGA](#page-362-2) research, this work will be carried out in an empirical way, analyzing various representative benchmarks applications. Evaluation and benchmarking of the different architectures will be performed in the most recent [VTR](#page-367-0) release, version 8.0 [\[142\]](#page-328-0). The [VTR](#page-364-12) tool suite has originally been designed for [LUT](#page-363-2) based [FPGAs,](#page-362-2) which means various changes are necessary to use it with fixed function [ULM](#page-364-10) cells. An approach to handle this was originally presented in [\[Pfa20\]](#page-343-1) and will be summarized here. In addition, a novel, more advanced [EDA](#page-362-0) approach

<span id="page-203-0"></span>

**Figure 6.2:** Basic structure of the [ULM](#page-364-10) based [LE](#page-362-3) and logic cluster. **[\(a\)](#page-203-0)** [ULM](#page-364-10) based 6-input [FLEs.](#page-362-10) **[\(b\)](#page-203-0)** Configurable logic block clusterof [LEs](#page-362-3) and [FLEs.](#page-362-10)

yielding better results for fracturable cells will be presented in the second part of this section.

**Basic [EDA](#page-362-0) Flow** [Figure 6.3](#page-204-0) on the next page shows the most basic modifications of the [VTR](#page-367-0) flow that are needed to synthesize for [ULMs.](#page-364-10) Whereas tools depicted in gray and black are unchanged from the original [VTR](#page-367-0) flow, those in orange are newly introduced or modified. User applications, in caseof [VTR](#page-367-0) the benchmark applications, are first passed to ODIN 2 for synthesis. This step is largely technology independent, except for the direct use of black boxes and other [IP](#page-362-11) cores in the Verilog code. Use of special mathematical or memory operations may also needa [DSP](#page-362-1) or block memory element in an architecture description for ODIN. These descriptions do not differ between [ULM](#page-364-10) and [LUT](#page-363-2) based [FPGA](#page-362-2) and are therefore not further described. After synthesis, ABC performs technology independent optimizations. This step is again unchanged. The primary change is in the next step, technology mapping in ABC. Technology mapping is guided by a synthesis script, which is automatically generated by the [VTR](#page-367-0) tool flow. The generated script however assumes mappingto [LUTs](#page-363-2) and cannot be used for [ULM.](#page-364-10) This script was adapted to perform a standard cell mapping, using a custom genlib technology file. The genlib library usually contains available cells in a standard cell library. In this case, the different [ULM](#page-364-10) modes have been specified as individual gates.

<span id="page-204-0"></span>

**Figure 6.3:** Custom [EDA](#page-362-0) flow to map [VTR](#page-367-0) benchmarksto [ULMs.](#page-364-10) Steps in orange have been newly added using custom written tools. Libraries and models in orange have been newly derived for the specific [ULM](#page-364-10) used.

As a result, ABC will map the user application logic to those functions. As a slight complication, [VTR](#page-364-12) sometimes performs multiple iterations of reading, optimizing and writing the mapped netlist. This is supported by ABC when using [LUTs](#page-363-2) as the target technology, as the representation of logic in unmapped netlists happens to be the same as for [LUTs.](#page-363-2) For standard cell synthesis, this is not the case and care must be taken to perform initial ABC steps using [LUT](#page-363-2) mapping, only switching to standard cell mapping in the final iteration.

The mapped netlist generated by ABC in cannot be directly used in the pack and place phaseof [VPR:](#page-364-1) The generated [Berkeley Logic Interchange Format](#page-365-2) [\(BLIF\)](#page-365-2) file uses .gate directives, which are not supportedin [VPR.](#page-364-1) Therefore, the custom G2S script is used to transform the .gate directives to .subckt directives. Those directives are supported by [VPR](#page-364-1) and are usually used to specify black boxes and more complex predefined blocks in the [FPGA](#page-362-2) architecture. As the [VPR](#page-364-1) packer was designed to be flexible, it can be used to pack the [ULMs](#page-364-10) functions into the [FLEs.](#page-362-10) For this, a custom [FLE](#page-362-10) model is provided to [VPR.](#page-364-1) It is modeled to consistof [ULMs,](#page-364-10) which are modeled as logic block with different modes. Each mode corresponds to one function of the [ULM](#page-364-10) cell and to one entry in the ABC gate library. The configuration of each individual [ULM](#page-364-10) can therefore be obtained by simply extracting the used mode from [VPR](#page-364-1) results. A major drawback of this simple approach is extended tool runtime, as an unusually large amount of cells has to be processed by the packer. Furthermore, some logic functions cannot be legally packed onto some [ULMs](#page-364-10) ina [FLEs](#page-362-10) due to routing restrictions. This will lead to a lot of backtracking in the packer. As in the [LUT](#page-363-2) case, the packer is also used to pack [FLEs](#page-362-10) into logic clusters.

Place and route steps are largely unmodified. [VPR](#page-364-1) needs an architecture description which declares the location and amount of logic clusters. Here, no complex modifications are necessary and this thesis will adapt a reference architecture from [VTR.](#page-367-0) To evaluate results, statistics will be extracted as will be described in the remainder of this chapter. Various top-level statistics such as [FPGA](#page-362-2) size and channel width can be obtained directly from [VTR.](#page-367-0) Other interesting statistics, such as the utilization of [FLE](#page-362-10) in logic clusters, the modes [ULM](#page-364-10) operate in etc. are not available from [VTR'](#page-367-0)s statistics. The information is however available in the mapped netlist generated by [VPR.](#page-364-1) Therefore, the custom tool UANA was written to extract the relevant metrics.

**Flow for Fracturable Cells** As will be analyzed in the evaluation, the simple [EDA](#page-362-0) flow presented in [\[Pfa20\]](#page-343-1) has significant limitations. As previously described, it slows down the packing phase a lot. Furthermore, mapping to [FLE](#page-362-10) has shown to be inefficient. Although in general [VPR](#page-364-1) is able to pack multiple [ULM](#page-364-10) intoa [FLE,](#page-362-10) output and input utilization of those [FLE](#page-362-10) is in average underwhelming. As the problem is mainly in the packing step, a modified flow as shown in [figure 6.4](#page-206-0) on the following page has been implemented. Elements shown in blue have been modified compared to the previously described basic flow.

Instead of modeling the base primitives in the technology mapping, a more exhaustive technology library was derived. This library contains every complex N-input logic function which can be represented by the [FLE.](#page-362-10) It includes both functions which use the whole [FLE](#page-362-10) and those, which only use a part of the [FLE](#page-362-10) and can be used in fracturable mode. Here, fitting more complex functions ina [FLE](#page-362-10) makes less use of the local and global interconnect than mapping to multiple, simple functions. This should be reflected in the generated library to incentivize the synthesis tool to use the larger functions. Adjusting the cost of functions this way has been implemented by ensuring that larger functions are virtually assigned a smaller cost, i.e. area.

The G2S step now becomes slightly more complicated: It still adapts .gate directives into .subckt directives, but in addition, it also transforms the netlist from the complex, modeled function to [ULM](#page-364-10) [FLE](#page-362-10) mode.A [FLE](#page-362-10) mode is a configuration, which differs in more than only the [ULM](#page-364-10) function used. For example,a [FLE](#page-362-10) can be in fractured mode realizing a two-input and a four-input function. Or it might only realize a single 6-input function. Two,

<span id="page-206-0"></span>

**Figure 6.4:** Advanced custom [EDA](#page-362-0) flow to map [VTR](#page-367-0) benchmarks efficientlyto [ULM](#page-364-10) based [FLEs.](#page-362-10) Tools, libraries and models in orange have been newly added compared to the standard [VTR](#page-367-0) flow. Components highlighted in blue have been modified or extended compared to the simpler approach of [figure 6.3.](#page-204-0)

four and six-input are therefore modes, but a two-input *NAND* and a twoinput *NOR* would be mapped to the same mode, as both can be realized by the same [FLE](#page-362-10) topology and only need different [ULM](#page-364-10) configurations. When this mapping step is performed in the F2L tool, it should be noted that the information about which function exactly is implemented in each directive is lost. While removing this information is the main step to simplify packing, it needs to be restored in bitstream generation. For packing, [FLEs](#page-362-10) are now simply described as the different modes. [ULMs](#page-364-10) do not need to be described in [VTR](#page-367-0) anymore at all. The task of the packing tool is therefore reduced to its original task, packing multiple logic functions into one [FLE.](#page-362-10) Packing [FLEs](#page-362-10) into logic clusters then works as usual and the remaining [EDA](#page-362-0) toolflow operates in the same way as in the basic flow.

#### **6.3 Design Methodology**

As the expressiveness of such a 6-input cell is still limited, the [ULM](#page-364-10) based [FLE](#page-362-10) cell has been embedded in a cluster, as shown in [figure 6.2b](#page-203-0) on page [181.](#page-203-0) This is also consistent with the [VPR](#page-364-1) reference architecture, which uses 10 6-input [LUTs](#page-363-2) in each logic cluster. Like in the reference architecture, the inputs of the cluster are connected to the inputs of the [FLEs](#page-362-10) using a full

crossbar and the outputs of the cell are fed back into the crossbar. This allows to route connections between multiple of these cells locally, without using global interconnect. The logic element is supposed to have the same expressiveness as the one of the reference architecture, which means the [FPGA](#page-362-2) device size and global interconnect usage should be the same for a set of benchmark circuits. The exact amount and type of [FLEs](#page-362-10) in a complex cluster will therefore be determined according to this goal in the following sections.

**Benchmark-Driven Design** The actual benchmarking of the architecture is performed using [VTR'](#page-367-0)s benchmark set, whereas for evaluation of intermediate architecture results, only a reduced set of benchmarks is used to reduce tool runtime. To evaluate the results and guide architecture design, the statistic results offered directly by [VPR](#page-364-1) are useful, but not sufficient: These statistics are mostly centered on the top-level view of the FPGA architecture, including the FPGA dimensions in blocks of the top level grid, the block type which dominates the device size, channel width and congestion for global routing, and similar data points. The main conclusion that can be drawn from these measurements is whether the objective of performing similarly toa [LUT](#page-363-2) based architecture, with respect to the global architecture, has been fulfilled. Here, a comparison between the [VTR](#page-367-0) reference architecture and the same architecture with the [LUT](#page-363-2) based logic block replaced by [ULM](#page-364-10) logic clusters will be carried out.

**Custom Analysis** In the design phase of a replacement logic cell which is supposed to integrate in an unmodified global architecture, evaluating these global aspects is of limited use. Bottlenecks such as reduced logic expressiveness caused by too few logic elements in a cluster, by restricted local routing or because of underutilized logic elements can not be found in these top-level statistics. To remedy this, a custom tool for analysis and statistics collection for the logic cluster blocks has been designed. Parsing [VPR'](#page-364-1)s structured, packed netlist output, it is possible to gain interesting information about the sub-blocks in the hierarchy instead of only top-level information. Quantities which will be analyzed include the utilization of input and output ports of the logic clusters, the utilization of the available [FLE](#page-362-10) cells in a cluster, utilization of inputs and outputs of the [FLE](#page-362-10) cells, [ULM](#page-364-10) configurations used in the [FLE](#page-362-10) cells and similar metrics. Statistics are the generated for aggregated quantities, such as the average and median number of inputs used, and similar values. To make informed decisions on the architecture, histograms are generated in addition. Those can for example be used to gauge how likely outputs, inputs or [FFs](#page-362-6) ina [FLE](#page-362-10) are used. In addition to aggregated statistics, statistics for

individual outputs and cells are generated. This allows to answer questions such as how often a specific output or cell is used.

### **6.4 Logic Clusters**

Changing certain parameters in the [FLE](#page-362-10) and logic cluster can largely affect the expressiveness of the logic cluster. Cell design therefore needs to be guided through a set of measurable variables, which will be obtained through analysis of user application benchmarks. The following quantities have been selected to evaluate the expressiveness of the overall logic cluster:

- 1. Total FPGA size: The width and height of the FPGA in complex blocks. This measurement is only relevant if the logic blocks [\(LUT-](#page-363-2) or [ULM-](#page-364-10)Cluster) determine the device size. Both points can be determined from the statistics file generated by [VPR.](#page-364-1)
- 2. Percentageof [FLEs](#page-362-10) used in logic clusters: This measurement gives an overview of how well device logic resources are utilized and allows drawing indirect conclusions on congestion issues of the global and local interconnect. Non-fully utilized [FLEs](#page-362-10) (if the device size is limited through logic cells), hints that the cells cannot be connected accordingly. This may be caused by congestion on global or local interconnects, as well as by not enough available inputs or outputs.
- 3. Logic cluster input and output utilization: A high utilization of inputs or outputs with little [FLE](#page-362-10) utilization suggests that excessive amounts of signals have to be routed using the global interconnect. Improving connectivity of the local interconnect can unburden the global interconnect and allow for better [FLE](#page-362-10) utilization.
- 4. [FF](#page-362-6) utilization: The amountof [FFs](#page-362-6) which are actually used. This measurement is particularly sensitive to the set of benchmarks used. If the [ULM](#page-364-10) based logic cluster is designed to have the same expressiveness as a [LUT](#page-363-2) based cluster, it is also expected to see the same amountof [FFs](#page-362-6) in such a cluster as in the reference architecture.

In addition to those quantities for logic cluster design, additional quantities can be used to evaluate the [FLE.](#page-362-10) The following quantities are therefore also evaluated:

- 1. [FLE](#page-362-10) input usage and distribution: Information about the average number of inputs used can suggest whethera [FLE](#page-362-10) with more or less inputs may lead to a better utilization of [ULM](#page-364-10) cells. A low number of used inputs suggests that the input circuits do not map well to the [ULM](#page-364-10) topology, whereas a low number of distinct inputs suggests that some [ULM](#page-364-10) inputs may be combined into a single input, to reduce size of the logic cluster crossbar. The distribution of inputs can also be used to gain certain insights: It allows drawing conclusions which part of the fracturable logic is most often used.
- 2. [ULM](#page-364-10) usage and distribution: This is another measurement to determine which part of the fracturable logic is most used. It further provides direct feedback whether the [FLE](#page-362-10) cell successfully matches the input logic functions, or whether the topologyof [ULMs](#page-364-10) does not allow nets to be mapped efficiently.
- 3. [FLE](#page-362-10) output usage and distribution: This is the primary way to determine whethera [FLE](#page-362-10) design is working efficiently. A low number in the average amount of used outputs suggests that the cell can not drive multiple outputs at the same time, likely caused by the cell topology. A low utilization of a certain, specific output can hint that this sub-part of the [FLE](#page-362-10) topology is not frequently used and that the output may be removed from the architecture without reduction of expressiveness.

**[LE](#page-362-3) Inputs** [Figure 6.5a](#page-210-0) on the following page shows the [LE](#page-362-3) utilization for a very simple, naïve [FLE](#page-362-10) which consists of only one [ULM.](#page-364-10) This [LE](#page-362-3) is instantiated ten times in the cluster, according to [figure 6.2b](#page-203-0) on page [181.](#page-203-0) The resulting architecture has been modelledin [VPR](#page-364-1) and the [VTR](#page-364-12) MCNC benchmarks have been evaluated for it. As can be seen in [figure 6.5,](#page-210-0) the results motivate the need for a more complex cell. In the architecture, [VPR](#page-364-1) makes use of only 9.64 inputs and 3.5 outputs in average, although the [ULM](#page-364-10) utilization in [figure 6.5a](#page-210-0) is at 96.1%. This clearly shows that even when all [ULM](#page-364-10) are fully utilized, not all inputs and outputs can be used. In such a situation, either the number of inputs and outputs needs to be reduced, or the [LE](#page-362-3) in the cluster needs to be changed. Furthermore, the amount of complex logic blocks used in the benchmarks increased by 241.4% compared to the reference architecture. Caused by fewer used inputs and outputs, the minimum channel width is also reduced to 77.0%. As the [FPGA](#page-362-2) top level architecture is supposed to be kept the same, this clearly indicates that a 2-input [ULM](#page-364-10) is not expressive enough as a basic logic element.

<span id="page-210-0"></span>

**Figure 6.5:** Logic cluster utilization when using a simple [LE](#page-362-3) consisting of one [ULM.](#page-364-10)**[\(a\)](#page-210-0)** Utilizationof [LEs](#page-362-3) in the cluster. **[\(b\)](#page-210-0)** Utilization of cluster inputs. Unused values not shown, in total 40 inputs are available. **[\(c\)](#page-210-0)** Utilization of cluster outputs. Unused values not shown, in total 40 outputs are available.

**Fracturable [LEs](#page-362-3)** As an initial improvement, the combined cell in [figure 6.2a](#page-203-0) on page [181](#page-203-0) was derived: Like the6[-LUT](#page-363-2) building block in the reference architecture, it can take up to six different inputs signals. This decision has been made to initially have similar routing requirements in the local interconnect of the logic cluster as in the reference architecture. This is supposed to yield higher input utilization for clusters even with an unchanged input crossbars. In addition, the cells have been arranged in a way to allow the system to be fracturable:If [ULM](#page-364-10) 4 in the tree is configured to pass through its right-hand input, the 6-input [FLE](#page-362-10) decomposes into a two-input and a four-input logic element. Similarly, the cell can be decomposed into three two-input logic elements by further putting ULM 3 into bypass mode, forwarding only its right input. As previously explained, multiple outputs can still be derived from all six inputs, if the outputs are related and their combinational logic functions can be mapped to the topology shown in [figure 6.2a.](#page-203-0) Reevaluating the changed architecture yields the results in [figure 6.6,](#page-212-0) where it can be seen in **[\(c\)](#page-212-0)** that only one output is heavily used. Further analysis shows some functions mapped to this output actually are two-input functions, which forces other [ULM](#page-364-10) cells into bypass mode and essentially implements simple functions in overly complex cells. The main problem lies within the overly simple technology mapping and packing approach: Directly mapping onto [ULMs](#page-364-10) in synthesis and combining multiple such mapped functions into more complex logic blocks in the packing stage prevents the synthesis tool from transforming functions directly for those larger cells. Another drawback of the simple [FLEs](#page-362-10) as a simple setof [ULMs](#page-364-10) also increases calculation effort in the packing stage, which will result in increased tool runtime. This limitation is not present in the advance toolflow presented later on.

**Internal [LEs](#page-362-3)** In order to reduce utilization of the global interconnect, logic clusters usually provide a local interconnect with direct feedback paths from the logic output to the logic input. If an output is fed back in the local interconnect, unless it is also required as an input for another logic function, it is not routed on the global interconnect. This means that larger utilization of the local interconnect will lead to less utilization of the logic cluster outputs. In order to avoid wasting resources, there are two possible solutions: One is to reduce the number of outputs. This however has consequences for the global interconnect and the overall [FPGA](#page-362-2) architecture. As the original top-level architecture should be kept close to the reference architecture, the number of outputs in a logic cluster should be kept the same. To still achieve higher output utilization, internal-only cells have been tested: The output of these cells are only connected to the local interconnect and to other logic cells' inputs in the same cluster. They are not connected to outputs of the

<span id="page-212-0"></span>

**Figure 6.6:** Internal utilization of elements in the 6-input [FLE](#page-362-10) as well as input and output utilization. **[\(a\)](#page-212-0)** Amountof [FLE](#page-362-10) inputs used. **[\(b\)](#page-212-0)** Amount of internal [ULMs](#page-364-10) used in the [FLE.](#page-362-10) **[\(c\)](#page-212-0)** Amountof [FLE](#page-362-10) outputs used. **[\(d\)](#page-212-0)** Distribution of the used [FLE](#page-362-10) outputs.

logic cluster and therefore also not connected to the global interconnect. This logic cluster architecture including internal cells was presented in [figure 6.2b](#page-203-0) on page [181.](#page-203-0)

**Simple [LEs](#page-362-3)** For internal cells, questions regarding number of inputs, the amount of basic [ULMs](#page-364-10) chained and the overall topology of the [LE](#page-362-3) arise in the same way they do for the non-internal cells. If internal cells do not contain [FFs,](#page-362-6) they need to be part of a deeper logic function. In this case, the depth of the internal cell itself should be reduced to make sure that total depth of one internal-only cell chained with one non-internal-only cell is not too large, which would prevent mapping of logic functions.If [FFs](#page-362-6) are used in internal cells, the internal cells can be used to terminate a combinational net. The output must however still be mapped to a non-internal cell and if the [FF](#page-362-6) in a cell is optional and bypassed, the considerations regarding path depth still hold. For this internal cell purpose, a non-fracturable version of the [FLE](#page-362-10) as shown in [figure 6.7a](#page-213-0) was evaluated. Compared to simple logic cells consisting of one [ULM](#page-364-10) and one [FF](#page-362-6) that can be bypassed, these cells do not show large advantages in experimental benchmark statistics. The simple cell consisting of one [ULM](#page-364-10) is sufficient to increase output utilization. The final

evaluated architecture therefore uses 5 such simple [LE](#page-362-3) combined with 15 [FLEs.](#page-362-10) As this adds up to 20 outputs in total, which matches the logic cluster output count, all cells outputs are exposed and internal-only cells are not used.

<span id="page-213-0"></span>

**Figure 6.7:** Modified [ULM](#page-364-10) architectures for further evaluation. **[\(a\)](#page-212-0)** A non-fracturable, reduced version of the [LE.](#page-362-3) **[\(b\)](#page-212-0)** A version introducing bypass paths.

**Cell Functions** Apart from the functions of [table 6.1](#page-200-0) on page [178](#page-200-0) which are directly provided by the ambipolar base cell, [EDA](#page-362-0) tools used require certain other, artificial modes: The ABC tool for synthesis expects a buffer cell, an inverter cell and constant zero and one generator cells. The buffer cell is unnecessary (or implicit) in an [FPGA](#page-362-2) architecture and can simply be removed from the netlist. The inverter cell can be represented in the *CNT-DR8F* cell, when both inputs are connected to the same input variable and a configuration such as ulm\_nor2 is chosen. Connecting inputs in this way is possible for those [ULM](#page-364-10) cells which are directly connected to the input crossbar, i.e. simple [LE](#page-362-3) and the first level of cells in the [FLEs.](#page-362-10) For other [ULMs,](#page-364-10) an inverter function can not be realized without additional hardware. Alternatively, better [EDA](#page-362-0) algorithms could reduce use of the inverter function, as it can be largely absorbed into complex modes with inverted inputs. For constant inputs, [LUT](#page-363-2) based [FPGA](#page-362-2) architectures can simply adjust the lookup tables to adjust for the constant inputs. An [ULM](#page-364-10) based architecture on the other hand's side has to provide these constant values as possible inputs to the logic cells. The constant zero and one generator cells have therefore been

implemented as two additional inputs to the logic cluster input crossbar, as shown in [figure 6.2b](#page-203-0) on page [181.](#page-203-0) Having local constants avoids routing of these constant nets, reducing congestion on the interconnects while still creating minimal resource usage in a logic cluster.

**Cell Bypass** In some cases, [ULMs](#page-364-10) need a bypass mode which simply forwards one of the inputs to the output. This function equivalent to the buffer function which is required by synthesis tools, but the buffer function is removed from the netlist before packing. The main difference is the reason such a function is required: Whereas the buffer mode is required because of limitationsof [EDA](#page-362-0) tools not adapted completely to an [ULM](#page-364-10) workflow, the bypass mode is used to increase flexibility of the fracturable logic cells: To realize simple two-input functions in the proposed [FLE,](#page-362-10) some [ULMs](#page-364-10) need to be bypassed. In the same way as the inverter function, the bypass function can be implemented with no overhead if both inputs of the bypassed [ULM](#page-364-10) can be connected to the same input. This is again the case for those [ULMs](#page-364-10) which directly connected to the input crossbar, as these can be configured in ulm\_and2 mode to forward the input. For other [ULMs,](#page-364-10) additional multiplexers are required to either bypass the cell, or connect both inputs to the same value. Connecting the inputs has the benefit of the bypassed [ULM](#page-364-10) still serving as an electrical buffer and has been implemented in [figure 6.7b.](#page-213-0) This approach has the drawback of requiring additional transistors for the multiplexers, as well as an additional bit of configuration storage.

<span id="page-214-0"></span>

**Figure 6.8:** Utilization statistics for the [FF](#page-362-6) in the logic cluster. **[\(a\)](#page-214-0)** Utilizationof [FFs](#page-362-6) in [FLE](#page-362-10) cells. **[\(b\)](#page-214-0)** Utilizationof [FFs](#page-362-6) in simple cells.

**[FF](#page-362-6) Amount** When there are more [FLEs](#page-362-10) in the [RFET](#page-364-0) logic cluster than in the reference architecture [LUT](#page-363-2) cluster, it can be questioned whether each [FLE](#page-362-10) needs to contain one [FF.](#page-362-6) Gathering the [FF](#page-362-6) usage statistics from the [VPR](#page-364-1) benchmarks shown in [figure 6.8](#page-214-0) suggests that only 12.2% of the [FFs](#page-362-6)in [FLEs](#page-362-10) and 9.2% of the [FFs](#page-362-6) in simple logic elements are used. Adjustments to the

architecture and further benchmark statistics suggest that a total of 10 [FFs](#page-362-6) per logic cluster yields almost complete utilization of all [FFs.](#page-362-6) At the same time, the total [FPGA](#page-362-2) area did not increase in these benchmarks. This effect would occur if there are too few [FF](#page-362-6) in a logic cluster and more clusters need to be instantiated. A reasonable trade-off therefore seems to be to reduce the amountof [FFs](#page-362-6) in the cluster to 10: The final tested architecture therefore uses 5 simple logic elements with [FFs,](#page-362-6)5 [FLEs](#page-362-10) with [FFs](#page-362-6) and 10 [FLEs](#page-362-10) without [FFs.](#page-362-6)
*This page intentionally left blank*

## <span id="page-217-1"></span>**Chapter 7**

# **Power Management Regions**

The following chapter provides details for the implementation of the region concept presented in [section 4.2](#page-136-0) on page [114](#page-136-0) for [VPR.](#page-364-0) The implementation presented enables different parts of the [FPGA](#page-362-0) architecture to operate under different [PVTA](#page-364-1) conditions and in different performance levels. An example of a supported region grid is shown in [figure 7.1,](#page-217-0) reprinted from [fig](#page-137-0)[ure 4.4.](#page-137-0)

<span id="page-217-0"></span>

**Figure 7.1:** An example of the *k6\_frac\_N10\_40nm* architecture with a placed user application (orange) and region grid (blue). Region size was arbitrarily chosen as 3x3 blocks, including [IOBs.](#page-362-1)

Apart from the introduction of the region feature, region assignment will be discussed. Region assignment determines how [FPGA](#page-362-0) resources are grouped into regions and which operating conditions are used for each region. Two variations of region assignment will be presented: Static region assignment maps the operating conditions to each region statically, according to the [FPGA](#page-362-0) architecture description. This mapping is therefore fixed for all user applications and all [FPGA](#page-362-0) devices of this architecture. Dynamic region assignment on the other hand is used for architectures, where the operating conditions of a region can change in the field, during application use. The final [PARFAIT](#page-363-0) [PVTA](#page-364-1) system will use only dynamic region assignment, but static assignment can provide a less complex alternative which needs fewer resources to realize.

The region description format was kept abstract, to enable modelling of not only [DVS](#page-362-2) systems, but also other varying parameters. Parts of this chapter were originally published in [\[Pfa23b\]](#page-344-0) and this chapter has been extended with further details. In addition, the region model support has been integrated and evaluated in a more thorough design space exploration, which will be presented in the final chapters of this dissertation.

### <span id="page-218-1"></span>**7.1 Region Modelling in VPR**

To support locally varying operating conditions, two new concepts are introduced: Power regions and region modes. A power region is a physical area on an [FPGA](#page-362-0) containing one or multiple primitive blocks, usually [CLBs.](#page-361-0) Such a region is the smallest entity for which operating conditions can be changed individually.

```
1 <region sizex="2" sizey="2" default="vdd_1V">
2 <mode name="vdd 1V"/>
3 <mode name="vdd 0.95V"/>
4 <layout>
5 <fill type="vdd_1V" priority="10"/>
6 <pattern type="vdd_0.95V" priority="20" ... />
7 </layout>
8 </region>
```
**Listing 7.1:** Example showing a region specification with two modes, vdd 1V and vdd 0.95V, with static region assignment.

What operating conditions can be changed within a region is irrelevant for the modeling scheme: It could be  $VDD$ , body biasing, [PG](#page-363-1) voltage or any other quantity. A region mode is a specific instance of these operating conditions. For example, if *VDD* is varied, operating modes could be two different voltages, 1.0 V and 0.95 V, as shown in [listing 7.1.](#page-218-0) Regions are usually placed as rectangles in grid form, but the introduced extensions for the architecture description allow any shape supported by [VPR.](#page-364-0) In the simplest case,a [VPR](#page-364-0) architecture description is extended with region width and height information and the grid will be created automatically. In [listing 7.1,](#page-218-0) a pattern description enables static assignment of region modes.

[LUT](#page-363-2) and therefore [CLB](#page-361-0) delays are the most critical components to consider in mode adjustments, e.g. when reducing operating voltage [\[251\]](#page-339-0). Primitive blocks, switches and other elements in these blocks will exhibit different propagation delays and power usage, depending on the operating conditions. For the region models in this thesis, clocking and routing networks are assumed to have an independent power supply and are not included in the regions. The [VPR](#page-364-0) architecture description format was accordingly extended to support different delay values depending on region mode. An example for such a modebased timing specification is shown in [listing 7.2.](#page-219-0)

```
1 <delay constant max="85e−12" in port="l.o" out port="b.o">
```

```
2 <region mode="vdd_1V" max="85e−12"/>
```

```
3 <region mode="vdd_0.95V" max="84.27e−12"/>
```

```
4 </delay constant>
```

```
Listing 7.2: Example showing the description of timing specifications which vary
           depending on the region mode.
```
In addition, two changes were made in the [VPR](#page-364-0) power estimator: To enable parametrization on modes, absolute values given in the architecture file such as in absolute or pin-toggle power models, are changed to tables. For more complex transistor level power modeling, e.g. as part of the auto-size model, [VPR](#page-364-0) obtains per transistor data from a . tech file. This functionality was also extended to support loading different. tech files depending on each region's current mode.

For logic block delay calculation, the [VPR](#page-364-0) algorithms were changed to use the delay from the architecture file according to the region mode. In the placement phase, it is ensured that the cost function also considers the now region-specific, adjusted delays. The power estimation was also changed to estimate each region's power usage independently, based on its current mode.

### **7.2 Static Mode Assignment**

In static region assignment, modes are pre-assigned to power regions in the [FPGA](#page-362-0) architecture description. [VPR](#page-364-0) will not change modes of the regions, and will simply calculate the initial placement according to this grid instead. [Fig](#page-220-0)[ure 7.2](#page-220-0) shows two simple examples of such static assignments, a checkerboard and a column pattern.

<span id="page-220-0"></span>

**Figure 7.2:** Two example grids specified using static assignment. Two different modes are indicated by the shade of blue. **[\(a\)](#page-220-0)** Checkerboard pattern. **[\(b\)](#page-220-0)** Column pattern.

The main expected benefit of static assignment is reduced resource overhead in implementation: As each region operates in only one predetermined mode, the implementation only has to provide one operating condition, e.g. one supply voltage, for each region. Compared to a dynamic scheme, this however trades off resource overhead with reduced flexibility: Modes can not be adjusted according to the placed application. Instead, the placement logic will be modified to place the application accordingly to the predefined grid. Another major limitation that follows from this limitation is that a system with static assignment can not adjust modes at runtime. It can therefore not implement the [PVTA](#page-364-1) compensation scheme.

[Figure 7.3](#page-221-0) shows modifications in the [VPR](#page-364-0) toolflow to support static region assignment. Firstly, a new region layout stage is inserted in the [VPR](#page-364-0) flow after packing, building the grid according to the architecture file. This step has to be performed after packing, as the [FPGA](#page-362-0) size is not available before. This

limitation is grounded in VPR's support for application-specific [FPGA](#page-362-0) size determination, which is performed in the packing phase. Secondly, the newly built region grid (*R-Grid*) is made available for the placement phase. In this phase, [VPR](#page-364-0) uses the new information to place logic in low-power or high performance regions. This is accomplished primarily through consideration of timing delays according to the respective modes.

<span id="page-221-0"></span>

**Figure 7.3:** [VPR](#page-364-0) extensions to support mapping applications to regions with statically assigned modes. The *R-Layout* pass and the *R-Grid* model shown in orange have been newly added to the original [VPR](#page-364-0) flow.

As an implementation peculiarity, [VPR](#page-364-0) caches certain block-internal delay values, assuming those are independent of the exact location of the block. As this is not true if regions exhibit different performance characteristics, the implementation was changed to update cached delays whenever a logic block moves between regions.

## **7.3 Dynamic Mode Assignment**

As static assignment can not be used for [PVTA](#page-364-1) compensation, a dynamic mode assignment scheme was implemented as well. In this approach, modes are assigned to regions after the design has been placed as usual by the standard [VPR](#page-364-0) flow. For the placement phase, there are two options to consider: All regions may be considered to be in low power mode when the design is placed. After placement, the voltages of regions containing paths failing the timing constraints are increased, until the paths meet the timing constraints. Alternatively, the design may be placed assuming all regions are in highperformance mode. After the initial placement, the voltage for each region is reduced until the design fails to achieve timing closure. Experiments show that the latter approach yields better results.

[Figure 7.4](#page-222-0) shows examples of dynamically assigned modes for the grid introduced in [figure 7.2.](#page-220-0) Two different applications were placed to the [FPGA,](#page-362-0) as indicated by the orange blocks. Regions which do not contain any logic used by the application can be switched off completely, as can be seen in

[figure 7.4a](#page-222-0) in the top-right and in [figure 7.4b](#page-222-0) in the bottom left. The modes selected by [VPR](#page-364-0) map to the initial or expected conditions: Based on the propagation delays for the default process, [VPR](#page-364-0) calculates the available slack. If [PVTA](#page-364-1) conditions are considered, the real available slack in each region may change. Usually, the delays used during placement are the worst case expected delays: This ensures that the placed application works on every [FPGA](#page-362-0) [IC.](#page-362-3)

<span id="page-222-0"></span>

**Figure 7.4:** Example mode assignments for two applications using dynamic assignment. Different modes are indicated by the shade of blue. **[\(a\)](#page-222-0)** Application 1. **[\(b\)](#page-222-0)** Application 2.

[Figure 7.5](#page-223-0) shows the implementation of dynamic region mode assignment in [VPR.](#page-364-0) The implementation of the algorithm largely follows the variant proposed by [\[237\]](#page-338-0) for dual power supply [FPGAs:](#page-362-0) Two new processing phases are introduced in [figure 7.5:](#page-223-0) Slack budget calculation and dynamic region assignment. In slack budget calculation, all paths going through a region are enumerated and their timing slacks are collected. The minimum slack is then assigned as the slack budget of that region. After determining the budgets of all regions, budgets are sorted and the dynamic region allocation phase is executed: [VPR](#page-364-0) iterates through all regions, starting with those with the highest slack budget. It assigns a lower voltage to the region and recalculates the overall timing. If timing closure is still achieved, the change is accepted and processing continues with the next region. Otherwise, the change is reverted before continuing with the next region. When all regions are processed, the algorithm concludes with a final timing analysis.

<span id="page-223-0"></span>

**Figure 7.5:** [VPR](#page-364-0) extensions to support mapping applications to regions with dynamically assigned modes. The *Budget* slack estimation pass and the *R-Assign* mode assignment pass shown in orange have been newly added to the original [VPR](#page-364-0) flow.

The approach introduced here uses information about the delays in modes to select modes during placement. The next chapter will discuss an alternative approach, where the slacks in each region will simply be recorded. Together with the critical path length in each region, this information will be used to derive a  $k_{\text{slack}}$  factor: This factor represents a relative measurement stating how much relative increase in path delay can be accepted, before timing violations occur.

*This page intentionally left blank*

## **Chapter 8**

# **PVT- and Aging Compensation**

With [RFET-](#page-364-2)based [LEs](#page-362-4) in place and power region supportin [EDA](#page-362-5) tools introduced, the final steps for power management are determining the acceptable performance degradation in regions and measuring a region's current performance. These two aspects are first covered in this chapter. The chapter then concludes with a summary describing the [PARFAIT](#page-363-0) power controller and [FPGA](#page-362-0) architecture.

### <span id="page-225-0"></span>**8.1 Performance Requirement Determination**

The [PVTA](#page-364-1) compensation approach employed in the [PARFAIT](#page-363-0) [FPGA](#page-362-0) does not have direct feedback of whether an application achieves timing closure: Unlike the Razor schemes introduced in [section 3.4](#page-107-0) on page [85,](#page-107-0) the scheme used here does not detect timing violations in the configured application. Instead of that, the measurement approach introduced in the next section determines the current propagation delays for all [LEs.](#page-362-4) The impact on user application paths is however not directly known, as the [PVTA](#page-364-1) controller does not know how [LEs](#page-362-4) are connected by paths in the application.

Nevertheless, the controller approach needs a target variable for each region. This dissertation therefore introduces the derivation of a slack factor for each region and each application. This factor is calculated in the [EDA](#page-362-5) flow in each region, based on the critical path and its available slack. As the [EDA](#page-362-5) timing analysis uses the nominal or typical [LE](#page-362-4) delay, the factor describes how much [LE](#page-362-4) delay can differ from nominal delay to still ensure timing closure. Referring to [section 2.3](#page-42-0) on page [20,](#page-42-0)  $t_{\text{slack}}$  is the difference between required and actual arrival time. This means that the path delay, as characterized by [VPR](#page-364-0) when placing the user application, could increase by up to  $t_{\text{slack}}$  for the circuit to still work correctly:

$$
t_{\rm PD,max} = t_{\rm PD, typ} + t_{\rm slack} \tag{8.1}
$$

The slack factor,  $k_{\text{slack}}$ , is then defined as:

$$
t_{\rm PD,max} = k_{\rm slack} * t_{\rm PD, typ}
$$
 (8.2)

$$
k_{\text{slack}} = \frac{t_{\text{PD,max}}}{t_{\text{PD,typ}}} \tag{8.3}
$$

<span id="page-226-0"></span>
$$
=\frac{t_{\rm PD,typ} + t_{\rm slack}}{t_{\rm PD,typ}}\tag{8.4}
$$

[Equation \(8.4\)](#page-226-0) can be easily obtained from the [VPR](#page-364-0) tool with some minor code changes. Defining the  $k_{\rm shock}$  factor provides a relative definition independent of the absolute slack and propagation delay, which both depend on the path length. When assuming that a path delay is a linear combinationof [LE](#page-362-4) delays and that all [LEs](#page-362-4) are affected by a change in delay in the same way, the  $k_{\text{slack}}$ factor can be compared to relative changein [LE](#page-362-4) performance: If the delay of each [LE](#page-362-4) in a path increases by less than  $k_{\text{slack}}$ , the total path delay will not exceed  $t_{\text{PD,max}}$ .

When a path passes through multiple regions, the path element delays could theoretically have different relative delay increases. To properly handle this case, the [VPR](#page-364-0) code calculating  $k_{\text{slack}}$  considers all paths passing through each region and selects the smallest factor. A path spanning multiple regions might therefore determine the factor in most of those regions. If the factor is different for one region, as per the definition, it must be smaller. So even in this case,  $t_{\text{PD max}}$  still won't be exceeded. Such a case however provides room for future optimization: If a part of a path is known to never be operating at its slowest possible performance, as other paths in that region require higher performance, the delay in other regions might be safely increased further than the averaged  $k_{\text{slack}}$  factor suggests.

[Listing 8.1](#page-226-1) shows pseudocode implementing the  $k_{\text{slack}}$  calculation in [VPR.](#page-364-0) In general, the calculation code builds on the region definition code of [sec](#page-218-1)[tion 7.1](#page-218-1) to load regions from the architecture definition. [VPR](#page-364-0) then places and routes the user application or loads a pre-defined placement and routing. The pseudocode then uses [VPR'](#page-364-0)s TimingPathCollector to iterate over all paths and determines whether the path uses any [CLB](#page-361-0) in a specific region. Ultimately, it selects the path with the smallest slack in a region to calculate the region's factor.

```
1 auto paths = path_collector.get_all_paths();
2
3 for (auto path: paths) {
4 auto slack = get_slack(path);
```

```
5 auto delay = get delay(path);
6
7 for (auto elem: get_elements(path)) {
8 auto loc = get loc(elem);
9 auto region = get_region(loc);
10
11 if (result[region].slack > slack) {
12 result[region].slack = slack;
13 result[region].k = (delay + slack) / delay;
14 }
15 }
16 }
```
**Listing8.1:** Pseudocode implementing the newly introduced  $k_{\text{slack}}$  calculation in [VPR.](#page-364-0)

[Figure 8.1b](#page-227-0) shows a graphical demonstration of obtained  $k_{\text{slack}}$  for a benchmark application. Here, the diffeq1 benchmark has been placed to a 24x24 [FPGA,](#page-362-0) with 4x4 power regions. [Figure 8.1a](#page-227-0) shows the user application place-ment, [Figure 8.1b](#page-227-0) the  $k_{\rm slack}$  factors. In this example, the critical path spans 7 regions, which therefore all have the same slack factor of 1.17. The remaining regions have varying factors from 1.39 to 4.73.

<span id="page-227-0"></span>

**Figure 8.1:** Example application placement and available slack in each region. **[\(a\)](#page-227-0)** shows how the benchmark application has been placed onto the FPGA and into regions. **[\(b\)](#page-227-0)** depicts the slack factor for each region. Brighter color indicates a lower factor, i.e. less available slack in the critical path. Unused regions are depicted in white color.

### <span id="page-228-0"></span>**8.2 Transparent Logic Invasion**

As explained in [section 4.1,](#page-129-0) a special logic invasion scheme has been designed as part of this thesis to characterize the performanceof [CLBs.](#page-361-0) In this scheme, the user application is transparently relocated, with the application output being unaffected. This relocation process is specifically designed to always keep a single row of the [FPGA](#page-362-0) unused. Over time, each single row of the [FPGA](#page-362-0) becomes temporarily unused, allowing characterization of all [CLBs](#page-361-0) with few changesin [FPGA](#page-362-0) architecture and little hardware overhead. The performance measurement system is described in detail in the next section. This section describes the invasion process itself, which realizes the logic relocation and the programming of arbitrary bitstreams onto unused [CLBs.](#page-361-0)

**Concept** The overall process of logic invasion is depicted in [figure 8.2](#page-229-0) by means of a small example. Subfigure **[\(a\)](#page-229-0)** shows the initial state after the user application has been programmed, i.e. the DONE state has been reached in [figure 4.12](#page-148-0) on page [126.](#page-148-0) In this example application, a bitstream has been programmed to the four upper [CLBs,](#page-361-0) the lowest row being unused. Numbers in the [CLBs](#page-361-0) denote the individual [CLB](#page-361-0) bitstreams programmed onto those, whereas blocks in blue show unused [CLBs.](#page-361-0) The global interconnect, including [PSMs](#page-363-3) and [CBs,](#page-361-1) is not affected by the logic invasion scheme. After the user application has been programmed initially, this configuration is never modified. It will therefore not be mentioned explicitly in the following discussion and figures.

Subfigure **[\(b\)](#page-229-0)** shows the [FPGA](#page-362-0) after the first logic invasion: The middle row has been invaded and is now effectively unused. It could be programmed with any arbitrary bitstream at this point without affecting the user application. To achieve this, the [CLB](#page-361-0) configuration and bitstream of this row has been duplicated to the previously unused row below. In the following, this process will be called relocation of logic from the invaded row to the backup row. As mentioned before, the interconnect is not modified, so inputs for the relocated [CLBs,](#page-361-0) which are now in the lower row, are still provided by the [CBs](#page-361-1) in the invaded row. The architecture has been modified accordingly to allow redirection of the inputs of one [CLB](#page-361-0) to the [CLB](#page-361-0) directly below. Similarly, outputs can be redirected from the relocated-to row back to the invaded row. Active extra connections are shown in orange in the figures and make up the main changes and resource overhead of the proposed logic invasion scheme.

<span id="page-229-0"></span>

**Figure 8.2:** Overview of the logic invasion process, numbers depict specific application [CLB](#page-361-0) bitstreams. **[\(a\)](#page-229-0)** Initial state. **[\(b\)](#page-229-0)** Row 1 invaded. **[\(c\)](#page-229-0)** Row 2 invaded, final up state. **[\(d\)](#page-229-0)** Row 1 invaded. **[\(e\)](#page-229-0)** Row 0 invaded, final down state.

This strategy using additional connections has been chosen, as the alternative, dynamic reprogramming of the global interconnect, essentially requires expensive re-routing of the user application: It is not possible to simply move [PSM](#page-363-3) configurations downwards. Even in this simple architecture there may be signals which still need to connect to specific [IOBs](#page-362-1) in the invaded row. [PSMs](#page-363-3) in the invaded row could therefore not simply be reprogrammed to forward between the upper and lower row. As an example, consider the leftmost [CLB](#page-361-0) in the middle row would connect to the left [IOB](#page-362-1) in the same row. If the [CLB](#page-361-0) moves down, it still needs to connect to the [IOB](#page-362-1) one row above. This connection requires introducing another track, which might not be available. For such a simple architecture, an elaborate solution could be implemented in the [EDA](#page-362-5) toolflow: If the [EDA](#page-362-5) tools ensure that in these cases, [CBs](#page-361-1) always use a wire in upwards direction, this wire could be forwarded to the [PSM](#page-363-3) above. Such an approach however may be difficult to scale to more complex architectures. It is mentioned here though as a possible optimization for future research, which could further reduce hardware overhead in logic invasion.

Subfigure **[\(c\)](#page-229-0)** shows the final invasion in upward direction. In this case, the top-most row has been invaded, where the general concept is the same as for all other rows. The invaded, or unused, row has now fully moved from the bottom to the top of the [FPGA,](#page-362-0) covering each single [CLB](#page-361-0) once. As there is now no unprogrammed row in the bottom that can be used for relocation, the invasion process can not simply start again from the beginning. Instead, the invasion process is now rolled back step-by step until the original state is reached again.

Subfigure **[\(d\)](#page-229-0)** shows the first invasion in downwards direction. In this case,

logicof [CLB](#page-361-0) 2 and 3 is moved back to its original location in the top-most row. In the following, this process will be called relocation of logic from the invaded row back to the original row. As the logic moved back to the original location, no further connections are necessary. Previously active bypass connections shown in orange will just have to be deactivated again. Subfigure **[\(e\)](#page-229-0)** shows the final invasion in the downwards direction. At this point, the bottom-most row is invaded and the [FPGA](#page-362-0) state is identical to the state in the beginning in subfigure **[\(a\)](#page-229-0)**. The invasion process can now start again from the beginning, allowing for unlimited, continuous operation.

<span id="page-230-0"></span>

**Figure 8.3:** Modifications of the [CLB](#page-361-0) of [figure 4.2](#page-132-0) on page [110](#page-132-0) to enable logic invasion. Newly added components are marked in orange.

**Architecture Changes** As was explained in the overview, changes to the architecture are necessary to support the relocation process. All changes are contained in the [CLB](#page-361-0) implementation and are shown in [figure 8.3,](#page-230-0) which is a modified version of [figure 4.2](#page-132-0) on page [110.](#page-132-0) For the logic invasion implementation, the [BLE](#page-361-2) is treated largely as a black box, enabling any logic generator to be used as the [BLE.](#page-361-2) The concept here is therefore not limited to [LUTs.](#page-363-2) The only needed changein [BLEs](#page-361-2) is using the [FF](#page-362-6) implementations with an asynchronous load option.

The main changes in [figure 8.3](#page-230-0) are the introduction of new multiplexers marked in orange and the corresponding in- and output signals. The multiplexer on the left side is added to support redirection of the inputs. It uses the IBP\_I signal to gain access to the global interconnect inputs of the [CLB](#page-361-0) in the row above. At the same time, it also forwards its input to the [CLB](#page-361-0) in the row below using the IBP\_O output. Similarly, the output multiplexer can <span id="page-231-0"></span>connect the global interconnect output to the output of the [BLEs](#page-361-2) in the row below, the OBP I signal. It also forwards its own [BLE](#page-361-2) outputs to the [CLB](#page-361-0) in the row above using the OBP\_O signal.



**Figure 8.4:** Modification of the [FFs](#page-362-6) in the [BLEs](#page-361-2) for logic invasion. A direct output provides the [FF](#page-362-6) output independently of the [MUX](#page-363-4) state in the [BLE.](#page-361-2) In addition, the [FF](#page-362-6) supports asynchronous loading from two selectable inputs.

Further modifications concern internal signals: During reconfiguration, internal feedback loops may not be initialized: Some [BLE](#page-361-2) might recursively depend on internal feedback of some other BLE, rendering its output initially undefined. As the [BLEs](#page-361-2) are not connected to the global interconnect yet, this is not immediately an issue, but ultimately, the output needs to be defined for relocation to be successful. As an example, think ofa [BLE](#page-361-2) acting as a simple latch, passing one input directly to the output. If the latch output is fed back to the input, this results in a stable 0 or 1 at the output. Which one it is after programming of the [BLE](#page-361-2) is however undefined. The new architecture therefore introduces the feedback multiplexer and the FBBP signals: These allow to temporarily drive the feedback inputs for the crossbar from either the [CLB](#page-361-0) above, or the [CLB](#page-361-0) below. Note that in this case, connections to both above and below [CLBs](#page-361-0) are required. For the global interconnect [CB](#page-361-1) inputs and outputs, this was not necessary: In the downwards move, connections are just restored to the original connections. But for feedback, the relocated-to row has to get feedback from the active row, which is the row below in the downward relocation case.

Similar observations apply to the [FFs](#page-362-6) in the [BLEs.](#page-361-2) When relocating logic, the stateof [FFs](#page-362-6) needs to be preserved. [FFs](#page-362-6) in the relocated-to row therefore need to load their values from the relocated-from row. Again, for the same reasons as for the feedback signals, the [CLB](#page-361-0) must be able to load these from both the [CLBs](#page-361-0) in the rows above and below. In [figure 8.3,](#page-230-0) [FF](#page-362-6) loading is realized by the FF signals. The modified [FF](#page-362-6) architecture itself is shown in [figure 8.4:](#page-231-0)

It consists of a multiplexer to select the load input from one of two options, north and south, and the [FF](#page-362-6) itself. The [FF](#page-362-6) has an additional asynchronous load input, L. As long as this input is high, the [FF](#page-362-6) asynchronously loads the input A as its stored value. In this mode, the output Q therefore follows the input A transparently.

**Toolflow Modifications** As depicted in [figure 8.2](#page-229-0) on page [207,](#page-229-0) the introduced invasion scheme requires that the bottom row is not used in the initial state. This means that the application bitstream must not use the [CLBs](#page-361-0) in this row. Other interconnect resources and [IOBs](#page-362-1) can be used as usual. To ensure that [VPR](#page-364-0) does not place any logic in these [CLBs,](#page-361-0) they were defined as special blocks in the [FPGA](#page-362-0) architecture. Marking them simply as empty has important side effects: [VPR](#page-364-0) changes the pin connection pattern of other [CLBs](#page-361-0) next to these empty blocks. This complicates bitstream generation and in order should be avoided. The blocks have therefore not been defined as empty, but as special black box blocks. These blocks have the same number of global interconnect inputs and outputs as [CLBs,](#page-361-0) and therefore the same connection pattern. They however do not contain any [BLEs,](#page-361-2) [LUTs](#page-363-2) or any other logic blocks, so [VPR](#page-364-0) does not attempt to place application logic in these locations.

In addition, these changes in the architecture may increase the worst-case propagation delays through [CLBs,](#page-361-0) as additional signal routing is introduced. The [EDA](#page-362-5) tools have to use these new worst-case delay values for [CLBs](#page-361-0) in timing analysis.

**Invasion Steps** [Figure 8.5](#page-233-0) shows the new state machine introduced in the ProgController block. It hooks into the state machine in [figure 4.12](#page-148-0) and replaces the DONE state: Instead of stopping operations after the initial programming, the controller jumps to the new WAIT state in the logic invasion logic.

- 1. The initial WAIT state is used to pause logic invasion. It waits for the invasion en signal and only continues to the next state if this signal is asserted.
- 2. In the INIT state, the [FSM](#page-362-7) sets the row select signal for the row where logic will be relocated to. It also disables the clock of that row. All operations carried out during logic invasion operate on one specific row, so control signals are used in a special way: There is one row\_select signal, selecting the row which is being operated on. All other control signals are provided in parallel to all rows. They are gated accordingly

<span id="page-233-0"></span>

**Figure 8.5:** [FSM](#page-362-7) which summarizes the steps in the invasion process. After invasion finishes in the DOUT BP state, the [CLBs](#page-361-0) in the original row are unused and can be reprogrammed with an arbitrary bitstream.

to ensure only the selected row is affected. Operations always affect all [CLBs](#page-361-0) in the selected row in the same way.

- 3. In PROG ROW, the [FSM](#page-362-7) configures the row programmer of [figure 4.11](#page-148-1) to program the relocated-to row with the bitstream of the relocated-from row. In practice, the bitstream is not read from the relocated-from row, as the configuration system only supports writing the configuration. The [FSM](#page-362-7) computes the index of the row in the original bitstream instead, depending on the reconfiguration direction. It then configures the row programmer to use that bitstream, activates the programmer and waits for the row to be programmed. Once programming is finished, it transitions to the next state.
- 4. Depending on reconfiguration direction, the BP ON state activates or deactivates the [CLB](#page-361-0) input bypass IBP\_EN: When moving up, logic is moved down and the bypass is enabled in the relocated-to row. Otherwise, the bypass in the relocated-to row is deactivated to fetch the inputs from the interconnect, as in the initial configuration. Furthermore, this state selects the FFBP\_N input in upward relocation, FFBP\_S in downward relocation. All these actions are performed in the same cycle, and the [FSM](#page-362-7) immediately proceeds to the next state.
- 5. The START CLK state simply enables the clock for the newly configured, relocated-to row. The feedback bypass is still kept active and the [FF](#page-362-6) have not been initialized yet.
- 6. In COPY FF, the [FSM](#page-362-7) enables the load signals for the [FFs.](#page-362-6) When relocating upwards, it selects FF\_N, otherwise FF\_S.
- 7. In the next state, BP OFF, the feedback bypass signals FFBP\_S of FFBP\_N and the [FF](#page-362-6) load signal FF\_L are deasserted. These signals are deasserted at exactly the same time, ensuring that the [FFs](#page-362-6) values and the feedback signal are synchronized. After this step, the relocated-to [CLBs](#page-361-0) are now fully using their internal feedbacks. They get interconnect inputs matching the relocated-from [CLBs,](#page-361-0) and compute the same outputs.
- 8. The DOUT BP state configures the OBP EN switch of the original row: It's either enabled, when relocating up, or disabled again, when relocating down. After this step, the relocated-from [CLBs](#page-361-0) have been disconnected entirely from the global [IC,](#page-362-3) and can be reconfigured arbitrarily.
- 9. The final state, NEXT, prepares invasion of the next row. It adjusts the internal row counter accordingly and changes direction if necessary. It then proceeds to the WAIT state again, which allows pausing the invasion process using the invasion en signal.

[Figure 8.6](#page-235-0) shows two [CLBs](#page-361-0) of two adjacent rows after an upwards relocation. Here, the input multiplexer in the lower row was configured to select the signal from above. The output multiplexer of the upper row selects the output signal from the [CLB](#page-361-0) below. It can be seen that the lower [CLB](#page-361-0) could still pass down its original input to the next [CLB](#page-361-0) below. It could also still connect its interconnect output to the [CLB](#page-361-0) below, even it is itself acting as a relocated-to row for the one above. This feature is required to ensure that all rows can be relocated consecutively.

The arrows marked in orange depict the complete signal path for a relocated [CLB.](#page-361-0) The blue path denotes that the internal feedback is also completely contained within the relocated-to [CLB.](#page-361-0) It does not depend on the relocatedfrom [CLB](#page-361-0) in any way.

<span id="page-235-0"></span>

**Figure 8.6:** Example showing how the modifications in [figure 8.3](#page-230-0) are used to relocate a [CLB.](#page-361-0) Used bypass connections are shown in orange. When the relocation is complete, the relocated-to [CLB](#page-361-0) uses internal feedback, shown in blue. The original [CLB](#page-361-0) is unused. **[\(a\)](#page-235-0)** [CLB](#page-361-0) in upper row, relocated from. **[\(b\)](#page-235-0)** [CLB](#page-361-0) in lower row, relocated to.

### <span id="page-236-0"></span>**8.3 Chip Performance Characterization**

With the changes introduced in the previous chapter, it is possible to program arbitrary bitstreams to invaded [CLBs,](#page-361-0) without affecting the user application. In this section, this feature will be used to continuously perform performance characterizations of all [CLBs](#page-361-0) on the [FPGA.](#page-362-0) Here, the performance is measured as the propagation delay  $t_{\text{PD}}$  $t_{\text{PD}}$  $t_{\text{PD}}$  of the [LEs](#page-362-4) in each [CLB.](#page-361-0) As explained in [section 4.3](#page-139-0) on page [117,](#page-139-0) information about the performance of each specific [CLB,](#page-361-0) which changes depending on location and time, will then be used to compensate [PVTA](#page-364-1) effects.

**Overview** The primary goal of the performance measurement system in the [PARFAIT](#page-363-0) [FPGA](#page-362-0) is to achieve a fine-grain characterization of all [CLBs](#page-361-0) with few additional hardware resources. This requirement and the specific architectureof [FPGAs](#page-362-0) compared to [ASICs](#page-361-3) makes some characterization approaches introduced in [section 3.4](#page-107-0) on page [85](#page-107-0) less viable: Critical path extraction and replica are less useful in [FPGA.](#page-362-0) First, replica paths require additional hardware resources. In addition, critical paths in [FPGAs](#page-362-0) are not known during chip manufacturing. Only when the user application is considered as well, a critical path can be determined as explained in [section 8.1.](#page-225-0) A replica would therefore have to be reconfigurable as well. Spending additional resources for reconfigurable replica which are not used for application logic, however seems wasteful.

The alternative approach, directly measuring performance of the individual building blocks in a path, is more viable for [FPGAs:](#page-362-0) Paths are made upof [LEs,](#page-362-4) which to a large part determine the delay of a path. In addition, the routing delay caused by both parasitic effects and interconnect multiplexers can be significantin [FPGAs.](#page-362-0) As a first estimate, it will be assumed that performance increase or decrease in routing multiplexers follows the relative changes in speed of nearby logic elements. This work will therefore focus on the delay of logic elements, determining the delay of individual [LEs](#page-362-4) using ring oscillators and counters: A ring oscillator forms an inverting, back-coupled delay path through certain [LEs.](#page-362-4) As a result, when measuring a signal connecting any two [LEs](#page-362-4) in the path, an oscillating rectangle wave can be obtained. The frequency of this rectangle wave is determined by the propagation delay, as will be explained in detail soon. Determining the propagation delay therefore becomes a problem equivalent to determining the oscillation frequency. As previously shown in state-of-the-art works, this can easily be achieved when counting the number of oscillations in a certain, predetermined time frame.

Whereas previous works have often used such measurements in a coarse-grain way and often measured temperature or voltage, this work will directly measure propagation delay. As all the [PVTA](#page-364-1) effects affect this propagation delay, isolating one effect requires compensation of the other effects, and can therefore be difficult. For the closed loop [PVTA](#page-364-1) compensation system designed here, this is not necessary: A change in propagation delay can be compensated independently of the original cause of the change. This measurement approach is therefore expected to yield better results than when used to obtain temperature or voltage measurements.

In order to achieve more fine-grain characterization than state-of-the-art works, each single [CLB](#page-361-0) will be characterized. To avoid heavy resource overhead, configurable [FPGA](#page-362-0) resources will be reused as much as possible through the previously introduced invasion technique: Ring oscillators and counters will be realized as soft logic using the [BLEs](#page-361-2) of the [FPGA](#page-362-0) architecture. Apart from saving resources, this provides one other, major benefit: Characterization is actually performed on those [LEs,](#page-362-4) which also implement the user application circuit. Compared to path replica, the proposed approach therefore provides a better estimate of propagation delay in [LEs.](#page-362-4) The following paragraphs describe this characterization system in more detail.

<span id="page-237-0"></span>

**Figure 8.7:** Modifications for the [CLB](#page-361-0) of [figure 8.3](#page-230-0) on page [208](#page-230-0) to enable measurement of propagation delay in the [CLB.](#page-361-0) Newly added components and connections are marked in orange.

**Architecture Changes** Even though the measurement approach presented here largely reuses existing hardware in soft logic, three further modifications to the [CLBs](#page-361-0) are required. [Figure 8.7](#page-237-0) shows the [CLB](#page-361-0) for logic invasion from [figure 8.3](#page-230-0) on page [208](#page-230-0) with newly introduced signals marked in orange. The EXT I signal is an additional input provided to the crossbar. It can be passed to any of the crossbar outputs, which in turn allows passing an external value to any of the [BLEs](#page-361-2) and [LEs.](#page-362-4) Changes to the bitstream structure are not required: The crossbar's 6 bit multiplexer select signals already allow selection from 64 inputs, leaving 4 inputs unused. The [FASM](#page-362-8) assembler for the architecture has been extended to support this external input option for each [BLE,](#page-361-2) generating the matching bitstream. [VPR](#page-364-0) based logic synthesis does not use this input and therefore does not have to be changed. All soft-logic bitstreams for characterization will be manually designedin [FASM](#page-362-8) instead of Verilog for full control of placement and routing. Supporting this signal in the assembler is therefore sufficient.

The second change is the introduction of the 1 bit LUT0\_0 signal. It passes the outputof [LE](#page-362-4) 0in [BLE](#page-361-2) 0 to the ProgController, which orchestrates the performance characterization. Such a single output allows for transmission of arbitrary data when using a serial protocol. It is deliberately connected to the [LE](#page-362-4) output directly instead of the [BLE](#page-361-2) output: As shown in [figure 4.3b](#page-134-0) on page [112,](#page-134-0) this essentially means the signal always carries the [LE](#page-362-4) output signal, regardless of how the output multiplexer is configured. This primarily allows configuration of this multiplexer to forward the [FF](#page-362-6) state of the respective [BLE,](#page-361-2) while still making the [LE](#page-362-4) output accessible externally. As will be explained later, this is not strictly required, but it simplifies reading of register values via this output.

The last change is the introduction of a clock switch signal, CLK\_S. Counting the ring oscillator pulses will be realized using a synchronous counter: The pulse signal to be counted will be provided as clock signal to all counter [FFs](#page-362-6) synchronously. Compared to asynchronous counters, this requires only a small change in the clocking network: Instead of using an external clock signal, all [FFs](#page-362-6) will be configured to switch to an internally provided clock signal. In the implementation, a single clock multiplexer in the [CLB](#page-361-0) allows selection of the LUT\_0 signal as clock for all [FFs.](#page-362-6) As this connection provides access to the signal output of [LE](#page-362-4) 0, it can be used as a tap into the ring oscillator. In commercial [FPGA](#page-362-0) architectures, clock switching architectures are commonly used to provide [CLB](#page-361-0) with different clocks. The introduction of another clock switch is therefore not an intrusive change.

**Measurement Steps** Based on the discussed architecture changes, measurements can be implemented in soft logic. Like invasion, the measurement process is implemented in ProgController. An overview of all states for

measurement is given in [figure 8.8.](#page-239-0) As an additional change, a new MEA-SURE state is inserted in [figure 8.5](#page-233-0) between the DOUT BP and NEXT states. In this state, the relocation of a row has just been finished and [CLBs](#page-361-0) in the relocated-from row are ready to be configured with arbitrary bitstreams. The measurement states introduced here are sub-states of the MEASURE state in the invasion [FSM.](#page-362-7) In the following, a quick overview of the measurement [FSM](#page-362-7) states will be given, before the most important aspects will be discussed in detail.

<span id="page-239-0"></span>

**Figure 8.8:** [FSM](#page-362-7) realizing the measurement of propagation delay in the invaded [CLBs.](#page-361-0)

- 1. The INIT state selects the relocated-from row for invasion and measurement.
- 2. PROG INIT programs the invaded row with a bitstream that initializes all [FFs](#page-362-6) which are used in the counter. It also disables the external clock input for the row and the configuration reset in the row programmer: Usually, the row programmer resets [FF](#page-362-6) contents before programming, to guarantee a defined reset state. The measurement process however requires that [FF](#page-362-6) values are kept between reconfigurations, so the programmer is configured not to issue such reset requests.
- 3. The PROG COUNT state programs the bitstream that realizes the ring oscillator and the counter in the [CLBs.](#page-361-0)
- 4. The START state starts the measurement. For this, it initializes the Prog-Controller time counter to 0 and sets the EXT\_I signal to 1, enabling the ring oscillator. It also asserts the CLK\_S signal to ensure that the [FFs](#page-362-6) in the [CLBs](#page-361-0) are clocked from the [LE](#page-362-4) 0 output, i.e. the ring oscillator.
- 5. The COUNT state counts up the reference counter using the ProgController reference clock. When the count reaches a predefined value, it deasserts the CLK\_S signal to stop clocking of the [CLB](#page-361-0) [FFs.](#page-362-6) This essentially switches back to the external clock, which is still disabled for the row. It ensures the [FFs](#page-362-6) keep their values, which represent the final counter state. The EXT\_I signal is furthermore set to 0 to disable the soft-logic oscillator, saving energy.
- 6. PROG READ programs the bitstream that outputs the registers using the LUT0\_0 signal.
- 7. In the READ state, the ProgController reads the values from all [CLBs'](#page-361-0) LUT0 O outputs in the measured row. This essentially reads one register, i.e. one bit of the counter state, for each [CLB.](#page-361-0) The process needs to be repeated for all 9 registers with a slightly modified bitstream, to select the registers properly. ProgController uses an internal counter for this and returns to the PROG READ state until all [FFs](#page-362-6) have been read out. After reading the last [FF,](#page-362-6) the [FSM](#page-362-7) proceeds to the OUT state.
- 8. In OUT, the ProgController simply provides the measurement values of all [CLBs](#page-361-0) in the row to an external controller. It uses a simple bus consisting of VALID, CLB\_X, CLB\_Y and VALUE signals. VALID is asserted for one cycle for each [CLB.](#page-361-0) If it is asserted, CLB\_X and CLB\_Y provide the position of the [CLB.](#page-361-0) VALUE provides the counter value describing the propagation delay.

**Invasion Bitstreams** Some measurement steps mentioned require programming of the measured [CLBs](#page-361-0) with special bitstreams. In the following, these special bitstreams will be demonstrated for a single [CLB.](#page-361-0) All [CLBs](#page-361-0) in an invaded row are configured with the same bitstream and perform the same operation. Therefore, the bitstream for only a single [CLB](#page-361-0) is stored in memory, and the ProgController duplicates this bitstream for all [CLBs](#page-361-0) in the row.

The first special bitstream initializes the counter registers in the PROG INIT state. Although register initialization could be handled using explicit architecture modifications, this is not necessary. The required functionality can be realized without any additional logic using the existing [CLB](#page-361-0) structure. [Fig](#page-241-0)[ure 8.9](#page-241-0) shows the [CLB](#page-361-0) configuration used. All [BLEs](#page-361-2) are configured in split mode, providing two [LEs](#page-362-4) and [FFs.](#page-362-6) The multiplexers selecting between [FF](#page-362-6) or [LE](#page-362-4) output are not shown in the figure. Similarly, unused [FFs](#page-362-6) are not shown and connections between [FFs](#page-362-6) and [LEs](#page-362-4) which pass through the crossbar, are shown as direct connections. Elements which had to be added to the [CLB](#page-361-0)

only for measurement and which are used in this configuration are shown in orange. The blue box shows one example of how the depicted [LEs](#page-362-4) and [FFs](#page-362-6) map to the [BLEs:](#page-361-2) The upper row shows the [LE](#page-362-4) 0 elements of the [CLB](#page-361-0) and no [FFs,](#page-362-6) as the [FF](#page-362-6) 0 elements are not used. The second row shows the [LE](#page-362-4) 1 and [FF](#page-362-6) 1 elements. One [LE](#page-362-4) in the top row and the [LE](#page-362-4) and register below therefore belong to one [BLE.](#page-361-2)

<span id="page-241-0"></span>

**Figure 8.9:** Initialization of registers used to realize the counter ina [CLB.](#page-361-0) Required newly added components and connections are marked in orange. The blue box shows which [LEs](#page-362-4) and [FFs](#page-362-6) are contained in a single [BLE](#page-361-2) [\(figure 4.3b\)](#page-134-0).

Initializing the registers consists of setting the first register to logic value 1, all others to 0. This is required due to a peculiarity of the chosen counter: The [Linear Feedback Shift Register \(LFSR\)](#page-363-5) connects feedback of registers through an XOR operation to the first shift register input. However, if the initial state is all 0, the shift register is stuck in this state. To initialize the [LFSR,](#page-363-5) the [LEs](#page-362-4) connected to the registers are configured to output a constant value, 0 or 1. All [FFs](#page-362-6) then need to be clocked once, to ensure that the input is stored in the [FF.](#page-362-6) As the measurement logic has no control of the user-provided application clock (it might be unpredictably slow or gated off ), the [CLB](#page-361-0) is configured to use the [LE](#page-362-4) 0 output as a clock. This [LE](#page-362-4) 0 is configured to pass through its only input, acting as a transparent buffer. The crossbar then connects this input to the newly added EXT I input, which is driven by the ProgController. It can therefore directly drive the clocks and issue a clock edge as needed by driving this input pin. An excerpt of the [FASM](#page-362-8) used to realize this configuration is shown in [listing 8.2,](#page-241-1) the full [FASM](#page-362-8) is available in the appendix, see [listing E.1](#page-381-0) on page [359.](#page-381-0)

```
1 # Pass through FLE0 LUT0 input 0 from external input
2 FLE[0]
3 \text{CB.I[0]} = "ext[0].I"4 .5BLE[0].FF.ENABLE
5 .5BLE[0].LUT5.INIT[31:0]=32'b10101010101010101010101010101010
6
7 # Configure BLE0 LUT 1 as constant 1
8 .5BLE[1].FF.ENABLE
9 .5BLE[1].LUT5.INIT[31:0]=32'b11111111111111111111111111111111
```




The configuration for the second bitstream, used during measurement, is shown in [figure 8.10,](#page-243-0) with the same conventions as in the previous figure. This bitstream realizes a ring oscillator using 11 [LEs,](#page-362-4) 10 in the top row and the rightmost one in the bottom row. It also realizesa [LFSR](#page-363-5) based counter using the first9 [FFs](#page-362-6) in the second row. The measurement system uses the newly added CLK\_S signal to drive the [FFs](#page-362-6) clocks using the [LE](#page-362-4) 0 output. Placing the logic appropriately therefore enables the oscillation signal to be used as a clock for the counter. This setup ensures that the path used for propagation delay measurement passes through all 10 [BLE](#page-361-2) elements and the additional eleventh [LE](#page-362-4) is used to further reduce the oscillation frequency. When using a classical oscillator consisting of only inverters, the frequency directly represents the propagation delay of one gate. This is however not practical here, as the oscillation frequency is too high to clock the [FFs](#page-362-6) in the counter: As the feedback is also realized using one [LE,](#page-362-4) it will have a propagation delay of half the clock period in this case. To obtain higher margins for timing closure and to reduce the total number of clock periods counted, this setup instead uses a single inverter paired with 10 buffers. The frequency can then be obtained as in the following equation:

$$
t_{\rm PD,OSC} = 11 * t_{\rm PD,LE} \tag{8.5}
$$

$$
T_{\rm OSC} = 2 \cdot t_{\rm PD, OSC} \tag{8.6}
$$

$$
f_{\text{OSC}} = \frac{1}{T_{\text{OSC}}} \tag{8.7}
$$

Instead of using a simple inverter in the first gate, it is replaced by a *NAND* gate combining the feedback path and the EXT\_IN input. This input can then be driven low to stop oscillation, or high to enable the oscillator. Driving the output low for some time also ensures a defined state after programming: As the initial [LE](#page-362-4) output values are unknown after programming, it is not guaranteed that all outputs initially have the same value. The measured oscillation signal could therefore be unpredictable. When driving EXT\_IN low until the [LE](#page-362-4) outputs have propagated through the whole chain, all outputs will be predictably 0.

The [LFSR](#page-363-5) used has the characteristic polynomial  $x^9 + x^5 + 1$ . This LFSR is of maximum length, with a period of  $2^n - 1$  [\[252\]](#page-339-1). It therefore counts  $C_{\text{max}} = 511$ distinct values, before it starts repeating values. The main benefit of this counter is, that it requires little logic, only a single *XOR* gate. The drawback of [LFSR](#page-363-5) counters is, that decoding the [LFSR](#page-363-5) state into the count number is non-trivial. As the amount of distinct states is small in this specific counter, decoding can however be efficiently implemented using a precomputed lookup table.

<span id="page-243-0"></span>

**Figure 8.10:** Ring oscillator and counter realized in a single [CLB](#page-361-0) for propagation delay characterization. Required newly added components and connections are marked in orange.

Considering that the [LE](#page-362-4) delay of the [VPR](#page-364-0) reference architecture is 235 ps, 11 [LEs](#page-362-4) cause an oscillation period of  $T_{\text{OSC}} = 5.17$  ns. When using a 100 MHz reference clock in the ProgController and counting for 100 cycles, the measure time  $T_{\text{meas}}$  is 1000 ns, resulting in a count value of 194 for the nominal frequency. It is also possible to calculate the minimum and maximum [LE](#page-362-4) delay before the counter will overflow or underflow:

$$
T_{\rm OSC,min} = \frac{T_{\rm meas}}{C_{\rm max}}\tag{8.8}
$$

$$
T_{\text{OSC,max}} = \frac{T_{\text{meas}}}{1} \tag{8.9}
$$

$$
t_{\text{PD,LE,min}} = \frac{T_{\text{OSC,min}}}{2 \times 11}
$$
 (8.10)

$$
t_{\text{PD,LE,max}} = \frac{T_{\text{OSC,max}}}{2*11}
$$
 (8.11)

For the values chosen here, this leads to a minimum measurable [LE](#page-362-4) propagation delay of 89 ps. The circuit can therefore be up to 2.6 times faster than the nominal case, before the measurement circuit counter rolls over. Similarly, the calculated maximum propagation delay of 45 ns means the circuit can

slow down by factor 190. It should be noted, that the relative resolution of values becomes lower for lower count values:

$$
\frac{t_{\text{PD,LE,1}}}{t_{\text{PD,LE,2}}} = \frac{T_{\text{OSC,1}}}{T_{\text{OSC,2}}} \tag{8.12}
$$

$$
=\frac{C_2}{C_1}\tag{8.13}
$$

$$
=\frac{C_1+1}{C_1} \tag{8.14}
$$

For smaller counts, the factor is up to 100%, whereas for counts close to the maximum, the factor is 0.2%. At the nominal count of 194, the factor is 0.5%, showing the minimal change relative to the nominal propagation delay that can be measured. Further optimization of these limits can be achieved through variation of the measure time: A longer measure time will yield higher counts and resolution, whereas a shorter measure time enables measuring faster signals without overflow. In addition, when the signal is known to be in a certain range, roll-over events of the counter can be taken into account to enable even longer measure times. Knowing the typical propagation delay of the [LE,](#page-362-4) a relative increase  $k_{\text{clb}}$  in the delay can be calculated from the count value. This relative increase can then directly be compared to the slack factor  $k_{\text{slack}}$  determined in [equation \(8.4\).](#page-226-0) As long as the measured increase factor is less than  $k_{\text{slack}}$ , user application paths are guaranteed to meet timing closure.

[Listing 8.3](#page-244-0) shows an excerpt of the [FASM](#page-362-8) used to implement the counter and oscillator. The full [FASM](#page-362-8) is available in the appendix, see [listing E.2](#page-383-0) on page [361.](#page-383-0)

```
1 # Invert fle[9]. O[1] if ext[0]. I is 1
2 FLE[0]
3 .CB.I[0]="fle[9].O[1]"
4 .CB.I[3]="ext[0].I"
5 .5BLE[0].FF.BYPASS
6 .5BLE[0].LUT5.INIT[31:0]=32'b01010101000000000101010100000000
7 # x9 + x5 + 1. Use I1 and I2 for an XOR
8 .CB.I[1]="fle[8].O[1]"
9 \cdot CB.I[2]="fle[4]. O[1]"
10 .5BLE[1].FF.ENABLE
11 .5BLE[1].LUT5.INIT[31:0]=32'b00111100001111000011110000111100
12
13 # LUTs as pass−through, oscillator ones without registers
14 FLE[1]
15 .CB.I[0]="fle[0].0[0]"
```

| 16 <sup>7</sup> | $.CB.I[1] = "file[0].0[1]"$                                            |
|-----------------|------------------------------------------------------------------------|
| 17              | $.5BLE[0]$ .FF.BYPASS                                                  |
| 18              | $.5BLE[0]$ .LUT5.INIT $[31:0]$ =32'b1010101010101010101010101010101010 |
| 19              | $.5BLE[1]$ .FF. <i>ENABLE</i>                                          |
| 20              | $.5BLE[1].LUT5. INT[31:0]=32$ 'b11001100110011001100110011001100       |

**Listing 8.3:** [FASM](#page-362-8) excerpt representing the counter and oscillator for measurement. A full version can be found in [listing E.2](#page-383-0) on page [361.](#page-383-0)

The third special bitstream is used in the PROG READ state to receive the count value stored in the [CLB](#page-361-0) registers. For this, all clocks applied to the registers are stopped after counting, so the values are stable. The readout then utilizes only the existing interconnect and the single, newly added LUT0\_O output to read all 9 bit. [Figure 8.11](#page-245-0) shows the general idea for this: [LE](#page-362-4) 0 is programmed in buffer mode, simply passing its input to its output. This output is connected to LUT0\_O and can be read externally. The ProgController now modifies this original bitstream to obtain nine slightly different versions: The crossbar configuration is modified to pass other signals to the [LE](#page-362-4) 0 input. This way, there is one modified bitstream for each of the nine [FFs,](#page-362-6) routing its value to this [LE](#page-362-4) and ultimately to the output. Programming all nine modified bitstreams one after the other, the ProgController can read all bits.

<span id="page-245-0"></span>

**Figure 8.11:** Configuration of the [CLB](#page-361-0) used to output counter register 0 to the Prog-Controller. Required newly added components and connections are marked in orange. To read other registers, the [CLB](#page-361-0) crossbar is reprogrammed to connect these registers' outputs to the [LUT](#page-363-2) 0 input.

[Listing 8.4](#page-245-1) shows an excerpt of the [FASM](#page-362-8) used to implement the readout logic. The full [FASM](#page-362-8) is available in the appendix, see [listing E.3](#page-385-0) on page [363.](#page-385-0)

```
1 FLE[0]
2 # This selects the register routed to the first LUT
3 \cdot CB.I[0]="fle[0].0[0]"
4
5 # LUT0 just forwards its input 0
```

```
6 .5BLE[0].LUT5.INIT[31:0]=32'b10101010101010101010101010101010
 7 .5BLE[0].FF.ENABLE
8 .5BLE[1].FF.ENABLE
9
10 # Enable all FFs
11 FLE[1]
12 .5BLE[0].FF.ENABLE
13 .5BLE[1].FF.ENABLE
```

```
Listing 8.4: FASM excerpt representing the FF readout configuration for measurement.
            A full version can be found in listing E.3 on page 363.
```
### **8.4 Power Management Controller**

With the performance requirement  $k_{\text{slack}}$  for a region, and the current measured performance  $k<sub>ch</sub>$  of each [CLB,](#page-361-0) the power management controller can now be derived in this chapter. Referring back to the concept in [figure 4.6](#page-140-0) on page [118,](#page-140-0) the region controller is reprinted here in [fig](#page-246-0)[ure 8.12.](#page-246-0)

<span id="page-246-0"></span>

**Figure 8.12:** Region detail and region controller concept. The figures show a zoomed in detail of the central region in [figure 4.4.](#page-137-0) **[\(a\)](#page-246-0)** Region controller in idle mode. **[\(b\)](#page-246-0)** Row was reconfigured dynamically with measurement circuits. **[\(c\)](#page-246-0)** Delay characterization is active.

The  $k<sub>clb</sub>$  factor is measured for each [CLB,](#page-361-0) but the  $k<sub>clack</sub>$  factor is only determined once for a whole power region. This factor could easily be calculated for each [CLB,](#page-361-0) but the performance and voltage adjustment techniques used operate on whole regions. In addition, as the  $k_{\text{slack}}$  values need to be stored as part of the bitstream, a single factor per region reduces the configuration

storage requirements. The configuration for the  $k_{\rm slack}$  factors themselves is inserted in the interconnect configuration chain. As these factors are constant and do not need to change during logic invasion, they can be stored as part of the static configuration.

Unlike shown in the simplified concept in [figure 8.12,](#page-246-0) the  $k_{\text{clb}}$  measurement is not performed locally by the region controller. It is rather orchestrated globally for the whole [FPGA](#page-362-0) in the ProgController's logic invasion of [section 8.2.](#page-228-0) The measured values therefore have to be passed from this central controller to the individual regions' controllers. [Figure 8.13](#page-247-0) shows this final [FPGA](#page-362-0) power compensation scheme, with the global and the per-region controllers added, including control connections.

<span id="page-247-0"></span>

**Figure 8.13:** The [PARFAIT](#page-363-0) [FPGA](#page-362-0) architecture with the shade of red in each region representing locally adjusted performance using [PVTA](#page-364-1) compensation. Also shown is how region controllers connect to the ProgController which orchestrates logic invasion and measurement.

All region controllers connect to the same measurement output port of the central controller. They then assess the X and Y signals of that port to determine whether a measured value belongs to the current region. The controllers detect when the factor for the last [CLB](#page-361-0) in the region's last row has been received and then perform one update cycle. As controllers only consider the slowest [CLB,](#page-361-0) they only keep the largest  $k<sub>clb</sub>$  factor for the region. It then calculates a single new control value using a control algorithm and adjusts the voltage in the region. The next adjustment will then be performed when a new measurement is available.

Because of the slow invasion process, the controller operates in closed-loop configuration with a slow update rate. As the [PVTA](#page-364-1) changes observed are generally slow effects, this is not an issue. For the initial characterization, which compensates primarily the process variation, this might seem counterintuitive: However, during the [EDA](#page-362-5) flow, the application timing analysis was performed for nominal delay. Therefore, there shouldn't be any timing violations, as long as the [CLBs](#page-361-0) do not have higher delays than the nominal values. Voltage changes in this initial case will in turn only reduce the voltage and performance, reducing energy consumption in the device. Adjustments to actually increase performance are only required once aging slows down the circuit further.

[Listing 8.5](#page-248-0) shows the [VHDL](#page-364-3) code used to implement the control algorithm within the region controller. For the evaluations in this thesis, a basic proportional controller is used and investigation of more elaborate control algorithms is left for future work. In addition, the current implementation is only meant to be used during simulation and therefore was not designed to be synthesizable.

```
1 architecture sim of RegionPCTRL is
2 signal bg_voltage_buf: real := 1.0;
3 constant MAX_DELTA: real := 0.001;
4 begin
5 bg voltage \leq bg voltage buf;
6 impl: process(clk)
7 variable bg_new: real;
8 begin
9 if rising_edge(clk) then
10 if target_delay_factor > 10.0 then
11 bg_voltage_buf <= VAL_MIN;
12 else
13 bg new := bg voltage buf - VAL P * (
              ↪ target_delay_factor - current_delay_factor);
14 if bg_new < VAL_MIN then
15 bg new := VAL MIN;
16 elsif bg_new > VAL_MAX then
```

```
17 bg new := VAL MAX;
18 end if;
19 bg_voltage_buf <= bg_new;
20 end if;
21 end if;
22 end process;
23 end;
```

```
Listing 8.5: VHDL excerpt showing a simple proportional controller used to calculate
            the control voltage from k_{\text{slack}} and k_{\text{clb}} in each region. MAX_DELTA,
            VAL MIN and P are generic parameters.
```
### **8.5 Power-Aware FPGA Architecture**

This section will quickly summarize how the previously discussed, individual aspects are combined in the [PARFAIT](#page-363-0) [FPGA.](#page-362-0) An overview of the final architecture has been shown in [figure 8.13](#page-247-0) on page [225.](#page-247-0) Implementation of non-reconfigurable logic uses the ambipolar standard cells introduced in [chapter 5.](#page-183-0) Apart from implementing standard logic, this provides an addi-tional [BG](#page-361-4) allowing for  $V_{th}$  voltage scaling.

For the [LE](#page-362-4) in the [PARFAIT](#page-363-0) [FPGA,](#page-362-0) reconfigurable ambipolar cells as introduced in [chapter 6](#page-199-0) are used. However, for the final evaluation, the [RFET](#page-364-2) technology characterized in [\[42\]](#page-319-0) was used. This technology was developed by [PARFAIT](#page-363-0) project partner NaMLab and was characterized in the ways necessary for the [PARFAIT](#page-363-0) [FPGA.](#page-362-0) The original characterization in [\[42\]](#page-319-0) and the temperature characterization in [\[2\]](#page-315-0) enabled the derivation of the [RFET](#page-364-2) delay model in [section 4.6](#page-166-0) on page [144.](#page-166-0) During writing of this thesis, a novel [LE](#page-362-4) based on this technology has been published in [\[246\]](#page-338-1) as an early access publication. This allows the [PARFAIT](#page-363-0) [FPGA](#page-362-0) to actually make use of an [ULM](#page-364-4) realized in exactly the technology used for the delay model characterization. For more consistent results, the final [PARFAIT](#page-363-0) [FPGA](#page-362-0) therefore uses this cell instead of the *CNT-DR8F* cell introduced in [section 6.1](#page-199-1) on page [177.](#page-199-1) As the set of supported functions is mostly overlapping for both cells, the approach presented in [section 6.1](#page-199-1) does not change. [Figure 8.14](#page-250-0) shows the [LE](#page-362-4) based on the *RGATE*. Unlike the *CNT-DR8F* based [LE](#page-362-4) in [chapter 6,](#page-199-0) this cell has not yet been optimized to match the [LUT](#page-363-2) expressiveness, as will be explained in the results chapter. For synthesis, the [PARFAIT](#page-363-0) architecture only uses the combined genlib synthesis approach introduced in [section 6.2.](#page-202-0)

Power management regions are used in the [PARFAIT](#page-363-0) [FPGA](#page-362-0) as introduced in [chapter 7.](#page-217-1) The [PARFAIT](#page-363-0) architecture exclusively uses dynamic power region assignment: As all regions are characterized repeatedly using the invasion scheme, all regions also enable threshold voltage adjustment. The benefits of higher power reduction in dynamic assignment are therefore achievable with little resource overhead, which makes it the preferred solution.

<span id="page-250-0"></span>

**Figure 8.14:** Final [FLE](#page-362-9) used for the [PARFAIT](#page-363-0) [FPGA,](#page-362-0) using the *RGATE* as published in [\[246\]](#page-338-1) based on the technology of [\[2\]](#page-315-0).

For logic invasion itself, the scheme introduced in [section 8.2](#page-228-0) works identically for ambipolar cells as well as for [LUTs.](#page-363-2) For measurement in [section 8.3,](#page-236-0) some slight adjustments are necessary: Whereas the approach works just as before, the [LUT](#page-363-2) values in the bitstream have to be replaced with configuration for the new [LEs.](#page-362-4) The used measurement bitstreams demand some requirementsfrom the [LE](#page-362-4) feature set: For register initialization, the [LEs](#page-362-4) must be able to output constant 0 or 1 values on both outputs. This can be achieved by routing some constant valuesto [LE](#page-362-4) inputs in the crossbar and passing through or inverting in the [LE.](#page-362-4) In addition, output 0 must be able to pass through an input to enable clocking of the registers, which is supported as well.

For the measurement bitstream, output 0 ofa [BLE](#page-361-2) must be able to realize a *NAND* function. The minimal [LE](#page-362-4) based on *RGATE*s from [\[246\]](#page-338-1) does not support this directly: Although the first *RGATE* supports the *NAND*, the second stage can not be configured in pass-through mode and will always invert. As a solution, output 0 of the second [BLE](#page-361-2) in the oscillator chain can be configured as an inverter instead of a buffer. This is possible, as the first stage *RGATE* does support pass-through mode. The counter only requires pass through mode on the [BLE](#page-361-2) 1 output, which is also supported. For [FF](#page-362-6) readout, again only a pass-through for output 0of [LE](#page-362-4) 0 is needed. Therefore, both the *CNT-DR8F* [ULM](#page-364-4) and the *RGATE* based [ULM](#page-364-4) support all measurement operations.
*This page intentionally left blank*

## **Chapter 9**

# **System Simulation and Evaluation Methodology**

Previous chapters have introduced the [PVTA](#page-364-0) compensation system, the [PAR-](#page-363-0)[FAIT](#page-363-0) architecture and the delay models. This chapter will describe how those parts are combined for the final evaluation. Two approaches will be covered: First, a hardware evaluation based on the [VFPGA](#page-364-1) system will be described. This approach can only be used for prototyping of the digital aspects, so [PVTA](#page-364-0) changes can not be assessed with such a system. For faster validation, a functional simulation of the [PARFAIT](#page-363-0) system is introduced. To evaluate [PVTA](#page-364-0) aspects, this simulation will be coupled with the technology models [\(section 4.6\)](#page-155-0) and the scenario models [\(section 4.7\)](#page-174-0). The introduced co-simulation system will also integrate [VPR](#page-364-2) to enable power analysis.

## **9.1 Virtual FPGA Evaluation**

Evaluation of custom [FPGA](#page-362-0) architectures on commercial [FPGAs](#page-362-0) can be achieved using [VFPGAs: VFPGAs](#page-364-1) are a small layer on-top of commercial [FPGAs](#page-362-0) implementing custom [FPGAs.](#page-362-0) The [PARFAIT](#page-363-0) implementation presented in [chapter 4](#page-129-0) on page [107](#page-129-0) can essentially be used asa [VFPGA.](#page-364-1) When mapping to commercial [FPGAs,](#page-362-0) there is however one issue: [FPGAs](#page-362-0) need to ensure uniform delays for all their logic blocks. This then enables [VPR](#page-364-2) architectures to describe the propagation delay characteristics for basic blocks, and to calculate the final delays. Such uniform delays can however not be guaranteed when mapping an [FPGA](#page-362-0) to another [FPGA.](#page-362-0) Fine-grain timing constraints can ensure that a guaranteed maximum delay is never exceeded, but this is not very practical: Ideally the whole [FPGA](#page-362-0) should operate at the maximum possible performance, not at a user derived arbitrary frequency.

To solve this issue, manual placement was investigated in [\[Pfa21\]](#page-344-0) and [\[Pfa22\]](#page-344-1). The primary idea here is that the custom [FPGA](#page-362-0) architecture is uniform, as is the commercial [FPGA](#page-362-0) it is mapped to. Using scripts, it is therefore possible to manually fix the placement, instead of leaving placement up to vendor tools. A detailed evaluation of this approach can be found in the original publication, but the following paragraphs will quickly reprint and summarize these publications.

#### **VFPGA Architecture**

[Figure 9.1](#page-254-0) shows the [VFPGA](#page-364-1) architecture and the arrangement into the common types (1-9) of tiles. In its simplest configuration, the [VFPGA](#page-364-1) consists of 9 different tile types, which are distinguished by orientation and contained elements. A single tile can contain all elements (like type 2 and 4), all but the [IOB](#page-362-1) (type 9, i.e. the central tiles), [PSM](#page-363-1) and [IOB](#page-362-1) (type 1, 5, 6, 8), [CLB,](#page-361-0) [PSM](#page-363-1) and two [IOBs](#page-362-1) (type 3), or only the [PSM](#page-363-1) (type 1). Even for tiles which have the same types of elements, their differences in orientation — and therefore layout — will require them to be placed differently.

<span id="page-254-0"></span>

**Figure 9.1:** [VFPGA](#page-364-1) architecture details. **[\(a\)](#page-254-0)**Tile Distribution and top-level architecture. **[\(b\)](#page-254-0)** [CBR](#page-361-1) and [CBW](#page-361-2) implementation and connection to routing channels.

The [Configuration Units \(CUs\)](#page-361-3) are not shown in [figure 9.1a,](#page-254-0) as it is an implementation detail of the [VFPGA.](#page-364-1) They store and provide the configuration for the [CLB,](#page-361-0) [PSM](#page-363-1) and [IOB](#page-362-1) in their respective tile and enable

dynamic reconfiguration of the [VFPGA.](#page-364-1) In the used [VFPGA](#page-364-1) implementation, the [CU](#page-361-3) is implemented essentially as a shift register with parallel output. It will have to be mapped to the host [FPGA](#page-362-0) in addition to the other components.

[Figure 9.1a](#page-254-0) also shows the wires corresponding to relevant delays for the final [VPR](#page-364-2) architecture model: Apart from intra-block delays such as delays within the [CLB,](#page-361-0) these consist of nets in the global routing channels. [Figure 9.1b](#page-254-0) shows how the [CLB](#page-361-0) in a tile connects to the global routing channels. Connection points are realized using [Read Connection Boxes \(CBRs\)](#page-361-1) and [Write](#page-361-2) [Connection Boxes \(CBWs\).](#page-361-2) Those consist of multiplexers, which either connect multiple wires to one [CLB](#page-361-0) input, or the [CLB](#page-361-0) output to one of the channel wires.

To realize this connection, the [CBW](#page-361-2) consists of one multiplexer for each wire in the channel. The multiplexer either forwards the signal of the channel wire or writes the CLB output to this wire. Structurally, these multiplexers will be mapped to host [FPGA](#page-362-0) [LUTs.](#page-363-2) Therefore, on the host [FPGA,](#page-362-0) a [VFPGA](#page-364-1) wire from [PSM](#page-363-1) to [PSM](#page-363-1) will actually consist of at least two host wire segments and the [LUT.](#page-363-2) Similar effects are also caused by [IOB](#page-362-1) connections.

Another peculiarity of the original [VFPGA](#page-364-1) implementation concerns the implementation of its bidirectional wiring: Logically, the [VFPGA](#page-364-1) architecture uses bidirectional wiring, but the implementation on the host [FPGA](#page-362-0) can only make use of unidirectional wiring. [Figure 9.1b](#page-254-0) shows how the [CBR](#page-361-1) and [CBW](#page-361-2) connect to different host wires, leading in different directions. In order to drive a logical [VFPGA](#page-364-1) wire in both directions, the [PSMs](#page-363-1) will loop back the right-to-left signal in left-to-right direction. Due to that, in the final [PARFAIT](#page-363-0) architecture, only unidirectional wiring is used.

All these effects cause two implications for this work: First, when constraining the design, the [VFPGA](#page-364-1) wire can not be constrained as one unit. Instead, all wire segments on the host [FPGA](#page-362-0) have to be constrained individually. Furthermore, when modelling the [VFPGA](#page-364-1) architecture in [VPR,](#page-364-2) the delay for the complete net is needed. For this reason, the data extraction script extracts the individual segments. But for architecture modeling and for assessment of placement results in this publication, these have to be summed.

#### **Uniformity Metrics**

Uniformity is a measurement of local delay variation across the [VFPGA](#page-364-1) structure. Placing every tile in the same way, the [VFPGA](#page-364-1) is a uniform structure, which in theory could be placed uniformly on the host [FPGA.](#page-362-0) As explained previously, not all tiles have exactly the same internal structure and nonuniformity of the host [FPGA](#page-362-0) architecture will further degrade the uniformity. To address this, the introduced definition of uniformity divides the [VFPGA](#page-364-1) into  $N_c$  sets, where each set represents one column C. Nets are grouped into classes, so that similar nets in different tiles are within a single class. The following classes have been introduced:

- 1. **[PSM](#page-363-1) Left:** Horizonal nets, startingat [PSM](#page-363-1) left output multiplexers and endingat [PSM](#page-363-1) right input multiplexers.
- 2. **[PSM](#page-363-1) Right:** Horizontal nets, startingat [PSM](#page-363-1) right output multiplexers and endingat [PSM](#page-363-1) left input multiplexers.
- 3. **[PSM](#page-363-1) Top:** Vertical nets, startingat [PSM](#page-363-1) top output multiplexers and endingat [PSM](#page-363-1) bottom input multiplexers.
- 4. **[PSM](#page-363-1) Bottom:** Vertical nets, starting at [PSM](#page-363-1) left output multiplexers and endingat [PSM](#page-363-1) right input multiplexers.
- 5. **[PSM](#page-363-1) Internal:** Internal nets within the [PSM,](#page-363-1) realizing theWilton switch pattern.
- 6. **[CLB](#page-361-0) Input:** Nets starting at the output of the [CBR](#page-361-1) and ending at the input of the [LUT.](#page-363-2)
- 7. **[CLB](#page-361-0) Output:** Nets starting at the [LUT](#page-363-2) output and ending at the input of the [CBW.](#page-361-2)

The definition of uniformity then essentially measures differences between rows within a set. This definition is then formalized in the following equations:

$$
\mu_{c,n} = \frac{1}{N_R} \sum_{r=1}^{N_R} t_{c,r,n} \tag{9.1}
$$

$$
\sigma_{c,n}^2 = \frac{1}{N_R} \sum_{r=1}^{N_R} (t_{c,r,n} - \mu_{c,n})^2
$$
\n(9.2)

<span id="page-256-0"></span>
$$
\overline{\sigma} = \frac{1}{N_C N_N} \sum_{c=1}^{N_C} \sum_{n=1}^{N_N} \sqrt{\sigma_{c,n}^2}
$$
(9.3)

$$
\overline{c_v} = \frac{1}{N_C N_N} \sum_{c=1}^{N_C} \sum_{n=1}^{N_N} \frac{\sqrt{\sigma_{c,n}^2}}{\mu_{c,n}}
$$
(9.4)

Here,  $t_{c,r,n}$  is the delay of a net in class *n*, column *c* and row *r*. [Equation \(9.1\)](#page-256-0) provides the arithmetic mean  $\mu_{c,n}$  of the delays, calculated over the  $N_R$  [VFPGA](#page-364-1) rows.  $\sigma_{c,n}^2$  then calculates the variance for a net class in a certain column over the rows. This is further used in  $\bar{\sigma}$  to calculate the arithmetic mean of the standard deviations of all net classes in all columns.  $\overline{c_v}$  provides the arithmetic mean over the coefficient of variation of all net classes in all columns. Whereas the standard deviation is an absolute value and therefore depends on the mean of the delays, the coefficient of variation provides a relative measurement. As the delays in the host [FPGA](#page-362-0) are largely discrete (e.g. fixed delaysin [LUTs\)](#page-363-2), it is expected that relative delays can not be reduced further at some point. Because of this,  $\bar{\sigma}$  is used to guide the design of the placement strategies and for evaluation of practically achievable uniformity.  $\overline{c_v}$  is used to judge the quality of results for [VFPGA:](#page-364-1) As a smaller delay  $\tau$  allows to put more logic elements in a path at the same frequency for [VFPGA](#page-364-1) applications, a constant standard deviation leads to reduced certainty of the numberof [VFPGA](#page-364-1) logic elements in the path. A constant relative value  $\overline{c_v}$  signifies unchanged conditions for the [VFPGA](#page-364-1) application synthesis.

#### **Building Blocks**

Aftera [VFPGA](#page-364-1) design has been synthesized, the custom placement strategies are applied. The strategies will be described in detail in the next section, but all of them are based on the following constraints:

**Timing Constraints** As *Vivado* analyzes every possible path in the design, it will also consider configurationsof [PSM](#page-363-1) multiplexers that can create combinational loops. It is therefore not easily possible to constrain the timing of the design by simple definition of the final clock period, as *Vivado* will break the loops at arbitrary points. This generates long paths through different numbersof [CLBs](#page-361-0) and [PSMs,](#page-363-1) making it further impossible to constrain a path just between two specific [PSMs.](#page-363-1) To solve this problem, these paths are broken manually.

Two variants of constraints are used: In the variant with fine-grain constraints, all individual atomic nets have their delay constrained using the set\_max\_delay timing exception, ensuring that the design still meets timing constraints and forcing the timing driven optimization to operate. These constraints

will lead to path segmentation, which is the desired outcome. In addition, it will add false path constraints on the original long paths automatically. Path segmentation can affect logic placement and timing results, so special care needs to be taken when examining the *Vivado* timing reports. Because of this, custom scripts are used to evaluate the delays of relevant nets. As was shown in [\[Pfa22\]](#page-344-1), these fine-grain constraints are necessary to force *Vivado* to optimize the routing for the manually placed design. The drawback of this approach is limited scalability, as large designs which introduce many of these constraints cause excessive memory usage and runtime in the *Vivado* toolflow. Due to this, placement strategies without the fine-grain constraints were evaluated as well.

**Placement Constraints** Placement constraints are used to perform floorplanning through definition of pin placement and absolute, or relative, placement of cells. It guides and controls where the place-and-route tools may put FPGA design elements. *Vivado* supports various placement constraints, ranging from just constraining a group of logic in a certain area to exact placement of single cells to a certain logic element. The following placement constraints were used in this work:

- 1. **LUTNM and HLUTNM:** Used to place two combinational functions into the same LUT.
- 2. **PROHIBIT:** When the only requirement is to avoid placing any logic at a specific site, this is achieved using this constraint.
- 3. **LOC and BEL:** To place a logical element in a specific location, the *place\_cell* command is used. This command translates into LOC and BEL constraints, where LOC links the element from the netlist to a slice and BEL places it to a specific LUT or flip-flop within the slice.
- 4. **PBlock:** A PBlock is a collection of cells in one or more rectangular regions that specify the device resources contained by the block. It is more restrictive than no placement constraints, but less constraining than LOC and BEL.

#### **Placement Strategies**

The following paragraphs introduce the manual placement strategies in detail. The uniformity metric was used to guide development of the strategies. Critical path delay and [LUT](#page-363-2) overhead will not be evaluated here, but were assessed in [\[Pfa22\]](#page-344-1).

<span id="page-259-0"></span>

**Figure 9.2:** [VFPGA](#page-364-1) placed using standard *Vivado* placement. Host[-FPGA](#page-362-0) [CLBs](#page-361-0) belonging to same [VFPGA](#page-364-1) tile are shown in the same color. As can be seen, some tiles are compact, whereas some are scattered across wider area. It can also be seen that *Vivado* packs tightly and does not keep empty sites to preserve overall structure.

**Standard Vivado Placement** [Figure 9.2](#page-259-0) shows placement result of the default *Vivado* strategy (*Vivado* Synthesis Defaults, *Vivado* Implementation Defaults, *Vivado* 2019.1.1). The figure illustrates the arguments given previously in the motivation of this work: Automated placement does not make explicit use of the structural regularity of the [VFPGA,](#page-364-1) which results in tiles being implemented in slightly different ways. Some are more distributed, others more localized, leading to varying net delays and reducing uniformity of the [VFPGA](#page-364-1) architecture. This effect is even more apparent in larger designs, where placement algorithms have to deal with an overall larger amount of nets and cells.

**Basic PBlock Strategy** In the basic PBlock strategy, each tile is contained in a single Partition Block (PBlock): After a block with suitable size is created in quadratic or rectangular form, the add\_cells\_to\_pblock TCL command is used to add all cells of a tile to the block. When the tile size has been determined, the host [FPGA](#page-362-0) location and target area are fixed. Finally, all [VFPGA](#page-364-1) cells are fixed to the PBlocks belonging to their tile using the PBlock constraints, completing the basic PBlock placement.

**Nested PBlock Strategy** In addition to the PBlocks used in the first strategy, this strategy introduces up to two additional PBlocks within each tile. Logic

belonging to the [VFPGA](#page-364-1) [CLBs](#page-361-0) and [IOBs](#page-362-1) is mapped to these nested PBlocks accordingly: When defining the PBlocks, all assigned logic cells are forced into the blocks, but this does not prevent placing any additional unassigned cells into them. Based on this idea, two more variants are introduced in addition to the rectangular vs. quadratic layout distinction: In the partially nested strategy, the outer PBlock is used for the tile and nested blocks are used for [IOB](#page-362-1) and [CLB,](#page-361-0) but the [PSM](#page-363-1) is only constrained by the outer PBlock. This gives *Vivado* the freedom to place the [PSM](#page-363-1) in the remaining outer PBlock area, or place part of it inside the nested PBlocks. In the fully nested strategy, *Vivado* is forced to not place any [PSM](#page-363-1) logic in the nested PBlocks, prohibiting usage of remaining logic cells in them. [Figure 9.3](#page-260-0) demonstrates the concept for a 5x5 [CLB](#page-361-0) [VFPGA.](#page-364-1)

<span id="page-260-0"></span>

**Figure 9.3:** 5x5 [CLB](#page-361-0) [VFPGA](#page-364-1) floorplan with nested PBlocks. The nested [CLB](#page-361-0) PBlock is divided into two pieces to ensure the minimum possible area is used. The top right corner tile has an extra nested PBlock for its second [IOB](#page-362-1) unit. No internal PBlock was used at all in the bottom left corner tile, as it only containsa [PSM.](#page-363-1)

The placement script is extended with the following steps to create the nested PBlocks:

- 1. The internal PBlocks can consist of multiple rectangles. The [CLB](#page-361-0) PBlock is placed in the bottom left corner with height at most equal to the height of the tile minus one. This guarantees some freedom to [IOB](#page-362-1) PBlock and to ensures distribution of the [PSM](#page-363-1) unit over the tile PBlock.
- 2. The [IOB](#page-362-1) PBlock is placed within the tile PBlock. The side is determined according to the tile type.

**Fine-Grain Manual Placement Strategy** This strategy further constrains logic, directly mapping the relevant [LUTs](#page-363-2) and flip-flops to specific [LUTs](#page-363-2) or flip-flops in the 7 series host [CLB.](#page-361-0) As there are numerous ways to place the logic within a tile, a manually derived layout is chosen instead of trying to find a fully automated one. The strategy is then made generic to support different [VFPGA](#page-364-1) parameters, but the layout is fixed to the [VFPGA](#page-364-1) and therefore cannot be reused for completely different applications. Evaluation of different manual layouts led to a placement as was presented in [figure 9.1a:](#page-254-0) The [PSM](#page-363-1) is located in the upper right corner and the [CLB](#page-361-0) is placed in the lower part of the tile. [Figure 9.4](#page-261-0) shows the device view in *Vivado* after the manual placement strategy has been applied.

<span id="page-261-0"></span>

**Figure 9.4:** VFPGA tile (type 9) placed using the manual placement strategy. Multiplexers of the [PSM'](#page-363-1)s top, right, bottom and left side are marked red (1), purple (2), yellow (3) and blue (4). The [CLB](#page-361-0) is located at the bottom with the [LUT,](#page-363-2) two internal multiplexers and D-flip-flop colored in black (5). Sky blue color represents the configuration units of the tile (6). Yellow blocks at the bottom (7) depict [Write Connection Boxes,](#page-361-2) whereas [Read Connection](#page-361-1) [Boxes](#page-361-1) are marked green and turquoise (8) and make up remaining logic distributed around the [LUT.](#page-363-2)

The implementation of this strategy operates on two lists for each tile PBlock, an instruction list and a list of the free host [FPGA](#page-362-0) [LUTs.](#page-363-2) The instruction list contains simple [VFPGA](#page-364-1) logic element place instructions, interleaved with sorting instructions. It is processed element by element, either placing logic elements or resorting the list of free resources. When an element placement instruction is processed, the logic elements are mapped sequentially to the elements in the sorted list of free resources, starting at a specified offset. When a resorting instruction is found, the resorting algorithm sorts the list of remaining available host [LUTs.](#page-363-2) It sorts horizontally or vertically and uses ascending or descending sorting order, depending on the instruction. As an example, the sort\_xy\_dd instruction sorts first based on the  $x$  location, and if the  $x$ value is the same for some [CLBs,](#page-361-0) it uses  $\nu$  as secondary criteria. Descending sorting is applied in both cases. This specific instruction is used to sort the list

of available logic elements before placing the right and left multiplexers of the [PSM,](#page-363-1) as they need to be placed vertically from the top right corner. Sorting is always done on the list of free resources, so the length of this list decreases as the placement process proceeds. This makes it possible to reach every single [CLB](#page-361-0) in the PBlock, not just the ones at the borders.

### <span id="page-262-1"></span>**9.2 Static Power Analysis**

[Section 2.3](#page-42-0) on page [20](#page-42-0) introduced the sources for power consumption in digital circuits. Power consumption was split into two categories, dynamic power and static power. Where dynamic power is largely determined by switching power, static power is largely determined by leakage effects. Switching power was given in [equation \(2.9\)](#page-44-0) as:

<span id="page-262-0"></span>
$$
P_{\text{switching}} = \alpha C_{\text{L}} V_{\text{DD}}^2 f \tag{9.5}
$$

As can be seen, it is dependent on the switching frequency, which is application specific.

Power analysis for custom [FPGA](#page-362-0) architectures for both dynamic and static power can be performed in [VPR.](#page-364-2) For the dynamic power estimation, [VPR](#page-364-2) can estimate the switching activity in a placed application. To realize power estimation, it uses . tech files which represent power for a specific technology. [VPR](#page-364-2) ships such files as examples for some technologies and provides a workflow to derive files for other technology from [SPICE](#page-367-0) models. The [VPR](#page-364-2) power estimator has been extended as explained in [chapter 7](#page-217-0) to support various operating modes and regions. However, obtaining the . tech models for [RFET](#page-364-3) technology is difficult: As there are no [SPICE](#page-367-0) models available, those would have to be derived manually. This is difficult again, as only selected current measurements are available for [RFETs,](#page-364-3) but power measurements are not available.

Asan alternative, higher level power modelling in [VPR](#page-364-2) does not require. tech files: In this model, *RC* values or absolute power for components are specified in the architecture. This model was also extended to support regions and multiple modes. Still, obtaining power values for [RFETs](#page-364-3) is difficult due to the reasons mentioned before.

For the evaluation in this thesis, a different approach was chosen because of these difficulties with the [VPR](#page-364-2) power estimation: The [PVTA](#page-364-0) system introduced and the voltage scaling methods used primarily effect static power: Switching power in [equation \(9.5\)](#page-262-0) is derived from charging curves of load capacitances and therefore only depends on the voltage the gates are charged to or from. This voltage is the supply voltage  $VDD$ , so threshold voltage scaling or body biasing do not affect the switching power. For dynamic power, these approaches can affect the short circuit power, but this effect is not further discussed here.

<span id="page-263-0"></span>

**Figure 9.5:** Potential current leakage paths in the *RGATE* between *P*1 and *P2*. Current direction depends on the program voltages  $P1$  and  $P2$ . Which transistor is "off" and contributes to the total leakage, depends on  $P1$  and  $P2$  as well as the input variable state. Figure adapted from [\[2\]](#page-315-0).

Static power and the subthreshold leakage on the other hand are directly influenced by threshold voltage scaling and body biasing. As [figure 9.5](#page-263-0) shows, leakage current is also a concern in the [CMOS-](#page-361-4)like arrangement of transistors in the *RGATE*.With the models for the leakage current parametrized on control variables introduced in [section 4.6,](#page-155-0) static power can be estimated. Whereas these models do not allow for absolute power estimation, a relative estimation can be used to evaluate how much the leakage current is reduced: First, the total leakage current in absence of voltage scaling for all [CLBs](#page-361-0) in an [FPGA](#page-362-0) can be described as:

$$
I_{\text{tot0}} = \sum_{i=1}^{N_{\text{CLB}}} \alpha \cdot I_{\text{Leak}}(V_{\text{C0}})
$$
  
=  $\alpha \cdot N_{\text{CLB}} \cdot I_{\text{Leak}}(V_{\text{C0}})$  (9.6)

Where  $\alpha$  is a proportionality constant representing the number of transistors forming potential leakage paths and  $V_{\text{C0}}$  is the control parameter at the nominal value. Generalizing these formulas for arbitrary control parameters

then yields the general formulas. The current with body biasing or threshold voltage scaling in regions is then given as:

$$
I_{\text{tot}} = \sum_{i=1}^{N_{\text{Region}}} \alpha \cdot I_{\text{Leak}}(V_{\text{C,i}})
$$

$$
= \alpha \sum_{i=1}^{N_{\text{Region}}} I_{\text{Leak}}(V_{\text{C,i}})
$$
(9.7)

The relative current consumption then is given as:

$$
i = \frac{I_{\text{tot}}}{I_{\text{tot0}}} = \frac{\sum_{i=1}^{N_{\text{Region}}} I_{\text{Leak}}(V_{\text{C,i}})}{N_{\text{CLB}} \cdot I_{\text{Leak}}(V_{\text{CO}})}
$$
(9.8)

Given that  $P = U \cdot I$ ,  $VDD$  being the relevant voltage and  $VDD$  being unaffected by the control parameter, the relative power is equal to the relative current:

$$
p = \frac{P_{\text{tot}}}{P_{\text{tot0}}} = \frac{V_{\text{DD}} \cdot I_{\text{tot}}}{V_{\text{DD}} \cdot I_{\text{tot0}}} = i
$$
\n(9.9)

The derivation given here is only valid for body biasing and threshold voltage scaling. When scaling the supply voltage, formulas need to be adjusted accordingly. Furthermore, scaling the supply voltage also affects the dynamic power with a quadratic influence. Using different voltages however requires careful matching between the regions and is not considered in this thesis.

Power analysis will be mostly important for performance adjustments in an application: When the application is run at nominal conditions, the power can be reduced as the control voltage will be reduced in various sections. [PVTA](#page-364-0) compensation uses the same control mechanism, but will not necessarily lead to leakage reduction: The [PVTA](#page-364-0) effects increase the delay and therefore require adjusting the control voltage in ways which will increase leakage currents. When the process variation in a given [IC](#page-362-2) leads to reduced propagation delay compared to the nominal delay, reductions in leakage currents may be observed though. When [FPGA](#page-362-0) applications are placed for worst-case conditions, this will commonly be the case and process variation compensation can reduce the leakage.

If [PVTA](#page-364-0) effects are analyzed over time, the energy may be more meaningful than the instant power. Relative energy can easily be obtained through integration of relative power over time.

## **9.3 Functional Runtime Simulation**

To validate the logic invasion and characterization concept of [section 8.2,](#page-228-0) a testing framework has been set up in QuestaSim. First, various small unit tests have been derived for the individual parts of the [FPGA.](#page-362-0) These are also used as regression tests, to quickly find broken changes during development. In addition, a large integration test has been devised to test the system as a whole with all parts included.

<span id="page-265-0"></span>

**Figure 9.6:** Test setup to validate logic invasion and logic characterization functions. A 4x5 [FPGA](#page-362-0) is programmed with a sample application bitstream. The application is provided with stimuli and its outputs are compared to a golden reference. In parallel, the [FPGA](#page-362-0) transparently invades the logic and provides characterization results, which are validated as well.

The structure of this integration test is shown in [figure 9.6.](#page-265-0) It consists of the 4x5 [FPGA](#page-362-0) to be tested and various helper logic. 4x5 is the smallest configuration where the logic invasion can be tested appropriately, as it contains two usable rows apart from the reserved one for logic invation. A simple test application was then synthesized for this [FPGA,](#page-362-0) using 3 input signals to generate 3 output signals. The bitstream is embedded in the simulation and after initial programming, the application virtually runs on the [FPGA](#page-362-0) in the QuestaSim simulation.

The *Application Stimuli* generator periodically repeats all possible input combinations. The stimuli are then provided to the [FPGA](#page-362-0) on the [IOB](#page-362-1) ports used by the test application. In parallel, stimuli are also provided to the golden reference model. This model calculates the expected outputs independently, reimplementing the test application's logic operationsin [VHDL](#page-364-4) code. The output of this golden reference and the test application running on the [PAR-](#page-363-0)[FAIT](#page-363-0) [FPGA](#page-362-0) are continuously compared and checked for equality. If values differ at any point in time, the test fails.

After the test application has been programmed and some time has been spent verifying the application output, the *Test Controller* enables logic characterization in the [FPGA.](#page-362-0) At this point, the ProgController embedded in the [FPGA](#page-362-0) will start transparent logic invasion. Application stimuli testing continues in parallel, to verify that logic invasion does not upset the test application operation.

The measured delays are verified in the *Characterizer Check*. This component verifies the locations and delay values produced by the ProgController in the [FPGA.](#page-362-0) To simulate the [LUT](#page-363-2) delay, [VHDL](#page-364-4) transport delays were modeled for the [LUT](#page-363-2) as shown in [listing 9.1.](#page-266-0)

```
1 if (split = '1') then
2 q0 <= transport table(sel and "011111") after 235 ps;
3 q1 <= transport table(sel or "100000") after 235 ps;
4 else
5 q0 <= transport table(sel) after 235 ps;
6 q1 \leq 'U';7 end if;
```

```
Listing 9.1: VHDL implementation of a LUT with delay information added. Type casts
           have been omitted to increase readability.
```
With [LUT](#page-363-2) delays of 235 ps, the period of the ring oscillator is  $2 \cdot 11 \cdot 235$  ps = 5.17ns. The ProgController is clocked at a period of 10 ns and measures for 100 clock cycles, yielding a measuring time of 1000 ns. Comparing the oscillator period to that, the expected count is 194. Converting this to the [LFSR](#page-363-3) value yields a value of 241, which is the expected output of the characterizer in the test case. As the functional simulation does not model any difference in propagation delays in different [LUTs,](#page-363-2) the expected value is the same for all tested [CLBs.](#page-361-0) The *Characterizer Check* also validates that values are emitted for all [CLBs](#page-361-0) in the [FPGA.](#page-362-0)

In addition to the automated test cases, the measuring time has also been verified manually through observation of signal waveforms. This ensures that the measuring time was implemented correctly, which is crucial for the system. Validation of the system with *RGATE* or other [ULMs](#page-364-5) can be performed in the same way. As the logic invasion system is independent of the logic cell used, results are however expected to match the [LUT](#page-363-2) case.

## **9.4 Co-Simulating DVS**

The previously introduced functional simulation shows the working principle of the [PVTA](#page-364-0) invasion. It can however not directly answer the question, how much power can be saved or how well the controller reactsto [PVTA](#page-364-0) changes. This information can not be evaluated in the pure digital simulation, as the effects cannot be modeled there. As a solution, the [VHDL](#page-364-4) [FPGA](#page-362-0) implementation was combined with the propagation delay and [PVTA](#page-364-0) models from [chap](#page-129-0)[ter 4.](#page-129-0) This section introduces this QuestaSim based co-simulation framework. Apart from the architecture and models, it will also include [VPR](#page-364-2) for timing and power evaluation. Parts of this section were originally published in [\[Pfa23b\]](#page-344-2), but it has been extended with further details.

<span id="page-267-0"></span>

**Figure 9.7:** Process flow for the [PVTA](#page-364-0) compensation co-simulation. The simulation loads results previously obtained using the [VPR](#page-364-2) place and route flow. It then calculates the initial  $k_{\text{slack}}$  factors once before starting the main operation loop. In this loop, the simulation repeatedly updates the [VHDL](#page-364-4) controller state and recalculates the  $k_{\text{CLB}}$  factors based on the control variable and [PVTA](#page-364-0) conditions.

[Figure 9.7](#page-267-0) shows the general workflow of the co-simulation: First, applications are placed and routed using [VPR](#page-364-2) as explained in [chapter 6](#page-199-0) for [ULMs](#page-364-5) and in [sec](#page-149-0)[tion 4.5](#page-149-0) for [LUT](#page-363-2) based architectures. The place and route files are then loaded into the co-simulation, which is implemented as a QuestaSim plugin. It uses anembedded version of [VPR](#page-364-2) to calculate the initial slack budgets  $k_{\text{slack}}$  once for all regions. After this initial step, it repeatedly updates the  $k_{CIR}$  factors using the propagation delay model and the scenario models. The control input for the propagation delay model is directly obtained from the [VHDL](#page-364-4) architec-ture implementation. This [VHDL](#page-364-4) code is then given the  $k_{\rm slack}$  and  $k_{\rm CLB}$  factors and one time-step of the [DVS](#page-362-3) controller is simulated. After this, updated  $k_{\text{CLB}}$ values are calculated and the process repeats.

As an optimization, the logic invasion based measurement is not used here and  $k_{\text{CLB}}$  factors are directly provided to the [DVS](#page-362-3) controllers in the tiles instead. The main reason for this is that the simulation models produce relative delays, which directly correspond to these factors. They could be used to derive absolute [LUT](#page-363-2) or [ULM](#page-364-5) delays through scaling of a typical value, but there is

little benefit in doing this. It would significantly increase simulation time and provide no new information, as the logic invasion measurement was already evaluated independently.

<span id="page-268-0"></span>

**Figure 9.8:** QuestaSim and VPR based co-simulation framework to simulate region based adaptive voltage scaling [FPGA](#page-362-0) architectures. Grey blocks are implementedin [HDL](#page-362-4) as part of the [FPGA](#page-362-0) architecture and simulated using QuestaSim. Blocks in orange are either embedding [VPR](#page-364-2) C++ code or a Lua interpreter, evaluating [PVTA](#page-364-0) models as . lua scripts.

[Figure 9.8](#page-268-0) shows a structural view of the co-simulation extensions. As shown in [figure 9.7,](#page-267-0) the overall simulation setup realizes a closed feedback loop. This loop can be used to evaluate either the *DVS* controller, user provided scenarios of external influences *PVTA*, or the  $t_{pD}$  *Model*. The left-hand side of the figure is part of the simulated [FPGA](#page-362-0) architecture and is modeled in HDL using QuestaSim. It consists of the *[DVS](#page-362-3)* controller, which adjusts the performance of regions dynamically, memory which stores  $k_{\text{slack}}$  characterization values, and the *Characterizer*, which assesses the current performance within a region. The simulation framework does not dictate any concrete implementation of those components, giving maximum flexibility to user supplied models.

The right-hand side of [figure 9.8](#page-268-0) is part of the simulation environment, implemented in C++, and interfaces to the HDL code using QuestaSim's [Foreign](#page-362-5) [Language Interface \(FLI\).](#page-362-5) It takes the requested control voltage from the architecture's *DVS* controller and passes it to user-supplied *[PVTA](#page-364-0)* scenario models written in LUA. Scenario model results are passed to other modules to derive the  $k_{\text{CLR}}$  timing degradation factors through the  $t_{\text{PD}}$  models. Using this information, a static timing analysis with updated delays can be performed in VPR. The [STA](#page-364-6) pass allows to assess the target application's timing requirements under changing operating conditions. The results are also forwarded to the *Characterizer*, which estimates the current delay in each region. To obtain a region's total performance, the characterizer uses the worst factor of all [CLBs](#page-361-0) within a region. The interface integrating C<sub>++</sub> and [VHDL](#page-364-4) code is shown in [figure 9.9.](#page-269-0)

<span id="page-269-0"></span>

**Figure 9.9:** Logic interface block which provides the interface between software and [VHDL](#page-364-4) code in each region. The interface receives its position in the grid as well as the control input. As output, it provides the current  $k_{\text{CIR}}$  factor as well as the initial  $k_{\text{slack}}$  factor for the region.

For power analysis, model values are forwarded to the *[VPR](#page-364-2) Power* estimator, most notably the current voltage for a region. The [VPR](#page-364-2) power analysis region extensions can then be used to calculate per-region power for all primitive blocks. The support of different .tech files for each region and point in time allows full flexibility for simulationof [DVS](#page-362-3) systems. For the evaluation presented in this thesis though, the simpler analysis method introduced in [section 9.2](#page-262-1) on page [240](#page-262-1) will be used.

[Figure 9.10](#page-270-0) shows a graphical depiction of an example benchmark evaluation. In this figure, subfigures [\(a\)](#page-270-0) and [\(b\)](#page-270-0) show the  $k_{\text{CLB}}$  factors at the beginning and at the end of the simulation. It can be seen that in the beginning, the measured delay largely conforms to the example process variation introduced in figure 4.24 $c$  on page [154.](#page-176-0) At time  $t_1$ , the FPGA architecture's [DVS](#page-362-3) controller has adjusted the power supply in each region and made the measured delay variations match the applications slack factors,  $k_{\text{slack}}$ . The adjustment in the control voltage needed to achieve this, is shown in [figure 9.10c.](#page-270-0) This result is mostly a combination of the process variation map and the slack factor, as the voltage adjustment has to both cancel the process variation and adapt the local performance to application requirements.

The simulation framework also enables validation of timing paths under voltage scaling: The slack values can be assessed through additional [STA](#page-364-6) for the target application at different time steps in the simulation. The results for this demonstration are shown in [figure 9.11,](#page-270-1) where regions are numbered from bottom-left to top-right. Analysis of the minimal slack in each region shows that all paths still achieve timing closure and most critical paths have

<span id="page-270-0"></span>

**Figure 9.10:** Delay factors  $k_{\text{CIR}}$  and region voltage at start and end of simulation. Unused regions are depicted in white color. **[\(a\)](#page-270-0)** and **[\(b\)](#page-270-0)** show the current delay factors at the respective point of time. Brighter color indicates a smaller factor, i.e. delays in the region are smaller compared to the nominal case. **[\(c\)](#page-270-0)** shows the voltage factors used to achieve the delay factors in **[\(b\)](#page-270-0)**. Brighter color indicates a lower factor, i.e. the threshold voltage in that region has been increased. At the initial time  $t_{0}$ , all region factors are 1.0.

reduced slacks. This also applies for the averages in most regions, showing that all paths are affected similarly.

An important observation is that some regions even provide larger slack values. This is caused by the initial measurement  $k_{\text{slack}}$  being based on the initial [VPR](#page-364-2)

<span id="page-270-1"></span>

**Figure 9.11:** Available slacks for each a region. Bars depict the mean over the slacks of all paths in a region and error bars show the minimum and maximum slack values. The nominal initial distribution after placement, ignoring the device specific variation model, is shown as white bars. Black bars show the final distribution with adjusted voltage factors at  $t_1$ .

placement, which does not include device specific variation. In some cases, the device specific variation might provide locally better performance in some regions than the worst case corner model, that was used for  $k_{\text{slack}}$ . Another observation that can be made, is that the slack values of critical paths have not been reduced to zero. As explained previously, this happens when paths traverse other critical regions: For example, the worst path in region 3 could be scaled by factor 4.73 and the analysis assumes it is scaled by this factor in all regions it traverses. However, the path also passes through region 2, which contains another, more critical, path, and can only be scaled by 1.59. The effect is also further explained by the paths used for characterization not being identical to the real application paths.

## **9.5 Benchmark Applications**

Results in [FPGA](#page-362-0) research also depend on the benchmark user applications used to evaluate the architecture, so a representative standard set of benchmarks had to be chosen. Out of the benchmarks offered by [VTR,](#page-364-7) the *MCNC20* benchmarks have been discarded as being too small: The [VTR](#page-364-7) documentation discourages their use, as results obtained using those are no longer representative for real-world applications.

The *Titan* set of benchmarks provides large-scale benchmarks for [FPGA](#page-362-0) design. These benchmarks are however only available as pre-synthesized netlists. Such netlists are already mapped to a certain logic generator primitive, usually a specific [LUT](#page-363-2) type. They can therefore not be used to target [ULM](#page-364-5) based [FPGAs](#page-362-0) or [FPGAs](#page-362-0) with other custom [LUT](#page-363-2) types. This set of benchmarks has therefore also been discarded.

Newly available *Koios 2.0* benchmarks could theoretically be used in future work. The stable [VTR](#page-364-7) distribution used for the evaluation of this thesis does however not yet support these benchmarks. Synthesizing the benchmarks with the old ODIN version included in stable [VTR](#page-364-7) is not successful, as some Verilog features are not supported. As the benchmarks were designed for machine learning use case evaluation, which is not relevant here, they have not been used. Similarly, the *Symbiflow* benchmarks have not been used as they are meant to be used only as regression test suite for some specific architectures. *NOC* benchmarks have also not been used, as the [PARFAIT](#page-363-0) architecture does not provide [NoC](#page-363-4) support.

Instead of these, the *VTR Benchmark* standard set has been used. Those medium-sized benchmarks are directly supportedin [VTR,](#page-364-7) and can be mapped to any logic generator. To enable evaluation of all benchmarks, the [PARFAIT](#page-363-0) architecture had to be extended with hard memory blocks. The evaluation was then performed using the *run\_vtr\_flow* tools: At first, all benchmarks were mapped using an auto\_layout version of the architecture. Then, the maximum [FPGA](#page-362-0) size to realize the benchmarks was determined. After that, all benchmarks were mapped to an appropriate fixed\_layout of fixed size for all applications.

# **Part III**

# **Final Remarks**

*This page intentionally left blank*

## **Chapter 10**

## **Evaluation**

The following sections will present various evaluation results for the [PARFAIT](#page-363-0) [FPGA.](#page-362-0) First sections evaluate selected individual aspects, whereas the final section presents power saving and [PVTA](#page-364-0) compensation results for the whole system.

## **10.1 Ambipolar Standard Cell Application**

Non-reconfigurable logic can be implementedin [RFET](#page-364-3) using the custom standard cell library introduced in [chapter 5.](#page-183-0) In this section, implementations of test circuits using these cells will be evaluated. At first, simple combinational circuits will be analyzed. Then, a more complex cryptographic accelerator will be evaluated to demonstrate sequential circuits.

#### **Combinational Cells**

To compare the results of synthesis of combinational circuits, the full adder cell was synthesized with both planar [RFET](#page-364-3) and [SOI](#page-364-8) reference timing libraries. A manual mapping was performed to ensure the netlists in both technologies are equal. These results have previously been published as a part of [\[Reu21\]](#page-344-3).

The timing report of the full adder circuit, presented in [table 10.3](#page-279-0) on page [257,](#page-279-0) lists the internal gates (1, 2 and 3) in the full adder, referring to gate numbers in the full adder schematic in [figure 5.4a](#page-191-0) on page [169.](#page-191-0) Gates which are directly connected to an output do not drive any loads, as ideal high impedance circuit outputs were used. Because of this, the timing of internal gates, which do have loads, will be analyzed.

Although the *NAND* gates (index 2 and 3) show an increased cell delay, the overall critical path of the planar [RFET](#page-364-3) circuit (1130 ps) is lower than for the [SOI](#page-364-8) reference technology (1331 ps). The reason for this can be found when comparing the *XOR* gate (index 1) delay: While the planar [RFET](#page-364-3) *XOR* gate has a cell delay of 336 ps, the cell delay of the *XOR* gate provided by the [SOI](#page-364-8) reference is, with 493 ps, significantly higher. A substantial difference between the timing of the planar [RFET](#page-364-3) and the reference technology is that the critical path is different: For the planar [RFET,](#page-364-3) it runs from input  $a$  to the carry output  $c_{out}$ . For the [SOI](#page-364-8) technology however, the critical path leads from  $\alpha$  to output v. This observation reinforces the assumption, that the planar [RFET](#page-364-3) boosts performance of largely XOR based circuits when compared to the [SOI](#page-364-8) reference technology.

[Table 10.4](#page-280-0) on page [258](#page-280-0) shows reduced versions of critical paths in **[\(a\)](#page-280-0)** a 32 bit carry ripple adder, **[\(b\)](#page-280-0)** a 4 bit checked adder and **[\(c\)](#page-280-0)** the [ARX](#page-361-5) cell. The circuits are based on the building blocks in [figure 5.4](#page-191-0) on page [169.](#page-191-0) They are completely combinational and therefore synthesized to exclusively [RFET](#page-364-3) cells. Using fanout information and the input capacitance of the cells and the wire loads, Cadence Genus determines the output load of all cells. It then calculates the delay according to the cell type, arc (input pin to output pin used by the analyzed timing path), the input slew rate (not shown), the input transition edge (rise or fall) and the output load. Delays match what a manual analysis suggests, which ecourages the view that the extracted timing information has been properly transformed into the .lib file. Further extraction and validation of the capacitance max rise and capacitance max fall values has been performed, to ensure that rising and falling transitions are both handled correctly.

#### **Cryptographic Accelerator**

In the following, the various system architectures for the cryptographic accelerator will be presented first, followed by a mappingto [RFET.](#page-364-3) Results for the system architecture have been presented previously in [\[Pfa19\]](#page-343-0), results for the [RFET](#page-364-3) mapping in [\[Reu21\]](#page-344-3).

**[FPGA](#page-362-0) Evaluation** Xilinx *Vivado* 2018.3 was used for evaluation. Retiming was explicitly enabled, and the VC707 board and Virtex 7 XC7VX485T-2FFG1761C [FPGA](#page-362-0) were targeted for implementation. Architectures can be parametrized on core count and the maximum reachable clock frequency reduces with higher count. This can be explained by increased stress on placement and routing. Throughput depends on both the number of cores and the clock frequency, so [figure 10.1](#page-277-0) compares different implementations' throughput vs. the required resources. The Pipeline implementation is shown as points, as it does not provide a Core count parameter to balance resources vs throughput.  $d =$  denotes whether DSP blocks have been used for the ARX cell (1) or not (0) and  $r =$  gives the number of introduced pipeline registers. [Table 10.5](#page-281-0) on page [259](#page-281-0) shows that these implementations surpass all state of the art ChaCha implementations, except for the low-resource optimized implementation by At et al., when comparing bitrate per slice. The Block Memory and Pipeline implementation also surpass [Advanced Encryption Standard \(AES\)](#page-361-6) state of the art, the Pipeline implementation even by a factor of 8. Results shown are for ChaCha8. To calculate numbers for ChaCha12/20, divide throughput by 1.5 and 2.5 for the Register and Memory implementations. For the Pipeline implementation, resource requirements are increased by these factors and the maximum clock frequency and therefore throughput may also be affected.

<span id="page-277-0"></span>

**Figure 10.1:** Comparing throughput vs. number of slices for various ChaCha accelerator system architectures.

**[RFET](#page-364-3) Evaluation** To evaluate the [RFET](#page-364-3) *.lib* file, the ChaCha accelerator circuit was synthesized in Cadence Genus. For this synthesis run, pure logic synthesis based on the *.lib* file was performed. Physical information from *.lef* files or capacitance tables are omitted. The synthesis makes use of the [SOI](#page-364-8)D [FF](#page-362-6) and is therefore not completely synthesized in [RFET](#page-364-3) technology. Unlike for the small circuits, the dont\_touch attribute is not used, permitting optimization during synthesis. As the cells have only been characterized for up to 15 fF output capacitance, synthesis tools insert buffer cells if deemed appropriate. This however increases the path lengths and reduces achievable target frequencies.

[Table 10.1](#page-278-0) shows the critical path in the ChaCha accelerator for the planar [RFET](#page-364-3) timing library. The critical path length of the [SOI](#page-364-8) reference technology is 58% of the critical path length of the planar [RFET,](#page-364-3) leading to an advantage of 42% in speed for the reference technology. Inspecting the critical path of the planar [RFET](#page-364-3) synthesis shows that 63 out of 96 gates in the critical paths are *NAND* gates. The comparison of cell timing characteristics between planar [RFET](#page-364-3) and [SOI](#page-364-8) reference cells in [\[Reu21\]](#page-344-3) states that the relative performance of the planar [RFET](#page-364-3) technology is worst for the *NAND* cell. The overall worse performance of the planar [RFET](#page-364-3) implementation for the ChaCha accelerator can therefore be partly traced back to the worse relative performance of the proposed planar [RFET](#page-364-3) cells. On the other hand, the missing derating of the [SOI](#page-364-8) reference, which would compensate the device specific characteristics like channel length and threshold voltage, benefits the reference as well [\[Reu21\]](#page-344-3).

<span id="page-278-0"></span>

**Table 10.1:** Excerpt of critical path analysis in ChaCha accelerator for [RFET](#page-364-3) standard library.

[Table 10.2](#page-279-1) shows the area summary for the ChaCha accelerator as reported by Cadence Genus. As can be seen, the cell area is largely zero. This is expected, as the *.lib* file is currently missing area definitions for the [RFET](#page-364-3) cells and only [SOI](#page-364-8)D [FF](#page-362-6) cells provide area information. Although area reports are currently of limited use because of this, they can be used to verify whether wire load models are working correctly.

<span id="page-279-1"></span>

**Table 10.2:** Genus area report for ChaCha accelerator for [RFET](#page-364-3) standard library.

<span id="page-279-0"></span>

**Table 10.3:** Delay analysis for the full adderin [RFET](#page-364-3) technology in comparisonto [SOI](#page-364-8) technology.

<span id="page-280-0"></span>

**Table 10.4:** Genus critical path analysis in combinational circuits for [RFET](#page-364-3) technology. **[\(a\)](#page-280-0)** 32 bit carry ripple adder: Reference is 78% faster. **[\(b\)](#page-280-0)** 4 bit checked adder: Reference is 26% faster. **[\(c\)](#page-280-0)** [ARX](#page-361-5) cell: Reference is 42% faster.

<span id="page-281-0"></span>

## **10.2 Ambipolar Reconfigurable Cells**

For the evaluation of the logic cell, all parameters were tuned as explained in [chapter 6:](#page-199-0) The logic cluster has 40 inputs that feed an input crossbar. It combines 5 simple, two-input logic elements with one [FF](#page-362-6) each,5 [FLEs](#page-362-7) with one [FF](#page-362-6) each and 10 [FLE](#page-362-7) without [FF.](#page-362-6) Each of the previously mentioned cells provides one output of the 20 outputs of the logic cluster and there are no internal-only cells in the final architecture. The evaluation useda [FLE](#page-362-7) with the topology of [figure 6.7a,](#page-213-0) as results did not show large benefits when using fracturable outputs. As will be shown, benefits of fracturable cells can only be realized when the more advanced combined genlib [EDA](#page-362-8) approach is used. Results currently do not include an absolute timing analysis for the FPGA architecture, as this requires dynamic characterization of at least the [ULM](#page-364-5) in the [RFET](#page-364-3) technology. Preferable, all [FPGA](#page-362-0) components should be characterized in the target technology to get detailed results, but due to incomplete [PDKs,](#page-363-5) such a detailed analysis is currently not possible. As timing and delays also influence FPGA size due timing driven routing algorithms, timing driven routing and packing was disabledin [VPR](#page-364-2) for this evaluation.

**Simple [EDA](#page-362-8) Flow** [Figure 10.2](#page-283-0) on the next page show results obtained through the [VTR](#page-367-1) benchmark flow. Data has been collected from benchmarks and averaged accordingly. [Figure 10.2a](#page-283-0) shows the input utilization of the logic cluster. The architecture uses up to 32 inputs, which is the maximum channel width in the routed design. As [VTR](#page-367-1) automatically increases the channel width in benchmark mode, this shows that a larger channel width was not required. One limiting factor of the architecture is therefore likely the amount of [LEs](#page-362-9) in each cluster. This conclusion is assured by the output utilization of the logic cluster shown in [figure 10.2b.](#page-283-0) It can be seen there, that out of 20 available outputs, an average of only 6.9 are used. Both [FLE](#page-362-7) and simple [LE](#page-362-9) are almost fully utilized, as can be seen in [figures 10.2c](#page-283-0) and [10.2d.](#page-283-0) This further reinforces the conclusion that the main limiting factor of logic expressiveness in this architecture is still the [LE.](#page-362-9) Introducing more [LE](#page-362-9) however increases the size of the input crossbar, which increases area and makes comparison to the reference architecture more difficult. As will be shown in this section, even though this architecture does not yield full utilization of logic cluster inputs and outputs, in most benchmarks the final [FPGA](#page-362-0) size is close to the reference architecture. A problem can be seen in [figure 10.2e,](#page-283-0) which shows the utilizationof [FLE](#page-362-7) inputs. It can be seen that without the fracturable [LE,](#page-362-9) the [LE](#page-362-9) is fully utilized in only a few cases. In the most common case, simple 2-input functions are mapped to the [LE](#page-362-9) and most inputs remain unused. As

<span id="page-283-0"></span>

**Figure 10.2:** Utilization statistics for the proposed logic cluster. **[\(a\)](#page-283-0)** Cluster input utilization. Mean = 16.8. **[\(b\)](#page-283-0)** Cluster output utilization. Mean = 6.9. **[\(c\)](#page-283-0)** Utilizationof [FLEs](#page-362-7) in cluster. Mean = 12.8. **[\(d\)](#page-283-0)** Utilization of simple [LE](#page-362-9) in cluster. **[\(e\)](#page-283-0)** Utilization of the [FLE](#page-362-7) inputs.

some complex functions are successfully mapped to the [FLE,](#page-362-7) it can however be seen that the approach in general is working correctly. The issue with underutilized [FLE](#page-362-7) inputs will be further discussed for the results using the advanced [EDA](#page-362-8) flow.

<span id="page-284-0"></span>

**Figure 10.3:** Comparisonof [EDA](#page-362-8) metrics for [ULM](#page-364-5) clusters and [LUT.](#page-363-2) This figure shows the [EDA](#page-362-8) tool runtime, averaged over all benchmark circuits.

[Figure 10.3](#page-284-0) shows the average tool runtime compared to the runtime in the [LUT](#page-363-2) based FPGA case for the evaluated benchmarks in the relevant tool steps. As can be seen, runtime is largely similar for most steps. For the packing phase however, the runtime is increased by almost two decades. An increase in packing time is expected, as the packer has to pack all [ULMs](#page-364-5) into the [FLE](#page-362-7) in the simple [EDA](#page-362-8) flow. As there are numerous unused [ULMs](#page-364-5) due to their limited expressiveness, more packing steps are necessary than ina [LUT](#page-363-2) based architecture. In [LUT-](#page-363-2)based [FPGA](#page-362-0) architectures on the other hand, packing is often used only to combine [FF](#page-362-6) and [LUT](#page-363-2) or multiple [LUTs](#page-363-2) into one cluster. Nevertheless, this increase in packing runtime is excessive and a severe limitation of the [ULM](#page-364-5) synthesis toolflow. As will be explained in the next section, this overhead can fortunately be reduced using the advanced [EDA](#page-362-8) flow. In addition, there is a slight increase in routing runtime. This increase can be explained by the mapped [ULM](#page-364-5) benchmarks being slightly larger than the [LUT](#page-363-2) baselines.

[Figure 10.4](#page-285-0) evaluates the FPGA sizes for the used [VTR](#page-367-1) benchmark circuits. [FPGA](#page-362-0) size is evaluated in blocks, counting logic and memory blocks but excluding the peripheral blocks. It can be seen that the [ULM](#page-364-5) based architecture uses less than 10% more blocks in most cases. For the LU\*PEEng and the mcml circuits, the [ULM](#page-364-5) based [FPGA](#page-362-0) was however up to 50% larger. This suggests that further architecture tuning may be beneficial for some circuits and that

<span id="page-285-0"></span>

Figure10.4: Comparison of [EDA](#page-362-8) metrics for [ULM](#page-364-5) clusters and [LUT.](#page-363-2) This figure depicts the [FPGA](#page-362-0) device size in blocks for the various benchmarks.

the [FLE](#page-362-7) expressiveness does not yet fully match the [LUT](#page-363-2) baseline. The issue can be addressed by introduction of more [FLEs](#page-362-7) in the logic cluster, at the expense of increasing the input crossbar size.

**Advanced [EDA](#page-362-8) Flow** To address underutilized inputs of [LE](#page-362-9) cells and nonutilized outputs of the fracturable cell, results have been reevaluated with the advanced tool flow. As this flow models combined functions to be realized by multiple [ULMs](#page-364-5) in one [FLE](#page-362-7) cell, it reduces the amount of work in the packing step. The packing algorithm is then only used to pack one or multiple combined functions into one [FLE](#page-362-7) and operates in the same way as for fracturable [LUTs.](#page-363-2) [Figure 10.5](#page-286-0) shows an evaluation of the fracturable [LE](#page-362-9) using the advanced flow. As can be seen in [figure 10.5a,](#page-286-0) the [EDA](#page-362-8) flow now successfully makes use of multiple inputs in the [FLE.](#page-362-7) Functions with one input are still used, as those are needed to realize inverter or buffer functions in some cases. Two or three inputs are never used for the tested benchmark sets. This clearly shows that in cases where a 2 input function is mapped toa [FLE,](#page-362-7) it gets fractured and another function is additionally mapped. More than half of the cells have fully utilized their inputs, a significant difference to the simple [EDA](#page-362-8) results of [figure 10.2e.](#page-283-0) [Figure 10.5b](#page-286-0) shows the utilization of [FLE](#page-362-7) outputs. It can clearly be seen that multiple outputs are now actively used. Whereas cells with only one mapped output suggest that the whole cell has been mapped to one function, two and three outputs can be intermediate results or be fractured cells implementing multiple functions.

<span id="page-286-0"></span>

**Figure 10.5:** Utilization of the 6-input [FLE](#page-362-7) when using the advanced, combined function [EDA](#page-362-8) flow. **[\(a\)](#page-286-0)** Amount of [FLE](#page-362-7) inputs used. **[\(b\)](#page-286-0)** Amount of [FLEs](#page-362-7) outputs used.

#### **10.3 Power Management Regions**

To demonstrate the use of power management regions and mode assignment approaches, a simple architecture supporting two power modes is evaluated. Parts of this evaluation were originally published in [\[Pfa23b\]](#page-344-2), but this version has been extended with a more thorough evaluation of modified [FPGA](#page-362-0) architectures. For the static mode assignment, modes have been assigned in an alternating pattern. The architecture used for evaluation is based on the 40 nm k6 frac N10 40nm architecture shipped with [VTR.](#page-364-7) It contains only basic logic elements and no memory, allowing easier comparison. To extend the architecture, delay values for various supply voltages were obtained in a similar 45 nm [PPDK](#page-363-6) using COFFEE2 [\[258\]](#page-340-0) and HSPICE. Power and delay values obtained with 1 V supply voltage were used for the high performance power regions. For the low power regions, varying power and delay values were evaluated. To evaluate the architecture, MCNC benchmarks shipped with [VPR](#page-364-2) were used. As a system-level measurement for power reduction, the amount of used [CLBs](#page-361-0) which are in low-power mode are compared to the baseline architecture. All result figures then show averages for the evaluated benchmarks.

[Figure 10.6](#page-287-0) depicts results for the static assignment strategy, showing that between 30% and 40% of the [CLBs](#page-361-0) have been placed into low power regions. Larger region sizes, which need less hardware resources in implementation, do not strongly affect the amount of low-power [CLBs](#page-361-0) in this evaluation. As the target frequency was not set (best-effort), there's a decrease of the maximum frequency of factor 1.1 at 0.95 V up to 1.8 at 0.7 V. When actively using static assignment, it should be assessed whether setting a fixed target frequency is necessary.

<span id="page-287-0"></span>

**Figure 10.6:** Static region assignment: Fraction of [CLBs](#page-361-0) in low power mode for different region sizes and low power supply levels. The supply voltage in high-performance regions is 1 V.

Unlike static assignment, a dynamic assignment will not affect maximum application frequencies: The algorithm essentially modifies the distribution of timing slacks, reducing the slack of non-critical paths. For an exemplary evaluation, the architecture is modified to specify only one region type with two possible voltage/power/delay combinations. The high performance mode continues to operate at 1 V and the low power mode uses varying voltage, power and delay. [Figure 10.7](#page-287-1) shows the average amount of cells in low-power

<span id="page-287-1"></span>

**Figure 10.7:** Dynamic region assignment: Fractionof [CLBs](#page-361-0) in low power mode for different region sizes and low power supply levels. The supply voltage in high-performance regions is 1 V.

for the dynamic strategy. It illustrates that a large amountof [CLBs](#page-361-0) can be operated in low power mode, but results depend heavily on the parameters chosen: A low-power supply voltage of 0.9 V will enable almost all [CLBs](#page-361-0) to be in lowpower mode. At 0.75 V for most region sizes, less than 25% of the [CLBs](#page-361-0) are placed in this mode. The other factor largely influencing the results is region size: A finer granularity in voltage adjustments allows for better results. For practical [FPGA](#page-362-0) design, such fine-grain voltage selection will have to balanced with the need for excessive additional resources.
### **10.4 Power Management and Compensation**

The following section will discuss various tests and evaluations using the final [PARFAIT](#page-363-0) architecture. Evaluations will focus on [PVTA](#page-364-0) compensation and power reduction using the co-simulation environment. The logic invasion scheme has been abstracted in the co-simulation to decrease simulation times. Apart from that, the evaluated architecture follows the description in [section 8.5](#page-249-0) on page [227.](#page-249-0) This evaluation section explains how to read the result graphs and provides an overview of aggregated results. Additional raw evaluations for eight selected benchmarks are available in [appendix F](#page-387-0) on page [365](#page-387-0) and will be referenced in this evaluation section.

**General** As previously mentioned, this final evaluation does not use the *CNT-DR8F* based [LE](#page-362-0) presented in [section 10.2,](#page-282-0) but the *RGATE* based [LE](#page-362-0) from [figure 8.14.](#page-250-0) The primary reason for this is that the *RGATE* was realized in a technology, for which characterization data is available for various [PVTA](#page-364-0) parameters. This allowed the [RFET](#page-364-1) propagation delay model in [section 4.6](#page-166-0) to be fitted for multiple parameters, which wasn't possible for the *CNT-DR8F* gate. Using *RGATE* therefore allows for more realistic evaluation, as the delay model matches the technology of the actual cell.

<span id="page-288-0"></span>

**Figure 10.8:** Numberof [ULM-](#page-364-2)based [CLBs](#page-361-0) in the [PARFAIT](#page-363-0) [FPGA](#page-362-1) vs. numberof [LUT](#page-363-1)based [CLBs](#page-361-0) in the reference [FPGA](#page-362-1) for the evaluated benchmarks.

[Figure 10.8](#page-288-0) provides an overviewof [CLB](#page-361-0) usage in the reference architecture vs. the modified architecture with *RGATE* cells. As can be seen, the *RGATE* based [FLE](#page-362-2) has not been optimized for expressiveness, as was done for the

*CNT-DR8F* based [LE.](#page-362-0) Due to that, in the worst case there is an increase of up to 3.31 in [CLB](#page-361-0) usage. It should be noted that the *RGATE* [LE](#page-362-0) shown in [figure 8.14](#page-250-0) needs only 10 bit for storage, plus one bit to select split mode. This is 54 bit less than the [LUT](#page-363-1) used for comparison in the reference architecture.

A 3.31x overhead would therefore still reduce the needed configuration storage. The total area however is also determined by the interconnect size, which increases by factor 3.31x as well. A naive approach could simply increase the numberof [LEs](#page-362-0) ina [CLB,](#page-361-0) therefore avoiding increasein [CLBs](#page-361-0) and global interconnect. This approach would however increase the size of the crossbar used for local interconnect in the [CLB,](#page-361-0) as well as the number of output pins. A more advanced approach could consider that few of the *RGATE* [LEs](#page-362-0) are actually used in split mode in the benchmarks. Split mode could therefore be removed, and the second output could be driven by an identical implementation of the [LE.](#page-362-0) This essentially enables doubling the logic density without increasing the number of output pins or the crossbar, but input pins need to be shared between two instances of the *RGATE* logic. Further evaluation is necessary to determine whether such a system can be fully utilized in benchmarks. Alternatively, different topologies for the [ULM,](#page-364-2) with e.g. more *RGATEs*, could be considered.

<span id="page-289-0"></span>

**Figure 10.9:** Process variation map used for the [PARFAIT](#page-363-0) [FPGA](#page-362-1) evaluation in this section. The map was generated as explained in [section 4.7](#page-174-0) and contains 250x250 points, one for each location on the [FPGA.](#page-362-1) **[\(a\)](#page-289-0)** Pure random variation. **[\(b\)](#page-289-0)** Spatially correlated variation. **[\(c\)](#page-289-0)** Final process variation model.

[Figure 10.9](#page-289-0) shows the process variation map which will be used to simulate process variation for the final results. All benchmarks in this analysis were mapped to a device of size 250x250, as this fits the largest benchmark. Because of this, the process variation map was also generated at a resolution of 250x250 according to [section 4.7.](#page-174-0) This then enables each [CLB](#page-361-0) in the architecture to get an individual process variation factor, as opposed to one factor for a whole region. Subfigure **[\(a\)](#page-289-0)** shows the pure random variation, which does not have any spatial correlation. Subfigure **[\(b\)](#page-289-0)** shows the spatially correlated part and subfigure **[\(c\)](#page-289-0)** shows the combined process variation, which is used in the cosimulation. The variation maps were generated to describe  $V_{th}$ , as explained in the introduction of the process variation model. Values are then scaled to be in the range of  $[-1,1]$  to fit the *P* parameter range of the propagation delaymodels. Darker colors in the figure depict a larger  $P$  value, i.e. a [CLB](#page-361-0) with less propagation delay. As this process variation map is representative for process variation, it is used in all the following evaluations. Although the co-simulation easily allows simulation with different process variation maps to assess the effects of those, such an additional evaluation was out of scope of this thesis.

<span id="page-290-0"></span>



[Figure 10.10](#page-290-0) shows placement maps of three selected example benchmarks. Orange color depicts used [CLB](#page-361-0) blocks, red color at the border shows used [IOBs](#page-362-3) and blue color shows used memory blocks. Gray color in general depicts unused blocks and gray stripes are caused by columns of memory blocks being unused. [Figure F.1](#page-388-0) on page [366](#page-388-0) in the appendix shows the placement maps of all evaluated benchmarks. In general, due to the device size being fixed to 250x250 blocks and the benchmarks having widely varying sizes, the results are quite different. For example, [figure 10.10a](#page-290-0) shows an average size benchmark, [figure 10.10b](#page-290-0) a similarly sized one with two clusters of placed logic and [figure 10.10c](#page-290-0) shows the benchmark with the largest utilization. This evaluation section will only provide averaged statistics and explain the

general structure of evaluation graphics using examples. Due to the different placements of benchmarks, individual results may be interesting and are provided in [appendix F](#page-387-0) for all benchmarks and most parameter combinations. The placement of benchmarks was fixed for all evaluations, the co-simulation was configured to always load the pre-placed results shown here.

**Power Reduction** [Figure 10.12](#page-293-0) provides an exemplary overview of all evaluations for the arm\_core benchmark, with [RFET](#page-364-1) delay and power models and no [PVTA.](#page-364-0) Columns show different regions sizes, whereas the rows show different parameter evaluations. The first row shows target delay factors  $k_{\text{slack}}$ , as obtained from [VPR.](#page-364-3) Values have been color coded according to [figure 10.11,](#page-291-0) where unused blocks are shown in gray. For larger region sizes, the worst case value for all [CLBs](#page-361-0) within a region is used. As can be seen in e.g. [figure 10.12c](#page-293-0) and as expected, this causes larger regions to require the worst case slack value, even if only a single [CLB](#page-361-0) within the region is utilized.  $k_{\text{clock}}$  values are independentof [PVTA](#page-364-0) and are therefore shown only once. Factors for all benchmarks can be found in [figures F.2](#page-389-0) to [F.5.](#page-392-0)

<span id="page-291-0"></span>

**Figure 10.11:** This legend shows the color coding for the various heatmaps. Values in the center depict typical values, the left side shows slow values and the right side fast values. Color gradients are scaled exponentially, so that smaller changes in the typical region cause larger variation in colors.

The second row shows the achieved delay factor,  $k_{\text{CIR}}$ . As in  $k_{\text{slack}}$  figures, a blue value shows a smaller target factor and depicts smaller propagation delay and therefore faster than nominal circuits. Colors in all graphs have been normalized to common values according to [figure 10.11,](#page-291-0) enabling comparisons between multiple figures.

The third row shows control voltage  $C$ . Graphs have been normalized for the range of [2,3] volts, with the minimum value used by the power controller being 1 V. Values outside the color range will be clamped to the range limits. Color normalization comes at the drawback of reduces color resolution for some regions. For example, a control voltage of 2.0 V can not easily be distinguished from a value of 2.05 V. Because of this, some control voltage

and power heatmaps appear to have uniform color, although there actually are differences in the obtained values. The last row in [figure 10.12](#page-293-0) shows the relative current or power, with blue color depicting higher power. For the absolute color values, the same remarks as for the control voltage apply.

[Figure 10.13](#page-294-0) shows relative power for the benchmarks. Simulations were performed using 100 simulation time steps, i.e. 100 invocations of the control loop. The value for each benchmark was obtained by first finding the control voltage in each [CLB](#page-361-0) at the last simulated time point. These values were then used with [RFET](#page-364-1) and [SOI](#page-364-4) current models of [section 4.6](#page-155-0) to obtain the normalized leakage current for all [CLBs.](#page-361-0) The result is assumed to also model the relative change in power, as explained in [section 9.2.](#page-262-0) Finally, the average of these values is calculated to obtain an aggregate value for each benchmark. The results for [SOI](#page-364-4) are shown in [figure 10.13a,](#page-294-0) for [RFET](#page-364-1) in [fig](#page-294-0)[ure 10.13b.](#page-294-0)

It can be seen that for the [SOI](#page-364-4) model, all benchmarks applications at all region sizes have a relative power of 2.76 $\times$  10<sup>-2</sup>. This is caused by the characteristics of body bias control in this [SOI](#page-364-4) technology: The application benchmarks have been placed assuming the worst-case process variation,  $P = -1$ . This co-simulation on the other hand does not yet include the process variation map and therefore determines the current delay factors  $k_{\text{CIR}}$  using nominal process variation,  $P = 0$ . Due to this, each [CLB](#page-361-0) is already 30.2 % faster than what was assumed during application placement. The controller accordingly modifies the body bias to increase delay and save power. The slope  $\delta t_{\rm PD}/\delta V_{\rm BS}$ is comparatively small, as was shown in [figure 4.17](#page-164-0) on page [142.](#page-164-0) This leads to the controller quickly setting the bias voltage to the minimum value, −0.5V. Essentially, the controllers configure all region in the slowest low-power mode available, reducing power as far as possible. The power saving is still significant, due to the larger slope of  $\delta I_{\text{leak}}/\delta V_{\text{BS}}$  as shown in [figure 4.19](#page-166-1) on page [144.](#page-166-1) In this case, region size does not have any influence, as all regions always use the same body bias, regardless of region size.

The [RFET](#page-364-1) simulation in [figure 10.13b](#page-294-0) shows different behavior: The slope  $\delta t_{\rm PD}/\delta V_{\rm PG}$  is steeper and the behavior of that dependency is exponential, as was shown in [figure 4.21](#page-173-0) on page [151.](#page-173-0) This means possible change by controllers for [PG](#page-363-2) voltage are limited, before the delay is increased too much. In addition,  $\delta I_{\text{leak}}/\delta V_{\text{PG}}$  is comparatively small, as shown in [figure 4.23](#page-175-0) on page [153.](#page-175-0) It should be noted that this leakage current model may however vary a lot with changesin [RFET](#page-364-1) technology and that the measurements used for modelling have a high uncertainty due to difficulty of measuring these

<span id="page-293-0"></span>

**Figure 10.12:** Power reduction and delay for the arm\_core benchmark. Evaluated using the [RFET](#page-364-1) model and without process variation. Rows show from top to bottom: Target delay, achieved delay, control voltage and relative power. Columns show different regions sizes. From left to right: 5x5, 10x10 and 25x25.

<span id="page-294-1"></span>values. Refer to [\[2\]](#page-315-0) for details. It should also be noted that while this effect limits the total energy saving possible using  $V_{th}$  scaling in [RFET](#page-364-1) technology, it also means large compensation in performance can be achieved with small changes in program voltage.

<span id="page-294-0"></span>

**Figure 10.13:** Normalized static power *p* for the benchmarks versus power region size, simulated without process variation. **[\(a\)](#page-294-0)** [SOI](#page-364-4) model. **[\(b\)](#page-294-0)** [RFET](#page-364-1) model.

As the region controllers do not reach limits when scaling control voltages for [RFET,](#page-364-1) region size and resource utilization effects can now be seen. In general, unused resources will still be scaled to the minimum control voltage and power consumption. Higher utilization reduces the number of regions in this low power mode and therefore leads to higher power consumption

for all regions sizes. This can for example be seen easily in the largest benchmark, LU64PEEng having the highest power consumption and the secondlargest, bgm, following. Region size effects can also be seen: A single [CLB](#page-361-0) with higher performance requirements in a region causes the whole region to consume more power. Therefore, larger regions are expected to increase power consumption. On the other hand, placement of logic is often clustered, which affects the severity of this effect. [Figure 10.13b](#page-294-0) shows that for benchmarks with limited resource utilization, the difference between finegrain 1x1 regions and large 50x50 regions is less than 20% and that benchmarks with less resource utilization are less affected by region size. The effects for 25x25 regions are already less severe, so a trade-off between additional logic for voltage scaling and power saving can be made based on these figures.

**Process Variation** [Figure 10.16](#page-298-0) shows similar [FPGA](#page-362-1) maps, but this time for [SOI](#page-364-4) technology and including process variation. Target delay factors  $k_{\text{clock}}$ are shown again in the first row to easily compare them to the graphs in the rows below. The figure also shows achieved delay, control voltage and relative power using the same conventions as the previous figures. It can be seen that even for [SOI](#page-364-4) technology, with process variation not all regions can be operated in the lowest power mode. Higher power regions are required according to a combination of the placement map and the process variation map: Locations where the target slack factor is smaller, or the process variation leads to worse performance, require higher body bias voltages.

[Figure 10.17](#page-299-0) shows the effect for the [RFET](#page-364-1) technology. Due to the previously explained differencesin [RFET](#page-364-1) and [SOI](#page-364-4) models, the difference between offregions and regions where active logic is placed is larger for [RFET](#page-364-1) technology, which means the placement locations can be easily recognized in the second, third and fourth row. When looking closely, process variation can still be seen in the background of the current delay figures. The control parameters also vary slightly, but due to the large  $\delta t_{\text{PD}}/\delta V_{\text{PC}}$ , this slight variation is not visible in the figures.

[Figure 10.14](#page-296-0) shows aggregate values, obtained in the same way as the ones in [figure 10.13.](#page-294-0) There is little change compared to the data without process variation for the [RFET](#page-364-1) technology, as changes in the control voltage are small due to the large slope of the delay models. Results for [SOI](#page-364-4) are more interesting: Here, it can be seen that with process variation, not all regions are in lowpower mode and differences between regions sizes and benchmark resource utilization start to show. It can again be seen, that larger benchmarks cause higher relative power consumption for all regions. Also, larger regions lead to

<span id="page-296-0"></span>

more current consumption for all benchmarks and affect larger benchmarks more.

**Figure 10.14:** Normalized static power *p* for the benchmarks versus power region size, simulated with process variation. **[\(a\)](#page-296-0)** [SOI](#page-364-4) model. **[\(b\)](#page-296-0)** [RFET](#page-364-1) model.

One interesting observation is the largest benchmark requiring more power than nominal power in the 50x50 region case. This can be explained in the following way: Due to the simple proportional control algorithm used for the region controllers, the control voltage for some regions did not settle on a final value yet. The regions therefore are in higher performance mode than necessary and use more power. This is especially severe when large regions are used, as the affected area compared to total area becomes larger. In the 50x50 case, the effect is severe enough that more than nominal power

is required. Even with the simple algorithm, the problem would be solved by simulating more than 100 time steps. If a faster settling time is demanded, a more elaborate control algorithm should be chosen.

<span id="page-297-0"></span>

**Figure 10.15:** Normalized static power *p* for the benchmarks versus utilization, simulated with process variation. **[\(a\)](#page-297-0)** [SOI](#page-364-4) model. **[\(b\)](#page-297-0)** [RFET](#page-364-1) model.

[Figure 10.15](#page-297-0) evaluates the same data in a slightly different way: Here, the y-axis still depicts relative power, but over an x-axis depicting logic utilization. The tested utilization range is limited according to the available benchmarks. In this figure, it can be clearly seen how a larger utilization leads to higher power consumption. As expected, there's therefore more potential for power saving when there's less resource utilization. Additionally, it can be seen how larger region sizes reduce the obtainable power reduction. The previously mentioned limitation in the 50x50 test can also be seen here.

<span id="page-298-0"></span>

**Figure 10.16:** Process variation and delay for the arm\_core benchmark, evaluated using the [SOI](#page-364-4) model. Rows show from top to bottom: Target delay, achieved delay, control voltage and relative power. Columns show different regions sizes. From left to right: 5x5, 10x10 and 25x25.

<span id="page-299-0"></span>

**Figure 10.17:** Process variation and delay for the arm\_core benchmark, evaluated using the [RFET](#page-364-1) model. Rows show from top to bottom: Target delay, achieved delay, control voltage and relative power. Columns show different regions sizes. From left to right: 5x5, 10x10 and 25x25.

**Voltage Variation** Voltage variation has been evaluated similarly, using the voltage variation scenario described in [section 4.7.](#page-174-0) For this evaluation, it has been assumed that the power grid is using 5x5 tiles, independent of the power regions implemented for power management. The figures in [figure 10.18](#page-300-0) have been derived using this grid and variation scenario based on the local utilization in each region. Examples for three benchmarks are given here, whereas the voltage variation maps for the remaining benchmarks can be found in [figure F.18](#page-405-0) on page [383.](#page-405-0) The voltage variation maps shown here are independent of the  $\epsilon$  value, as it scales all voltage drop values in the same way. [Figure 10.19](#page-301-0) shows the achieved delays for the arm core benchmark and  $\epsilon = 0.1$  in the [RFET](#page-364-1) model. Heatmaps for the [SOI](#page-364-4) model are similar and are therefore not explicitly shown in this chapter. The voltage variation simulations shown here and in the appendix additionally include process variation as explained in the previous section. Graphs for other values of  $\epsilon$ and for [SOI](#page-364-4) are shown in [appendix F.](#page-387-0)

<span id="page-300-0"></span>

**Figure 10.18:** Voltage variation maps for example applications. **[\(a\)](#page-300-0)** arm\_core benchmark. **[\(b\)](#page-300-0)** stereovision0 benchmark. **[\(c\)](#page-300-0)** LU64PEEng benchmark.

[Figure 10.20](#page-302-0) shows aggregate statistics for the voltage variation case, similarly to the previously shown aggregate statistics for process variation. Unlike in previous graphs, the graphs here only show relative current instead of relative power: Due to the voltage variations, *VDD* can no longer assumed to be constant and power can no longer be derived from current using a single factor. The overall trend of larger region causing higher leakage currents can be seen for both technologies. The effect seems to be independent of the  $\epsilon$  value, i.e. the magnitude of the voltage drops. This can be explained by voltage drops occurring in regions with active logic independently of values of  $\epsilon$  or region size.

For the [SOI](#page-364-4) model, differences between  $\epsilon$  values are clearly visible. Because of the large range of current values, the y-axis was plotted logarithmically. Values below 1 show operation parameters under which the system is still able to reduce current. For larger *epsilon* or larger region sizes, this is not the case and higher currents are accepted when boosting the performance of regions.

The influence of *VDD* in the [RFET](#page-364-1) model is less significant and results for all values are similar. Due to the large influence of the program gate control voltage on the propagation delay, the compensation system is able to compensate completely. Because of this, the effect of different  $\epsilon$  values on the total leakage current is also limited.

<span id="page-301-0"></span>

**Figure 10.19:** Voltage variation compensation for the arm core benchmark, evalu-ated using the [RFET](#page-364-1) model and  $\epsilon = 0.1$ . Rows show from top to bottom: Achieved delay, control voltage and relative power. Columns show different regions sizes. From left to right: 5x5, 10x10 and 25x25.

<span id="page-302-0"></span>

**Figure 10.20:** Normalized leakage current *i* for the benchmarks versus power region size and voltage variation. **[\(a\)](#page-296-0)** [SOI](#page-364-4) model. **[\(b\)](#page-296-0)** [RFET](#page-364-1) model.

**Temperature Variation** Temperature variation results are given in [fig](#page-303-0)[ure 10.21,](#page-303-0) showing again achieved delay, control voltage and relative power for one exemplary benchmark. Additionally, aggregated statistics in [fig](#page-304-0)[ure 10.23](#page-304-0) show achieved static power reduction with the hotspot simulations. The local hotspot model introduced in [section 4.7](#page-174-0) used for the evaluation is visualized in [figure 10.22.](#page-304-1) It can be seen that the hotspot appears over time, the evaluation figures however show the results at time step 100, when the final value has been settled. Simulation over time is supported in the co-simulator, but the data has not been evaluated here.

[Figures 10.23a](#page-304-0) and [10.23b](#page-304-0) show little effect of the local hotspot. Those graphs closely resemble the average graphs for the process variation evaluation, which is expected as the temperature hotspot scenario was simulated with process variation. The power usage is independent of the hotspot for both technologies. In the [SOI](#page-364-4) model, the hotspot leads to slightly increased local delay. The effect is compensated by the [PVTA](#page-364-0) scheme, but due to the small number of affected regions, the leakage power reduction does not change significantly.

In the [RFET](#page-364-1) model, higher temperature leads to reduced delays. This effect allows reducing the control parameter locally, but this does not lead to visible changes in leakage power either.

<span id="page-303-0"></span>

**Figure 10.21:** Temperature variation compensation for the arm\_core benchmark, evaluated using the [RFET](#page-364-1) model and  $T = 100$ . Rows show from top to bottom: Achieved delay, control voltage and relative power. Columns show different regions sizes. From left to right: 5x5, 10x10 and 25x25.

<span id="page-304-1"></span>

**Figure 10.22:** Local hotspot simulation for various points of time in the simulation.  $(a)$   $t = 1$ .  $(b)$   $t = 5$ .  $(c)$   $t = 10$ .  $(d)$   $t = 20$ .

<span id="page-304-0"></span>

**Figure 10.23:** Normalized static power *p* for the benchmarks versus power region size and temperature variation. **[\(a\)](#page-304-0)** [SOI](#page-364-4) model. **[\(b\)](#page-304-0)** [RFET](#page-364-1) model.

**Aging** Achieved delay, control voltage and relative power for one exemplary benchmark in the aging scenario are shown in [figure 10.24.](#page-306-0) [Figure 10.25](#page-307-0) again shows aggregated static power statistics and additional graphs are again available in [appendix F.](#page-387-0)

Aging was modeled according to [section 4.7](#page-174-0) on page [152](#page-174-0) and as all other simulations, was simulated with process variation. For [SOI,](#page-364-4) the parameters used to derive that aging parameter A were chosen as  $120^{\circ}$ C and 1.8 V. As was explained [section 4.6,](#page-155-0) the [RFET](#page-364-1) aging model was already matched to produce results matching this aging parameter in the [SOI](#page-364-4) technology.

With aging, it can be seen that there is again no large influence on the [RFET](#page-364-1) model, but the [SOI](#page-364-4) power consumption changes after longer aging. This is again related to the delay sensitivity to control voltage changesin [RFET](#page-364-1) and [SOI](#page-364-4) models. Again, due to the large sensitivity of the current model in [SOI,](#page-364-4) relative power increases up to an order of magnitude. The region size again has a large influence on the additional power used. For [RFET,](#page-364-1) high sensitivity of propagation delay and low sensitivity of current on the control voltage again result in little changes over time.

The [RFET](#page-364-1) technology therefore seems to be relatively stable against any [PVTA](#page-364-0) influences, as they can be compensated more effectively thanin [SOI](#page-364-4) technology. On the other hand, power saving benefitsin [RFET](#page-364-1) result primarily from completely unutilized regions: Due to small changes in control voltage and little sensitivity of the leakage current to this control voltage, the effects are limited. This can be beneficial, as increasing relative performance is easier, and detrimental, as saving power is more difficult. Whether this different behavior is beneficial therefore depends on the absolute performance reachable in each technology. Such performance results are however largely dependent of the maturity of the technology and could not be evaluated in this thesis. It should also be remembered that the [RFET](#page-364-1) model partially uses data from the [SOI](#page-364-4) model, such as aging and process variation. When production ready manufacturing for [RFET](#page-364-1) is introduced, those parts need to be re-modeled to match the technology. Nevertheless, this initial analysis provided certain insights in the differences of the technologies used and showed viability of a power region based [FPGA](#page-362-1) with [PVTA](#page-364-0) compensation at system level, for both technologies. The introduced co-simulation and the evaluation methodology lay the foundations to evaluate improved models and region controller control strategies in future work.

<span id="page-306-0"></span>

**Figure 10.24:** Aging compensation for the arm\_core benchmark, evaluated using the [RFET](#page-364-1) model and  $t = 10y$ . Rows show from top to bottom: Achieved delay, control voltage and relative power. Columns show different regions sizes. From left to right: 5x5, 10x10 and 25x25.

<span id="page-307-0"></span>

Figure 10.25: Normalized static power *p* for the benchmarks versus power region size and aging. **[\(a\)](#page-307-0)** [SOI](#page-364-4) model. **[\(b\)](#page-307-0)** [RFET](#page-364-1) model.

*This page intentionally left blank*

### **Chapter 11**

### **Conclusion and Outlook**

The following sections will quickly summarize the topics and results of this thesis. The final section will then discuss aspects that could not be addressed fully in this thesis and should be addressed in future work.

#### **Summary**

The overall goal of this thesis was to introduce novel ideas to reduce power consumptionin [FPGA.](#page-362-1) Here, the thesis focused largely on static leakage currents, which could be reduced down to 2.76% in the most extreme cases, when the whole [FPGA](#page-362-1) could be put to low-power mode with the evaluated [SOI](#page-364-4) technology. Other, related topics were addressed in this document as well: For [PVTA](#page-364-0) compensation, simulation models and a co-simulation system were introduced. For example, this simulator has shown that for region sizes up to 10x10, voltage drops of up to 10% could be compensated in the [RFET](#page-364-1) technology without additional current draw. Similarly, it was shown that small, local temperature hotspots have little effect on overall leakage current with the simulated transistor technologies. It was also shown the process variation can lead to propagation delay variation of up to 30% in the evaluated technologies. Making use of the fact that [FPGA](#page-362-1) applications are commonly checked against worst case delays in [STA,](#page-364-5) the power reduction to 2.76% could be realized. It was also shown that for some benchmarks, depending on region size, the improvement can be less. The results therefore can serve as guidance when selecting the region size for such a power-aware [FPGA](#page-362-1) architecture.

The thesis also introduced [RFET](#page-364-1) based logic generators for [FPGAs,](#page-362-1) evaluating the power savings achievable with [RFET](#page-364-1) technology. It introduced a delay and a power model for this technology, that was used in the final evaluation to

<span id="page-310-0"></span>estimate the power savings. Furthermore, it was shown that the [RFET](#page-364-1) technology used has a high sensitivity of propagation delay in regard to the program gate voltage, and little sensitivity of leakage current depending on this voltage. The high sensitivity in the propagation delay enables more efficient [PVTA](#page-364-0) compensation compared to the tested [SOI](#page-364-4) technology. On the other hand, low sensitivity in the leakage current limits the achievable power reduction comparedto [SOI.](#page-364-4) For example, the final evaluation showed a best-case power reduction to 49.1%, whereas the [SOI](#page-364-4) model could reduce power to 2.76%. The results of this thesis suggest that future technology research could focus more on influence of program gate bias on off-currents. On the other hand, it should be noted that the [RFET](#page-364-1) technology already offers absolute values for off currents, that are much smaller than the ones in other technologies [\[2\]](#page-315-0). To demonstrate all those benefits, various [FPGA](#page-362-1) aspects have been analyzed with respectto [RFET](#page-364-1) technology.

**Simulation Models** For the final evaluation, system-level simulation models for [RFET](#page-364-1) were needed. This thesis therefore analyzed a way to derive high level propagation delays when only knowing the device currents, as available in device research papers. It further described how to take this propagation delay information for a single cell to determine delays in a whole [FPGA.](#page-362-1) Those delays were then modeled as relative changes, as absolute values were not obtainable with raw device data. The models were then designed to describe [PVTA](#page-364-0) dependencies and control parameters for program gate voltage scaling. In addition to [RFET](#page-364-1) models based on device measurements,a [SOI](#page-364-4) model was introduced to describe a commercial technology for comparison. For this comparison model, propagation delays could directly be obtained from [SPICE](#page-367-0) simulation. As some [PVTA](#page-364-0) dependencies were not known for the [RFET](#page-364-1) technology, they were transferred from the [SOI](#page-364-4) model to the [RFET](#page-364-1) model, obtaining a plausible, but not real, estimated technology.

**Building Circuits** In order to build large circuits using [RFET](#page-364-1) technology, the design and use of an [RFET](#page-364-1) standard cell library was investigated. Whereas simulation of individual cells was performed by Reuter et al. [\[Reu21\]](#page-344-0), the derivation of the standard cell library was part of this thesis. The library was then used for [STA](#page-364-5) of large circuits, focusing on a cryptographic accelerator for the ChaCha cipher. Especially for [FPGAs,](#page-362-1) applicationof [RFETs](#page-364-1) in reconfigurable logic was evaluated. Whereas small reconfigurable cells are available in literature, they can not directly be usedin [FPGA](#page-362-1) due to their limited expressiveness. Because of that, more complex [ULMs](#page-364-2) have been derived based on these reconfigurable cells to replace the [LUTs](#page-363-1) in [FPGA.](#page-362-1) Then, efficient strategies to combine these [ULMs](#page-364-2) in clusters were evaluated. This thesis also

developed tools and methodology to evaluate the expressiveness of these cells and compare them to reference [LUTs.](#page-363-1) In addition, a toolflow that can map arbitrary circuits to such [ULM-](#page-364-2)based [FPGA](#page-362-1) insteadof [LUT-](#page-363-1)based [FPGA](#page-362-1) has been introduced.

**Building [FPGAs](#page-362-1)** To use these [RFET](#page-364-1) [ULMs](#page-364-2) in [FPGA,](#page-362-1) system level changes in the [FPGA](#page-362-1) architecture were evaluated: Power management regions were introduced to enable local trade-offs between performance and power. To map applications to power regions, two approaches to assign region modes have been introduced: Static assignment and dynamic assignment. In static assignment body biasing, or program gate voltage scaling are determined during user application implementation. The dynamic assignment, which is used in the final [PARFAIT](#page-363-0) architecture, allows changing these control voltages at runtime. For this, modifications in the [EDA](#page-362-4) toolflow have been implemented to characterize application paths and derive slack factors for each region. Whereas this gives a metric for the required performance in a region, a mechanism to measure the actual performance in each region is also required. Here, the novel logic invasion scheme allows measuring the propagation delay of all [CLBs](#page-361-0) on an [FPGA](#page-362-1) with little hardware overhead. Logic invasion uses special auto-generated, partial bitstreams, to dynamically reprogram parts of the [FPGA](#page-362-1) with delay measurement logic, i.e. ring oscillators and counters. This invasion occurs transparently, not interfering with the user application. It is performed repeatedly to enable measurement of dynamic effects over time. Target slack factors and the measured delays are then compared in a region controller. This controller can adjust the body bias or program gate voltage in a closed-loop manner, as it receives feedback from the delay measurement system.

**Methodology and Evaluation** To evaluate possible power savings, the derived current and delay models for [SOI](#page-364-4) and [RFET](#page-364-1) technology were used in simulation. For this simulation, a co-simulation framework has been designed to combine [VHDL](#page-364-6) simulation for the [FPGA](#page-362-1) architecture, Lua models for [PVTA](#page-364-0) influence and [VPR](#page-364-3) for target slack factor estimation. The resulting simulator is fully flexible in regard to the used technology models and the simulation scenarios. Whereas this thesis provided some example evaluations, further evaluations can be realized easily. For example, heating evaluation could consider larger hotspots, faster dynamic effects, multiple hot spots etc. In addition to this simulation, the [VFPGA](#page-364-7) was extended to allow for some hardware simulation of the architecture.

## **Outlook**

This thesis has set the foundations for research into power aware [FPGA](#page-362-1) architectures. In some of the addressed topics, there are however opportunities for further research:

Evaluation results depend heavily on the [PVTA](#page-364-0) models. Whereas the [SOI](#page-364-4) model was derived from mature [SPICE](#page-367-0) models, the [RFET](#page-364-1) model was directly extracted from measurement data. This data is not as mature as final [SPICE](#page-367-0) models, so the [RFET](#page-364-1) model could be improved with better source data. For scenario evaluations, more evaluations are possible. In addition, [DSE](#page-362-5) could be performed for more parameters: In this thesis, it was assumed that control voltages can be set to arbitrary values. Further research could evaluate the effect of discrete power levels and the number of levels required for efficient power saving.

The *RGATE* [ULM](#page-364-2) presented in this work can be further improved as well. Such improvements were shown in detail for the *CNT-DR8F* [ULM,](#page-364-2) but could not be implemented for the *RGATE* one because of time restrictions. The *RGATE* cell has various benefits over *CNT-DR8F* : It is a static logic cell, similar to common [CMOS](#page-361-1) cells, and it is implemented in a technology for which detailed characterization data is available. It also reduces the number of required configuration inputs, which is crucial to reduce the [FPGA](#page-362-1) configuration storage. For example, due to the missing optimization, [FPGAs](#page-362-1) needed three to four times more [CLBs](#page-361-0) to realize the evaluated benchmarks, than compared to the [LUT](#page-363-1) reference architecture. However, the *RGATE* [ULM](#page-364-2) consists of only 5 *RGATE*s with 2 bit configuration each, requiring 10 bit configuration storage in total. The reference [LUT](#page-363-1) on the other hand uses 64 bit of data storage. The *RGATE* [FPGA](#page-362-1) therefore needed less configuration storage, even though it was larger. Increasing [CLB](#page-361-0) numbers however also increase the global interconnect area, which reduces the benefits of smaller storage. Future work therefore should focus on introduction of more logicin [CLBs,](#page-361-0) e.g. by doubling of the number of *RGATE* [ULMs](#page-364-2) ina [FLE.](#page-362-2) For example, instead of using fracturable cells, both outputs could be connected to full logic trees. This would enable doubling of logic ina [CLB](#page-361-0) without affecting the input or output interconnects. Also not considered in this thesis was the topic of dynamic power, evaluation focused largely on leakage current induced power loss.

Nevertheless, this thesis demonstrated the potential of power aware [FPGA](#page-362-1) architectures, more clever [FPGA](#page-362-1) architectures that perform active power management. Application of logic invasion has shown that fine-grain performance measurement in [FPGAs](#page-362-1) is possible with little additional resource

usage. In addition, an interdisciplinary approach has been used to evaluate these power saving on novel [RFET](#page-364-1) technology, even before detailed circuit simulation models are available. The propagation delay models introduced for [RFET](#page-364-1) and [SOI](#page-364-4) in this thesis are thoroughly described and can be easily adapted for other circuit evaluations. Similarly, the developed simulation approach can be used to simulate arbitrary [PVTA](#page-364-0) scenarios. All in all, this thesis demonstrated that [RFET](#page-364-1) based [FPGA](#page-362-1) are feasible and the required changes in toolflow and architecture are limited. [RFET](#page-364-1) technology has been shown to provide better compensationof [PVTA](#page-364-0) effects due to the strong dependency of device on-current on program gate voltage. On the other hand, the comparison [SOI](#page-364-4) technology showed better improvements in leakage power reduction, as the sensitivity of off-current on program gate voltage was limited for [RFET.](#page-364-1) For larger power savings, [RFET](#page-364-1) technology research would have to focus on improving this parameter. After writing of this thesis, improved [RFET](#page-364-1) device measurements with better control of  $I_{\text{off}}$  $I_{\text{off}}$  $I_{\text{off}}$  have been researched in the [PAR-](#page-363-0)[FAIT](#page-363-0) project. Updated evaluation results are intended to be published in late 2024. Besides that, the power management and [PVTA](#page-364-0) compensation system was shown to be technology independent, through evaluation in [SOI](#page-364-4) and [RFET](#page-364-1) technologies. Such a power aware [FPGA](#page-362-1) architecture could therefore be implemented in any technology that allows for power and performance trade-offs in some way.

*This page intentionally left blank*

# **Bibliography**

- **[1]** KRAUSS, Tillmann A.: "Planare elektrostatisch dotierte rekonfigurierbare Schottky-Barriere FDSOI Feldeffekttransistor Strukturen". Dissertation. Darmstadt: Technische Universität Darmstadt, 2019 (cit. on pp. [3,](#page-25-0) [14–](#page-36-0)[16,](#page-38-0) [161\)](#page-183-0).
- <span id="page-315-0"></span>**[2]** GALDERISI, Giulio; BEYER, Christoph; MIKOLAJICK, Thomas and TROMMER, Jens: "Insights into the Temperature Dependent Switching Behaviour of Three– Gated Reconfigurable Field Effect Transistors". In: *physica status solidi (a)* (2023). DOI: [10.1002/pssa.202300019](https://doi.org/10.1002/pssa.202300019) (cit. on pp. [3,](#page-25-0) [6,](#page-28-0) [144–](#page-166-2)[149,](#page-171-0) [152,](#page-174-1) [153,](#page-175-1) [227,](#page-249-1) [228,](#page-250-1) [241,](#page-263-0) [272,](#page-294-1) [288\)](#page-310-0).
- **[3]** ALTIERI SCARPATO, Mauricio: "Digital circuit performance estimation under PVT and aging effects". Thesis. Université Grenoble Alpes, 2017. URL: [https:](https://theses.hal.science/tel-01773745) [//theses.hal.science/tel-01773745](https://theses.hal.science/tel-01773745) (cit. on pp. [6,](#page-28-0) [12,](#page-34-0) [23,](#page-45-0) [32](#page-54-0)[–35,](#page-57-0) [38](#page-60-0)[–43,](#page-65-0) [133,](#page-155-1) [134,](#page-156-0) [137](#page-159-0)[–140,](#page-162-0) [147\)](#page-169-0).
- **[4]** CLERC, Sylvain; DI GILIO, Thierry and CATHELIN, Andreia, eds.: The Fourth Terminal: Benefits of Body-Biasing Techniques for FDSOI Circuits and Systems. 1st ed. 2020. Integrated Circuits and Systems. Cham: Springer International Publishing and Imprint Springer, 2020 (cit. on p. [11\)](#page-33-0).
- **[5]** RANICA, R.; PLANES, N.; WEBER, O.; THOMAS, O.; HAENDLER, S.; NOBLET, D.; CROAIN, D.; GARDIN, C. and ARNAUD, F.: "FDSOI process/design full solutions for ultra low leakage, high speed and low voltage SRAMs". In: *2013 Symposium on VLSI Circuits* (2013) (cit. on p. [12\)](#page-34-0).
- **[6]** HARTMANN, Joel: "FD-SOI Technology Development and Key Devices Characteristics for Fast, Power Efficient, Low Voltage SoCs". In: *2014 IEEE Compound Semiconductor Integrated Circuit Symposium (CSICS)*. IEEE, 2014, pp. 1–4. DOI: [10.1109/CSICS.2014.6978554](https://doi.org/10.1109/CSICS.2014.6978554) (cit. on p. [12\)](#page-34-0).
- **[7]** SKOTNICKI, Thomas et al.: "Innovative Materials, Devices, and CMOS Technologies for Low-Power Mobile Multimedia". In: *IEEE Transactions on Electron Devices* 55.1 (2008), pp. 96–130. DOI: [10.1109/TED.2007.911338](https://doi.org/10.1109/TED.2007.911338) (cit. on p. [12\)](#page-34-0).
- **[8]** SAKURAI, T. and NEWTON, A. R.: "Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas". In: *IEEE Journal of Solid-State Circuits* 25.2 (1990), pp. 584–594. DOI: [10.1109/4.52187](https://doi.org/10.1109/4.52187) (cit. on pp. [12,](#page-34-0) [23\)](#page-45-0).
- **[9]** DASDAN, Ali and HOM, Ivan: "Handling inverted temperature dependence in static timing analysis". In: *ACM Transactions on Design Automation of Electronic Systems* 11.2 (2006), pp. 306–324. DOI: [10.1145/1142155.1142158](https://doi.org/10.1145/1142155.1142158) (cit. on pp. [12,](#page-34-0) [13\)](#page-35-0).
- **[10]** GHONEIM, H.; KNOCH, J.; RIEL, H.;WEBB, D.; BJORK, M. T.; KARG, S.; LORTSCHER, E.; SCHMID, H. and RIESS, W.: "Interface engineering for the suppression of ambipolar behavior in Schottky-barrier MOSFETs". In: *2009 10th International Conference on Ultimate Integration of Silicon*. IEEE, 2009, pp. 69–72. DOI: [10.](https://doi.org/10.1109/ULIS.2009.4897541) [1109/ULIS.2009.4897541](https://doi.org/10.1109/ULIS.2009.4897541) (cit. on p. [14\)](#page-36-0).
- **[11]** GHONEIM, H.; KNOCH, J.; RIEL, H.;WEBB, D.; BJÖRK, M. T.; KARG, S.; LÖRTSCHER, E.; SCHMID, H. and RIESS, W.: "Suppression of ambipolar behavior in metallic source/drain metal-oxide-semiconductor field-effect transistors". In: *Applied Physics Letters* 95.21 (2009), p. 213504. DOI: [10.1063/1.3266526](https://doi.org/10.1063/1.3266526) (cit. on p. [14\)](#page-36-0).
- **[12]** REENA MONICA, P.: "Seven Strategies to Suppress the Ambipolar Behaviour in CNTFETs: a Review". In: *Silicon* 14.16 (2022), pp. 10199–10216. DOI: [10.1007/](https://doi.org/10.1007/s12633-022-01813-5) [s12633-022-01813-5](https://doi.org/10.1007/s12633-022-01813-5) (cit. on p. [14\)](#page-36-0).
- **[13]** LIN, Y.-M.; APPENZELLER, J.; KNOCH, J. and AVOURIS, P.: "High-Performance Carbon Nanotube Field-Effect Transistor With Tunable Polarities". In: *IEEE Transactions on Nanotechnology* 4.5 (2005), pp. 481–489. DOI: [10.1109/TNANO.](https://doi.org/10.1109/TNANO.2005.851427) [2005.851427](https://doi.org/10.1109/TNANO.2005.851427) (cit. on pp. [14,](#page-36-0) [17\)](#page-39-0).
- **[14]** KRAUSS, Tillmann; WESSELY, Frank and SCHWALKE, Udo: "Fabrication and simulation of electrically reconfigurable dual metal-gate planar field-effect transistors for dopant-free CMOS". In: *2017 12th International Conference on Design & Technology of Integrated Systems In Nanoscale Era (DTIS)*. IEEE, 2017, pp. 1–6. DOI: [10.1109/DTIS.2017.7930155](https://doi.org/10.1109/DTIS.2017.7930155) (cit. on p. [15\)](#page-37-0).
- **[15]** KRAUSS, Tillmann; WESSELY, Frank and SCHWALKE, Udo: "Electrostatically Doped Planar Field-Effect Transistor for High Temperature Applications". In: *ECS Journal of Solid State Science and Technology* 4.5 (2015), Q46–Q50. DOI: [10.1149/2.0021507jss](https://doi.org/10.1149/2.0021507jss) (cit. on pp. [15,](#page-37-0) [17\)](#page-39-0).
- **[16]** HUETING, Raymond J. E. and GUPTA, Gaurav: "Electrostatic Doping and Devices". In: *Springer Handbook of Semiconductor Devices*. Ed. by RUDAN, Massimo; BRUNETTI, Rossella and REGGIANI, Susanna. Springer Handbooks. Cham: Springer International Publishing, 2023, pp. 371–389. DOI: [10.1007/978-3-030-](https://doi.org/10.1007/978-3-030-79827-7_11) [79827-7\\_11](https://doi.org/10.1007/978-3-030-79827-7_11) (cit. on p. [15\)](#page-37-0).
- **[17]** CARTER, R. et al.: "22nm FDSOI technology for emerging mobile, Internetof-Things, and RF applications". In: *2016 IEEE International Electron Devices Meeting (IEDM)*. IEEE, 2016, pp. 2.2.1–2.2.4. DOI: [10.1109/IEDM.2016.7838029](https://doi.org/10.1109/IEDM.2016.7838029) (cit. on pp. [15,](#page-37-0) [17\)](#page-39-0).
- **[18]** MIKOLAJICK, T.; HEINZIG, A.; TROMMER, J.; BALDAUF, T. and WEBER, W. M.: "The RFET—a reconfigurable nanowire transistor and its application to novel electronic circuits and systems". In: *Semiconductor Science and Technology* 32.4 (2017), p. 043001. DOI: [10.1088/1361-6641/aa5581](https://doi.org/10.1088/1361-6641/aa5581) (cit. on pp. [15,](#page-37-0) [17,](#page-39-0) [53,](#page-75-0) [54\)](#page-76-0).
- **[19]** KRAUSS, Tillmann; WESSELY, Frank and SCHWALKE, Udo: "Reconfigurable electrostatically doped 2.5-gate planar field-effect transistors for dopant-free CMOS". In: *2018 13th International Conference on Design & Technology of Integrated Systems In Nanoscale Era (DTIS)*. IEEE, 2018, pp. 1–4. DOI: [10.1109/](https://doi.org/10.1109/DTIS.2018.8368567) [DTIS.2018.8368567](https://doi.org/10.1109/DTIS.2018.8368567) (cit. on p. [16\)](#page-38-0).
- **[20]** WEBER, W. M.; HEINZIG, A.; TROMMER, J.; MARTIN, D.; GRUBE, M. and MIKO-LAJICK, T.: "Reconfigurable nanowire electronics – A review". In: *Solid-State Electronics* 102 (2014), pp. 12–24. DOI: [10.1016/j.sse.2014.06.010](https://doi.org/10.1016/j.sse.2014.06.010) (cit. on p. [16\)](#page-38-0).
- **[21]** FEI, Wenwen; TROMMER, Jens; LEMME, Max Christian; MIKOLAJICK, Thomas and HEINZIG, André: "Emerging reconfigurable electronic devices based on two–dimensional materials: A review". In: *InfoMat* 4.10 (2022). DOI: [10.1002/](https://doi.org/10.1002/inf2.12355) [inf2.12355](https://doi.org/10.1002/inf2.12355) (cit. on p. [16\)](#page-38-0).
- **[22]** HEINZIG, André; SLESAZECK, Stefan; KREUPL, Franz; MIKOLAJICK, Thomas and WEBER, Walter M.: "Reconfigurable silicon nanowire transistors". In: *Nano letters* 12.1 (2012), pp. 119–124. DOI: [10.1021/nl203094h](https://doi.org/10.1021/nl203094h) (cit. on p. [17\)](#page-39-0).
- **[23]** WESSELY, Frank; KRAUSS, Tillmann and SCHWALKE, Udo: "Reconfigurable CMOS with undoped silicon nanowire midgap Schottky-barrier FETs". In: *Microelectronics Journal* 44.12 (2013), pp. 1072–1076. DOI: [10.1016/j.mejo.](https://doi.org/10.1016/j.mejo.2012.08.004) [2012.08.004](https://doi.org/10.1016/j.mejo.2012.08.004) (cit. on p. [17\)](#page-39-0).
- **[24]** KOO, Sang-Mo; LI, Qiliang; EDELSTEIN, Monica D.; RICHTER, Curt A. and VOGEL, Eric M.: "Enhanced channel modulation in dual-gated silicon nanowire transistors". In: *Nano letters* 5.12 (2005), pp. 2519–2523. DOI: [10.1021/nl051855i](https://doi.org/10.1021/nl051855i) (cit. on p. [17\)](#page-39-0).
- **[25]** COLLI, Alan; TAHRAOUI, Abbes; FASOLI, Andrea; KIVIOJA, Jani M.; MILNE, William I. and FERRARI, Andrea C.: "Top-gated silicon nanowire transistors in a single fabrication step". In: *ACS nano* 3.6 (2009), pp. 1587–1593. DOI: [10.1021/nn900284b](https://doi.org/10.1021/nn900284b) (cit. on p. [17\)](#page-39-0).
- **[26]** TROMMER, Jens; HEINZIG, Andre; BALDAUF, Tim; SLESAZECK, Stefan; MIKO-LAJICK, Thomas and WEBER, Walter M.: "Functionality-Enhanced Logic Gate Design Enabled by Symmetrical Reconfigurable Silicon Nanowire Transistors". In: *IEEE Transactions on Nanotechnology* 14.4 (2015), pp. 689–698. DOI: [10.1109/TNANO.2015.2429893](https://doi.org/10.1109/TNANO.2015.2429893) (cit. on pp. [17,](#page-39-0) [54,](#page-76-0) [55\)](#page-77-0).
- **[27]** TROMMER, J.; HEINZIG, A.; SLESAZECK, S.; MUHLE, U.; LOFFLER, M.; WALTER, D.; MAYR, C.; MIKOLAJICK, T. and WEBER, W. M.: "Reconfigurable germanium transistors with low source-drain leakage for secure and energy-efficient dopingfree complementary circuits". In: *2017 75th Annual Device Research Conference (DRC)*. IEEE, 2017, pp. 1–2. DOI: [10.1109/DRC.2017.7999426](https://doi.org/10.1109/DRC.2017.7999426) (cit. on p. [17\)](#page-39-0).
- **[28]** TROMMER, Jens et al.: "Enabling Energy Efficiency and Polarity Control in Germanium Nanowire Transistors by Individually Gated Nanojunctions". In: *ACS nano* 11.2 (2017), pp. 1704–1711. DOI: [10.1021/acsnano.6b07531](https://doi.org/10.1021/acsnano.6b07531) (cit. on p. [17\)](#page-39-0).
- **[29]** NAKAHARAI, Shu; IIJIMA, Tomohiko; OGAWA, Shinich; SUZUKI, Shingo; TSUK-AGOSHI, Kazuhito; SATO, Shintaro and YOKOYAMA, Naoki: "Electrostaticallyreversible polarity of dual-gated graphene transistors with He ion irradiated channel: Toward reconfigurable CMOS applications". In: *2012 International Electron Devices Meeting*. IEEE, 2012, pp. 4.2.1–4.2.4. DOI: [10.1109/IEDM.2012.](https://doi.org/10.1109/IEDM.2012.6478976) [6478976](https://doi.org/10.1109/IEDM.2012.6478976) (cit. on p. [17\)](#page-39-0).
- **[30]** NAKAHARAI, Shu; YAMAMOTO, Mahito; UENO, Keiji; LIN, Yen-Fu; LI, Song-Lin and TSUKAGOSHI, Kazuhito: "Electrostatically Reversible Polarity of Ambipolar

α-MoTe2 Transistors". In: *ACS nano* 9.6 (2015), pp. 5976–5983. DOI: [10.1021/](https://doi.org/10.1021/acsnano.5b00736) [acsnano.5b00736](https://doi.org/10.1021/acsnano.5b00736) (cit. on p. [17\)](#page-39-0).

- **[31]** LARENTIS, Stefano; FALLAHAZAD, Babak; MOVVA, Hema C. P.; KIM, Kyounghwan; RAI, Amritesh; TANIGUCHI, Takashi; WATANABE, Kenji; BANERJEE, Sanjay K. and TUTUC, Emanuel: "Reconfigurable Complementary Monolayer MoTe2 Field-Effect Transistors for Integrated Circuits". In: *ACS nano* 11.5 (2017), pp. 4832– 4839. DOI: [10.1021/acsnano.7b01306](https://doi.org/10.1021/acsnano.7b01306) (cit. on p. [17\)](#page-39-0).
- **[32]** RESTA, Giovanni V.; BALAJI, Yashwanth; LIN, Dennis; RADU, Iuliana P.; CATTHOOR, Francky; GAILLARDON, Pierre-Emmanuel and MICHELI, Giovanni de: "Doping-free complementary inverter enabled by 2D WSe2 electrostatically-doped reconfigurable transistors". In: *2018 76th Device Research Conference (DRC)*. IEEE, 2018, pp. 1–2. DOI: [10.1109/DRC.2018.8442152](https://doi.org/10.1109/DRC.2018.8442152) (cit. on p. [17\)](#page-39-0).
- **[33]** PANG, Chin-Sheng and CHEN, Zhihong: "First Demonstration of WSe2 CMOS Inverter with Modulable Noise Margin by Electrostatic Doping". In: *2018 76th Device Research Conference (DRC)*. IEEE, 2018, pp. 1–2. DOI: [10.1109/DRC.](https://doi.org/10.1109/DRC.2018.8442258) [2018.8442258](https://doi.org/10.1109/DRC.2018.8442258) (cit. on p. [17\)](#page-39-0).
- **[34]** WU, Peng; AMEEN, Tarek; ZHANG, Huairuo; BENDERSKY, Leonid A.; ILATIK-HAMENEH, Hesameddin; KLIMECK, Gerhard; RAHMAN, Rajib; DAVYDOV, Albert V. and APPENZELLER, Joerg: "Complementary Black Phosphorus Tunneling Field-Effect Transistors". In: *ACS nano* 13.1 (2019), pp. 377–385. DOI: [10.1021/](https://doi.org/10.1021/acsnano.8b06441) [acsnano.8b06441](https://doi.org/10.1021/acsnano.8b06441) (cit. on p. [17\)](#page-39-0).
- **[35]** KOLODINSKI, S. et al.: "IPCEI subcontracts contributing to 22-FDX Add-On Functionalities at GF". In: *ESSDERC 2019 - 49th European Solid-State Device Research Conference (ESSDERC)*. IEEE, 2019, pp. 74–77. DOI: [10.1109/ESSDERC.](https://doi.org/10.1109/ESSDERC.2019.8901736) [2019.8901736](https://doi.org/10.1109/ESSDERC.2019.8901736) (cit. on p. [17\)](#page-39-0).
- **[36]** SIMON, Maik; MULAOSMANOVIC, Halid; SESSI, Violetta; DRESCHER, Maximilian; BHATTACHARJEE, Niladri; SLESAZECK, Stefan; WIATR, Maciej; MIKOLAJICK, Thomas and TROMMER, Jens: "Three-to-one analog signal modulation with a single back-bias-controlled reconfigurable transistor". In: *Nature communications* 13.1 (2022), p. 7042. DOI: [10.1038/s41467-022-34533-w](https://doi.org/10.1038/s41467-022-34533-w) (cit. on p. [17\)](#page-39-0).
- **[37]** ZHANG, Jian; TANG, Xifan; GAILLARDON, Pierre-Emmanuel and MICHELI, Giovanni de: "Configurable Circuits Featuring Dual-Threshold-Voltage Design With Three-Independent-Gate Silicon Nanowire FETs". In: *IEEE Transactions on Circuits and Systems I: Regular Papers* 61.10 (2014), pp. 2851–2861. DOI: [10.1109/TCSI.2014.2333675](https://doi.org/10.1109/TCSI.2014.2333675) (cit. on pp. [17,](#page-39-0) [18\)](#page-40-0).
- **[38]** GAILLARDON, Pierre-Emmanuel; BEIGNE, Edith; LESECQ, Suzanne and MICHELI, Giovanni de: "A Survey on Low-Power Techniques with Emerging Technologies". In: *ACM Journal on Emerging Technologies in Computing Systems* 12.2 (2015), pp. 1–26. DOI: [10.1145/2714566](https://doi.org/10.1145/2714566) (cit. on pp. [17,](#page-39-0) [18\)](#page-40-0).
- **[39]** MARCHI, M. de; SACCHETTO, D.; FRACHE, S.; ZHANG, J.; GAILLARDON, P.-E.; LEBLEBICI, Y. and MICHELI, G. de: "Polarity control in double-gate, gate-allaround vertically stacked silicon nanowire FETs". In: *2012 International Elec-*

*tron Devices Meeting*. IEEE, 2012, pp. 8.4.1–8.4.4. DOI: [10.1109/IEDM.2012.](https://doi.org/10.1109/IEDM.2012.6479004) [6479004](https://doi.org/10.1109/IEDM.2012.6479004) (cit. on p. [17\)](#page-39-0).

- [40] WESSELY, F.; KRAUSS, T. and SCHWALKE, U.: "Virtually dopant-free CMOS: Midgap Schottky-barrier nanowire field-effect-transistors for high temperature applications". In: *Solid-State Electronics* 74 (2012), pp. 91–96. DOI: [10.1016/j.sse.2012.04.017](https://doi.org/10.1016/j.sse.2012.04.017) (cit. on p. [17\)](#page-39-0).
- **[41]** NI,Wangze; ZHANG, Yichi; HUANG, Bairun and CHEN, Zhuojun: "The Impact of Temperature on Reconfigurable Field-Effect Transistor and Its Applications". In: *2021 9th International Symposium on Next Generation Electronics (ISNE)*. IEEE, 2021, pp. 1–4. DOI: [10.1109/ISNE48910.2021.9493616](https://doi.org/10.1109/ISNE48910.2021.9493616) (cit. on p. [17\)](#page-39-0).
- **[42]** GALDERISI, Giulio; MIKOLAJICK, Thomas and TROMMER, Jens: "Robust Reconfigurable Field Effect Transistors Process Route Enabling Multi-V T Devices Fabrication for Hardware Security Applications". In: *2022 Device Research Conference (DRC)*. IEEE, 2022, pp. 1–2. DOI: [10.1109/DRC55272.2022.9855805](https://doi.org/10.1109/DRC55272.2022.9855805) (cit. on pp. [18,](#page-40-0) [144,](#page-166-2) [145,](#page-167-0) [147,](#page-169-0) [227\)](#page-249-1).
- **[43]** ZHANG, Jian; MARCHI, Michele de; SACCHETTO, Davide; GAILLARDON, Pierre-Emmanuel; LEBLEBICI,Yusuf and MICHELI, Giovanni de: "Polarity-Controllable Silicon Nanowire Transistors With Dual Threshold Voltages". In: *IEEE Transactions on Electron Devices* 61.11 (2014), pp. 3654–3660. DOI: [10.1109/TED.2014.](https://doi.org/10.1109/TED.2014.2359112) [2359112](https://doi.org/10.1109/TED.2014.2359112) (cit. on p. [18\)](#page-40-0).
- **[44]** BAKER, Russel Jacob: CMOS circuit design, layout, and simulation. Fourth edition. Vol. 22. IEEE Press series on microelectronic systems. Piscataway, NJ and Hoboken, New Jersey: IEEE Press and Wiley, 2019 (cit. on p. [20\)](#page-42-0).
- **[45]** WESTE, Neil H. E. and HARRIS, David Money: CMOS VLSI design: A circuits and systems perspective. 4. ed. Boston, Mass.: Addison-Wesley, 2011 (cit. on pp. [22,](#page-44-0) [23\)](#page-45-0).
- **[46]** KAHNG, Andrew B.; LIENIG, Jens; MARKOV, Igor L. and HU, Jin: VLSI Physical Design: From Graph Partitioning to Timing Closure. Cham: Springer International Publishing, 2022. DOI: [10.1007/978-3-030-96415-3](https://doi.org/10.1007/978-3-030-96415-3) (cit. on pp. [24–](#page-46-0)[26,](#page-48-0) [65,](#page-87-0) [66\)](#page-88-0).
- **[47]** HUFF, Michael: "Review—Important Considerations Regarding Device Parameter Process Variations in Semiconductor-Based Manufacturing". In: *ECS Journal of Solid State Science and Technology* 10.6 (2021), p. 064002. DOI: [10.](https://doi.org/10.1149/2162-8777/ac02a4) [1149/2162-8777/ac02a4](https://doi.org/10.1149/2162-8777/ac02a4) (cit. on pp. [28,](#page-50-0) [30\)](#page-52-0).
- **[48]** MAY, Gary S. and SPANOS, Costas J.: Fundamentals of semiconductor manufacturing and process control. Hoboken, NJ: Wiley-Interscience, 2006 (cit. on p. [28\)](#page-50-0).
- **[49]** VAN ZANT, Peter: Microchip fabrication: A practical guide to semiconductor processing. 6. ed. New York, NY: McGraw-Hill, 2014 (cit. on p. [28\)](#page-50-0).
- **[50]** GENG, Hwaiyu: Semiconductor Manufacturing Handbook, Second Edition. 2nd edition. New York, N.Y.: McGraw-Hill Education and McGraw Hill, 2017 (cit. on p. [28\)](#page-50-0).
- **[51]** BLAAUW, D.; CHOPRA, K.; SRIVASTAVA, A. and SCHEFFER, L.: "Statistical Timing Analysis: From Basic Principles to State of the Art". In: *IEEE Transactions on*

*Computer-Aided Design of Integrated Circuits and Systems* 27.4 (2008), pp. 589– 607. DOI: [10.1109/TCAD.2007.907047](https://doi.org/10.1109/TCAD.2007.907047) (cit. on pp. [29,](#page-51-0) [85\)](#page-107-0).

- **[52]** CHAMPAC, Victor and GARCIA GERVACIO, Jose: Timing Performance of Nanometer Digital Circuits Under Process Variations. Vol. 39. Cham: Springer International Publishing, 2018. DOI: [10.1007/978-3-319-75465-9](https://doi.org/10.1007/978-3-319-75465-9) (cit. on pp. [29–](#page-51-0) [31\)](#page-53-0).
- **[53]** SARANGI, Smruti R.; GRESKAMP, Brian; TEODORESCU, Radu; NAKANO, Jun; TI-WARI, Abhishek and TORRELLAS, Josep: "VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects". In: *IEEE Transactions on Semiconductor Manufacturing* 21.1 (2008), pp. 3–13. DOI: [10.1109/TSM.2007.](https://doi.org/10.1109/TSM.2007.913186) [913186](https://doi.org/10.1109/TSM.2007.913186) (cit. on pp. [31,](#page-53-0) [153\)](#page-175-1).
- **[54]** AGARWAL, A.; BLAAUW, D. and ZOLOTOV, V.: "Statistical timing analysis for intradie process variations with spatial correlations". In: *ICCAD-2003. International Conference on Computer Aided Design (IEEE Cat. No.03CH37486)*. IEEE, 2003, pp. 900–907. DOI: [10.1109/ICCAD.2003.159781](https://doi.org/10.1109/ICCAD.2003.159781) (cit. on p. [31\)](#page-53-0).
- **[55]** AGARWAL, A.; BLAAUW, D.; ZOLOTOV, V.; SUNDARESWARAN, S.; ZHAO, Min; GALA, K. and PANDA, R.: "Statistical delay computation considering spatial correlations". In: *Proceedings of the ASP-DAC Asia and South Pacific Design Automation Conference, 2003*. IEEE, 2003, pp. 271–276. DOI: [10.1109/ASPDAC.2003.](https://doi.org/10.1109/ASPDAC.2003.1195028) [1195028](https://doi.org/10.1109/ASPDAC.2003.1195028) (cit. on p. [31\)](#page-53-0).
- **[56]** KESHAVARZI, Ali et al.: "Measurements and modeling of intrinsic fluctuations in MOSFET threshold voltage". In: *Proceedings of the 2005 international symposium on Low power electronics and design - ISLPED '05*. Ed. by ROY, Kaushik and TIWARI, Vivek. New York, New York, USA: ACM Press, 2005, p. 26. DOI: [10.1145/1077603.1077611](https://doi.org/10.1145/1077603.1077611) (cit. on pp. [31,](#page-53-0) [32,](#page-54-0) [153\)](#page-175-1).
- **[57]** FRIEDBERG, P.; CAO, Yu; CAIN, J.; WANG, Ruth; RABAEY, Jan and SPANOS, C.: "Modeling Within-Die Spatial Correlation Effects for Process-Design Co-Optimization". In: *Sixth International Symposium on Quality of Electronic Design (ISQED'05)*. IEEE, 2005, pp. 516–521. DOI: [10.1109/ISQED.2005.82](https://doi.org/10.1109/ISQED.2005.82) (cit. on p. [31\)](#page-53-0).
- **[58]** PANG, Liang-Teck and NIKOLIC, Borivoje: "Measurements and Analysis of Process Variability in 90 nm CMOS". In: *IEEE Journal of Solid-State Circuits* 44.5 (2009), pp. 1655–1663. DOI: [10.1109/JSSC.2009.2015789](https://doi.org/10.1109/JSSC.2009.2015789) (cit. on p. [31\)](#page-53-0).
- **[59]** CHANG, Hongliang and SAPATNEKAR, S. S.: "Statistical timing analysis under spatial correlations". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 24.9 (2005), pp. 1467–1482. DOI: [10.1109/TCAD.](https://doi.org/10.1109/TCAD.2005.850834) [2005.850834](https://doi.org/10.1109/TCAD.2005.850834) (cit. on p. [33\)](#page-55-0).
- **[60]** GHASEMZADEH MOHAMMADI, Hassan; GAILLARDON, Pierre-Emmanuel and MICHELI, Giovanni de: "Efficient Statistical Parameter Selection for Nonlinear Modeling of Process/Performance Variation". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 35.12 (2016), pp. 1995–2007. DOI: [10.1109/TCAD.2016.2547908](https://doi.org/10.1109/TCAD.2016.2547908) (cit. on pp. [33,](#page-55-0) [153\)](#page-175-1).
- **[61]** LANGE, Andre; SOHRMANN, Christoph; JANCKE, Roland; HAASE, Joachim; CHENG, Binjie; ASENOV, Asen and SCHLICHTMANN, Ulf: "Multivariate Modeling

of Variability Supporting Non-Gaussian and Correlated Parameters". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 35.2 (2016), pp. 197–210. DOI: [10.1109/TCAD.2015.2459042](https://doi.org/10.1109/TCAD.2015.2459042) (cit. on pp. [33,](#page-55-0) [153\)](#page-175-1).

- **[62]** LANGE, Andre; SOHRMANN, Christoph; JANCKE, Roland; HAASE, Joachim; LORENZ, Ingolf and SCHLICHTMANN, Ulf: "Probabilistic standard cell modeling considering non-Gaussian parameters and correlations". In: *Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014*. New Jersey: IEEE Conference Publications, 2014, pp. 1–4. DOI: [10.7873/DATE.2014.243](https://doi.org/10.7873/DATE.2014.243) (cit. on p. [33\)](#page-55-0).
- **[63]** KISHORE, Sajja Krishna; PATNALA, Tulasi Radhika; TIGADI, Arun S. and JAMSHED, Aatif: "An On-chip Analysis of the VLSI designs under Process Variations". In: *2020 International Conference on Smart Electronics and Communication (ICOSEC)*. IEEE, 2020, pp. 1273–1277. DOI: [10 . 1109 / ICOSEC49089 . 2020 .](https://doi.org/10.1109/ICOSEC49089.2020.9215244) [9215244](https://doi.org/10.1109/ICOSEC49089.2020.9215244) (cit. on p. [33\)](#page-55-0).
- **[64]** SHAH, Nivana; HD, Nataraj Urs; GADHAWE, Akanksha and SAXENA, Ankit: "Study of the Impact of Variations on Standard Cells". In: *Indian Journal of Science and Technology* 12.36 (2019), pp. 1–5. DOI: [10.17485/ijst/2019/v12i36/](https://doi.org/10.17485/ijst/2019/v12i36/147751) [147751](https://doi.org/10.17485/ijst/2019/v12i36/147751) (cit. on p. [33\)](#page-55-0).
- **[65]** JIN, Leilei; FU, Wenjie; YAN, Hao and SHI, Longxing: "A Statistical Cell Delay Model for Estimating the 3ς Delay by Matching Kurtosis". In: *IEEE Transactions on Circuits and Systems II: Express Briefs* 69.6 (2022), pp. 2932–2936. DOI: [10.](https://doi.org/10.1109/TCSII.2022.3157981) [1109/TCSII.2022.3157981](https://doi.org/10.1109/TCSII.2022.3157981) (cit. on p. [33\)](#page-55-0).
- **[66]** WIRNSHOFER, Martin: Variation-aware adaptive voltage scaling for digital CMOS circuits. Vol. 41. Springer series in advanced microelectronics. Dordrecht: Springer, 2013. DOI: [10 . 1007 / 978 - 94 - 007 - 6196 - 4](https://doi.org/10.1007/978-94-007-6196-4) (cit. on pp. [34,](#page-56-0) [35\)](#page-57-0).
- **[67]** BORKAR, Shekhar; KARNIK, Tanay; NARENDRA, Siva; TSCHANZ, Jim; KESHAVARZI, Ali and DE, Vivek: "Parameter variations and impact on circuits and microarchitecture". In: *Proceedings of the 40th annual Design Automation Conference*. Ed. by GETREU, Ian; FIX, Limor and LAVAGNO, Luciano. New York, NY, USA: ACM, 2003, pp. 338–342. DOI: [10.1145/775832.775920](https://doi.org/10.1145/775832.775920) (cit. on pp. [34,](#page-56-0) [36,](#page-58-0) [38\)](#page-60-0).
- **[68]** WONG, K. L.; RAHAL-ARABI, T.; MA, M. and TAYLOR, G.: "Enhancing Microprocessor Immunity to Power Supply Noise With Clock-Data Compensation". In: *IEEE Journal of Solid-State Circuits* 41.4 (2006), pp. 749–758. DOI: [10.1109/](https://doi.org/10.1109/JSSC.2006.870925) [JSSC.2006.870925](https://doi.org/10.1109/JSSC.2006.870925) (cit. on p. [34\)](#page-56-0).
- **[69]** GNAD, Dennis R. E.; OBORIL, Fabian; KIAMEHR, Saman and TAHOORI, Mehdi B.: "An Experimental Evaluation and Analysis of Transient Voltage Fluctuations in FPGAs". In: *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 26.10 (2018), pp. 1817–1830. DOI: [10.1109/TVLSI.2018.2848460](https://doi.org/10.1109/TVLSI.2018.2848460) (cit. on pp. [34,](#page-56-0) [90,](#page-112-0) [104\)](#page-126-0).
- **[70]** KRAUTTER, Jonas: "Analysis and Mitigation of Remote Side-Channel and Fault Attacks on the Electrical Level". PhD thesis. 2022. DOI: [10.5445/IR/1000144660](https://doi.org/10.5445/IR/1000144660) (cit. on p. [35\)](#page-57-0).
- **[71]** LARSSON, Patrik: "di/dt Noise in CMOS Integrated Circuits". In: *Analog Design Issues in Digital VLSI Circuits and Systems*. Ed. by BECERRA, Juan J. and FRIED-MAN, Eby G. Boston, MA: Springer US, 1997, pp. 113–129. DOI: [10.1007/978-1-](https://doi.org/10.1007/978-1-4615-6101-9_10) [4615-6101-9\\_10](https://doi.org/10.1007/978-1-4615-6101-9_10) (cit. on p. [35\)](#page-57-0).
- **[72]** GUPTA, Meeta S.; OATLEY, Jarod L.; JOSEPH, Russ; WEI, Gu-Yeon and BROOKS, David M.: "Understanding Voltage Variations in Chip Multiprocessors using a Distributed Power-Delivery Network". In: *2007 Design, Automation & Test in Europe Conference & Exhibition*. IEEE, 2007, pp. 1–6. DOI: [10.1109/DATE.2007.](https://doi.org/10.1109/DATE.2007.364663) [364663](https://doi.org/10.1109/DATE.2007.364663) (cit. on p. [35\)](#page-57-0).
- **[73]** KIAMEHR, Saman; EBRAHIMI, Mojtaba; GOLANBARI, Mohammad Saber and TAHOORI, Mehdi B.: "Temperature-Aware Dynamic Voltage Scaling to Improve Energy Efficiency of Near-Threshold Computing". In: *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 25.7 (2017), pp. 2017–2026. DOI: [10.1109/TVLSI.2017.2669375](https://doi.org/10.1109/TVLSI.2017.2669375) (cit. on p. [36\)](#page-58-0).
- **[74]** MONDAL, S.; MUKHERJEE, R. and MEMIK, S. O.: "Fine-Grain Thermal Profiling and Sensor Insertion for FPGAs". In: *2006 IEEE International Symposium on Circuits and Systems*. IEEE, 2006, pp. 4387–4390. DOI: [10.1109/ISCAS.2006.](https://doi.org/10.1109/ISCAS.2006.1693601) [1693601](https://doi.org/10.1109/ISCAS.2006.1693601) (cit. on p. [36\)](#page-58-0).
- **[75]** SUNDARARAJAN, Priya; GAYASEN, Aman; VIJAYKRISHNAN, N. and TUAN, T.: "Thermal characterization and optimization in platform FPGAs". In: *Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design - ICCAD '06*. Ed. by HASSOUN, Soha. New York, New York, USA: ACM Press, 2006, p. 443. DOI: [10.1145/1233501.1233589](https://doi.org/10.1145/1233501.1233589) (cit. on p. [36\)](#page-58-0).
- **[76]** AMOURI, Abdulazim; AMROUCH, Hussam; EBI, Thomas; HENKEL, Jorg and TAHOORI, Mehdi: "Accurate Thermal-Profile Estimation and Validation for FPGA-Mapped Circuits". In: *2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines*. IEEE, 2013, pp. 57–60. DOI: [10.1109/FCCM.2013.48](https://doi.org/10.1109/FCCM.2013.48) (cit. on pp. [36,](#page-58-0) [37\)](#page-59-0).
- **[77]** AMOURI, Abdulazim; HEPP, Jochen and TAHOORI, Mehdi: "Built-In Self-Heating Thermal Testing of FPGAs". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 35.9 (2016), pp. 1546–1556. DOI: [10.1109/](https://doi.org/10.1109/TCAD.2015.2512905) [TCAD.2015.2512905](https://doi.org/10.1109/TCAD.2015.2512905) (cit. on p. [36\)](#page-58-0).
- **[78]** AMROUCH, Hussam; EBI, Thomas; SCHNEIDER, Josef; PARAMESWARAN, Sridevan and HENKEL, Jorg: "Analyzing the thermal hotspots in FPGA-based embedded systems". In: *2013 23rd International Conference on Field programmable Logic and Applications*. IEEE, 2013, pp. 1–4. DOI: [10.1109/FPL.2013.6645567](https://doi.org/10.1109/FPL.2013.6645567) (cit. on p. [36\)](#page-58-0).
- **[79]** AMOURI, Abdulazim: "Degradation in FPGAs: Monitoring, Modeling and Mitigation". PhD thesis. 2015. DOI: [10.5445/IR/1000051435](https://doi.org/10.5445/IR/1000051435) (cit. on p. [36\)](#page-58-0).
- **[80]** LU, Weina; HU, Yu; YE, Jing and LI, Xiaowei: "TeSHoP: A Temperature Sensing based Hotspot-Driven Placement technique for FPGAs". In: *2016 26th International Conference on Field Programmable Logic and Applications (FPL)*. IEEE, 2016, pp. 1–4. DOI: [10.1109/FPL.2016.7577304](https://doi.org/10.1109/FPL.2016.7577304) (cit. on p. [37\)](#page-59-0).
- **[81]** SOLEIMANI, S.; AFZALI-KUSHA, A. and FOROUZANDEH, B.: "Temperature dependence of propagation delay characteristic in FinFET circuits". In: *2008 International Conference on Microelectronics*. IEEE, 2008, pp. 276–279. DOI: [10.1109/ICM.2008.5393513](https://doi.org/10.1109/ICM.2008.5393513) (cit. on pp. [37,](#page-59-0) [38\)](#page-60-0).
- **[82]** KUMAR, R. and KURSUN, V.: "Impact of temperature fluctuations on circuit characteristics in 180nm and 65nm CMOS technologies". In: *2006 IEEE International Symposium on Circuits and Systems*. IEEE, 2006, p. 4. DOI: [10.1109/](https://doi.org/10.1109/ISCAS.2006.1693470) [ISCAS.2006.1693470](https://doi.org/10.1109/ISCAS.2006.1693470) (cit. on p. [38\)](#page-60-0).
- **[83]** KUMAR, R. and KURSUN, V.: "Reversed Temperature-Dependent Propagation Delay Characteristics in Nanometer CMOS Circuits". In: *IEEE Transactions on Circuits and Systems II: Express Briefs* 53.10 (2006), pp. 1078–1082. DOI: [10.1109/TCSII.2006.882218](https://doi.org/10.1109/TCSII.2006.882218) (cit. on p. [38\)](#page-60-0).
- **[84]** LANGE, André; GONZALEZ, Fabio A. Velarde; GIERING, Kay-Uwe; VERVANTIDIS, Anastasios; HAHNE, Lukas; HEINIG, Andy and JANCKE, Roland: "A general approach for degradation modeling to enable a widespread use of aging simulations in IC design". In: *Microelectronics Reliability* 137 (2022), p. 114775. DOI: [10.1016/j.microrel.2022.114775](https://doi.org/10.1016/j.microrel.2022.114775) (cit. on pp. [38,](#page-60-0) [41\)](#page-63-0).
- **[85]** WANG, Xiaofei; TANG, Qianying; JAIN, Pulkit; JIAO, Dong and KIM, Chris H.: "The Dependence of BTI and HCI-Induced Frequency Degradation on Interconnect Length and Its Circuit Level Implications". In: *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 23.2 (2015), pp. 280–291. DOI: [10.1109/TVLSI.](https://doi.org/10.1109/TVLSI.2014.2307589) [2014.2307589](https://doi.org/10.1109/TVLSI.2014.2307589) (cit. on p. [39\)](#page-61-0).
- **[86]** TYAGINOV, Stanislav; JECH, Markus; FRANCO, Jacopo; SHARMA, Prateek; KACZER, Ben and GRASSER, Tibor: "Understanding and Modeling the Temperature Behavior of Hot-Carrier Degradation in SiON nMOSFETs". In: *IEEE Electron Device Letters* 37.1 (2016), pp. 84–87. DOI: [10.1109/LED.2015.2503920](https://doi.org/10.1109/LED.2015.2503920) (cit. on p. [39\)](#page-61-0).
- **[87]** KHOSHAVI, Navid; ASHRAF, Rizwan A.; DEMARA, Ronald F.; KIAMEHR, Saman; OBORIL, Fabian and TAHOORI, Mehdi B.: "Contemporary CMOS aging mitigation techniques: Survey, taxonomy, and methods". In: *Integration* 59 (2017), pp. 10–22. DOI: [10.1016/j.vlsi.2017.03.013](https://doi.org/10.1016/j.vlsi.2017.03.013) (cit. on pp. [39,](#page-61-0) [41,](#page-63-0) [42,](#page-64-0) [91\)](#page-113-0).
- **[88]** VELAMALA, Jyothi Bhaskarr; SUTARIA, Ketul; SATO, Takashi and CAO,Yu: "Physics matters: Statistical aging prediction under trapping/detrapping". In: *Proceedings of the 49th Annual Design Automation Conference*. Ed. by GROENEVELD, Patrick; SCIUTO, Donatella and HASSOUN, Soha. New York, NY, USA: ACM, 2012, pp. 139–144. DOI: [10.1145/2228360.2228388](https://doi.org/10.1145/2228360.2228388) (cit. on p. [39\)](#page-61-0).
- **[89]** GUO, Xinfei; BURLESON,Wayne and STAN, Mircea: "Modeling and Experimental Demonstration of Accelerated Self-Healing Techniques". In: *Proceedings of the 51st Annual Design Automation Conference*. New York, NY, USA: ACM, 2014, pp. 1–6. DOI: [10.1145/2593069.2593162](https://doi.org/10.1145/2593069.2593162) (cit. on p. [39\)](#page-61-0).
- **[90]** YE, Wei; ALAWIEH, Mohamed Baker; HSU, Che-Lun; LIN, Yibo and PAN, David Z.: "Dealing with Aging and Yield in Scaled Technologies". In: *Dependable Embedded Systems*. Ed. by HENKEL, Jörg and DUTT, Nikil. Embedded Systems.
Cham: Springer International Publishing, 2021, pp. 409–429. DOI: [10.1007/978-](https://doi.org/10.1007/978-3-030-52017-5_17) [3-030-52017-5\\_17](https://doi.org/10.1007/978-3-030-52017-5_17) (cit. on p. [40\)](#page-62-0).

- **[91]** ARNAUD, Lucile; TARTAVEL, G.; BERGER, T.; MARIOLLE, D.; GOBIL, Y. and TOUET, I.: "Microstructure and electromigration in copper damascene lines". In: *Microelectronics Reliability* 40.1 (2000), pp. 77–86. DOI: [10.1016/S0026-2714\(99\)](https://doi.org/10.1016/S0026-2714(99)00209-7) [00209-7](https://doi.org/10.1016/S0026-2714(99)00209-7) (cit. on p. [40\)](#page-62-0).
- **[92]** LANGE, Andre: Aging Models: The Basis For Predicting Circuit Reliability. 2018. URL: [https://semiengineering.com/aging-models-the-basis-for-predicting](https://semiengineering.com/aging-models-the-basis-for-predicting-circuit-reliability/)[circuit-reliability/](https://semiengineering.com/aging-models-the-basis-for-predicting-circuit-reliability/) (visited on 04/25/2023) (cit. on p. [41\)](#page-63-0).
- **[93]** Open Model Interface Provides Standard for Advanced SPICE Capabilities. URL: <https://si2.org/open-model/> (visited on 04/25/2023) (cit. on p. [41\)](#page-63-0).
- **[94]** NAPHADE, T.; GOEL, N.; NAIR, P. R. and MAHAPATRA, S.: "Investigation of stochastic implementation of reaction diffusion (RD) models for NBTI related interface trap generation". In: *2013 IEEE International Reliability Physics Symposium (IRPS)*. IEEE, 2013, XT.5.1–XT.5.11. DOI: [10.1109/IRPS.2013.6532120](https://doi.org/10.1109/IRPS.2013.6532120) (cit. on p. [41\)](#page-63-0).
- **[95]** KACZER, B.; GRASSER, T.; ROUSSEL, Ph. J.; FRANCO, J.; DEGRAEVE, R.; RAGNARS-SON, L.-A.; SIMOEN, E.; GROESENEKEN, G. and REISINGER, H.: "Origin of NBTI variability in deeply scaled pFETs". In: *2010 IEEE International Reliability Physics Symposium*. IEEE, 2010, pp. 26–32. DOI: [10.1109/IRPS.2010.5488856](https://doi.org/10.1109/IRPS.2010.5488856) (cit. on p. [41\)](#page-63-0).
- **[96]** NUNES, Cícero; BUTZEN, Paulo F.; REIS, André I. and RIBAS, Renato P.: "BTI, HCI and TDDB aging impact in flip–flops". In: *Microelectronics Reliability* 53.9-11 (2013), pp. 1355–1359. DOI: [10.1016/j.microrel.2013.07.044](https://doi.org/10.1016/j.microrel.2013.07.044) (cit. on p. [41\)](#page-63-0).
- **[97]** JAFARI, Atousa; RAJI, Mohsen and GHAVAMI, Behnam: "Impacts of Process Variations and Aging on Lifetime Reliability of Flip-Flops: A Comparative Analysis". In: *IEEE Transactions on Device and Materials Reliability* 19.3 (2019), pp. 551– 562. DOI: [10.1109/TDMR.2019.2933998](https://doi.org/10.1109/TDMR.2019.2933998) (cit. on p. [41\)](#page-63-0).
- **[98]** LORENZ, Dominik; BARKE, Martin and SCHLICHTMANN, Ulf: "Aging analysis at gate and macro cell level". In: *2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)*. IEEE, 2010, pp. 77–84. DOI: [10.1109/ICCAD.](https://doi.org/10.1109/ICCAD.2010.5654309) [2010.5654309](https://doi.org/10.1109/ICCAD.2010.5654309) (cit. on p. [41\)](#page-63-0).
- **[99]** LU, Yinghai; SHANG, Li; ZHOU, Hai; ZHU, Hengliang; YANG, Fan and ZENG, Xuan: "Statistical reliability analysis under process variation and aging effects". In: *Proceedings of the 46th Annual Design Automation Conference*. New York, NY, USA: ACM, 2009, pp. 514–519. DOI: [10.1145/1629911.1630044](https://doi.org/10.1145/1629911.1630044) (cit. on p. [41\)](#page-63-0).
- **[100]** KIAMEHR, Saman; WECKX, Pieter; TAHOORI, Mehdi; KACZER, Ben; KUKNER, Halil; RAGHAVAN, Praveen; GROESENEKEN, Guido and CATTHOOR, Francky: "The impact of process variation and stochastic aging in nanoscale VLSI". In: *2016 IEEE International Reliability Physics Symposium (IRPS)*. IEEE, 2016, CR-1-1-CR-1–6. DOI: [10.1109/IRPS.2016.7574590](https://doi.org/10.1109/IRPS.2016.7574590) (cit. on p. [41\)](#page-63-0).
- **[101]** OBORIL, Fabian and TAHOORI, Mehdi B.: "ExtraTime: Modeling and analysis of wearout due to transistor aging at microarchitecture-level". In: *IEEE/IFIP*

*International Conference on Dependable Systems and Networks (DSN 2012)*. IEEE, 2012, pp. 1–12. DOI: [10.1109/DSN.2012.6263957](https://doi.org/10.1109/DSN.2012.6263957) (cit. on p. [42\)](#page-64-0).

- **[102]** FIROUZI, Farshad; KIAMEHR, Saman; TAHOORI, Mehdi and NASSIF, Sani: "Incorporating the Impacts of Workload-Dependent Runtime Variations into Timing Analysis". In: *Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013*. New Jersey: IEEE Conference Publications, 2013, pp. 1022–1025. DOI: [10.7873/DATE.2013.213](https://doi.org/10.7873/DATE.2013.213) (cit. on p. [42\)](#page-64-0).
- **[103]** VAN SANTEN,Victor M.; AMROUCH, Hussam; MARTIN-MARTINEZ, Javier; NAFRIA, Montserrat and HENKEL, Jörg: "Designing guardbands for instantaneous aging effects". In: *Proceedings of the 53rd Annual Design Automation Conference*. New York, NY, USA: ACM, 2016, pp. 1–6. DOI: [10.1145/2897937.2898006](https://doi.org/10.1145/2897937.2898006) (cit. on p. [42\)](#page-64-0).
- **[104]** BROWN, S. and ROSE, J.: "FPGA and CPLD architectures: a tutorial". In: *IEEE Design & Test of Computers* 13.2 (1996), pp. 42–57. DOI: [10.1109/54.500200](https://doi.org/10.1109/54.500200) (cit. on pp. [45,](#page-67-0) [46,](#page-68-0) [50,](#page-72-0) [51\)](#page-73-0).
- **[105]** PARANDEH-AFSHAR, Hadi; BENBIHI, Hind; NOVO, David and IENNE, Paolo: "Rethinking FPGAs: elude the flexibility excess of LUTs with and-inverter cones". In: *Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays - FPGA '12*. Ed. by COMPTON, Katherine and HUTCH-INGS, Brad. New York, New York, USA: ACM Press, 2012, p. 119. DOI: [10.1145/](https://doi.org/10.1145/2145694.2145715) [2145694.2145715](https://doi.org/10.1145/2145694.2145715) (cit. on pp. [46,](#page-68-0) [47,](#page-69-0) [75,](#page-97-0) [104\)](#page-126-0).
- **[106]** ZGHEIB, Grace; YANG, Liqun; HUANG, Zhihong; NOVO, David; PARANDEH-AFSHAR, Hadi; YANG, Haigang and IENNE, Paolo: "Revisiting and-inverter cones". In: *Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays*. Ed. by BETZ, Vaughn and CON-STANTINIDES, George A. New York, NY, USA: ACM, 2014, pp. 45–54. DOI: [10.1145/2554688.2554791](https://doi.org/10.1145/2554688.2554791) (cit. on p. [47\)](#page-69-0).
- **[107]** THUMMLER, Martin; RAI, Shubham and KUMAR, Akash: "Improving Technology Mapping for And-Inverter-Cones". In: *2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)*. IEEE, 2022, pp. 274–279. DOI: [10.](https://doi.org/10.23919/DATE54114.2022.9774544) [23919/DATE54114.2022.9774544](https://doi.org/10.23919/DATE54114.2022.9774544) (cit. on p. [47\)](#page-69-0).
- **[108]** RAI, Shubham; NATH, Pallab; RUPANI, Ansh;VISHVAKARMA, Santosh Kumar and KUMAR, Akash: "A Survey of FPGA Logic Cell Designs in the Light of Emerging Technologies". In: *IEEE Access* 9 (2021), pp. 91564–91574. DOI: [10.1109/ACCESS.](https://doi.org/10.1109/ACCESS.2021.3092167) [2021.3092167](https://doi.org/10.1109/ACCESS.2021.3092167) (cit. on pp. [47,](#page-69-0) [55\)](#page-77-0).
- **[109]** ZILIC, Z. and VRANESIC, Z. G.: "Using BDDs to Design ULMs for FPGAs". In: *Fourth International ACM Symposium on Field-Programmable Gate Arrays*. IEEE, 1996, pp. 24–30. DOI: [10.1109/FPGA.1996.242252](https://doi.org/10.1109/FPGA.1996.242252) (cit. on pp. [47,](#page-69-0) [48\)](#page-70-0).
- **[110]** PREPARATA, Franco P. and MULLER, David E.: "Generation of near-optimal universal Boolean functions". In: *Journal of Computer and System Sciences* 4.2 (1970), pp. 93–102. DOI: [10.1016/S0022-0000\(70\)80002-2](https://doi.org/10.1016/S0022-0000(70)80002-2) (cit. on p. [47\)](#page-69-0).
- **[111]** IIDA, Masahiro; AMAGASAKI, Motoki; OKAMOTO, Yasuhiro; ZHAO, Qian and SUEYOSHI, Toshinori: "COGRE: A Novel Compact Logic Cell Architecture for

Area Minimization". In: *IEICE Transactions on Information and Systems* E95- D.2 (2012), pp. 294–302. DOI: [10.1587/transinf.E95.D.294](https://doi.org/10.1587/transinf.E95.D.294) (cit. on p. [48\)](#page-70-0).

- **[112]** THAKUR, S. andWONG, D. F.: "On Designing ULM-Based FPGA Logic Modules". In: (1995), pp. 3–9. DOI: [10.1109/FPGA.1995.241856](https://doi.org/10.1109/FPGA.1995.241856) (cit. on p. [48\)](#page-70-0).
- **[113]** HUTTER, M.: "Designing universal logic modules". In: *9th International Conference on Electronics, Circuits and Systems*. IEEE, 2002, pp. 709–712. DOI: [10.1109/ICECS.2002.1046267](https://doi.org/10.1109/ICECS.2002.1046267) (cit. on p. [48\)](#page-70-0).
- **[114]** MICROSEMI: Axcelerator Family FPGAs. Revision 18. 2012. (Visited on 06/28/2023) (cit. on pp. [49,](#page-71-0) [50\)](#page-72-0).
- **[115]** KUON, Ian; TESSIER, Russell and ROSE, Jonathan: "FPGA Architecture: Survey and Challenges". In: *Foundations and Trends*Ⓡ *in Electronic Design Automation* 2.2 (2007), pp. 135–253. DOI: [10.1561/1000000005](https://doi.org/10.1561/1000000005) (cit. on p. [49\)](#page-71-0).
- **[116]** YANG, Haigang; ZHANG, Jia; SUN, Jiabin and LE YU: "Review of advanced FPGA architectures and technologies". In: *Journal of Electronics (China)* 31.5 (2014), pp. 371–393. DOI: [10.1007/s11767-014-4090-x](https://doi.org/10.1007/s11767-014-4090-x) (cit. on pp. [50,](#page-72-0) [51\)](#page-73-0).
- **[117]** CHIN, Stephen Alexander; LUU, Jason; HUDA, Safeen and ANDERSON, Jason H.: "Hybrid LUT/Multiplexer FPGA Logic Architectures". In: *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 24.4 (2016), pp. 1280–1292. DOI: [10.1109/TVLSI.2015.2451658](https://doi.org/10.1109/TVLSI.2015.2451658) (cit. on p. [50\)](#page-72-0).
- **[118]** DEHON, André and HAUCK, Scott: Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation. 1. Aufl. Systems on Silicon. s.l.: Elsevier professional, 2007 (cit. on pp. [50,](#page-72-0) [51,](#page-73-0) [57,](#page-79-0) [58,](#page-80-0) [64–](#page-86-0)[66\)](#page-88-0).
- **[119]** CHIASSON, Charles and BETZ, Vaughn: "COFFE: Fully-automated transistor sizing for FPGAs". In: *2013 International Conference on Field-Programmable Technology (FPT)*. IEEE, 2013, pp. 34–41. DOI: [10.1109/FPT.2013.6718327](https://doi.org/10.1109/FPT.2013.6718327) (cit. on p. [51\)](#page-73-0).
- **[120]** AMANO, Hideharu, ed.: Principles and structures of FPGAs. Singapore: Springer, 2018 (cit. on pp. [51,](#page-73-0) [58–](#page-80-0)[60\)](#page-82-0).
- **[121]** RODRÍGUEZ-ANDINA, Juan José; LA TORRE-ARNANZ, Eduardo de and VALDÉS PEÑA, María Dolores: FPGAs: Fundamentals, advanced features, and applications in industrial electronics. First issued in paperback. Boca Raton: CRC Press, 2020 (cit. on p. [51\)](#page-73-0).
- **[122]** CHENG, Kevin; LE BEUX, Sebastien and O'CONNOR, Ian: "Hybrid Topologies for Reconfigurable Matrices Based on Nano-Grain Cells". In: *2017 IEEE International Conference on Rebooting Computing (ICRC)*. IEEE, 2017, pp. 1–8. DOI: [10.1109/ICRC.2017.8123639](https://doi.org/10.1109/ICRC.2017.8123639) (cit. on pp. [53,](#page-75-0) [55\)](#page-77-0).
- **[123]** JABEUR, Kotb; YAKYMETS, Natalya; O'CONNOR, Ian and LE-BEUX, Sébastien: "Fine-grain reconfigurable logic cells based on double-gate CNTFETs". In: *Proceedings of the 21st edition of the great lakes symposium on Great lakes symposium on VLSI*. Ed. by ATIENZA, David; XIE, Yuan; AYALA, Jose L. and STEVENS, Ken. New York, NY: ACM, 2011, p. 19. DOI: [10.1145/1973009.1973014](https://doi.org/10.1145/1973009.1973014) (cit. on pp. [53](#page-75-0)[–55\)](#page-77-0).
- **[124]** CHENG, Kevin; LE BEUX, Sebastien and O'CONNOR, Ian: "Am/IDG-FET based reconfigurable cells versus LUTs: Characteristics description and analysis".

In: *2013 25th International Conference on Microelectronics (ICM)*. IEEE, 2013, pp. 1–4. DOI: [10.1109/ICM.2013.6734987](https://doi.org/10.1109/ICM.2013.6734987) (cit. on pp. [53,](#page-75-0) [55\)](#page-77-0).

- **[125]** GAILLARDON, Pierre-Emmanuel; AMARÙ, Luca Gaetano; BOBBA, Shashikanth; MARCHI, Michele de; SACCHETTO, Davide and MICHELI, Giovanni de: "Nanowire systems: technology and design". In: *Philosophical transactions. Series A, Mathematical, physical, and engineering sciences* 372.2012 (2014), p. 20130102. DOI: [10.1098/rsta.2013.0102](https://doi.org/10.1098/rsta.2013.0102) (cit. on pp. [53,](#page-75-0) [55\)](#page-77-0).
- **[126]** RAI, Shubham; TROMMER, Jens; RAITZA, Michael; MIKOLAJICK,Thomas;WEBER, Walter M. and KUMAR, Akash: "Designing Efficient Circuits Based on Runtime-Reconfigurable Field-Effect Transistors". In: *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 27.3 (2019), pp. 560–572. DOI: [10.1109/TVLSI.](https://doi.org/10.1109/TVLSI.2018.2884646) [2018.2884646](https://doi.org/10.1109/TVLSI.2018.2884646) (cit. on pp. [54,](#page-76-0) [71\)](#page-93-0).
- **[127]** O'CONNOR, Ian et al.: "CNTFET Modeling and Reconfigurable Logic-Circuit Design". In: *IEEE Transactions on Circuits and Systems I: Regular Papers* 54.11 (2007), pp. 2365–2379. DOI: [10.1109/TCSI.2007.907835](https://doi.org/10.1109/TCSI.2007.907835) (cit. on p. [54\)](#page-76-0).
- **[128]** LIU, J.; O'CONNOR, I.; NAVARRO, D. and GAFFIOT, F.: "Design of a Novel CNTFETbased Reconfigurable Logic Gate". In: *IEEE Computer Society Annual Symposium on VLSI (ISVLSI '07)*. IEEE, 2007, pp. 285–290. DOI: [10.1109/ISVLSI.2007.](https://doi.org/10.1109/ISVLSI.2007.39) [39](https://doi.org/10.1109/ISVLSI.2007.39) (cit. on pp. [54,](#page-76-0) [177](#page-199-0)[–179\)](#page-201-0).
- **[129]** KATO, Junki; WATANABE, Shigeyoshi; NINOMIYA, Hiroshi; KOBAYASHI, Manabu and MIURA, Yasuyuki: "Circuit design of reconfigurable dynamic logic based on double gate CNTFETs focusing on number of states of back gate voltages". In: *Contemporary Engineering Sciences* 7 (2014), pp. 39–52. DOI: [10.12988/ces.](https://doi.org/10.12988/ces.2014.3952) [2014.3952](https://doi.org/10.12988/ces.2014.3952) (cit. on p. [54\)](#page-76-0).
- **[130]** KUMAR, T. Nandha; ALMURIB, Haider A. F. and LOMBARDI, Fabrizio: "A novel design of a memristor-based look-up table (LUT) for FPGA". In: *2014 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)*. IEEE, 2014, pp. 703–706. DOI: [10.1109/APCCAS.2014.7032878](https://doi.org/10.1109/APCCAS.2014.7032878) (cit. on p. [55\)](#page-77-0).
- **[131]** GUO, Yanwen; WANG, Xiaoping and ZENG, Zhigang: "A Compact Memristor-CMOS Hybrid Look-Up-Table Design and Potential Application in FPGA". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 36.12 (2017), pp. 2144–2148. DOI: [10.1109/TCAD.2017.2681079](https://doi.org/10.1109/TCAD.2017.2681079) (cit. on p. [55\)](#page-77-0).
- **[132]** NINOMIYA, Hiroshi; KOBAYASHI, Manabu andWATANABE, Shigeyoshi: "Reduced Reconfigurable Logic Circuit Design Based on Double Gate CNTFETs Using Ambipolar Binary Decision Diagram". In: *IEICE Transactions on Fundamentals of Electronics,Communications and Computer Sciences* E96.A.1 (2013), pp. 356– 359. DOI: [10.1587/transfun.E96.A.356](https://doi.org/10.1587/transfun.E96.A.356) (cit. on p. [55\)](#page-77-0).
- **[133]** YAKYMETS, N.; JABEUR, K.; O'CONNOR, I. and LE BEUX, S.: "Interconnect topology for cell matrices based on low-power nanoscale devices". In: *2011 Faible Tension Faible Consommation (FTFC)*. IEEE, 2011, pp. 99–102. DOI: [10.1109/](https://doi.org/10.1109/FTFC.2011.5948929) [FTFC.2011.5948929](https://doi.org/10.1109/FTFC.2011.5948929) (cit. on p. [55\)](#page-77-0).
- **[134]** BABU, Praveenkumar and PARTHASARATHY, Eswaran: "Reconfigurable FPGA Architectures: A Survey and Applications". In: *Journal of The Institution of*

*Engineers (India): Series B* 102.1 (2021), pp. 143–156. DOI: [10.1007/s40031-020-](https://doi.org/10.1007/s40031-020-00508-y) [00508-y](https://doi.org/10.1007/s40031-020-00508-y) (cit. on pp. [57,](#page-79-0) [59,](#page-81-0) [60\)](#page-82-0).

- **[135]** INTEL: IntelⓇ ArriaⓇ 10 Core Fabric and General Purpose I/Os Handbook. 2023. URL: [https : / / www .intel . com / content / www / us / en / docs /](https://www.intel.com/content/www/us/en/docs/programmable/683461/) [programmable/683461/](https://www.intel.com/content/www/us/en/docs/programmable/683461/) (visited on 07/22/2023) (cit. on p. [58\)](#page-80-0).
- **[136]** ZGHEIB, Grace and IENNE, Paolo: "Evaluating FPGA clusters under wide ranges of design parameters". In: *2017 27th International Conference on Field Programmable Logic and Applications (FPL)*. IEEE, 2017, pp. 1–8. DOI: [10.23919/](https://doi.org/10.23919/FPL.2017.8056826) [FPL.2017.8056826](https://doi.org/10.23919/FPL.2017.8056826) (cit. on p. [58\)](#page-80-0).
- **[137]** BOUTROS, Andrew and BETZ, Vaughn: "FPGA Architecture: Principles and Progression". In: *IEEE Circuits and Systems Magazine* 21.2 (2021), pp. 4–29. DOI: [10.1109/MCAS.2021.3071607](https://doi.org/10.1109/MCAS.2021.3071607) (cit. on p. [59\)](#page-81-0).
- **[138]** XILINX: UltraScale Architecture Configurable Logic Block: User Guide. 2017. URL: <https://docs.xilinx.com/v/u/en-US/ug574-ultrascale-clb> (visited on 07/22/2023) (cit. on p. [59\)](#page-81-0).
- **[139]** SMITH, Michael John Sebastian: Application-specific integrated circuits. VLSI systems series. Reading, Mass.: Addison-Wesley, 1997 (cit. on p. [61\)](#page-83-0).
- **[140]** DEL SOZZO, Emanuele; CONFICCONI, Davide; ZENI, Alberto; SALARIS, Mirko; SCIUTO, Donatella and SANTAMBROGIO, Marco D.: "Pushing the Level of Abstraction of Digital System Design: A Survey on How to Program FPGAs". In: *ACM Computing Surveys* 55.5 (2023), pp. 1–48. DOI: [10.1145/3532989](https://doi.org/10.1145/3532989) (cit. on p. [63\)](#page-85-0).
- **[141]** LYKE, James C.; CHRISTODOULOU, Christos G.; VERA, G. Alonzo and EDWARDS, Arthur H.: "An Introduction to Reconfigurable Systems". In: *Proceedings of the IEEE* 103.3 (2015), pp. 291–317. DOI: [10.1109/JPROC.2015.2397832](https://doi.org/10.1109/JPROC.2015.2397832) (cit. on p. [63\)](#page-85-0).
- **[142]** LUU, Jason et al.: "VTR 7.0". In: *ACM Transactions on Reconfigurable Technology and Systems* 7.2 (2014), pp. 1–30. DOI: [10.1145/2617593](https://doi.org/10.1145/2617593) (cit. on pp. [64,](#page-86-0) [180\)](#page-202-0).
- **[143]** SHAH, David; HUNG, Eddie; WOLF, Clifford; BAZANSKI, Serge; GISSELQUIST, Dan and MILANOVIC, Miodrag: "Yosys+nextpnr: An Open Source Framework from Verilog to Bitstream for Commercial FPGAs". In: *2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)*. IEEE, 2019, pp. 1–4. DOI: [10.1109/FCCM.2019.00010](https://doi.org/10.1109/FCCM.2019.00010) (cit. on pp. [64,](#page-86-0) [66\)](#page-88-0).
- **[144]** BRAYTON, Robert and MISHCHENKO, Alan: "ABC: An Academic Industrial-Strength Verification Tool". In: *Computer Aided Verification*. Ed. by HUTCHISON, David et al. Vol. 6174. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 24–40. DOI: [10.1007/978-3-642-14295-6\\_5](https://doi.org/10.1007/978-3-642-14295-6_5) (cit. on p. [64\)](#page-86-0).
- **[145]** SYNOPSYS: Liberty User Guides and Reference Manual Suite. 2017 (cit. on p. [64\)](#page-86-0).
- **[146]** BETZ, Vaughn and ROSE, Jonathan: "VPR: a new packing, placement and routing tool for FPGA research". In: 1304 (1997), pp. 213–222. DOI: [10.1007/3-540-](https://doi.org/10.1007/3-540-63465-7_226) [63465-7\\_226.](https://doi.org/10.1007/3-540-63465-7_226) (Visited on 04/04/2019) (cit. on pp. [65,](#page-87-0) [66\)](#page-88-0).
- **[147]** GEREZ, Sabih H.: Algorithms for VLSI design automation. Chichester and Weinheim: Wiley, 1999. URL: [http://www.loc.gov/catdir/bios/wiley042/](http://www.loc.gov/catdir/bios/wiley042/98039574.html) [98039574.html](http://www.loc.gov/catdir/bios/wiley042/98039574.html) (cit. on p. [66\)](#page-88-0).
- **[148]** F4PGA: FPGA Assembly (FASM) documentation. 2023. URL: [https://fasm.](https://fasm.readthedocs.io/en/latest/) [readthedocs.io/en/latest/](https://fasm.readthedocs.io/en/latest/) (visited on 07/23/2023) (cit. on p. [66\)](#page-88-0).
- **[149]** VTR DEVELOPERS: FPGA Assembly (FASM) Output Support. 2023. URL: [https:](https://docs.verilogtorouting.org/en/latest/utils/fasm/) [//docs.verilogtorouting.org/en/latest/utils/fasm/](https://docs.verilogtorouting.org/en/latest/utils/fasm/) (visited on 07/23/2023) (cit. on p. [67\)](#page-89-0).
- **[150]** SKYWATER TECHNOLOGY: SkyWater Open Source PDK. 2020. URL: [https://](https://github.com/google/skywater-pdk) [github.com/google/skywater-pdk](https://github.com/google/skywater-pdk) (visited on 08/01/2023) (cit. on p. [69\)](#page-91-0).
- **[151]** GLOBALFOUNDRIES: GlobalFoundries GF180MCU Open Source PDK. 2022. URL: <https://github.com/google/gf180mcu-pdk> (cit. on p. [69\)](#page-91-0).
- **[152]** STINE, James E. et al.: "FreePDK: An Open-Source Variation-Aware Design Kit". In: *2007 IEEE International Conference on Microelectronic Systems Education (MSE'07)*. IEEE, 2007, pp. 173–174. DOI: [10.1109/MSE.2007.44](https://doi.org/10.1109/MSE.2007.44) (cit. on p. [69\)](#page-91-0).
- **[153]** BHANUSHALI, Kirti and DAVIS, W. Rhett: "FreePDK15". In: *Proceedings of the 2015 Symposium on International Symposium on Physical Design*. Ed. by DAVOODI, Azadeh and YOUNG, Evangeline. New York, NY, USA: ACM, 2015, pp. 165–170. DOI: [10.1145/2717764.2717782](https://doi.org/10.1145/2717764.2717782) (cit. on p. [69\)](#page-91-0).
- **[154]** MARTINS, Mayler; MATOS, Jody Maick; RIBAS, Renato P.; REIS, André; SCHLINKER, Guilherme; RECH, Lucio and MICHELSEN, Jens: "Open Cell Library in 15nm FreePDK Technology". In: *Proceedings of the 2015 Symposium on International Symposium on Physical Design*. Ed. by DAVOODI, Azadeh and YOUNG, Evangeline. New York, NY, USA: ACM, 2015, pp. 171–178. DOI: [10.1145/2717764.2717783](https://doi.org/10.1145/2717764.2717783) (cit. on pp. [70,](#page-92-0) [71\)](#page-93-0).
- **[155]** RAI, Shubham; RAITZA, Michael; SAHOO, Siva Satyendra and KUMAR, Akash: "DiSCERN: Distilling Standard-Cells for Emerging Reconfigurable Nanotechnologies". In: *2020 Design,Automation & Testin Europe Conference & Exhibition (DATE)*. IEEE, 2020, pp. 674–677. DOI: [10.23919/DATE48585.2020.9116216](https://doi.org/10.23919/DATE48585.2020.9116216) (cit. on p. [71\)](#page-93-0).
- **[156]** KRINKE, Andreas; RAI, Shubham; KUMAR, Akash and LIENIG, Jens: "Exploring Physical Synthesis for Circuits based on Emerging Reconfigurable Nanotechnologies". In: *2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)*. IEEE, 2021, pp. 1–9. DOI: [10.1109/ICCAD51958.2021.9643439](https://doi.org/10.1109/ICCAD51958.2021.9643439) (cit. on p. [72\)](#page-94-0).
- **[157]** BEN-JAMAA, M. Haykel; MOHANRAM, Kartik and MICHELI, Giovanni de: "An Efficient Gate Library for Ambipolar CNTFET Logic". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 30.2 (2011), pp. 242– 255. DOI: [10.1109/TCAD.2010.2085250](https://doi.org/10.1109/TCAD.2010.2085250) (cit. on pp. [72,](#page-94-0) [74\)](#page-96-0).
- **[158]** RAI, Shubham; RUPANI, Ansh; WALTER, Dennis; RAITZA, Michael; HEINZIG, Andre; BALDAUF, Tim; TROMMER, Jens; MAYR, Christian; WEBER, Walter M. and KUMAR, Akash: "A physical synthesis flow for early technology evaluation of silicon nanowire based reconfigurable FETs". In: *2018 Design, Automation &*

*Test in Europe Conference & Exhibition (DATE)*. IEEE, 2018, pp. 605–608. DOI: [10.23919/DATE.2018.8342080](https://doi.org/10.23919/DATE.2018.8342080) (cit. on pp. [72,](#page-94-0) [74\)](#page-96-0).

- **[159]** GORE, Ganesh; CADAREANU, Patsy; GIACOMIN, Edouard and GAILLARDON, Pierre-Emmanuel: "A Predictive Process Design Kit for Three-Independent-Gate Field-Effect Transistors". In: *2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC)*. IEEE, 2019, pp. 172–177. DOI: [10.](https://doi.org/10.1109/VLSI-SoC.2019.8920358) [1109/VLSI-SoC.2019.8920358](https://doi.org/10.1109/VLSI-SoC.2019.8920358) (cit. on pp. [73,](#page-95-0) [74\)](#page-96-0).
- **[160]** GAUCHI, Roman; SNELGROVE, Ashton and GAILLARDON, Pierre-Emmanuel: "An Open-source Three-Independent-Gate FET Standard Cell Library for Mixed Logic Synthesis". In: *2022 IEEE International Symposium on Circuits and Systems (ISCAS)*. IEEE, 2022, pp. 273–277. DOI: [10.1109/ISCAS48785.2022.9937590](https://doi.org/10.1109/ISCAS48785.2022.9937590) (cit. on pp. [73,](#page-95-0) [74\)](#page-96-0).
- **[161]** KEYSER, Michael; GAUCHI, Roman and GAILLARDON, Pierre-Emmanuel: "An Energy-Efficient Three-Independent-Gate FET Cell Library for Low-Power Edge Computing". In: *2022 IFIP/IEEE 30th International Conference on Very Large Scale Integration (VLSI-SoC)*. IEEE, 2022, pp. 1–6. DOI: [10.1109/VLSI-](https://doi.org/10.1109/VLSI-SoC54400.2022.9939636)[SoC54400.2022.9939636](https://doi.org/10.1109/VLSI-SoC54400.2022.9939636) (cit. on pp. [73,](#page-95-0) [74\)](#page-96-0).
- **[162]** QUIJADA, Jorge Navarro; BALDAUF, Tim; RAI, Shubham; HEINZIG, Andre; KU-MAR, Akash; WEBER, Walter M.; MIKOLAJICK, Thomas and TROMMER, Jens: "A Germanium Nanowire Reconfigurable Transistor Model for Predictive Technology Evaluation". In: *IEEE Transactions on Nanotechnology* (2022), pp. 1–8. DOI: [10.1109/TNANO.2022.3221836](https://doi.org/10.1109/TNANO.2022.3221836) (cit. on pp. [73,](#page-95-0) [74\)](#page-96-0).
- **[163]** GONCALVES, O.; PRENAT, G. and DIENY, B.: "Radiation Hardened MRAM-Based FPGA". In: *IEEE Transactions on Magnetics* 49.7 (2013), pp. 4355–4358. DOI: [10.1109/TMAG.2013.2247744](https://doi.org/10.1109/TMAG.2013.2247744) (cit. on pp. [75,](#page-97-0) [104\)](#page-126-0).
- **[164]** BEN JAMAA, M. Haykel; GAILLARDON, Pierre-Emmanuel; FRÉGONÈSE, Sebastien; MARCHI, Michele de; MICHELI, Giovanni de; ZIMMER, Thomas; O'CONNOR, Ian and CLERMIDY, Fabien: "FPGA Design with Double-Gate Carbon Nanotube Transistors". In: *ECS Transactions* 34.1 (2011), pp. 1005–1010. DOI: [10.1149/1.3567706](https://doi.org/10.1149/1.3567706) (cit. on pp. [75,](#page-97-0) [104\)](#page-126-0).
- **[165]** GAILLARDON, Pierre-Emmanuel; TANG, Xifan; KIM, Gain and MICHELI, Giovanni de: "A Novel FPGA Architecture Based on Ultrafine Grain Reconfigurable Logic Cells". In: *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 23.10 (2015), pp. 2187–2197. DOI: [10.1109/TVLSI.2014.2359385](https://doi.org/10.1109/TVLSI.2014.2359385) (cit. on p. [76\)](#page-98-0).
- **[166]** GAILLARDON, Pierre-Emmanuel; BEN-JAMAA, M. Haykel; CLERMIDY, Fabien and O'CONNOR, Ian: "Ultra-fine grain FPGAs: A granularity study". In: *2011 IEEE/ACM International Symposium on Nanoscale Architectures*. IEEE, 2011, pp. 9–15. DOI: [10.1109/NANOARCH.2011.5941477](https://doi.org/10.1109/NANOARCH.2011.5941477) (cit. on pp. [76,](#page-98-0) [104\)](#page-126-0).
- **[167]** PARK, Jun-Mo; BAE, Jong-Ho; EUM, Jai-Ho; JIN, Sung Hun; PARK, Byung-Gook and LEE, Jong-Ho: "High-Density Reconfigurable Devices With Programmable Bottom-Gate Array". In: *IEEE Electron Device Letters* 38.5 (2017), pp. 564–567. DOI: [10.1109/LED.2017.2679343](https://doi.org/10.1109/LED.2017.2679343) (cit. on pp. [76,](#page-98-0) [77,](#page-99-0) [178\)](#page-200-0).
- **[168]** VIPIN, Kizheppatt and FAHMY, Suhaib A.: "FPGA Dynamic and Partial Reconfiguration". In: *ACM Computing Surveys* 51.4 (2019), pp. 1–39. DOI: [10.1145/](https://doi.org/10.1145/3193827) [3193827](https://doi.org/10.1145/3193827) (cit. on pp. [77–](#page-99-0)[79\)](#page-101-0).
- **[169]** VASSILIADIS, Stamatis, ed.: Fine- and coarse-grain reconfigurable computing. Dordrecht: Springer, 2007 (cit. on p. [78\)](#page-100-0).
- **[170]** CARDONA, Luis Andres and FERRER, Carles: "AC\_ICAP: A Flexible High Speed ICAP Controller". In: *International Journal of Reconfigurable Computing* 2015 (2015), pp. 1–15. DOI: [10.1155/2015/314358](https://doi.org/10.1155/2015/314358) (cit. on p. [78\)](#page-100-0).
- **[171]** STEIGER, C.; WALDER, H. and PLATZNER, M.: "Operating systems for reconfigurable embedded platforms: online scheduling of real-time tasks". In: *IEEE Transactions on Computers* 53.11 (2004), pp. 1393–1407. DOI: [10.1109/TC.2004.](https://doi.org/10.1109/TC.2004.99) [99](https://doi.org/10.1109/TC.2004.99) (cit. on pp. [80,](#page-102-0) [81\)](#page-103-0).
- **[172]** DIESSEL, Oliver and ELGINDY, Hossam: "Run-time compaction of FPGA designs". In: *Field-Programmable Logic and Applications*. Ed. by Goos, Gerhard; HARTMANIS, Juris; VAN LEEUWEN, Jan; LUK, Wayne; CHEUNG, Peter Y. K. and GLESNER, Manfred. Vol. 1304. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 1997, pp. 131–140. DOI: [10.1007/3-540-](https://doi.org/10.1007/3-540-63465-7_218) [63465-7\\_218](https://doi.org/10.1007/3-540-63465-7_218) (cit. on p. [80\)](#page-102-0).
- **[173]** WALDER, H.; STEIGER, C. and PLATZNER, M.: "Fast online task placement on FPGAs: free space partitioning and 2D-hashing". In: *Proceedings International Parallel and Distributed Processing Symposium*. IEEE Comput. Soc, 2003, p. 8. DOI: [10.1109/IPDPS.2003.1213329](https://doi.org/10.1109/IPDPS.2003.1213329) (cit. on p. [80\)](#page-102-0).
- **[174]** SIDIROPOULOS, Harry; FIGULI, Peter; SIOZIOS, Kostas; SOUDRIS, Dimitrios and BECKER, Jurgen: "A platform-independent runtime methodology for mapping multiple applications onto FPGAs through resource virtualization". In: *2013 23rd International Conference on Field programmable Logic and Applications*. IEEE, 2013, pp. 1–4. DOI: [10.1109/FPL.2013.6645564](https://doi.org/10.1109/FPL.2013.6645564) (cit. on p. [80\)](#page-102-0).
- **[175]** DIESSEL, O.; ELGINDY, H.; MIDDENDORF, M.; SCHMECK, H. and SCHMIDT, B.: "Dynamic scheduling of tasks on partially reconfigurable FPGAs". In: *IEE Proceedings - Computers and Digital Techniques* 147.3 (2000), p. 181. DOI: [10.1049/](https://doi.org/10.1049/ip-cdt:20000485) [ip-cdt:20000485](https://doi.org/10.1049/ip-cdt:20000485) (cit. on p. [81\)](#page-103-0).
- **[176]** STERPONE, Luca and BOZZOLI, Ludovica: "Fast Partial Reconfiguration on SRAM-Based FPGAs: A Frame-Driven Routing Approach". In: *Applied Reconfigurable Computing. Architectures, Tools, and Applications*. Ed. by VOROS, Nikolaos; HUEBNER, Michael; KERAMIDAS, Georgios; GOEHRINGER, Diana; ANTONOPOULOS, Christos and DINIZ, Pedro C. Vol. 10824. Lecture Notes in Computer Science. Cham: Springer International Publishing, 2018, pp. 319– 330. DOI: [10.1007/978-3-319-78890-6\\_26](https://doi.org/10.1007/978-3-319-78890-6_26) (cit. on p. [81\)](#page-103-0).
- **[177]** DA SILVA, Bruno; BRAEKEN, An and TOUHAFI, Abdellah: "Probabilistic Performance Modelling when Using Partial Reconfiguration to Accelerate Streaming Applications with Non-deterministic Task Scheduling". In: 11444 (2019), pp. 81-95. DOI: [10.1007/978-3-030-17227-5\\_7](https://doi.org/10.1007/978-3-030-17227-5_7) (cit. on p. [81\)](#page-103-0).
- **[178]** STEIGER, C.; WALDER, H.; PLATZNER, M. and THIELE, L.: "Online scheduling and placement of real-time tasks to partially reconfigurable devices". In:

*Proceedings. 2003 International Symposium on System-on-Chip (IEEE Cat. No.03EX748)*. IEEE Comput. Soc, 2003, pp. 224–225. DOI: [10.1109/REAL.2003.](https://doi.org/10.1109/REAL.2003.1253269) [1253269](https://doi.org/10.1109/REAL.2003.1253269) (cit. on p. [81\)](#page-103-0).

- **[179]** ISMAIL, Aws and SHANNON, Lesley: "FUSE: Front-End User Framework for O/S Abstraction of Hardware Accelerators". In: *2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines*. IEEE, 2011, pp. 170–177. DOI: [10.1109/FCCM.2011.48](https://doi.org/10.1109/FCCM.2011.48) (cit. on p. [81\)](#page-103-0).
- **[180]** JANßEN, Benedikt; WINGENDER, Tim and HÜBNER, Michael: Hardware Accelerator Framework Approach for Dynamic Partial Reconfigurable Overlays on Xilinx PYNQ. 2017. DOI: [10.18420/IN2017\\_44](https://doi.org/10.18420/IN2017_44) (cit. on p. [81\)](#page-103-0).
- **[181]** JANßEN, Benedikt; KÄSTNER, Florian; WINGENDER, Tim and HUEBNER, Michael: "A Dynamic Partial Reconfigurable Overlay Framework for Python". In: *Applied Reconfigurable Computing. Architectures, Tools, and Applications*. Ed. by VOROS, Nikolaos; HUEBNER, Michael; KERAMIDAS, Georgios; GOEHRINGER, Diana; ANTONOPOULOS, Christos and DINIZ, Pedro C. Vol. 10824. Lecture Notes in Computer Science. Cham: Springer International Publishing, 2018, pp. 331–342. DOI: [10.1007/978-3-319-78890-6\\_27](https://doi.org/10.1007/978-3-319-78890-6_27) (cit. on p. [81\)](#page-103-0).
- **[182]** KORINTH, Jens; HOFMANN, Jaco; HEINZ, Carsten and KOCH, Andreas: "The TaPaSCo Open-Source Toolflow for the Automated Composition of Task-Based Parallel Reconfigurable Computing Systems". In: *Applied Reconfigurable Computing*. Ed. by HOCHBERGER, Christian; NELSON, Brent; KOCH, Andreas;WOODS, Roger and DINIZ, Pedro. Vol. 11444. Lecture Notes in Computer Science. Cham: Springer International Publishing, 2019, pp. 214–229. DOI: [10.1007/978-3-030-](https://doi.org/10.1007/978-3-030-17227-5_16) 17227-5 16 (cit. on p. [81\)](#page-103-0).
- **[183]** TRIMBERGER, S.; CARBERRY, D.; JOHNSON, A. and WONG, J.: A time-multiplexed FPGA. Los Alamitos Calif.: IEEE Computer Society Press, 1997. DOI: [10.1109/](https://doi.org/10.1109/FPGA.1997.624601) [FPGA.1997.624601](https://doi.org/10.1109/FPGA.1997.624601) (cit. on p. [82\)](#page-104-0).
- **[184]** LI, Z.; COMPTON, K. and HAUCK, S.: "Configuration caching management techniques for reconfigurable computing". In: *Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871)*. IEEE Comput. Soc, 2000, pp. 22–36. DOI: [10.1109/FPGA.2000.903390](https://doi.org/10.1109/FPGA.2000.903390) (cit. on p. [82\)](#page-104-0).
- **[185]** COMPTON, K.; LI, Zhiyuan; COOLEY, J.; KNOL, S. and HAUCK, S.: "Configuration relocation and defragmentation for run-time reconfigurable computing". In: *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 10.3 (2002), pp. 209–220. DOI: [10.1109/TVLSI.2002.1043324](https://doi.org/10.1109/TVLSI.2002.1043324) (cit. on pp. [82,](#page-104-0) [104\)](#page-126-0).
- **[186]** BREBNER, Gordon and DIESSEL, Oliver: "Chip-Based Reconfigurable Task Management". In: *Field-Programmable Logic and Applications*. Ed. by GOOS, Gerhard; HARTMANIS, Juris; VAN LEEUWEN, Jan; BREBNER, Gordon and WOODS, Roger. Vol. 2147. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2001, pp. 182–191. DOI: [10.1007/3-540-44687-7\\_19](https://doi.org/10.1007/3-540-44687-7_19) (cit. on pp. [83,](#page-105-0) [104\)](#page-126-0).
- **[187]** KOCH, D.; AHMADINIA, A.; BOBDA, C. and KALTE, H.: "FPGA architecture extensions for preemptive multitasking and hardware defragmentation". In: *Proceedings. 2004 IEEE International Conference on Field- Programmable Technology*

*(IEEE Cat. No.04EX921)*. IEEE, 2004, pp. 433–436. DOI: [10.1109/FPT.2004.](https://doi.org/10.1109/FPT.2004.1393318) [1393318](https://doi.org/10.1109/FPT.2004.1393318) (cit. on p. [83\)](#page-105-0).

- **[188]** FIGULI, Peter; HUBNER, Michael; GIRARDEY, Romuald; BAPP, Falco; BRUCKSCHLOGL, Thomas; THOMA, Florian; HENKEL, Jorg and BECKER, Jurgen: "A heterogeneous SoC architecture with embedded virtual FPGA cores and runtime Core Fusion". In: *2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS)*. IEEE, 2011, pp. 96–103. DOI: [10.1109/AHS.2011.5963922](https://doi.org/10.1109/AHS.2011.5963922) (cit. on pp. [83,](#page-105-0) [84,](#page-106-0) [104\)](#page-126-0).
- **[189]** BOZZOLI, Ludovica and STERPONE, Luca: "ReM: A Reconfigurable Multipotent Cell for New Distributed Reconfigurable Architectures". In: 11444 (2019), pp. 295–304. DOI: [10.1007/978-3-030-17227-5\\_21](https://doi.org/10.1007/978-3-030-17227-5_21) (cit. on p. [84\)](#page-106-0).
- **[190]** ADETOMI, Adewale; ENEMALI, Godwin and ARSLAN, Tughrul: "Enabling Dynamic Communication for Runtime Circuit Relocation". In: *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 28.1 (2020), pp. 142–155. DOI: [10.1109/TVLSI.2019.2934927](https://doi.org/10.1109/TVLSI.2019.2934927) (cit. on p. [84\)](#page-106-0).
- **[191]** DONGAONKAR, Sourabh; MUDANAI, Sivakumar P. and GILES, Martin D.: "From Process Corners to Statistical Circuit Design Methodology: Opportunities and Challenges". In: *IEEE Transactions on Electron Devices* 66.1 (2019), pp. 19–27. DOI: [10.1109/TED.2018.2860929](https://doi.org/10.1109/TED.2018.2860929) (cit. on p. [85\)](#page-107-0).
- **[192]** LIN, Yan; HUTTON, Mike and HE, Lei: "Placement and Timing for FPGAs Considering Variations". In: *2006 International Conference on Field Programmable Logic and Applications*. IEEE, 2006, pp. 1–7. DOI: [10.1109/FPL.2006.311192](https://doi.org/10.1109/FPL.2006.311192) (cit. on p. [86\)](#page-108-0).
- **[193]** KAENEL, V. von; MACKEN, P. and DEGRAUWE, M.G.R.: "A voltage reduction techniquefor battery-operated systems". In:*IEEE Journal of Solid-State Circuits* 25.5 (1990), pp. 1136–1140. DOI: [10.1109/4.62134](https://doi.org/10.1109/4.62134) (cit. on pp. [87,](#page-109-0) [104\)](#page-126-0).
- **[194]** CHENG, Lerong; XIONG, Jinjun; HE, Lei and HUTTON, Mike: "FPGA Performance Optimization Via Chipwise Placement Considering Process Variations". In: *2006 International Conference on Field Programmable Logic and Applications*. IEEE, 2006, pp. 1–6. DOI: [10.1109/FPL.2006.311193](https://doi.org/10.1109/FPL.2006.311193) (cit. on pp. [87,](#page-109-0) [104\)](#page-126-0).
- **[195]** GHOSH, Swaroop; BHUNIA, Swarup and ROY, Kaushik: "CRISTA: A New Paradigm for Low-Power, Variation-Tolerant, and Adaptive Circuit Synthesis Using Critical Path Isolation". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 26.11 (2007), pp. 1947–1956. DOI: [10.1109/TCAD.2007.896305](https://doi.org/10.1109/TCAD.2007.896305) (cit. on pp. [87,](#page-109-0) [104\)](#page-126-0).
- **[196]** EBRAHIMI, Mohammad and NAVABI, Zainalabedin: "Selecting Representative Critical Paths for Sensor Placement Provides Early FPGA Aging Information". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 39.10 (2020), pp. 2976–2989. DOI: [10.1109/TCAD.2019.2953174](https://doi.org/10.1109/TCAD.2019.2953174) (cit. on p. [88\)](#page-110-0).
- **[197]** ELGEBALY, Mohamed and SACHDEV, Manoj: "Efficient adaptive voltage scaling system through on-chip critical path emulation". In: *Proceedings of the 2004 international symposium on Low power electronics and design*. Ed. by JOSHI,

Rajiv; CHOI, Kiyoung; TIWARI, Vivek and ROY, Kaushik. New York, NY, USA: ACM, 2004, pp. 375–380. DOI: [10.1145/1013235.1013325](https://doi.org/10.1145/1013235.1013325) (cit. on p. [88\)](#page-110-0).

- **[198]** FOJTIK, Matthew; FICK, David; KIM, Yejoong; PINCKNEY, Nathaniel; HARRIS, David Money; BLAAUW, David and SYLVESTER, Dennis: "Bubble Razor: Eliminating Timing Margins in an ARM Cortex-M3 Processor in 45 nm CMOS Using Architecturally Independent Error Detection and Correction". In: *IEEE Journal of Solid-State Circuits* 48.1 (2013), pp. 66–81. DOI: [10.1109/JSSC.2012.2220912](https://doi.org/10.1109/JSSC.2012.2220912) (cit. on p. [88\)](#page-110-0).
- **[199]** MIRO-PANADES, Ivan; BEIGNE, Edith; BILLOINT, Olivier and THONNART, Yvain: "In-situ Fmax/Vmin tracking for energy efficiency and reliability optimization". In: *2017 IEEE 23rd International Symposium on On-Line Testing and Robust System Design (IOLTS)*, pp. 96–99. DOI: [10.1109/IOLTS.2017.8046240](https://doi.org/10.1109/IOLTS.2017.8046240) (cit. on p. [89\)](#page-111-0).
- **[200]** CHEN, Poki; SHIE, Mon-Chau; ZHENG, Zhi-Yuan; ZHENG, Zi-Fan and CHU, Chun-Yan: "A Fully Digital Time-Domain Smart Temperature Sensor Realized With 140 FPGA Logic Elements". In: *IEEE Transactions on Circuits and Systems I: Regular Papers* 54.12 (2007), pp. 2661–2668. DOI: [10.1109/TCSI.2007.906073](https://doi.org/10.1109/TCSI.2007.906073) (cit. on p. [89\)](#page-111-0).
- **[201]** YU, Haile; XU, Qiang and LEONG, Philip H.W.: "Fine-grained characterization of process variation in FPGAs". In: *2010 International Conference on Field-Programmable Technology*. IEEE, 2010, pp. 138–145. DOI: [10.1109/FPT.2010.](https://doi.org/10.1109/FPT.2010.5681770) [5681770](https://doi.org/10.1109/FPT.2010.5681770) (cit. on pp. [90,](#page-112-0) [104\)](#page-126-0).
- **[202]** FRANCO, John J. Leon; BOEMO, Eduardo; CASTILLO, Encarnacion and PARRILLA, Luis: "Ring oscillators as thermal sensors in FPGAs: Experiments in low voltage". In: *2010 VI Southern Programmable Logic Conference (SPL)*. IEEE, 2010, pp. 133–137. DOI: [10.1109/SPL.2010.5483027](https://doi.org/10.1109/SPL.2010.5483027) (cit. on pp. [90,](#page-112-0) [104\)](#page-126-0).
- **[203]** HAPPE, Markus; AGNE, Andreas and PLESSL, Christian: "Measuring and Predicting Temperature Distributions on FPGAs at Run-Time". In: *2011 International Conference on Reconfigurable Computing and FPGAs*. IEEE, 2011, pp. 55–60. DOI: [10.1109/ReConFig.2011.59](https://doi.org/10.1109/ReConFig.2011.59) (cit. on p. [90\)](#page-112-0).
- **[204]** AGARWAL, Mridul; PAUL, Bipul C.; ZHANG, Ming and MITRA, Subhasish: "Circuit Failure Prediction and Its Application to Transistor Aging". In: *25th IEEE VLSI Test Symmposium (VTS'07)*. IEEE, 2007, pp. 277–286. DOI: [10.1109/VTS.2007.22](https://doi.org/10.1109/VTS.2007.22) (cit. on p. [90\)](#page-112-0).
- **[205]** HUARD, V.; CACHO, F.; GINER, F.; SALIVA, M.; BENHASSAIN, A.; PATEL, D.; TORRES, N.; NAUDET, S.; JAIN, A. and PARTHASARATHY, C.: "Adaptive Wearout Management with in-situ aging monitors". In: *2014 IEEE International Reliability Physics Symposium*. IEEE, 2014, 6B.4.1–6B.4.11. DOI: [10 . 1109 / IRPS . 2014 .](https://doi.org/10.1109/IRPS.2014.6861106) [6861106](https://doi.org/10.1109/IRPS.2014.6861106) (cit. on p. [90\)](#page-112-0).
- **[206]** SENGUPTA, Deepashree and SAPATNEKAR, Sachin S.: "Estimating Circuit Aging Due to BTI and HCI Using Ring-Oscillator-Based Sensors". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 36.10 (2017), pp. 1688–1701. DOI: [10.1109/TCAD.2017.2648840](https://doi.org/10.1109/TCAD.2017.2648840) (cit. on p. [90\)](#page-112-0).
- **[207]** ZICK, Kenneth M. and HAYES, John P.: "Low-cost sensing with ring oscillator arrays for healthier reconfigurable systems". In: *ACM Transactions on Reconfigurable Technology and Systems* 5.1 (2012), pp. 1–26. DOI: [10.1145/2133352.](https://doi.org/10.1145/2133352.2133353) [2133353](https://doi.org/10.1145/2133352.2133353) (cit. on p. [90\)](#page-112-0).
- **[208]** ALAM, M.: "Reliability- and process-variation aware design of integrated circuits". In: *Microelectronics Reliability* 48.8-9 (2008), pp. 1114–1122. DOI: [10.](https://doi.org/10.1016/j.microrel.2008.07.039) [1016/j.microrel.2008.07.039](https://doi.org/10.1016/j.microrel.2008.07.039) (cit. on p. [91\)](#page-113-0).
- **[209]** GUPTA, Meeta S.; RIVERS, Jude A.; BOSE, Pradip; WEI, Gu-Yeon and BROOKS, David: "Tribeca: Design for PVT Variations with Local Recovery and Finegrained Adaptation". In: *Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture*. Ed. by ALBONESI, David; MARTONOSI, Margaret; AUGUST, David and MARTÍNEZ, José. New York, NY, USA: ACM, 2009, pp. 435–446. DOI: [10.1145/1669112.1669168](https://doi.org/10.1145/1669112.1669168) (cit. on p. [91\)](#page-113-0).
- **[210]** MITTAL, Sparsh: "A Survey of Architectural Techniques for Managing Process Variation". In: *ACM Computing Surveys* 48.4 (2016), pp. 1–29. DOI: [10.1145/](https://doi.org/10.1145/2871167) [2871167](https://doi.org/10.1145/2871167) (cit. on p. [91\)](#page-113-0).
- **[211]** RAHIMI, Abbas; BENINI, Luca and GUPTA, Rajesh K.: "Variability Mitigation in Nanometer CMOS Integrated Systems: A Survey of Techniques From Circuits to Software". In: *Proceedings of the IEEE* 104.7 (2016), pp. 1410–1448. DOI: [10.1109/JPROC.2016.2518864](https://doi.org/10.1109/JPROC.2016.2518864) (cit. on p. [91\)](#page-113-0).
- **[212]** TSCHANZ, J.W.; KAO, J. T.; NARENDRA, S. G.; NAIR, R.; ANTONIADIS, D. A.; CHAN-DRAKASAN, A. P. and DE, V.: "Adaptive body bias for reducing impacts of dieto-die and within-die parameter variations on microprocessor frequency and leakage". In: *IEEE Journal of Solid-State Circuits* 37.11 (2002), pp. 1396–1402. DOI: [10.1109/JSSC.2002.803949](https://doi.org/10.1109/JSSC.2002.803949) (cit. on p. [92\)](#page-114-0).
- **[213]** TEODORESCU, Radu; NAKANO, Jun; TIWARI, Abhishek and TORRELLAS, Josep: "Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing". In: *40th Annual IEEE/ACM International Symposium on Microarchitecture (MI-CRO 2007)*. IEEE, 2007, pp. 27–42. DOI: [10.1109 /MICRO.2007.43](https://doi.org/10.1109/MICRO.2007.43) (cit. on p. [92\)](#page-114-0).
- **[214]** MAURICIO, Joan and MOLL, Francesc: "Local variations compensation with DLL-based Body Bias Generator for UTBB FD-SOI technology". In: *2015 IEEE 13th International New Circuits and Systems Conference (NEWCAS)*. IEEE, 2015, pp. 1–4. DOI: [10.1109/NEWCAS.2015.7182005](https://doi.org/10.1109/NEWCAS.2015.7182005) (cit. on p. [93\)](#page-115-0).
- **[215]** MATSUSHITA, Yusuke; OKUHARA, Hayate; MASUYAMA, Koichiro; FUJITA, Yu; KAWANO, Ryuta and AMANO, Hideharu: "Body bias grain size exploration for a coarse grained reconfigurable accelerator". In: *2016 26th International Conference on Field Programmable Logic and Applications (FPL)*. IEEE, 2016, pp. 1–4. DOI: [10.1109/FPL.2016.7577346](https://doi.org/10.1109/FPL.2016.7577346) (cit. on p. [93\)](#page-115-0).
- **[216]** CHOW, C. T.; TSUI, L.S.M.; LEONG, P.H.W.; LUK,W. andWILTON, S.J.E.: "Dynamic voltage scalingfor commercial FPGAs". In: *Proceedings.2005 IEEE International Conference on Field-Programmable Technology, 2005*. IEEE, 2005, pp. 173–180. DOI: [10.1109/FPT.2005.1568543](https://doi.org/10.1109/FPT.2005.1568543) (cit. on pp. [94,](#page-116-0) [99,](#page-121-0) [104\)](#page-126-0).
- **[217]** NABAA, Georges; AZIZI, Navid and NAJM, Farid N.: An adaptive FPGA architecture with process variation compensation and reduced leakage. New York, NY and Piscataway, NJ: Association for Computing Machinery and IEEE Service Center, 2006. DOI: [10.1145/1146909.1147069](https://doi.org/10.1145/1146909.1147069) (cit. on pp. [94,](#page-116-0) [95,](#page-117-0) [104\)](#page-126-0).
- **[218]** HIOKI, Masakazu; MA, Chao; KAWANAMI, Takashi; OGASAHARA, Yasuhiro; NAK-AGAWA, Tadashi; SEKIGAWA, Toshihiro; TSUTSUMI, Toshiyuki and KOIKE, Hanpei: "SOTB Implementation of a Field Programmable Gate Array with Fine-Grained Vt Programmability". In: *Journal of Low Power Electronics and Applications* 4.3 (2014), pp. 188–200. DOI: [10.3390/jlpea4030188](https://doi.org/10.3390/jlpea4030188) (cit. on pp. [95,](#page-117-0) [104\)](#page-126-0).
- **[219]** BURMESTER CAMPOS, Pedro: "Variability-Aware Circuit Performance Optimisation Through Digital Reconfiguration". PhD thesis. 2015 (cit. on p. [95\)](#page-117-0).
- **[220]** MARAGOS, Konstantinos; LENTARIS, George and SOUDRIS, Dimitrios: "A PVT-Aware Voltage Scaling Method for Energy Efficient FPGAs". In: *2021 IEEE International Symposium on Circuits and Systems (ISCAS)*. IEEE, 2021, pp. 1–5. DOI: [10.1109/ISCAS51556.2021.9401622](https://doi.org/10.1109/ISCAS51556.2021.9401622) (cit. on pp. [96,](#page-118-0) [104\)](#page-126-0).
- **[221]** EETIMES: Lattice unveils first FPGAs on FDSOI. 2019. URL: [https: / /www.](https://www.eetimes.com/lattice-unveils-first-fpgas-on-fd-soi/) [eetimes.com/lattice-unveils-first-fpgas-on-fd-soi/](https://www.eetimes.com/lattice-unveils-first-fpgas-on-fd-soi/) (visited on 08/07/2023) (cit. on p. [96\)](#page-118-0).
- **[222]** AKGUN, Gokhan; ALI, Muhammad and GOHRINGER, Diana: "Power-Aware Computing Systems on FPGAs: A Survey". In: *2021 31st International Conference on Field-Programmable Logic and Applications (FPL)*. IEEE, 2021, pp. 45– 51. DOI: [10.1109/FPL53798.2021.00016](https://doi.org/10.1109/FPL53798.2021.00016) (cit. on pp. [96,](#page-118-0) [97\)](#page-119-0).
- **[223]** SUTTER, G.; TODOROVICH, E.; LOPEZ-BUEDO, S. and BOEMO, E.: "Low-Power FSMs in FPGA: Encoding Alternatives". In: *Integrated Circuit Design. Power and Timing Modeling, Optimization and Simulation*. Ed. by GOOS, Gerhard; HARTMANIS, Juris; VAN LEEUWEN, Jan; HOCHET, Bertrand; ACOSTA, Antonio J. and BELLIDO, Manuel J. Vol. 2451. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002, pp. 363–370. DOI: [10.1007/3-](https://doi.org/10.1007/3-540-45716-X_36) [540-45716-X\\_36](https://doi.org/10.1007/3-540-45716-X_36) (cit. on p. [97\)](#page-119-0).
- **[224]** SINGH, Amit; PARTHASARATHY, Ganapathy and MAREK-SADOWSKA, Malgorzata: "Efficient circuit clustering for area and power reduction in FPGAs". In: *ACM Transactions on Design Automation of Electronic Systems* 7.4 (2002), pp. 643– 663. DOI: [10.1145/605440.605448](https://doi.org/10.1145/605440.605448) (cit. on p. [97\)](#page-119-0).
- **[225]** GAYASEN, A.; TSAI, Y.;VIJAYKRISHNAN, N.; KANDEMIR, M.; IRWIN, M. J. and TUAN, T.: "Reducing leakage energy in FPGAs using region-constrained placement". In: *Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays*. Ed. by TESSIER, Russ and SCHMIT, Herman. New York, NY, USA: ACM, 2004, pp. 51–58. DOI: [10.1145/968280.968289](https://doi.org/10.1145/968280.968289) (cit. on p. [97\)](#page-119-0).
- **[226]** TUAN, Tim; RAHMAN, Arif; DAS, Satyaki; TRIMBERGER, Steve and KAO, Sean: "A 90-nm Low-Power FPGA for Battery-Powered Applications". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 26.2 (2007), pp. 296–300. DOI: [10.1109/TCAD.2006.885731](https://doi.org/10.1109/TCAD.2006.885731) (cit. on p. [97\)](#page-119-0).
- **[227]** BSOUL, Assem A. M.; WILTON, Steven J. E.; TSOI, Kuen Hung and LUK, Wayne: "An FPGA Architecture and CAD Flow Supporting Dynamically Controlled Power Gating". In: *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 24.1 (2016), pp. 178–191. DOI: [10.1109/TVLSI.2015.2393914](https://doi.org/10.1109/TVLSI.2015.2393914) (cit. on p. [98\)](#page-120-0).
- **[228]** BSOUL, Assem A. M. and WILTON, Steven J. E.: "An FPGA architecture supporting dynamically controlled power gating". In: *2010 International Conference on Field-Programmable Technology*. IEEE, 2010, pp. 1–8. DOI: [10.1109/FPT.](https://doi.org/10.1109/FPT.2010.5681533) [2010.5681533](https://doi.org/10.1109/FPT.2010.5681533) (cit. on p. [98\)](#page-120-0).
- **[229]** SEIFOORI, Zeinab; ASADI, Hossein and STOJILOVIC, Mirjana: "A Machine Learning Approach for Power Gating the FPGA Routing Network". In: *2019 International Conference on Field-Programmable Technology (ICFPT)*. IEEE, 2019, pp. 10–18. DOI: [10.1109/ICFPT47387.2019.00010](https://doi.org/10.1109/ICFPT47387.2019.00010) (cit. on p. [98\)](#page-120-0).
- **[230]** NABINA, Atukem and NUNEZ-YANEZ, Jose Luis: "Adaptive Voltage Scaling in a Dynamically Reconfigurable FPGA-Based Platform". In: *ACM Transactions on Reconfigurable Technology and Systems* 5.4 (2012), pp. 1–22. DOI: [10.1145/](https://doi.org/10.1145/2392616.2392618) [2392616.2392618](https://doi.org/10.1145/2392616.2392618) (cit. on p. [99\)](#page-121-0).
- **[231]** NUNEZ-YANEZ, Jose Luis: "Adaptive Voltage Scaling with In-Situ Detectors in Commercial FPGAs". In: *IEEE Transactions on Computers* 64.1 (2015), pp. 45– 53. DOI: [10.1109/TC.2014.2365963](https://doi.org/10.1109/TC.2014.2365963) (cit. on pp. [99,](#page-121-0) [100\)](#page-122-0).
- **[232]** AHMED, Ibrahim; ZHAO, Shuze; TRESCASES, Olivier and BETZ, Vaughn: "Measure twice and cut once: Robust dynamic voltage scaling for FPGAs". In: *2016 26th International Conference on Field Programmable Logic and Applications (FPL)*. IEEE, 2016, pp. 1–11. DOI: [10.1109/FPL.2016.7577342](https://doi.org/10.1109/FPL.2016.7577342) (cit. on p. [99\)](#page-121-0).
- **[233]** AHMED, Ibrahim; ZHAO, Shuze; TRESCASES, Olivier and BETZ, Vaughn: "Automatic Application-Specific Calibration to Enable Dynamic Voltage Scaling in FPGAs". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 37.12 (2018), pp. 3095–3108. DOI: [10.1109/TCAD.2018.2801222](https://doi.org/10.1109/TCAD.2018.2801222) (cit. on p. [99\)](#page-121-0).
- **[234]** LEVINE, Joshua M.; STOTT, Edward and CHEUNG, Peter Y.K.: "Dynamic voltage & frequency scaling with online slack measurement". In: *Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays*. Ed. by BETZ, Vaughn and CONSTANTINIDES, George A. New York, NY, USA: ACM, 2014, pp. 65–74. DOI: [10.1145/2554688.2554784](https://doi.org/10.1145/2554688.2554784) (cit. on p. [99\)](#page-121-0).
- **[235]** ZHAO, Shuze; AHMED, Ibrahim; LAMOUREUX, Carl; LOTFI, Ashraf; BETZ, Vaughn and TRESCASES, Olivier: "A universal self-calibrating Dynamic Voltage and Frequency Scaling (DVFS) scheme with thermal compensation for energy savings in FPGAs". In: *2016 IEEE Applied Power Electronics Conference and Exposition (APEC)*. IEEE, 2016, pp. 1882–1887. DOI: [10 . 1109 / APEC . 2016 .](https://doi.org/10.1109/APEC.2016.7468125) [7468125](https://doi.org/10.1109/APEC.2016.7468125) (cit. on p. [100\)](#page-122-0).
- **[236]** TAKA, Endri; LENTARIS, George and SOUDRIS, Dimitrios: "Improving the performance of RISC-V softcores on FPGA by exploiting PVT variability and DVFS". In: *2022 IEEE International Symposium on Circuits and Systems (ISCAS)*. IEEE, 2022, pp. 1595–1599. DOI: [10.1109/ISCAS48785.2022.9937320](https://doi.org/10.1109/ISCAS48785.2022.9937320) (cit. on p. [100\)](#page-122-0).
- **[237]** GAYASEN, A.; LEE, K.; VIJAYKRISHNAN, N.; KANDEMIR, M.; IRWIN, M. J. and TUAN, T.: "A Dual-VDD Low Power FPGA Architecture". In: *Field Programmable Logic and Application*. Ed. by HUTCHISON, David et al. Vol. 3203. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 145–157. DOI: [10.1007/978-3-540-30117-2\\_17](https://doi.org/10.1007/978-3-540-30117-2_17) (cit. on pp. [100,](#page-122-0) [200\)](#page-222-0).
- **[238]** ZUKOSKI, Andrew; YANG, Xuebei and MOHANRAM, Kartik: "Universal logic modules based on double-gate carbon nanotube transistors". In: *Proceedings of the 48th Design Automation Conference*. Ed. by STOK, Leon; DUTT, Nikil and HASSOUN, Soha. New York, NY, USA: ACM, 2011, pp. 884–889. DOI: [10.1145/](https://doi.org/10.1145/2024724.2024921) [2024724.2024921](https://doi.org/10.1145/2024724.2024921) (cit. on p. [101\)](#page-123-0).
- **[239]** ANDERSON, Jason H. and WANG, Qiang: "Area-efficient FPGA logic elements: Architecture and synthesis". In: *16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011)*. IEEE, 2011, pp. 369–375. DOI: [10.1109/ASPDAC.](https://doi.org/10.1109/ASPDAC.2011.5722215) [2011.5722215](https://doi.org/10.1109/ASPDAC.2011.5722215) (cit. on pp. [101,](#page-123-0) [102\)](#page-124-0).
- **[240]** LIN, Chih-Chang and MAREK-SADOWSKA, M.: "On designing universal logic blocks and their application to FPGA design". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 16.5 (1997), pp. 519–527. DOI: [10.1109/43.631214](https://doi.org/10.1109/43.631214) (cit. on p. [101\)](#page-123-0).
- **[241]** CONG, Jason; HUANG, Hui and YUAN, Xin: "Technology mapping and architecture evalution for k/m -macrocell-based FPGAs". In: *ACM Transactions on Design Automation of Electronic Systems* 10.1 (2005), pp. 3–23. DOI: [10.1145/](https://doi.org/10.1145/1044111.1044113) [1044111.1044113](https://doi.org/10.1145/1044111.1044113) (cit. on p. [101\)](#page-123-0).
- **[242]** HU, Yu; DAS, Satyaki; TRIMBERGER, Steve and HE, Lei: "Design, Synthesis and Evaluation of Heterogeneous FPGA with Mixed LUTs and Macro-Gates". In: *Proceedings of the 2007 IEEE/ACM International Conference on Computer-Aided Design* (2007), pp. 188–193 (cit. on p. [102\)](#page-124-0).
- **[243]** LUO, Tao; LIANG, Hao; ZHANG, Wei; HE, Bingsheng and MASKELL, Douglas: "A Hybrid Logic Block Architecture in FPGA for Holistic Efficiency". In: *IEEE Transactions on Circuits and Systems II: Express Briefs* 64.1 (2017), pp. 71–75. DOI: [10.1109/TCSII.2016.2551555](https://doi.org/10.1109/TCSII.2016.2551555) (cit. on pp. [102,](#page-124-0) [103\)](#page-125-0).
- **[244]** VTR DEVELOPERS: VTR Documentation: FPGA Assembly (FASM) Output Support. 2024. URL: <https://docs.verilogtorouting.org/en/stable/utils/fasm/> (visited on 02/01/2024) (cit. on p. [127\)](#page-149-0).
- **[245]** VTR DEVELOPERS: VTR Documentation: Routing Resource Graph File Format. 2024. URL: [https://docs.verilogtorouting.org/en/stable/vpr/file\\_formats/](https://docs.verilogtorouting.org/en/stable/vpr/file_formats/#routing-resource-graph-file-format-xml) [#routing-resource-graph-file-format-xml](https://docs.verilogtorouting.org/en/stable/vpr/file_formats/#routing-resource-graph-file-format-xml) (visited on 02/01/2024) (cit. on p. [131\)](#page-153-0).
- **[246]** GALDERISI, G.; MIKOLAJICK, T. and TROMMER, J.: "The RGATE: an 8-in-1 Polymorphic Logic Gate Built from Reconfigurable Field Effect Transistors". In:*IEEE Electron Device Letters* (2024). Early Access. DOI: [10.1109/LED.2023.3347397](https://doi.org/10.1109/LED.2023.3347397) (cit. on pp. [144,](#page-166-0) [227–](#page-249-0)[229\)](#page-251-0).
- **[247]** FREELEY, Jennifer; MISHAGLI, Dmvtro; BRAZIL, Tom and BLOKHINA, Elena: "Statistical Simulations of Delay Propagation in Large Scale Circuits Using Graph Traversal and Kernel Function Decomposition". In: *2018 15th Interna-*

*tional Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD)*. IEEE, 2018, pp. 213–219. DOI: [10.1109/SMACD.2018.8434901](https://doi.org/10.1109/SMACD.2018.8434901) (cit. on p. [153\)](#page-175-0).

- **[248]** MANFREDI, Paolo and TRINCHERO, Riccardo: "A Probabilistic Machine Learning Approach for the Uncertainty Quantification of Electronic Circuits Based on Gaussian Process Regression". In: *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems* 41.8 (2022), pp. 2638–2651. DOI: [10.1109/TCAD.2021.3112138](https://doi.org/10.1109/TCAD.2021.3112138) (cit. on p. [153\)](#page-175-0).
- **[249]** TURKYILMAZ, Ogun; CLERMIDY, Fabien; AMARU, Luca Gaetano; GAILLARDON, Pierre-Emmanuel and MICHELI, Giovanni de: "Self-checking ripple-carry adder with Ambipolar Silicon NanoWire FET". In: *2013 IEEE International Symposium on Circuits and Systems (ISCAS2013)*. IEEE, 2013, pp. 2127–2130. DOI: [10.1109/ISCAS.2013.6572294](https://doi.org/10.1109/ISCAS.2013.6572294) (cit. on p. [169\)](#page-191-0).
- **[250]** BERNSTEIN, Daniel J.: ChaCha, a variant of Salsa20. 2008. URL: [http://cr.yp.to/](http://cr.yp.to/chacha/chacha-20080120.pdf) [chacha/chacha-20080120.pdf](http://cr.yp.to/chacha/chacha-20080120.pdf) (visited on 04/25/2023) (cit. on p. [170\)](#page-192-0).
- **[251]** AHMED, Ibrahim; SHEN, Linda L. and BETZ, Vaughn: "Optimizing FPGA Logic Circuitry for Variable Voltage Supplies". In: *IEEE Transactions on Very Large Scale Integration (VLSI) Systems* 28.4 (2020), pp. 890–903. DOI: [10.1109/TVLSI.](https://doi.org/10.1109/TVLSI.2019.2962501) [2019.2962501](https://doi.org/10.1109/TVLSI.2019.2962501) (cit. on p. [197\)](#page-219-0).
- **[252]** GOLOMB, Solomon W.: Shift register sequences: [secure and limited-access code generators ; efficiency code generators ; prescribed property generators ; mathematical models]. Rev. ed. Laguna Hills, Calif.: Aegean Park Pr, 1982 (cit. on p. [221\)](#page-243-0).
- **[253]** XIPHERA LTD.: CHACHA20-POLY1305 PRODUCT BRIEF. 2019. URL: [https :](https://xiphera.com/product_brief/ChaCha20_Poly1305_MPSoC.pdf) [//xiphera.com/product\\_brief/ChaCha20\\_Poly1305\\_MPSoC.pdf](https://xiphera.com/product_brief/ChaCha20_Poly1305_MPSoC.pdf) (visited on 04/16/2019) (cit. on p. [259\)](#page-281-0).
- **[254]** AT, Nuray; BEUCHAT, Jean-Luc; OKAMOTO, Eiji; SAN, Ismail and YAMAZAKI, Teppei: "Compact Hardware Implementations of ChaCha, BLAKE, Threefish, and Skein on FPGA". In: *IEEE Transactions on Circuits and Systems I: Regular Papers* 61.2 (2014), pp. 485–498. DOI: [10.1109/TCSI.2013.2278385](https://doi.org/10.1109/TCSI.2013.2278385) (cit. on p. [259\)](#page-281-0).
- **[255]** STRÖMBERGSON, Joachim: Verilog 2001 implementation of the ChaCha stream cipher. 2019. URL: [https : / / github . com / secworks / chacha/](https://github.com/secworks/chacha/) (visited on 04/16/2019) (cit. on p. [259\)](#page-281-0).
- **[256]** SILITONGA, Arthur; SCHADE, Florian; JIANG, Guanru and BECKER, Juergen: "HLS-Based Performance and Resource Optimization of Cryptographic Modules". In: *Proceedings of the 16th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA2018),Melbourne, Australia, 11th-13th December 2018*. IEEE, 2018, pp. 1009–1016. DOI: [10.1109/BDCloud.](https://doi.org/10.1109/BDCloud.2018.00147) [2018.00147](https://doi.org/10.1109/BDCloud.2018.00147) (cit. on p. [259\)](#page-281-0).
- **[257]** SOLTANI, Abolfazl and SHARIFIAN, Saeed: "An ultra-high throughput and fully pipelined implementation of AES algorithm on FPGA". In: *Microprocessors and Microsystems* 39.7 (2015), pp. 480–493. DOI: [10.1016/j.micpro.2015.07.005](https://doi.org/10.1016/j.micpro.2015.07.005) (cit. on p. [259\)](#page-281-0).

**[258]** YAZDANSHENAS, Sadegh and BETZ, Vaughn: "COFFE 2: Automatic Modelling and Optimization of Complex and Heterogeneous FPGA Architectures". In: *ACM Transactions on Reconfigurable Technology and Systems (TRETS)* 12.1 (2019), p. 3. DOI: [10.1145/3301298](https://doi.org/10.1145/3301298) (cit. on p. [264\)](#page-286-0).

### **Publications**

This section contains a complete list of own publications. Publications [\[Pfa18,](#page-343-0) [Pfa19,](#page-343-1) [Pfa20,](#page-343-2) [Pfa21,](#page-344-0) [Pfa22,](#page-344-1) [Pfa23a,](#page-344-2) [Pfa23b\]](#page-344-3) have been authored by this thesis' author and are listed first. Out of those, [\[Pfa18\]](#page-343-0) focuses on [FPGA-](#page-362-0)based signal processing and [\[Pfa23a\]](#page-344-2) on teachingof [System-on-Chip \(SoC\)](#page-364-0) design. Both are not directly related to the content of this thesis. Those original publications are also available as open access preprints in the KITopen repository. The list continues with publications to which this thesis' author has contributed to. [PARFAIT](#page-363-0) publications [\[Reu19,](#page-344-4) [Reu20,](#page-344-5) [Reu21,](#page-344-6) [Reu22\]](#page-344-7) contain contributions relevant to this thesis and are again listed first. All remaining publications are not related to the thesis' topic.

- <span id="page-343-0"></span>**[Pfa18]** PFAU, Johannes; FIGULI, Shalina Percy Delicia; BÄHR, Steffen and BECKER, Jürgen: "Reconfigurable FPGA-Based Channelization Using Polyphase Filter Banks for Quantum Computing Systems". In: *Applied Reconfigurable Computing. Architectures, Tools, and Applications*. Ed. by VOROS, Nikolaos; HUEBNER, Michael; KERAMIDAS, Georgios; GOEHRINGER, Diana; ANTONOPOULOS, Christos and DINIZ, Pedro C. Vol. 10824. Lecture Notes in Computer Science. Cham: Springer International Publishing, 2018, pp. 615–626. DOI: [10.1007/978-3-319-78890-6\\_49.](https://doi.org/10.1007/978-3-319-78890-6_49)
- <span id="page-343-1"></span>**[Pfa19]** PFAU, Johannes; REUTER, Maximilian; HARBAUM, Tanja; HOFMANN, Klaus and BECKER, Jürgen: "A Hardware Perspective on the ChaCha Ciphers: Scalable Chacha8/12/20 Implementations Ranging from 476 Slices to Bitrates of 175 Gbit/s". In: *2019 32nd IEEE International System-on-Chip Conference (SOCC)*. IEEE, 2019, pp. 294–299. DOI: [10.1109/socc46988.](https://doi.org/10.1109/socc46988.2019.1570548289) [2019.1570548289](https://doi.org/10.1109/socc46988.2019.1570548289) (cit. on pp. [170,](#page-192-0) [173,](#page-195-0) [174,](#page-196-0) [254\)](#page-276-0).
- <span id="page-343-2"></span>**[Pfa20]** PFAU, Johannes; REUTER, Maximilian; HOFMANN, Klaus and BECKER, Jürgen: "Designing Universal Logic Module FPGA Architectures for UseWith Ambipolar Transistor Technology". In: *2020 International Conference on Field-Programmable Technology (ICFPT)*. IEEE, 2020, pp. 165–173. DOI: [10.1109/icfpt51103.2020.00031](https://doi.org/10.1109/icfpt51103.2020.00031) (cit. on pp. [177,](#page-199-0) [180,](#page-202-0) [183\)](#page-205-0).

<span id="page-344-0"></span>

- <span id="page-344-1"></span>**[Pfa22]** PFAU, Johannes; ZAKI, Peter Wagih and BECKER, Jürgen: "V-FPGAs: Increasing Performance with Manual Placement, Timing Extraction and Extended Timing Modeling". In: *Journal of Signal Processing Systems* 94.9 (2022), pp. 865–882. DOI: [10.1007/s11265-022-01786-z](https://doi.org/10.1007/s11265-022-01786-z) (cit. on pp. [232,](#page-254-0) [236\)](#page-258-0).
- <span id="page-344-2"></span>**[Pfa23a]** PFAU, Johannes; LEYS, Richard; NEU, Marc; SERDYUK, Alexey; PERIC, Ivan and BECKER, Jürgen: "A Unified SoC Lab Course: Combined Teaching of Mixed Signal Aspects, System Integration, Software Development and Documentation". In: *2023 IEEE International Symposium on Circuits and Systems (ISCAS)*. IEEE, 2023, pp. 1–5. DOI: [10.1109/ISCAS46773.2023.](https://doi.org/10.1109/ISCAS46773.2023.10181679) [10181679.](https://doi.org/10.1109/ISCAS46773.2023.10181679)
- <span id="page-344-3"></span>**[Pfa23b]** PFAU, Johannes; HERNANDEZ, Jiro; REUTER, Maximilian; HOFMANN, Klaus and BECKER, Jürgen: "Co-Simulating Region-Based Dynamic Voltage Scaling for FPGA Architecture Design". In: *2023 IEEE Nordic Circuits and Systems Conference (NorCAS)*. IEEE, 2023, pp. 1–7. DOI: [10.1109/](https://doi.org/10.1109/NorCAS58970.2023.10305486) [NorCAS58970.2023.10305486](https://doi.org/10.1109/NorCAS58970.2023.10305486) (cit. on pp. [196,](#page-218-0) [245,](#page-267-0) [264\)](#page-286-0).
- <span id="page-344-4"></span>**[Reu19]** REUTER, Maximilian; KRAUSS, Tillmann A.; MORADINASAB, Mahdi; PFAU, Johannes; SCHWALKE, Udo; BECKER, Jürgen and HOFMANN, Klaus: "From MOSFETs to Ambipolar Transistors: A Static DeFET Inverter Cell for SOI". In: *2019 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)*. IEEE, 2019, pp. 113–116. DOI: [10.1109/APCCAS47518.2019.8953083](https://doi.org/10.1109/APCCAS47518.2019.8953083) (cit. on pp. [20,](#page-42-0) [21,](#page-43-0) [161\)](#page-183-0).
- <span id="page-344-5"></span>**[Reu20]** REUTER, Maximilian; PFAU, Johannes; KRAUSS, Tillmann A.; MORADI-NASAB,Mahdi; SCHWALKE, Udo; BECKER, Jürgen and HOFMANN, Klaus:"Towards Ambipolar Planar Devices: The DeFET Device in Area Constrained XOR Applications". In: *2020 IEEE 11th Latin American Symposium on Circuits & Systems (LASCAS)*. IEEE, 2020, pp. 1–4. DOI: [10.1109/lascas45839.](https://doi.org/10.1109/lascas45839.2020.9069043) [2020.9069043](https://doi.org/10.1109/lascas45839.2020.9069043) (cit. on pp. [3,](#page-25-0) [161,](#page-183-0) [163\)](#page-185-0).
- <span id="page-344-6"></span>**[Reu21]** REUTER, Maximilian; PFAU, Johannes; KRAUSS, Tillmann A.; BECKER, Jürgen and HOFMANN, Klaus: "From MOSFETs to Ambipolar Transistors: Standard Cell Synthesis for the Planar RFET Technology". In: *IEEE Transactions on Circuits and Systems I: Regular Papers* 68.1 (2021), pp. 114–125. DOI: [10.1109/TCSI.2020.3035889](https://doi.org/10.1109/TCSI.2020.3035889) (cit. on pp. [3,](#page-25-0) [133,](#page-155-0) [134,](#page-156-0) [161–](#page-183-0)[164,](#page-186-0) [175,](#page-197-0) [176,](#page-198-0) [253,](#page-275-0) [254,](#page-276-0) [256,](#page-278-0) [288\)](#page-310-0).
- <span id="page-344-7"></span>**[Reu22]** REUTER, Maximilian; KRAMER, Andreas; KRAUSS, Tillmann; PFAU, Johannes; BECKER, Jürgen and HOFMANN, Klaus: "Reconfiguring an RFET Based Differential Amplifier". In: *2022 IEEE 40th Central Amer-*

*ica and Panama Convention (CONCAPAN)*. IEEE, 2022, pp. 1–6. DOI: [10.1109/CONCAPAN48024.2022.9997726](https://doi.org/10.1109/CONCAPAN48024.2022.9997726) (cit. on p. [18\)](#page-40-0).

- **[Pis16]** PISTORIUS, Felix; LAUBER, Andreas; PFAU, Johannes; KLIMM, Alexander and BECKER, Jürgen: "Development of a Latency Optimized Communication Device forWAVE and SAE Based V2X-Applications". In: *SAE Technical Paper Series*. SAE Technical Paper Series. SAE International400 Commonwealth Drive, Warrendale, PA, United States, 2016. DOI: [10.4271/2016-01-](https://doi.org/10.4271/2016-01-0150) [0150.](https://doi.org/10.4271/2016-01-0150)
- **[Nus22]** NUSS, Benjamin; GROESCHEL, Patrick; PFAU, Johannes; BECKER, Juergen; VOSSIEK, Martin and ZWICK, Thomas: "Broadband MIMO Testbed for the Development and Research on 6G". In: *European Wireless 2022; 27th European Wireless Conference*. 2022.
- **[Kar22]** KARLE, Christian Maximilian; KREUTZER, Marius; PFAU, Johannes and BECKER, Jürgen: "A hardware/software co-design approach to prototype 6G mobile applications inside the GNU Radio SDR Ecosystem using FPGA hardware accelerators". In: *International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies*. New York, NY, USA: ACM, 2022, pp. 33–41. DOI: [10.1145/3535044.3535049.](https://doi.org/10.1145/3535044.3535049)
- **[Chu23]** CHU, Anqi et al.: "LETSCOPE: Lifecycle Extensions Through Software-Defined Predictive Control of Power Electronics". In: *IEEE EUROCON 2023 - 20th International Conference on Smart Technologies*. IEEE, 2023, pp. 665–670. DOI: [10.1109/EUROCON56442.2023.10199076.](https://doi.org/10.1109/EUROCON56442.2023.10199076)
- **[Kar23]** KARLE, Christian; NEU, Marc; PFAU, Johannes; SPERLING, Jan and BECKER, Jürgen: "ReLoDAQ: Resource-Efficient, Low-Overhead 200 Gbits-1 Data Acquisition System for 6G Prototyping". In: *2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)*. IEEE, 2023, p. 209. DOI: [10.1109/FCCM57271.2023.00037.](https://doi.org/10.1109/FCCM57271.2023.00037)
- **[Mar23]** MARTEN, Johann Christian; YOUNIS, Marwan; KRIEGER, Gerhard; PFAU, Johannes; UNGER, Kai and BECKER, Jürgen: "Design and Implementation of Staggered-SAR Azimuth-Processing". In: *2023 24th International Radar Symposium (IRS)*. IEEE, 2023, pp. 1–10. DOI: [10.23919/IRS57608.2023.](https://doi.org/10.23919/IRS57608.2023.10172405) [10172405.](https://doi.org/10.23919/IRS57608.2023.10172405)
- **[Kre23]** KREß, Fabian; PFAU, Johannes; KEMPF, Fabian; SCHMIDT, Patrick; HE, Zhuofan; HARBAUM, Tanja and BECKER, Jürgen: "Automated Replacement of State-Holding Flip-Flops to Enable Non-Volatile Checkpointing". In: *2023 IEEE Nordic Circuits and Systems Conference (NorCAS)*. IEEE, 2023, pp. 1–7. DOI: [10.1109/norcas58970.2023.10305469.](https://doi.org/10.1109/norcas58970.2023.10305469)

### **Student Theses**

This section contains a list of student theses which have been supervised by this thesis' author. Entries are sorted by year and author name. For [\[Den21,](#page-347-0) [Len21,](#page-347-1) [He22b,](#page-348-0) [Li23\]](#page-348-1), this work's author was the secondary supervisor. Thesis [\[Cre19,](#page-347-2) [Ste19,](#page-347-3) [Len21\]](#page-347-1) were external thesis, where this work's author was the institute's supervising contact.

<span id="page-347-3"></span><span id="page-347-2"></span><span id="page-347-1"></span><span id="page-347-0"></span>

<span id="page-348-1"></span><span id="page-348-0"></span>

# **Figures**









**Figures 333**



### **Tables**



## **Listings**




# **Acronyms**

<span id="page-361-7"></span><span id="page-361-6"></span><span id="page-361-5"></span><span id="page-361-4"></span><span id="page-361-3"></span><span id="page-361-2"></span><span id="page-361-1"></span><span id="page-361-0"></span>

<span id="page-362-9"></span><span id="page-362-8"></span><span id="page-362-7"></span><span id="page-362-6"></span><span id="page-362-5"></span><span id="page-362-4"></span><span id="page-362-3"></span><span id="page-362-2"></span><span id="page-362-1"></span><span id="page-362-0"></span>

<span id="page-363-6"></span><span id="page-363-5"></span><span id="page-363-4"></span><span id="page-363-3"></span><span id="page-363-2"></span><span id="page-363-1"></span><span id="page-363-0"></span>

<span id="page-364-8"></span><span id="page-364-7"></span><span id="page-364-6"></span><span id="page-364-5"></span><span id="page-364-4"></span><span id="page-364-3"></span><span id="page-364-2"></span><span id="page-364-1"></span><span id="page-364-0"></span>

# **Glossary**

<span id="page-365-6"></span><span id="page-365-5"></span><span id="page-365-4"></span><span id="page-365-3"></span><span id="page-365-2"></span><span id="page-365-1"></span><span id="page-365-0"></span>

<span id="page-366-7"></span><span id="page-366-6"></span><span id="page-366-5"></span><span id="page-366-4"></span><span id="page-366-3"></span><span id="page-366-2"></span><span id="page-366-1"></span><span id="page-366-0"></span>

#### **Glossary 345**

<span id="page-367-3"></span><span id="page-367-2"></span><span id="page-367-1"></span><span id="page-367-0"></span>

*This page intentionally left blank*

## **Appendix A**

#### **Research Data Archive**

Some data used in this thesis can not be published as Open-Source, as it uses parts of commercial [PDK](#page-363-1) or was partially derived from source code from previous institute members with unclear licensing. Nevertheless, all code which was developed as part of this thesis as well as all the raw evaluation data has been stored in the [Research Data Archive.](https://www.rda.kit.edu/) The filename for the thesis data is johannes\_pfau\_thesis.tar.zst and the SHA256 checksum of the file is bf8192cbd243af405b845f2aa5c95d856c37a0ae43cebdb2747b8f89 9e8fb728.

*This page intentionally left blank*

#### **Appendix B**

#### **FPGA Architecture Descriptions**

```
1 <pb_type name="io" capacity="8" area="0">
2 <input name="outpad" num_pins="1"/>
3 <output name="inpad" num_pins="1"/>
4 <clock name="clock" num_pins="1"/>
5
6 <!−− IOs can operate as either inputs or outputs.
7 −−>
8 <mode name="inpad">
9 <pb type name="inpad" blif model=".input" num pb="1">
10 <output name="inpad" num_pins="1"/>
11 </pb type>
12 <interconnect>
13 <direct name="inpad" input="inpad.inpad" output="io.
             \leftrightarrow inpad">
14 <delay_constant max="4.243e−11" in_port="inpad.inpad"
              ↪ out_port="io.inpad"/>
15 </direct>
16 </interconnect>
17 \langle /mode>
18
19 <mode name="outpad">
20 <pb_type name="outpad" blif_model=".output" num_pb="1">
21 <input name="outpad" num_pins="1"/>
22 </pb_type>
23 <interconnect>
24 <direct name="outpad" input="io.outpad" output="outpad.
             ↪ outpad">
25 <delay_constant max="1.394e−11" in_port="io.outpad"
              </u> ⇒ out port="outpad.outpad"26 </direct>
27 </interconnect>
28 </mode>
29
```

```
30 < fc in type="frac" in val="0.15" out type="frac" out val="
         \leftrightarrow 0.10"/>
31
32 <pinlocations pattern="custom">
33 <loc side="left">io.outpad io.inpad io.clock</loc>
34 <loc side="top">io.outpad io.inpad io.clock</loc>
35 <loc side="right">io.outpad io.inpad io.clock</loc>
36 <loc side="bottom">io.outpad io.inpad io.clock</loc>
37 </pinlocations>
38
39 <power method="ignore"/>
40 </pb_type>
```
**Listing B.1:** Full description of the *IOB* used in the *k6\_frac\_N10\_40nm* reference architecture.

```
1 <mode name="n1_lut6">
2 <pb_type name="ble6" num_pb="1">
3 \langleinput name="in" num pins="6"/>
4 <output name="out" num_pins="1"/>
5 <clock name="clk" num_pins="1"/>
6
7 <pb_type name="lut6" blif_model=".names" num_pb="1" class=
            \leftrightarrow "\frac{1}{1}8 \langleinput name="in" num pins="6" port class="lut in"/>
9 <output name="out" num_pins="1" port_class="lut_out"/>
10 \times/bb type>
11
12 <br/>
\langlepb type name="ff" blif model=".latch" num pb="1" class="
            ↪ flipflop">
13 <input name="D" num_pins="1" port_class="D"/>
14 < < < < < < < < < < < < 14 < < < < < < < 14 < < 14 and 14 port class="0"/>
15 <clock name="clk" num_pins="1" port_class="clock"/>
16 <T_setup value="66e−12" port="ff.D" clock="clk"/>
17 <T_clock_to_Q max="124e−12" port="ff.Q" clock="clk"/>
18 </pb type>
19
20 <interconnect>
21 <direct name="direct1" input="ble6.in" output="lut6[0:0
              \leftrightarrow 1. in"/>
22 <direct name="direct2" input="lut6.out" output="ff.D">
23 <pack_pattern name="ble6" in_port="lut6.out" out_port=
                \hookrightarrow "ff. D"/>
24 </direct>
```

```
25 <direct name="direct3" input="ble6.clk" output="ff.clk"/
              \hookrightarrow26 <mux name="mux1" input="ff.Q lut6.out" output="ble6.out"
              \hookrightarrow ></mux>
27 </interconnect>
28 \langle/pb type>
29 <interconnect>
30 <direct name="direct1" input="fle.in" output="ble6.in"/>
31 <direct name="direct2" input="ble6.out" output="fle.out[0
            \leftrightarrow :07"/>
32 <direct name="direct3" input="fle.clk" output="ble6.clk"/>
33 </interconnect>
34 </mode>
```
#### **Listing B.2:** [LUT6](#page-363-3) mode of the [FLE](#page-362-8) XML architecture description used by the [VTR](#page-364-2) framework and [VPR.](#page-364-3)

```
1 <mode name="n2_lut5">
2 \leq \le3 \langleinput name="in" num pins="5"/>
4 <output name="out" num_pins="2"/>
5 \langle \text{clock name} = "c \, \text{l} \, k" \text{ num pins} = "1" \rangle6
7 <br />
<br />
<br />
<br />
<br />
The state of type name="ble5" num pb="2">
8 \langleinput name="in" num pins="5"/>
9 < output name="out" num_pins="1"/>
10 <clock name="clk" num_pins="1"/>
11
12 <pb_type name="lut5" blif_model=".names" num_pb="1"
                 \leftrightarrow class="lut">
13 <input name="in" num_pins="5" port_class="lut_in"/>
14 <output name="out" num_pins="1" port_class="lut_out"/>
15 \langle/pb type>
16
17 <br />
Lif_model=". latch" num_pb="1" class=
                 ↪ "flipflop">
18 <input name="D" num_pins="1" port_class="D"/>
19 <output name="Q" num_pins="1" port_class="Q"/>
20 <clock name="clk" num_pins="1" port_class="clock"/>
21 \langle/pb type>
22
23 <interconnect>
24 <direct name="direct1" input="ble5.in[4:0]" output="
                   \leftrightarrow lut5[0:0].in[4:0]"/>
```

```
25 <direct name="direct2" input="lut5[0:0].out" output="
                \leftrightarrow ff[0:0].D">
26 <pack_pattern name="ble5" in_port="lut5[0:0].out"
                  ightharpoonup out port="ff[0:0].D"/>
27 </direct>
28 <direct name="direct3" input="ble5.clk" output="ff[0:0
                ightharpoonup ].clk"/>
29 <mux name="mux1" input="ff[0:0].Q lut5.out[0:0]"
                \leftrightarrow output="ble5.out[0:0]"></mux>
30 </interconnect>
31 </pb_type>
32 <interconnect>
33 <direct name="direct1" input="lut5inter.in" output="ble5
              \leftrightarrow [0:0].in"/>
34 <direct name="direct2" input="lut5inter.in" output="ble5
              \hookrightarrow [1:1]. in"/>
35 <direct name="direct3" input="ble5[1:0].out" output="
              ↪ lut5inter.out"/>
36 <complete name="complete1" input="lut5inter.clk" output=
              \leftrightarrow "ble5[1:0].clk"/>
37 </interconnect>
38 </pb_type>
39
40 <interconnect>
41 <direct name="direct1" input="fle.in[4:0]" output="
            ↪ lut5inter.in"/>
42 <direct name="direct2" input="lut5inter.out" output="fle.
            ightharpoonup out"/>
43 <direct name="direct3" input="fle.clk" output="lut5inter.
            G \cup C \cup k''/k44 </interconnect>
45 </mode>
```
**Listing B.3:** Dual [LUT5](#page-363-3) mode of the [FLE](#page-362-8) XML architecture description used by the [VTR](#page-364-2) framework and [VPR.](#page-364-3)

### **Appendix C**

### **Delay Model Extraction**

```
1 library ieee;
2 use ieee.std_logic_1164.all;
3
4 entity tpd is
5 port (
6 a: in std_logic;
7 y: out std_logic_vector(3 downto 0)
8 );
9 end;
10
11 architecture xt018 of tpd is
12 signal inv0_out, inv1_out: std_logic;
13 begin
14
15 inv0: entity work.INHDX0
16 port map (
17 A => a,
18 Q => inv0_out
19 );
20
21 inv1: entity work.INHDX0
22 port map (
23 A => inv0_out,
24 Q => inv1_out
25 );
26
27 inv2_0: entity work.INHDX0
28 port map (
29 A \Rightarrow inv1 out,
30 0 \Rightarrow y(0)31 );
32
33 inv2_1: entity work.INHDX0
```

```
34 port map (
35 A => inv1_out,
36 Q \Rightarrow y(1)37 );
38
39 inv2_2: entity work.INHDX0
40 port map (
41 A \Rightarrow inv1_out,42 Q \Rightarrow y(2)43 );
44
45 inv2_3: entity work.INHDX0
46 port map (
47 A => inv1_out,
48 Q => y(3)
49 );
50
51 end;
```
**Listing C.1:** VHDL code used in Cadence Genus synthesis with the *XT018* PDK to extract [FO4](#page-365-6) inverter delay. The model is used with different corner files to extract differences caused by process variation.

# **Appendix D**

#### **Standard Cell Library Excerpts**

```
\frac{1}{2} / \star2 * cell_description : 2-Input NOR
   \star/4
5 cell (NO2HDX1) {
6 area : 10.0352;
7 cell_footprint : NO2;
8
9 pg_pin (gnd) {
10 voltage_name : gndCommon;
11 pg_type : primary_ground;
12 }
13 pg_pin (vdd) {
14 voltage_name : vddInt;
15 pg_type : primary_power;
16 }
17
18 pin (A) {
19 direction : input;
20 related_ground_pin : gnd;
21 related_power_pin : vdd;
22 rise_capacitance : 0.006461898799178912;
23 fall_capacitance : 0.005618442917028880;
24 }
25 pin (B) {
26 direction : input;
27 related_ground_pin : gnd;
28 related_power_pin : vdd;
29 rise_capacitance : 0.006480533990037019;
30 fall_capacitance : 0.005240246832871408;
31 }
32 pin (Q) {
33 direction : output;
```

```
34 function : "!(A+B)";
35 max_capacitance: 0.015;
36 max_fanout : 10;
37 related_ground_pin : gnd;
38 related_power_pin : vdd;
39 power_down_function : "!vdd + gnd";
40
41 timing () {
42 timing_sense : negative_unate;
43 related_pin : A;
44 rise_transition (delay_template_3x3) {
45 index_1 ("0.060000, 0.300000, 0.540000");
46 index_2 ("0.003000, 0.009000, 0.015000");
47 values ( \
48 "0.205750, 0.241829, 0.277907", \
49 "0.329676, 0.356424, 0.383173", \
50 "0.453601, 0.471020, 0.488439");
51 }
52 fall transition (delay template 3x3) {
53 index_1 ("0.060000, 0.300000, 0.540000");
54 index_2 ("0.003000, 0.009000, 0.015000");
55 values ( \
56 "0.270651, 0.305638, 0.340626", \
57 "0.384688, 0.406597, 0.428506", \
58 "0.498726, 0.507556, 0.516386");
59 }
60 cell_rise (delay_template_3x3) {
61 index_1 ("0.060000, 0.300000, 0.540000");
62 index_2 ("0.003000, 0.009000, 0.015000");
63 values ( \
64 "0.250443, 0.271110, 0.291777", \
65 "0.339382, 0.364643, 0.389904", \
66 "0.428321, 0.458176, 0.488031");
67 }
68 cell_fall (delay_template_3x3) {
69 index_1 ("0.060000, 0.300000, 0.540000");
70 index_2 ("0.003000, 0.009000, 0.015000");
71 values ( \
72 "0.235192, 0.321403, 0.407613", \
73 "0.312413, 0.392071, 0.471729", \
74 "0.389633, 0.462739, 0.535844");
75 }
76 }
77 timing () {
```

```
79 related_pin : B;
80 rise_transition (delay_template_3x3) {
81 index_1 ("0.060000, 0.300000, 0.540000");
82 index_2 ("0.003000, 0.009000, 0.015000");
83 values ( \
84 "0.205300, 0.252093, 0.298886",
85 "0.329448, 0.366706, 0.403964", \
86 "0.453595, 0.481319, 0.509043");
87 }
88 fall_transition (delay_template_3x3) {
89 index_1 ("0.060000, 0.300000, 0.540000");
90 index_2 ("0.003000, 0.009000, 0.015000");
91 values ( \
92 "0.158357, 0.193715, 0.229073", \
93 "0.272944, 0.298892, 0.324839", \
94 "0.387532, 0.404069, 0.420605");
95 }
96 cell rise (delay template 3x3) {
97 index_1 ("0.060000, 0.300000, 0.540000");
98 index_2 ("0.003000, 0.009000, 0.015000");
99 values ( \
100 "0.183191, 0.207483, 0.231776", \
101 "0.272849, 0.306402, 0.339955", \
102 "0.362508, 0.405321, 0.448134");
103 }
104 cell_fall (delay_template_3x3) {
105 index_1 ("0.060000, 0.300000, 0.540000");
106 index_2 ("0.003000, 0.009000, 0.015000");
107 values ( \
108 "0.160941, 0.223859, 0.286777", \
109 "0.236255, 0.303522, 0.370789",
110 "0.311569, 0.383185, 0.454801");
111 }
112 }
113 }
114 }
```
**timing\_sense** : negative\_unate;

```
Listing D.1: Full description of the NO2 gate in the RFET .lib file.
```

```
1 wire_load_table (0_1k) {
2 fanout_area (1, x.xx);
3 fanout_capacitance (1, x.xx);
4 fanout_length (1, x.xx);
```

```
5 fanout_resistance (1, x.xx);
6 fanout_area (5, x.xx);
7 fanout_capacitance (5, x.xx);
8 fanout_length (5, x.xx);
9 fanout_resistance (5, x.xx);
10 fanout_area (20, x.xx);
11 fanout_capacitance (20, x.xx);
12 fanout_length (20, x.xx);
13 fanout_resistance (20, x.xx);
14 fanout_area (10000, x.xx);
15 fanout_capacitance (10000, x.xx);
16 fanout_length (10000, x.xx);
17 fanout_resistance (10000, x.xx);
18 }
```
**Listing D.2:** Wireload description in the [RFET](#page-364-7) *.lib* file.

#### **Appendix E**

#### **FASM for Logic Invasion**

```
1 # This is the fasm file is used to initialize the LFSR counter
        ↪ . LUTs always set the registers to a constant value.
2 # To compile: ./src/sw/pbit −−single CLB src/fasm/
        </u> ← clb counter init.fasm
3
4 # Pass through FLE0 LUT0 input 0 from external input
5 BLK_X[000]Y[000]Z[000].CLB.FLE[0].CB.I[0]="ext[0].I"
6 BLK_X[000]Y[000]Z[000].CLB.FLE[0].5BLE[0].FF.ENABLE
7 BLK_X[000]Y[000]Z[000].CLB.FLE[0].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 10101010101010101010101010101010
8
9 # Configure BLE0 LUT 1 as constant 1
10 BLK_X[000]Y[000]Z[000].CLB.FLE[0].5BLE[1].FF.ENABLE
11 BLK_X[000]Y[000]Z[000].CLB.FLE[0].5BLE[1].LUT5.INIT[31:0]=32'b
        ↪ 11111111111111111111111111111111
12
13 \# All other LUTs as constant 0
14 BLK_X[000]Y[000]Z[000].CLB.FLE[1].5BLE[0].FF.ENABLE
15 BLK_X[000]Y[000]Z[000].CLB.FLE[1].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
16 BLK_X[000]Y[000]Z[000].CLB.FLE[1].5BLE[1].FF.ENABLE
17 BLK_X[000]Y[000]Z[000].CLB.FLE[1].5BLE[1].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
18
19 BLK_X[000]Y[000]Z[000].CLB.FLE[2].5BLE[0].FF.ENABLE
20 BLK_X[000]Y[000]Z[000].CLB.FLE[2].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
21 BLK_X[000]Y[000]Z[000].CLB.FLE[2].5BLE[1].FF.ENABLE
22 BLK_X[000]Y[000]Z[000].CLB.FLE[2].5BLE[1].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
23
24 BLK_X[000]Y[000]Z[000].CLB.FLE[3].5BLE[0].FF.ENABLE
```

```
25 BLK_X[000]Y[000]Z[000].CLB.FLE[3].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
26 BLK_X[000]Y[000]Z[000].CLB.FLE[3].5BLE[1].FF.ENABLE
27 BLK_X[000]Y[000]Z[000].CLB.FLE[3].5BLE[1].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
28
29 BLK_X[000]Y[000]Z[000].CLB.FLE[4].5BLE[0].FF.ENABLE
30 BLK_X[000]Y[000]Z[000].CLB.FLE[4].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
31 BLK_X[000]Y[000]Z[000].CLB.FLE[4].5BLE[1].FF.ENABLE
32 BLK_X[000]Y[000]Z[000].CLB.FLE[4].5BLE[1].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
33
34 BLK_X[000]Y[000]Z[000].CLB.FLE[5].5BLE[0].FF.ENABLE
35 BLK_X[000]Y[000]Z[000].CLB.FLE[5].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
36 BLK_X[000]Y[000]Z[000].CLB.FLE[5].5BLE[1].FF.ENABLE
37 BLK_X[000]Y[000]Z[000].CLB.FLE[5].5BLE[1].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
38
39 BLK_X[000]Y[000]Z[000].CLB.FLE[6].5BLE[0].FF.ENABLE
40 BLK_X[000]Y[000]Z[000].CLB.FLE[6].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
41 BLK_X[000]Y[000]Z[000].CLB.FLE[6].5BLE[1].FF.ENABLE
42 BLK_X[000]Y[000]Z[000].CLB.FLE[6].5BLE[1].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
43
44 BLK_X[000]Y[000]Z[000].CLB.FLE[7].5BLE[0].FF.ENABLE
45 BLK_X[000]Y[000]Z[000].CLB.FLE[7].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
46 BLK_X[000]Y[000]Z[000].CLB.FLE[7].5BLE[1].FF.ENABLE
47 BLK_X[000]Y[000]Z[000].CLB.FLE[7].5BLE[1].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
4049 BLK_X[000]Y[000]Z[000].CLB.FLE[8].5BLE[0].FF.ENABLE
50 BLK_X[000]Y[000]Z[000].CLB.FLE[8].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
51 BLK_X[000]Y[000]Z[000].CLB.FLE[8].5BLE[1].FF.ENABLE
52 BLK_X[000]Y[000]Z[000].CLB.FLE[8].5BLE[1].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
53
54 BLK_X[000]Y[000]Z[000].CLB.FLE[9].5BLE[0].FF.ENABLE
55 BLK_X[000]Y[000]Z[000].CLB.FLE[9].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
```

```
56 BLK_X[000]Y[000]Z[000].CLB.FLE[9].5BLE[1].FF.ENABLE
```

```
57 BLK_X[000]Y[000]Z[000].CLB.FLE[9].5BLE[1].LUT5.INIT[31:0]=32'b
        ↪ 00000000000000000000000000000000
```
**Listing E.1:** [FASM](#page-362-0) representing the register initialization for measurement in [section 8.3](#page-236-0) on page [214.](#page-236-0)

```
1 # This is the fasm file used to build an oscillator for logic
        ↪ invasion
2 # To compile: ./src/sw/pbit −−single CLB src/fasm/clb_ring.
        G fasm
3
4 # We build a ring oscillator in the LUT0s of BLE 0−9 and LUT1
        ↪ of BLE10
5 # Oscillator input is I0, so an inverter/NAND of I0 is used
6 # The counter / LFSR is implemented in LUT1s of BLE0−8.
7 # I1 is counter input, I2 optional second input when using an
        ightharpoonup xor for the first tap.
8 # For non−first tap, use a pass−through of I1
9
10 # Externally gated inverter (invert fle[9].O[1] if ext[0].I is
        \hookrightarrow 1)
11 BLK_X[000]Y[000]Z[000].CLB.FLE[0].CB.I[0]="fle[9].O[1]"
12 BLK_X[000]Y[000]Z[000].CLB.FLE[0].CB.I[3]="ext[0].I"
13 BLK_X[000]Y[000]Z[000].CLB.FLE[0].5BLE[0].FF.BYPASS
14 BLK_X[000]Y[000]Z[000].CLB.FLE[0].5BLE[0].LUT5.INIT[31:0]=32'b
        \leftrightarrow 01010101000000000000101010100000000
15 # x9 + x5 + 116 BLK_X[000]Y[000]Z[000].CLB.FLE[0].CB.I[1]="fle[8].O[1]"
17 BLK_X[000]Y[000]Z[000].CLB.FLE[0].CB.I[2]="fle[4].O[1]"
18 BLK_X[000]Y[000]Z[000].CLB.FLE[0].5BLE[1].FF.ENABLE
19 # The XOR of I1 and I2
20 BLK_X[000]Y[000]Z[000].CLB.FLE[0].5BLE[1].LUT5.INIT[31:0]=32'b
        \leftrightarrow 00111100001111000011110000111100
21
22 # All following ring elements are buffers
23 BLK_X[000]Y[000]Z[000].CLB.FLE[1].CB.I[0]="fle[0].O[0]"
24 BLK_X[000]Y[000]Z[000].CLB.FLE[1].CB.I[1]="fle[0].O[1]"
25 BLK_X[000]Y[000]Z[000].CLB.FLE[1].5BLE[0].FF.BYPASS
26 BLK_X[000]Y[000]Z[000].CLB.FLE[1].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 10101010101010101010101010101010
27 BLK_X[000]Y[000]Z[000].CLB.FLE[1].5BLE[1].FF.ENABLE
28 BLK_X[000]Y[000]Z[000].CLB.FLE[1].5BLE[1].LUT5.INIT[31:0]=32'b
        \hookrightarrow 11001100110011001100110011001100
```

```
2930 BLK_X[000]Y[000]Z[000].CLB.FLE[2].CB.I[0]="fle[1].O[0]"
31 BLK_X[000]Y[000]Z[000].CLB.FLE[2].CB.I[1]="fle[1].O[1]"
32 BLK_X[000]Y[000]Z[000].CLB.FLE[2].5BLE[0].FF.BYPASS
33 BLK_X[000]Y[000]Z[000].CLB.FLE[2].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 10101010101010101010101010101010
34 BLK_X[000]Y[000]Z[000].CLB.FLE[2].5BLE[1].FF.ENABLE
35 BLK_X[000]Y[000]Z[000].CLB.FLE[2].5BLE[1].LUT5.INIT[31:0]=32'b
        \leftrightarrow 11001100110011001100110011001100
36
37 BLK_X[000]Y[000]Z[000].CLB.FLE[3].CB.I[0]="fle[2].O[0]"
38 BLK_X[000]Y[000]Z[000].CLB.FLE[3].CB.I[1]="fle[2].O[1]"
39 BLK_X[000]Y[000]Z[000].CLB.FLE[3].5BLE[0].FF.BYPASS
40 BLK_X[000]Y[000]Z[000].CLB.FLE[3].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 10101010101010101010101010101010
41 BLK_X[000]Y[000]Z[000].CLB.FLE[3].5BLE[1].FF.ENABLE
42 BLK_X[000]Y[000]Z[000].CLB.FLE[3].5BLE[1].LUT5.INIT[31:0]=32'b
        ↪ 11001100110011001100110011001100
43
44 BLK_X[000]Y[000]Z[000].CLB.FLE[4].CB.I[0]="fle[3].O[0]"
45 BLK_X[000]Y[000]Z[000].CLB.FLE[4].CB.I[1]="fle[3].O[1]"
46 BLK_X[000]Y[000]Z[000].CLB.FLE[4].5BLE[0].FF.BYPASS
47 BLK_X[000]Y[000]Z[000].CLB.FLE[4].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 10101010101010101010101010101010
48 BLK_X[000]Y[000]Z[000].CLB.FLE[4].5BLE[1].FF.ENABLE
49 BLK_X[000]Y[000]Z[000].CLB.FLE[4].5BLE[1].LUT5.INIT[31:0]=32'b
        ↪ 11001100110011001100110011001100
50
51 BLK_X[000]Y[000]Z[000].CLB.FLE[5].CB.I[0]="fle[4].O[0]"
52 BLK_X[000]Y[000]Z[000].CLB.FLE[5].CB.I[1]="fle[4].O[1]"
53 BLK_X[000]Y[000]Z[000].CLB.FLE[5].5BLE[0].FF.BYPASS
54 BLK_X[000]Y[000]Z[000].CLB.FLE[5].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 10101010101010101010101010101010
55 BLK_X[000]Y[000]Z[000].CLB.FLE[5].5BLE[1].FF.ENABLE
56 BLK_X[000]Y[000]Z[000].CLB.FLE[5].5BLE[1].LUT5.INIT[31:0]=32'b
        \leftrightarrow 110011001100110011001100110001100
57
58 BLK_X[000]Y[000]Z[000].CLB.FLE[6].CB.I[0]="fle[5].O[0]"
59 BLK_X[000]Y[000]Z[000].CLB.FLE[6].CB.I[1]="fle[5].O[1]"
60 BLK_X[000]Y[000]Z[000].CLB.FLE[6].5BLE[0].FF.BYPASS
61 BLK_X[000]Y[000]Z[000].CLB.FLE[6].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 10101010101010101010101010101010
62 BLK_X[000]Y[000]Z[000].CLB.FLE[6].5BLE[1].FF.ENABLE
63 BLK_X[000]Y[000]Z[000].CLB.FLE[6].5BLE[1].LUT5.INIT[31:0]=32'b
```

```
\hookrightarrow 1100110011001100110011001100
64
65 BLK_X[000]Y[000]Z[000].CLB.FLE[7].CB.I[0]="fle[6].O[0]"
66 BLK_X[000]Y[000]Z[000].CLB.FLE[7].CB.I[1]="fle[6].O[1]"
67 BLK_X[000]Y[000]Z[000].CLB.FLE[7].5BLE[0].FF.BYPASS
68 BLK_X[000]Y[000]Z[000].CLB.FLE[7].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 10101010101010101010101010101010
69 BLK_X[000]Y[000]Z[000].CLB.FLE[7].5BLE[1].FF.ENABLE
70 BLK_X[000]Y[000]Z[000].CLB.FLE[7].5BLE[1].LUT5.INIT[31:0]=32'b
        \leftrightarrow 1100110011001100110011001100
71
72 BLK_X[000]Y[000]Z[000].CLB.FLE[8].CB.I[0]="fle[7].O[0]"
73 BLK_X[000]Y[000]Z[000].CLB.FLE[8].CB.I[1]="fle[7].O[1]"
74 BLK_X[000]Y[000]Z[000].CLB.FLE[8].5BLE[0].FF.BYPASS
75 BLK_X[000]Y[000]Z[000].CLB.FLE[8].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 10101010101010101010101010101010
76 BLK_X[000]Y[000]Z[000].CLB.FLE[8].5BLE[1].FF.ENABLE
77 BLK_X[000]Y[000]Z[000].CLB.FLE[8].5BLE[1].LUT5.INIT[31:0]=32'b
        \leftrightarrow 1100110011001100110011001100
78
79 BLK_X[000]Y[000]Z[000].CLB.FLE[9].CB.I[0]="fle[8].O[0]"
80 BLK_X[000]Y[000]Z[000].CLB.FLE[9].CB.I[1]="fle[9].O[0]"
81 BLK_X[000]Y[000]Z[000].CLB.FLE[9].5BLE[0].FF.BYPASS
82 BLK_X[000]Y[000]Z[000].CLB.FLE[9].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 10101010101010101010101010101010
83 BLK_X[000]Y[000]Z[000].CLB.FLE[9].5BLE[1].FF.BYPASS
84 # Pass−Through I1
85 BLK_X[000]Y[000]Z[000].CLB.FLE[9].5BLE[1].LUT5.INIT[31:0]=32'b
        \leftrightarrow 1100110011001100110011001100
```
**Listing E.2:** [FASM](#page-362-0) representing the counter and oscillator for measurement in [section 8.3](#page-236-0) on page [214.](#page-236-0)

```
1 # This is the fasm file is used to output the register values
       </u> to the LUT0 output.
2 # All registers are on, so they are available on the CB
3 # To compile: ./src/sw/pbit −−single CLB src/fasm/
       ↪ clb_counter_init.fasm
4
5 # This selects the register routed to the 0 input of the first
       \hookrightarrow IUT.
6 # which is then emitted to LUT0. This is just a placeholder,
       \leftrightarrow the actual
7 # value is substituted in gen_clb_read_bitstream
```

```
8 BLK X[000]Y[000]Z[000].CLB.FLE[0].CB.I[0]="fle[0].O[0]"
9
10 # LUT0 just forwards its input 0. The CB must route the wanted
        \Leftrightarrow register to this input.
11 BLK_X[000]Y[000]Z[000].CLB.FLE[0].5BLE[0].LUT5.INIT[31:0]=32'b
        ↪ 10101010101010101010101010101010
12 BLK_X[000]Y[000]Z[000].CLB.FLE[0].5BLE[0].FF.ENABLE
13 BLK_X[000]Y[000]Z[000].CLB.FLE[0].5BLE[1].FF.ENABLE
14
15 # Enable all FFs
16 BLK_X[000]Y[000]Z[000].CLB.FLE[1].5BLE[0].FF.ENABLE
17 BLK_X[000]Y[000]Z[000].CLB.FLE[1].5BLE[1].FF.ENABLE
18
19 BLK_X[000]Y[000]Z[000].CLB.FLE[2].5BLE[0].FF.ENABLE
20 BLK_X[000]Y[000]Z[000].CLB.FLE[2].5BLE[1].FF.ENABLE
21
22 BLK_X[000]Y[000]Z[000].CLB.FLE[3].5BLE[0].FF.ENABLE
23 BLK_X[000]Y[000]Z[000].CLB.FLE[3].5BLE[1].FF.ENABLE
24
25 BLK_X[000]Y[000]Z[000].CLB.FLE[4].5BLE[0].FF.ENABLE
26 BLK_X[000]Y[000]Z[000].CLB.FLE[4].5BLE[1].FF.ENABLE
27
28 BLK_X[000]Y[000]Z[000].CLB.FLE[5].5BLE[0].FF.ENABLE
29 BLK_X[000]Y[000]Z[000].CLB.FLE[5].5BLE[1].FF.ENABLE
30
31 BLK_X[000]Y[000]Z[000].CLB.FLE[6].5BLE[0].FF.ENABLE
32 BLK_X[000]Y[000]Z[000].CLB.FLE[6].5BLE[1].FF.ENABLE
33
34 BLK_X[000]Y[000]Z[000].CLB.FLE[7].5BLE[0].FF.ENABLE
35 BLK_X[000]Y[000]Z[000].CLB.FLE[7].5BLE[1].FF.ENABLE
36
37 BLK_X[000]Y[000]Z[000].CLB.FLE[8].5BLE[0].FF.ENABLE
38 BLK_X[000]Y[000]Z[000].CLB.FLE[8].5BLE[1].FF.ENABLE
39
40 BLK_X[000]Y[000]Z[000].CLB.FLE[9].5BLE[0].FF.ENABLE
41 BLK_X[000]Y[000]Z[000].CLB.FLE[9].5BLE[1].FF.ENABLE
```
**Listing E.3:** [FASM](#page-362-0) representing the [FF](#page-362-9) readout configuration for measurement in [section 8.3](#page-236-0) on page [214.](#page-236-0)

# **Appendix F**

#### **PARFAIT FPGA Evaluation Results**

This chapter contains additional figures for the evaluation in [chapter 10](#page-275-0) on page [253.](#page-275-0) The evaluation chapter contains detailed instructions on how these figures can be interpreted. It also provides examples and interpretations for selected benchmarks. The figures in this appendix cover eight of the evaluated benchmarks. In addition, not all [RFET](#page-364-7) and [SOI](#page-364-0) evaluations are shown, due to space limitations.

This appendix first shows power evaluation figures, followed by process variation, voltage variation, temperature variation and aging figures. Aggregated statistics, which summarize average values, are given in the evaluation chapter. Appendix figure [figure F.1](#page-388-0) on the following page provides placements for all [FPGA.](#page-362-2) [Figure F.2](#page-389-0) on page [367](#page-389-0) to [figure F.9](#page-396-0) on page [374](#page-396-0) provide evaluations for power management without [PVTA.](#page-364-8) [Figure F.10](#page-397-0) on page [375](#page-397-0) to [figure F.17](#page-404-0) on page [382](#page-404-0) show the simulations with process variation. [Fig](#page-405-0)[ure F.18](#page-405-0) on page [383](#page-405-0) gives an overview of all voltage variation maps used in the voltage evaluation. The voltage variation evaluation itself is given in [figure F.19](#page-406-0) on page [384](#page-406-0) to [figure F.32](#page-419-0) on page [397.](#page-419-0) Temperature evaluation maps follow in [figure F.33](#page-420-0) on page [398](#page-420-0) to [figure F.36](#page-423-0) on page [401.](#page-423-0) The appendix concludes with aging evaluations in [figure F.37](#page-424-0) on page [402](#page-424-0) to [figure F.44](#page-431-0) on page [409.](#page-431-0)

There's one aspect to be noted in the achieved delay graphs: Even for regions in low-power modes, such as unused regions, the achieved delay can be lower than the nominal delay. This is caused by process variation causing some areas to have better than typical delay and results in graphs having a mostly blue shade. For larger region sizes and the same benchmark, the amount of such high-performance regions can seem to be smaller: The reason for this being that the current delay is showing the delay as it is known by the compensation algorithm. This algorithm always uses the worst value in a region, causing the whole area to show higher delay. The real delay can however be lower for most [CLBs](#page-361-0) in the region.

<span id="page-388-0"></span>

**Figure F.1:** Placements for the benchmarks which are evaluated on the [PARFAIT](#page-363-5) [FPGA](#page-362-2) architecture. Placements were obtained using the [ULM](#page-364-6) [VPR](#page-364-3) flow introduced in [section 6.2](#page-202-0) on page [180.](#page-202-0) **[\(a\)](#page-388-0)** to **[\(h\)](#page-388-0)**: Benchmarks arm\_core, bgm, blob\_merge, diffeq2, ch\_intrinsics, LU64PEEng, mkSMAdapter4B and stereovision0.

<span id="page-389-0"></span>

**Figure F.2:** [PARFAIT](#page-363-5) target factor maps evaluated using the [SOI](#page-364-0) delay model introduced in [section 4.6.](#page-155-0) Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.3:** [PARFAIT](#page-363-5) target factor maps evaluated using the [SOI](#page-364-0) delay model introduced in [section 4.6.](#page-155-0) Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.4:** [PARFAIT](#page-363-5) target factor maps evaluated using the [RFET](#page-364-7) delay model introduced in [section 4.6.](#page-155-0) Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.5:** [PARFAIT](#page-363-5) target factor maps evaluated using the [RFET](#page-364-7) delay model introduced in [section 4.6.](#page-155-0) Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.6:** [PARFAIT](#page-363-5) delay factor maps evaluated using the [RFET](#page-364-7) delay model introduced in [section 4.6.](#page-155-0) Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.7:** [PARFAIT](#page-363-5) delay factor maps evaluated using the [RFET](#page-364-7) delay model introduced in [section 4.6.](#page-155-0) Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.8:** [PARFAIT](#page-363-5) power maps evaluated using the [RFET](#page-364-7) delay model introduced in [section 4.6.](#page-155-0) Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.


**Figure F.9:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6.](#page-155-0) Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.

**375**











(i)  $(k)$  (l)







**Figure F.10:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [SOI](#page-364-1) delay model introduced in [section 4.6](#page-155-0) with process variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.11:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [SOI](#page-364-1) delay model introduced in [section 4.6](#page-155-0) with process variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.12:** [PARFAIT](#page-363-0) power maps evaluated using the [SOI](#page-364-1) delay model introduced in [section 4.6](#page-155-0) with process variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.13:** [PARFAIT](#page-363-0) power maps evaluated using the [SOI](#page-364-1) delay model introduced in [section 4.6](#page-155-0) with process variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.14:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with process variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.15:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with process variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.16:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with process variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.



**Figure F.17:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with process variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.

<span id="page-405-0"></span>

**Figure F.18:** Voltage Variation Maps for the benchmarks which are evaluated on the [PARFAIT](#page-363-0) [FPGA](#page-362-0) architecture. Placements were obtained using the [ULM](#page-364-2) [VPR](#page-364-3) flow introduced in [section 6.2](#page-202-0) on page [180.](#page-202-0) **[\(a\)](#page-405-0)** to **[\(h\)](#page-405-0)**: Benchmarks arm\_core, bgm, blob\_merge, diffeq2, ch\_intrinsics, LU64PEEng, mkSMAdapter4B and stereovision0.



**Figure F.19:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [SOI](#page-364-1) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $\varepsilon = 0.1$ .







**Figure F.20:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [SOI](#page-364-1) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $\epsilon = 0.1$ .



**Figure F.21:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [SOI](#page-364-1) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $\varepsilon = 0.3$ .



**Figure F.22:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [SOI](#page-364-1) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $\epsilon = 0.3$ .



**Figure F.23:** [PARFAIT](#page-363-0) power maps evaluated using the [SOI](#page-364-1) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $\epsilon = 0.3$ .



**Figure F.24:** [PARFAIT](#page-363-0) power maps evaluated using the [SOI](#page-364-1) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $\varepsilon = 0.3$ .



**Figure F.25:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5,  $10x10, 25x25, 50x50. \epsilon = 0.1.$ 



**Figure F.26:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $\epsilon = 0.1$ .



**Figure F.27:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right:  $5x5$ ,  $10x10$ ,  $25x25$ ,  $50x50$ .  $\varepsilon = 0.1$ .



**Figure F.28:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $\varepsilon = 0.1$ .



**Figure F.29:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5,  $10x10, 25x25, 50x50. \epsilon = 0.3.$ 



**Figure F.30:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $\varepsilon = 0.3$ .



**Figure F.31:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $\varepsilon = 0.3$ .



**Figure F.32:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with voltage variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $\varepsilon = 0.3$ .



**Figure F.33:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with temperature variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $T = 100$  K.



**Figure F.34:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with temperature variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $T = 100$  K.



**Figure F.35:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with temperature variation. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5,  $10x10, 25x25, 50x50. T = 100K.$ 



**Figure F.36:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with temperature variation. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $T = 100$  K.



**Figure F.37:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with aging. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $t = 1$  year.



**Figure F.38:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with aging. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $t = 1$  year.



**Figure F.39:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with aging. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $t = 1$  year.



**Figure F.40:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with aging. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $t = 1$  year.



**Figure F.41:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with aging. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $t = 10$  years.



**Figure F.42:** [PARFAIT](#page-363-0) delay factor maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with aging. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $t = 10$  years.



**Figure F.43:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with aging. Benchmarks top to bottom: arm\_core, bgm, blob\_merge, diffeq2. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $t = 10$  years.



**Figure F.44:** [PARFAIT](#page-363-0) power maps evaluated using the [RFET](#page-364-0) delay model introduced in [section 4.6](#page-155-0) with aging. Benchmarks top to bottom: ch\_intrinsics, LU64PEEng, mkSMAdapter4B, stereovision0. Region size left to right: 5x5, 10x10, 25x25, 50x50.  $t = 10$  years.
*This page intentionally left blank*

# **Index**

#### **– A –**





### **– B –**



#### **– C –**



#### **– D –**





# **– E –**



### **– F –**



### **– G –**



#### **– H –**



# **– I –**



### **– L –**



### **– M –**



#### **– O –**



### **– P –**







### **– R –**





#### **– S –**





#### **– T –**



### **– U –**



# **– V –**

